Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

Jinhua Liang; Xubo Liu; Wenwu Wang; Mark D. Plumbley; Huy Phan; Emmanouil Benetos

doi:10.1109/TASLPRO.2025.3533375

Back

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

Journal article

Open access

Peer reviewed

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

Jinhua Liang, Xubo Liu, Wenwu Wang, Mark D. Plumbley, Huy Phan and Emmanouil Benetos

IEEE Transactions on Audio, Speech and Language Processing, Vol.33, pp.949-961

2025

DOI: https://doi.org/10.1109/TASLPRO.2025.3533375

Abstract

Audio understanding

large language model

audio-language learning

audio recognition

automated audio captioning

natural language audio reasoning

The auditory system plays a substantial role in shaping the overall human perceptual experience. While prevailing large language models (LLMs) and visual language models (VLMs) have shown their promise in solving a wide variety of language and vision understanding tasks, only a few of them can be generalised to the audio domain without compromising their domain-specific capability. In this work, we introduce Acoustic Prompt Tuning (APT), a new adapter extending LLMs and VLMs to the audio domain by injecting audio embeddings to the input of LLMs, namely soft prompting. Specifically, APT applies an instruction-aware audio aligner to generate soft prompts, conditioned on both input text and sounds, as the inputs to the language model. To mitigate data scarcity in the audio domain, a curriculum learning strategy is proposed by formulating diverse audio tasks in a sequential manner. Moreover, we improve the audio language model by using interleaved audio-text embeddings as the input sequence. In this improved model, zero constraints are imposed on the input format, thus it is capable of tackling diverse modelling tasks, such as few-shot audio classification and audio comparison. To further evaluate the advanced ability of the audio networks, we introduce natural language audio reasoning (NLAR), a new task that analyses two audio clips by comparison and summarisation. Experiments show that APT-enhanced LLMs (namely APT-LLMs) achieve competitive results compared to the expert models (i.e., the networks trained on the target datasets) across various tasks. We finally demonstrate APT's ability in extending frozen VLMs to the audio domain without fine-tuning, achieving promising results in audiovisual question and answering. Our code and model weights will be released at https://github.com/JinhuaLiang/APT. Index Terms—Audio understanding, large language model, audio-language learning, audio recognition, automated audio captioning, natural language audio reasoning.

Files and links (2)

pdf

Liang et al_TASLP_20251.79 MBDownload View

Author's Accepted Manuscript CC BY V4.0, Open Access

url

https://doi.org/10.1109/TASLPRO.2025.3533375View

Published (Version of record)

Metrics

10 File views/ downloads

36 Record Views

See more details

Details

Title: Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities
Creators: Jinhua Liang (Corresponding Author) - Queen Mary University of London
Xubo Liu (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Wenwu Wang (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Mark D. Plumbley (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Huy Phan (Author) - Meta Reality Labs
Emmanouil Benetos (Author) - Queen Mary University of London
Publication Details: IEEE Transactions on Audio, Speech and Language Processing, Vol.33, pp.949-961
Publisher: Institute of Electrical and Electronics Engineers (IEEE); PISCATAWAY
Number of pages: 13
First online publication date: 24/01/2025
Publication Date: 2025
Date accepted for publication: 10/01/2025
Grants: AI for Sound, EP/T019751/1, Engineering and Physical Sciences Research Council (United Kingdom, Swindon) - EPSRC
EP/T518086/1, Engineering and Physical Sciences Research Council (United Kingdom, Swindon) - EPSRC
LTRF2223-19-106, Leverhulme Trust (United Kingdom, London)
EP/Y028805/1, Engineering and Physical Sciences Research Council (United Kingdom, Swindon) - EPSRC
Grant note: The work of Xubo Liu was supported by a Research Gift from Google. The work of Emmanouil Benetos was supported by RAEng/Leverhulme Trust Research Fellowship under Grant LTRF2223-19-106. This work was supported by the Engineering and Physical Sciences Research Council under Grant EP/T518086/1, Grant EP/Y028805/1, and Grant EP/T019751/1. This research utilised Queen Mary’s Apocrita HPC facility, supported by QMUL Research-IT. http://doi.org/10.5281/zenodo.438045.
Identifiers: 99957066502346; WOS:001484963200002
Copyright: © 2025 IEEE. All rights reserved, including rights for text and data mining, and training of artificial intelligence and similar technologies. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. For the purpose of open access, the authors have applied a Creative Commons attribution (CC BY) license to any Author Accepted Manuscript version arising.
Academic Unit: School of Computer Science and Electronic Engineering
Language: English
Resource Type: Journal article
Data Access Statement: Our code and model weights can be found at https://github.com/JinhuaLiang/ APT.

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

Abstract

Files and links (2)

Metrics

Details

Usage Policy