AudioSetCaps: An Enriched Audio-Caption Dataset Using Automated Generation Pipeline With Large Audio and Language Models

Jisheng Bai; Haohe Liu; Mou Wang; Dongyuan Shi; Wenwu Wang; Mark D. Plumbley; Woon-Seng Gan; Jianfeng Chen

doi:10.1109/TASLPRO.2025.3583354

Back

Journal article

AudioSetCaps: An Enriched Audio-Caption Dataset Using Automated Generation Pipeline With Large Audio and Language Models

Jisheng Bai, Haohe Liu, Mou Wang, Dongyuan Shi, Wenwu Wang, Mark D. Plumbley, Woon-Seng Gan and Jianfeng Chen

IEEE Transactions on Audio, Speech and Language Processing, Vol.33, pp.2817-2829

26/06/2025

DOI: https://doi.org/10.1109/TASLPRO.2025.3583354

Abstract

Annotations

Audio-language learning

audio-language models

audio-text retrieval

automated audio captioning

Electronic mail

Instruments

Large language models

Pipelines

Scalability

Speech processing

Transforms

Acoustics

Data Mining

With the emergence of audio-language models, constructing large-scale paired audio-language datasets has become essential yet challenging for model development, primarily due to the time-intensive and labour-heavy demands involved. While large language models (LLMs) have improved the efficiency of synthetic audio caption generation, current approaches struggle to effectively extract and incorporate detailed audio information. In this paper, we propose an automated pipeline that integrates audio-language models for fine-grained content extraction, LLMs for synthetic caption generation, and a contrastive language-audio pretraining (CLAP) model-based refinement process to improve the quality of captions. Specifically, we employ prompt chaining techniques in the content extraction stage to obtain accurate and fine-grained audio information, while we use the refinement process to mitigate potential hallucinations in the generated captions. Leveraging the AudioSet dataset and the proposed approach, we create AudioSetCaps, a dataset comprising 1.9 million audio-caption pairs, the largest audio-caption dataset at the time of writing. The models trained with AudioSetCaps achieve state-of-the-art performance on audio-text retrieval with R@1 scores of 46.3% for text-to-audio and 59.7% for audio-to-text retrieval and automated audio captioning with the CIDEr score of 84.8. As our approach has shown promising results with AudioSetCaps, we create another dataset containing 4.1 million synthetic audio-language pairs based on the Youtube-8 M and VGGSound datasets. To facilitate research in audio-language learning, we have made our pipeline, datasets with 6 million audio-language pairs,

Metrics

9 Record Views

Details

Title: AudioSetCaps: An Enriched Audio-Caption Dataset Using Automated Generation Pipeline With Large Audio and Language Models
Creators: Jisheng Bai (Author) - School of Marine Science and Technology, Northwestern Polytechnical University, Xi'an, China
Haohe Liu (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Mou Wang (Author) - Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
Dongyuan Shi (Author) - School of Marine Science and Technology, Northwestern Polytechnical University, Xi'an, China
Wenwu Wang (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Mark D. Plumbley (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Woon-Seng Gan (Author) - School of Electrical & Electronic Engineering, Nanyang Technological University, Singapore
Jianfeng Chen (Corresponding Author) - School of Marine Science and Technology, Northwestern Polytechnical University, Xi'an, China
Publication Details: IEEE Transactions on Audio, Speech and Language Processing, Vol.33, pp.2817-2829
Publisher: Institute of Electrical and Electronics Engineers (IEEE)
Number of pages: 13
Publication Date: 26/06/2025
Date accepted for publication: 12/06/2025
Grants: AI for Sound, EP/T019751/1, Engineering and Physical Sciences Research Council (United Kingdom, Swindon) - EPSRC
EP/Y028805/1, Engineering and Physical Sciences Research Council (United Kingdom, Swindon) - EPSRC
Grant note: This work was partly supported by the China Scholarship Council during a visit of Jisheng Bai to Nanyang Technological University. This research was partly supported by Engineering and Physical Sciences Research Council (EPSRC) under Grant EP/T019751/1 and EP/Y028805/1, a Research Gift from Adobe, and a PhD scholarship from the Centre for Vision, Speech and Signal Processing (CVSSP) at the University of Surrey and BBC R&D.
Identifiers: 991011566502346; WOS:001527198900001
Academic Unit: School of Computer Science and Electronic Engineering
Language: English
Resource Type: Journal article

AudioSetCaps: An Enriched Audio-Caption Dataset Using Automated Generation Pipeline With Large Audio and Language Models

Abstract

Metrics

Details

Usage Policy