WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

Xinhao Mei; Haohe Liu; Qiuqiang Kong; Tom Ko; Mark D. Plumbley; Yuexian Zou; Wenwu Wang

doi:10.1109/TASLP.2024.3419446

Back

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

Journal article

Open access

Peer reviewed

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

Xinhao Mei, Haohe Liu, Qiuqiang Kong, Tom Ko, Mark D. Plumbley, Yuexian Zou and Wenwu Wang

IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol.32, pp.3339-3354

26/06/2024

DOI: https://doi.org/10.1109/TASLP.2024.3419446

Abstract

Audio Captioning

Audio-language dataset

multimodal learning

ChatGPT

deep learning

Acoustics

The advancement of audio-language (AL) multi-modal learning tasks has been significant in recent years, yet the limited size of existing audio-language datasets poses challenges for researchers due to the costly and time-consuming collection process. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. We sourced audio clips and their raw descriptions from web sources and a sound event detection dataset. However, the online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. To overcome this issue, we propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically. We conduct a comprehensive analysis of the characteristics of WavCaps dataset and evaluate it on multiple downstream audio-language multimodal learning tasks. The systems trained on WavCaps outperform previous state-of-the-art (SOTA) models by a significant margin. Our aspiration is for the WavCaps dataset we have proposed to facilitate research in audio-language multimodal learning and demonstrate the potential of utilizing large language models (LLMs) to enhance academic research. Our dataset and codes are available at https://github.com/XinhaoMei/WavCaps.

Files and links (1)

pdf

WavCaps_A_ChatGPT-Assisted_Weakly-Labelled_Audio_Captioning_Dataset_for_Audio-Language_Multimodal_Research5.14 MBDownload View

Author's Accepted Manuscript Open Access

Metrics

1 File views/ downloads

1 Record Views

See more details

Details

Title: WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
Creators: Xinhao Mei (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Haohe Liu (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Qiuqiang Kong (Author) - Chinese University of Hong Kong
Tom Ko (Author) - Speech, Audio & Music Intelligence (SAMI), ByteDance
Mark D. Plumbley (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Yuexian Zou (Author) - Peking University
Wenwu Wang (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Publication Details: IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol.32, pp.3339-3354
Publisher: Institute of Electrical and Electronics Engineers (IEEE)
Date published: 26/06/2024
Grants: AI for Sound, EP/T019751/1, Engineering and Physical Sciences Research Council (United Kingdom, Swindon) - EPSRC
Automated Captioning of Image and Audio for Visually and Hearing Impaired, 623805725, British Council (United Kingdom, London)
Grant note: This work is supported partly by a Newton Institutional Links Award from the British Council, titled “Automated Captioning of Image and Audio for Visually and Hearing Impaired” (Grant number 623805725), and a grant EP/T019751/1 from the Engineering and Physical Sciences Research Council (EPSRC).
Identifiers: 99922264802346
Copyright: © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising. The authors wish to thank the associate editor and the reviewers for their helpful comments to further improve this work.
Academic Unit: University of Surrey
Language: English
Resource Type: Journal article

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

Abstract

Files and links (1)

Metrics

Related content

Details

Usage Policy