Sound to Text: Automated Audio Captioning using Deep Learning

Xinhao Mei

doi:10.15126/thesis.901197

Back

Sound to Text: Automated Audio Captioning using Deep Learning

Doctoral Thesis

Open access

Sound to Text: Automated Audio Captioning using Deep Learning

Xinhao Mei

University of Surrey

Doctor of Philosophy (PhD), University of Surrey

DOI:

https://doi.org/10.15126/thesis.901197

Abstract

Audio Understanding

Language Generation

Audio Captioning

Multimodal Learning

Automated audio captioning (AAC) is a task that involves describing the ambiant sound within an audio clip using a natural language sentence, bridging the gap between auditory perception and linguistic expression. AAC requires not only identifying sound events and acoustic scenes within an audio clip but also interpreting their relationships and summarizing the audio content through descriptive language. AAC has gained significant attention and seen considerable progress in recent years. Despite the progress, the field continues to face numerous challenges. This thesis studied automated audio captioning from three perspectives: model architectures, data scarcity issue, and diversity in generated captions.

Translating sound into text, AAC is a sequence-to-sequence task. Therefore, existing approaches follow an encoder-decoder paradigm using deep learning techniques. Recurrent neural networks and convolutional neural networks are popularly employed as the audio encoder. However, both of them have their own limitations in modeling lengthy audio signals. We introduce the Audio Captioning Transformer (ACT), a novel fully Transformer-based model that overcomes the limitations of traditional RNN and CNN approaches in automated audio captioning. The self-attention mechanism of the ACT model facilitates a better modelling of audio signals’ local and global dependencies. Our findings highlight the critical role of the audio encoder in an AAC system.

Data scarcity is a major issue in the field of AAC. Collecting audio captioning datasets manually is expensive and time-consuming, therefore, existing audio captioning datasets are all limited in size. To address the data scarcity issue, we source audio clips and their metadata (e.g., raw descriptions and audio tags) from three web platforms and one audio tagging dataset. We devise a three-stage processing pipeline to filter and transform noisy raw descriptions into audio captions with the help of ChatGPT, a powerful, conversational large language model. Consequently, we introduce the WavCaps dataset, the first large-scale, weakly-labelled audio captioning dataset for audio-language multimodal research, containing 403050 audio clips with paired captions. We conduct a comprehensive analysis of the characteristics of WavCaps dataset and achieve new state-of-the-art results on main AAC benchmarks.

Finally, different people may interpret and describe the same audio scene in diverse ways, leading to a wide range of possible captions for a single audio clip. However, captions generated by existing audio captioning systems are deterministic, simple and generic. We argue that an effective audio captioning system should be capable of producing diverse captions for both a single audio clip and across similar clips. To achieve this, we introduce an adversarial training framework utilizing a conditional generative adversarial network (C-GAN) to enhance the diversity of audio captioning systems. Our experiments on the Clotho dataset demonstrate that our model outperforms state-of-the-art methods in generating more diverse captions.

Files and links (5)

pdf

Xinhao_Mei_PhD_Thesis11.72 MBDownload View

PDFCC BY-NC-SA V4.0, Open Access

url

https://ieeexplore.ieee.org/document/10572302View

WavCaps Paper for Chapter 4

url

https://ieeexplore.ieee.org/abstract/document/10568388View

Paper for Chapter 5

url

https://dcase.community/documents/workshop2021/proceedings/DCASE2021Workshop_Mei_68.pdfView

Paper for Chapter 3

url

https://link.springer.com/article/10.1186/s13636-022-00259-2View

Paper for Chaoter 2

Metrics

2 File views/ downloads

12 Record Views

Details

Title: Sound to Text: Automated Audio Captioning using Deep Learning
Creators: Xinhao Mei - University of Surrey, School of Computer Science and Electronic Engineering
Contributors: Wenwu Wang (Supervisor) - University of Surrey, School of Computer Science and Electronic Engineering
Awarding Institution: University of Surrey; Doctor of Philosophy (PhD)
Theses and Dissertations: Doctor of Philosophy (PhD), University of Surrey
Publisher: University of Surrey
Identifiers: 99905666402346
Academic Unit: School of Computer Science and Electronic Engineering
Resource Type: Doctoral Thesis

Sound to Text: Automated Audio Captioning using Deep Learning

Abstract

Files and links (5)

Metrics

Details

Usage Policy