Towards Generating Diverse Audio Captions Via Adversarial Training

Xinhao Mei; Xubo Liu; Jianyuan Sun; Mark D. Plumbley; Wenwu Wang

doi:10.1109/TASLP.2024.3416686

Back

Towards Generating Diverse Audio Captions Via Adversarial Training

Journal article

Open access

Peer reviewed

Towards Generating Diverse Audio Captions Via Adversarial Training

Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley and Wenwu Wang

IEEE/ACM transactions on audio, speech, and language processing, Vol.32, pp.3311-3323

21/06/2024

DOI: https://doi.org/10.1109/TASLP.2024.3416686

Abstract

Audio captioning

cross-modal task

deep learning

GANs

Generators

Hybrid power systems

Maximum likelihood estimation

Measurement

reinforcement learning

Task analysis

Training

Semantics

Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences. This task has attracted increasing attention and substantial progress has been made in recent years. Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e.g., generating a fixed caption for a given audio clip), simple (e.g., using common words and simple grammar), and generic (e.g., generating the same caption for similar audio clips). When people are asked to describe the content of an audio clip, different people tend to focus on different sound events and describe an audio clip diversely from various aspects using distinct words and grammar. We believe that an audio captioning system should have the ability to generate diverse captions, either for a fixed audio clip, or across similar audio clips. To this end, we propose an adversarial training framework based on a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems. A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions, and the hybrid discriminators assess the generated captions from different criteria, such as their naturalness and semantics. We conduct experiments on the Clotho dataset. The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.

Files and links (1)

pdf

MeiLSPW_TASLP_20241.46 MBDownload View

Author's Accepted Manuscript CC BY V4.0, Open Access

Metrics

9 Record Views

See more details

Details

Title: Towards Generating Diverse Audio Captions Via Adversarial Training
Creators: Xinhao Mei - University of Surrey, School of Computer Science and Electronic Engineering
Xubo Liu - University of Surrey, School of Computer Science and Electronic Engineering
Jianyuan Sun - University of Surrey, School of Computer Science and Electronic Engineering
Mark D. Plumbley - University of Surrey, School of Computer Science and Electronic Engineering
Wenwu Wang - University of Surrey, School of Computer Science and Electronic Engineering
Publication Details: IEEE/ACM transactions on audio, speech, and language processing, Vol.32, pp.3311-3323
Publisher: IEEE
Date published: 21/06/2024
Date accepted: 04/06/2024
Grants: AI for Sound, EP/T019751/1, Engineering and Physical Sciences Research Council (United Kingdom, Swindon) - EPSRC
Automated Captioning of Image and Audio for Visually and Hearing Impaired, 623805725, British Council (United Kingdom, London)
Research Scholarship, 202006470010, China Scholarship Council (China, Beijing) - CSC
Grant note: Newton Institutional Links Award from the British Council
Identifiers: 99899164602346
Copyright: © 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Academic Unit: School of Computer Science and Electronic Engineering; Institute for Sustainability
Language: English
Resource Type: Journal article

Towards Generating Diverse Audio Captions Via Adversarial Training

Abstract

Files and links (1)

Metrics

Details

Usage Policy