AUDIO CAPTIONING TRANSFORMER

XINHAO  MEI; XUBO  LIU; QIUSHI  HUANG; Mark D. Plumbley; WENWU WANG

doi:10.48550/arXiv.2107.09817

Back

Conference proceeding

Open access

AUDIO CAPTIONING TRANSFORMER

XINHAO MEI, XUBO LIU, QIUSHI HUANG, Mark D. Plumbley and WENWU WANG

Proceedings of the 6th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2021),

Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2021), 6th (Virtual, 15/11/2021–19/11/2021)

11/2021

DOI: https://doi.org/10.48550/arXiv.2107.09817

Abstract

Transformer

sequence-to- sequence model

cross-modal task

Audio captioning

Audio captioning aims to automatically generate a natural language description of an audio clip. Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are often used as the audio encoder. However, CNNs can be limited in modelling temporal relationships among the time frames in an audio signal, while RNNs can be limited in modelling the long-range dependencies among the time frames. In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free. The proposed method has a better ability to model the global information within an audio signal as well as capture temporal relationships between audio events. We evaluate our model on AudioCaps, which is the largest audio captioning dataset publicly available. Our model shows competitive performance compared to other state-of-the-art approaches.

Files and links (2)

pdf

camera_ready_ACT353.88 kBDownload View

Author's Accepted Manuscript Open Access

url

http://dcase.community/workshop2021/indexView

Event WebsiteConference website

Metrics

19 File views/ downloads

123 Record Views

Details

Title: AUDIO CAPTIONING TRANSFORMER
Creators: XINHAO MEI - University of Surrey, School of Computer Science and Electronic Engineering
XUBO LIU - University of Surrey, School of Computer Science and Electronic Engineering
QIUSHI HUANG - University of Surrey, Department of Computer Science
Mark D. Plumbley - University of Surrey, School of Computer Science and Electronic Engineering
WENWU WANG - University of Surrey, School of Computer Science and Electronic Engineering
Publication Details: Proceedings of the 6th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2021),
Conference: Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2021), 6th (Virtual, 15/11/2021–19/11/2021)
Publication Date: 11/2021
Date accepted for publication: 14/09/2021
Grants: AI for Sound, EP/T019751/1, Engineering and Physical Sciences Research Council (United Kingdom, Swindon) - EPSRC
Grant note: This work is partly supported by grant EP/T019751/1 from the Engineering and Physical Sciences Research Council (EPSRC), a Newton Institutional Links Award from the British Council, titled “Automated Captioning of Image and Audio for Visually and Hearing Impaired” (Grant number 623805725) and a Research Scholarship from the China Scholarship Council (CSC) No. 202006470010.
Identifiers: 99602919902346
Academic Unit: Department of Computer Science; School of Computer Science and Electronic Engineering
Language: English
Resource Type: Conference proceeding

AUDIO CAPTIONING TRANSFORMER

Abstract

Files and links (2)

Metrics

Details

Usage Policy