Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning

Jianyuan Sun; Xubo Liu; Xinhao Mei; Volkan Kılıç; Mark D. Plumbley; Wenwu Wang

doi:10.21437/interspeech.2023-943

Back

Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning

Conference paper

Open access

Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning

Jianyuan Sun, Xubo Liu, Xinhao Mei, Volkan Kılıç, Mark D. Plumbley and Wenwu Wang

Proceedings of the 24th Annual Conference of the International Speech Communication Association, INTERSPEECH (INTERSPEECH 2023), pp.4164-4168

International Speech Communication Association (ISCA)

INTERSPEECH 2023 (Dublin, Ireland, 20/08/2023 - 24/08/2023)

2023

DOI: https://doi.org/10.21437/interspeech.2023-943

Abstract

fused feature

high-dimensional feature

dual transformer decoder

audio captioning

PANNS

Automated audio captioning (AAC) which generates textual descriptions of audio content. Existing AAC models achieve good results but only use the high-dimensional representation of the encoder. There is always insufficient information learning of high-dimensional methods owing to high-dimensional representations having a large amount of information. In this paper, a new encoder-decoder model called the Low-and High-Dimensional Feature Fusion (LHDFF) is proposed. LHDFF uses a new PANNs encoder called Residual PANNs (RPANNs) to fuse low-and high-dimensional features. Low-dimensional features contain limited information about specific audio scenes. The fusion of low-and high-dimensional features can improve model performance by repeatedly emphasizing specific audio scene information. To fully exploit the fused features, LHDFF uses a dual transformer decoder structure to generate captions in parallel. Experimental results show that LHDFF outperforms existing audio captioning models.

Files and links (1)

pdf

Dual Transformer Decoder based Features Fusion Network - AAM582.69 kBDownload View

Author's Accepted Manuscript Open Access

Metrics

1 Record Views

Details

Title: Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning
Creators: Jianyuan Sun (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Xubo Liu (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Xinhao Mei (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Volkan Kılıç (Author) - Izmir Kâtip Çelebi University
Mark D. Plumbley (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Wenwu Wang (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Publication Details: Proceedings of the 24th Annual Conference of the International Speech Communication Association, INTERSPEECH (INTERSPEECH 2023), pp.4164-4168
Conference: INTERSPEECH 2023 (Dublin, Ireland, 20/08/2023 - 24/08/2023)
Publisher: International Speech Communication Association (ISCA)
Date published: 2023
Grants: AI for Sound, EP/T019751/1, Engineering and Physical Sciences Research Council (United Kingdom, Swindon) - EPSRC
Automated Captioning of Image and Audio for Visually and Hearing Impaired, 623805725, British Council (United Kingdom, London)
120N995, TUBITAK BILGEM (Turkey, Gebze)
Grant note: This work is partly supported by a Newton Institutional Links Award from the British Council and the Scientific and Technological Research Council of Turkey (TUBITAK), titled “Automated Captioning of Image and Audio for Visually and Hearing Impaired” (Grant numbers 623805725 and 120N995), a grant EP/T019751/1 from the Engineering and Physical Sciences Research Council (EPSRC).
Identifiers: 99909365702346
Academic Unit: School of Computer Science and Electronic Engineering; Institute for Sustainability
Language: English
Resource Type: Conference paper

Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning

Abstract

Files and links (1)

Metrics

Details

Usage Policy