Latent Diffusion Model Based Foley Sound Generation System For DCASE Challenge 2023 Task 7

Yi Yuan; Haohe Liu; Xubo Liu; Xiyuan Kang; Mark Plumbley; Wenwu Wang

doi:10.48550/arxiv.2305.15905

Back

Latent Diffusion Model Based Foley Sound Generation System For DCASE Challenge 2023 Task 7

Preprint

Open access

Latent Diffusion Model Based Foley Sound Generation System For DCASE Challenge 2023 Task 7

Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Mark Plumbley and Wenwu Wang

arXiv.org

Cornell University Library, arXiv.org

15/09/2023

DOI: https://doi.org/10.48550/arxiv.2305.15905

Abstract

Audio data

Coders

Large language models

Multimedia

Sound effects

Sound generation

Foley sound presents the background sound for multimedia content and the generation of Foley sound involves computationally modelling sound effects with specialized techniques. In this work, we proposed a system for DCASE 2023 challenge task 7: Foley Sound Synthesis. The proposed system is based on AudioLDM, which is a diffusion-based text-to-audio generation model. To alleviate the data-hungry problem, the system first trained with large-scale datasets and then downstreamed into this DCASE task via transfer learning. Through experiments, we found out that the feature extracted by the encoder can significantly affect the performance of the generation model. Hence, we improve the results by leveraging the input label with related text embedding features obtained by a significant language model, i.e., contrastive language-audio pertaining (CLAP). In addition, we utilize a filtering strategy to further refine the output, i.e. by selecting the best results from the candidate clips generated in terms of the similarity score between the sound and target labels. The overall system achieves a Frechet audio distance (FAD) score of 4.765 on average among all seven different classes, substantially outperforming the baseline system which performs a FAD score of 9.7.

Files and links (1)

url

https://arxiv.org/pdf/2305.15905.pdfView

Preprint (Author's original)CC BY V4.0, Open

Metrics

23 Record Views

Details

Title: Latent Diffusion Model Based Foley Sound Generation System For DCASE Challenge 2023 Task 7
Creators: Yi Yuan - University of Surrey, Department of Computer Science
Haohe Liu - University of Surrey, Department of Computer Science
Xubo Liu - University of Surrey, Department of Computer Science
Xiyuan Kang - University of Surrey, Department of Computer Science
Mark Plumbley - University of Surrey, Department of Computer Science
Wenwu Wang - University of Surrey, Department of Computer Science
Publication Details: arXiv.org
Publisher: Cornell University Library, arXiv.org; Ithaca
Date published: 15/09/2023
Grants: AI for Sound, EP/T019751/1, Engineering and Physical Sciences Research Council (United Kingdom, Swindon) - EPSRC
PhD Scholarship, PhD_CVSSP, University of Surrey (United Kingdom, Guildford)
Research Scholarship - CSC, 202208060240, China Scholarship Council (China, Beijing) - CSC
Grant note: This research was partly supported by a research scholarship from the China Scholarship Council (CSC) No.202208060240, the British Broadcasting Corporation Research and Development (BBC R&D), Engineering and Physical Sciences Research Council (EPSRC) Grant EP/T019751/1 “AI for Sound”, and a PhD scholarship from the Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey
Identifiers: 99822193702346
Copyright: For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising.
Academic Unit: School of Computer Science and Electronic Engineering
Language: English
Resource Type: Preprint

Latent Diffusion Model Based Foley Sound Generation System For DCASE Challenge 2023 Task 7

Abstract

Files and links (1)

Metrics

Details

Usage Policy