AudioLDM 2: Learning holistic audio generation with self-supervised pretraining

Haohe Liu; Yi Yuan; Xubo Liu; Xinhao Mei; Qiuqiang  Kong; Qiao Tian; Yuping Wang; Wenwu Wang; Yuxuan Wang; Mark D. Plumbley

doi:10.1109/TASLP.2024.3399607

Back

AudioLDM 2: Learning holistic audio generation with self-supervised pretraining

Journal article

Open access

Peer reviewed

AudioLDM 2: Learning holistic audio generation with self-supervised pretraining

Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang and Mark D. Plumbley

IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol.32, pp.2871-2883

13/05/2024

DOI: https://doi.org/10.1109/TASLP.2024.3399607

Abstract

audio generation

diffusion model

self-supervised learning

speech synthesis

AIGC

Acoustics

Computer Science

Engineering

Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a holistic framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework utilizes a general representation of audio, called “language of audio” (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate other modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on the LOA of audio in our training set. The proposed framework naturally brings advantages such as reusable self-supervised pretrained latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech with three AudioLDM 2 variants demonstrate competitive performance of the AudioLDM 2 framework against previous approaches.

Files and links (2)

pdf

TASLP_AudioLDM27.69 MBDownload View

Author's Accepted Manuscript Open Access

url

https://doi.org/10.1109/TASLP.2024.3399607View

Published (Version of record)

Metrics

28 File views/ downloads

82 Record Views

7 Times Cited - Web of Science

Details

Title: AudioLDM 2: Learning holistic audio generation with self-supervised pretraining
Creators: Haohe Liu (Corresponding Author) - University of Surrey, School of Computer Science and Electronic Engineering
Yi Yuan (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Xubo Liu (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Xinhao Mei (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Qiuqiang Kong (Author) - Chinese University of Hong Kong
Qiao Tian (Author) - Speech, Audio & Music Intelligence (SAMI), ByteDance
Yuping Wang (Author) - Speech, Audio & Music Intelligence (SAMI), ByteDance
Wenwu Wang (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Yuxuan Wang (Author) - Speech, Audio & Music Intelligence (SAMI), ByteDance
Mark D. Plumbley (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Publication Details: IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol.32, pp.2871-2883
Publisher: Institute of Electrical and Electronics Engineers (IEEE)
Publication Date: 13/05/2024
Date accepted for publication: 24/04/2024
Grants: AI for Sound, EP/T019751/1, Engineering and Physical Sciences Research Council (United Kingdom, Swindon) - EPSRC
Grant note: This work was supported in part by the British Broadcasting Corporation Research and Development (BBC R&D), in part by Engineering and Physical Sciences Research Council (EPSRC) under Grant EP/T019751/1 “AI for Sound”, and in part by a Ph.D Scholarship from the Centre for Vision, Speech and Signal Processing (CVSSP), Faculty of Engineering and Physical Science (FEPS), University of Surrey.
Identifiers: 99891066502346; WOS:001236637800007
Academic Unit: School of Computer Science and Electronic Engineering
Resource Type: Journal article

AudioLDM 2: Learning holistic audio generation with self-supervised pretraining

Abstract

Files and links (2)

Metrics

Related content

Details

Usage Policy