DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

Yi Yuan; Xubo Liu; Haohe Liu; Xiyuan Kang; Zhuo Chen; Yuxuan Wang; Mark D. Plumbley; Wenwu Wang

doi:10.1109/TASLPRO.2026.3678172

Back

DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

Journal article

Open access

Peer reviewed

DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

Yi Yuan, Xubo Liu, Haohe Liu, Xiyuan Kang, Zhuo Chen, Yuxuan Wang, Mark D. Plumbley and Wenwu Wang

IEEE Transactions on Audio, Speech and Language Processing, Vol.34, pp.2421-2435

26/03/2026

DOI: https://doi.org/10.1109/TASLPRO.2026.3678172

Abstract

AIGC

audio generation

Benchmark testing

customized generation

diffusion model

Diffusion models

Dogs

Electronic mail

Feature extraction

Generators

Pipelines

retrieval argumentation

Timing

Training

Semantics

With the development of large-scale diffusion-based and language-modeling-based generative models, impressive progress has been achieved in text-to-audio generation. Despite producing high-quality outputs, existing text-to-audio models mainly aim to generate semantically aligned sound and fall short of controlling fine-grained acoustic characteristics of specific sounds. As a result, users who need specific sound content may find it difficult to generate the desired audio clips. In this paper, we present DreamAudio for customized text-to-audio generation (CTTA). Specifically, we introduce a new framework that is designed to enable the model to identify auditory information from user-provided reference concepts for audio generation. Given a few reference audio samples containing personalized audio events, our system can generate new audio samples that include these specific events. In addition, two types of datasets are developed for training and testing the proposed systems. The experiments show that DreamAudio generates audio samples that are highly consistent with the customized audio features and aligned well with the input text prompts. Furthermore, DreamAudio offers comparable performance in general text-to-audio tasks. We also provide a human-involved dataset containing audio events from real-world CTTA cases as the benchmark for customized generation tasks.

Files and links (1)

pdf

Yuan et al_TASLP_20263.80 MBDownload View

Author's Accepted Manuscript Open Access CC BY V4.0

Metrics

1 Record Views

Details

Title: DreamAudio: Customized Text-to-Audio Generation with Diffusion Models
Creators: Yi Yuan - School of Computer Science and Electronic Engineering, University of Surrey, Guildford, U.K
Xubo Liu - School of Computer Science and Electronic Engineering, University of Surrey, Guildford, U.K
Haohe Liu - University of Surrey
Xiyuan Kang - School of Computer Science and Electronic Engineering, University of Surrey, Guildford, U.K
Zhuo Chen - Seed GroupByteDance Inc
Yuxuan Wang - Seed GroupByteDance Inc
Mark D. Plumbley - King's College London
Wenwu Wang - School of Computer Science and Electronic Engineering, University of Surrey, Guildford, U.K
Publication Details: IEEE Transactions on Audio, Speech and Language Processing, Vol.34, pp.2421-2435
Publisher: IEEE
Number of pages: 14
Publication Date: 26/03/2026
Date accepted for publication: 20/03/2026
Grants: AI for Sound, EP/T019751/1, Engineering and Physical Sciences Research Council (United Kingdom, Swindon) - EPSRC
EP/Y028805/1, Engineering and Physical Sciences Research Council (United Kingdom, Swindon) - EPSRC
Grant note: research scholarship from the China Scholarship Council British Broadcasting Corporation Research and Development Research England “Games and Innovation Nexus” programme Engineering and Physical Sciences Research Council (Grant Number: 1EP/T019751/1 and EP/Y028805/1) Centre for Vision Speech and Signal Processing University of Surrey
Identifiers: 991123333202346; WOS:001754876800001
Academic Unit: School of Computer Science & Electronic Engineering
Language: English
Resource Type: Journal article

DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

Abstract

Files and links (1)

Metrics

Details

Usage Policy