Logo image
DreamAudio: Customized Text-to-Audio Generation with Diffusion Models
Journal article   Open access   Peer reviewed

DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

Yi Yuan, Xubo Liu, Haohe Liu, Xiyuan Kang, Zhuo Chen, Yuxuan Wang, Mark D. Plumbley and Wenwu Wang
IEEE Transactions on Audio, Speech and Language Processing, Vol.34, pp.2421-2435
26/03/2026

Abstract

AIGC audio generation Benchmark testing customized generation diffusion model Diffusion models Dogs Electronic mail Feature extraction Generators Pipelines retrieval argumentation Timing Training Semantics
With the development of large-scale diffusion-based and language-modeling-based generative models, impressive progress has been achieved in text-to-audio generation. Despite producing high-quality outputs, existing text-to-audio models mainly aim to generate semantically aligned sound and fall short of controlling fine-grained acoustic characteristics of specific sounds. As a result, users who need specific sound content may find it difficult to generate the desired audio clips. In this paper, we present DreamAudio for customized text-to-audio generation (CTTA). Specifically, we introduce a new framework that is designed to enable the model to identify auditory information from user-provided reference concepts for audio generation. Given a few reference audio samples containing personalized audio events, our system can generate new audio samples that include these specific events. In addition, two types of datasets are developed for training and testing the proposed systems. The experiments show that DreamAudio generates audio samples that are highly consistent with the customized audio features and aligned well with the input text prompts. Furthermore, DreamAudio offers comparable performance in general text-to-audio tasks. We also provide a human-involved dataset containing audio events from real-world CTTA cases as the benchmark for customized generation tasks.
pdf
Yuan et al_TASLP_20263.80 MBDownloadView
Author's Accepted Manuscript Open Access CC BY V4.0

Metrics

1 Record Views

Details

Logo image

Usage Policy