Gen-SER: When the Generative Model Meets Speech Emotion Recognition

Taihui Wang; Jinzheng Zhao; Rilin Chen; Tong Lei; Wenwu Wang; Dong Yu

doi:10.48550/arXiv.2601.20573

Back

Conference paper

Gen-SER: When the Generative Model Meets Speech Emotion Recognition

Taihui Wang, Jinzheng Zhao, Rilin Chen, Tong Lei, Wenwu Wang and Dong Yu

Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026)

Institute of Electrical and Electronics Engineers (IEEE)

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026) (Barcelona, Spain, 04/05/2026–08/05/2026)

17/01/2026

DOI: https://doi.org/10.48550/arXiv.2601.20573

Abstract

distribution transport

generative model

target matching

Speech emotion recognition

Speech emotion recognition (SER) is crucial in speech understanding and generation. Most approaches are based on either classification models or large language models. Different from previous methods, we propose Gen-SER, a novel approach that reformulates SER as a distribution shift problem via generative models. We propose to project discrete class labels into a continuous space, and obtain the terminal distribution via sinusoidal taxonomy encoding. The target-matching-based generative model is adopted to transform the initial distribution into the terminal distribution efficiently. The classification is achieved by calculating the similarity of the generated terminal distribution and ground truth terminal distribution. The experimental results confirm the efficacy of the proposed method, demonstrating its extensibility to various speech-understanding tasks and suggesting its potential applicability to a broader range of classification tasks.

Files and links (2)

pdf

GEN-SER482.74 kB

Author's Accepted Manuscript Restricted. Access maybe granted on request., This file will be open access upon publication. CC BY V4.0

url

https://2026.ieeeicassp.org/View

Event Website Conference website

Metrics

2 Record Views

Details

Title: Gen-SER: When the Generative Model Meets Speech Emotion Recognition
Creators: Taihui Wang (Author) - Tencent (China)
Jinzheng Zhao (Author) - Tencent (China)
Rilin Chen (Author) - Tencent (China)
Tong Lei (Author) - Tencent (China)
Wenwu Wang (Author) - University of Surrey, School of Computer Science & Electronic Engineering
Dong Yu (Author) - Tencent AI Lab, Bellevue, WA 98004, USA
Publication Details: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026)
Conference: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026) (Barcelona, Spain, 04/05/2026–08/05/2026)
Publisher: Institute of Electrical and Electronics Engineers (IEEE)
Date accepted for publication: 17/01/2026
Identifiers: 991110395302346
Copyright: © Copyright 2026 IEEE – All rights reserved. For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising.
Academic Unit: School of Computer Science & Electronic Engineering
Language: English
Resource Type: Conference paper

Gen-SER: When the Generative Model Meets Speech Emotion Recognition

Abstract

Files and links (2)

Metrics

Details

Usage Policy