TEACHER-GUIDED PSEUDO SUPERVISION AND CROSS-MODAL ALIGNMENT FOR AUDIO-VISUAL VIDEO PARSING

Yaru Chen; Ruohao Guo; Liting Gao; Yang Xiang; Qingyu Luo; Zhenbo Li; Wenwu Wang

Back

Conference proceeding

TEACHER-GUIDED PSEUDO SUPERVISION AND CROSS-MODAL ALIGNMENT FOR AUDIO-VISUAL VIDEO PARSING

Yaru Chen, Ruohao Guo, Liting Gao, Yang Xiang, Qingyu Luo, Zhenbo Li and Wenwu Wang

2026 IEEE International Conference on Acoustics, Speech, and Signal Processing

2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (Barcelona, Spain, 04/05/2026–08/05/2026)

16/01/2026

Abstract

Weakly-supervised audio-visual video parsing (AVVP) seeks to detect audible, visible, and audio-visual events without temporal annotations. Previous work has emphasized refining global predictions through contrastive or collaborative learning, but neglected stable segment-level supervision and class-aware cross-modal alignment. To address this, we propose two strategies: (1) an exponential moving average (EMA)-guided pseudo supervision framework that generates reliable segment-level masks via adaptive thresholds or top-k selection, offering stable temporal guidance beyond videolevel labels; and (2) a class-aware cross-modal agreement (CMA) loss that aligns audio and visual embeddings at reliable segment–class pairs, ensuring consistency across modalities while preserving temporal structure. Evaluations on LLP and UnAV-100 datasets shows that our method achieves state-of-the-art (SOTA) performance across multiple metrics

Files and links (1)

pdf

TEACHER_GUIDED_PSEUDO_SUPERVISION_AND_CROSS_MODAL_ALIGNMENT_FOR_AUDIO_VISUAL_VIDEO_PARSING_cr6.57 MB

Author's Accepted Manuscript CC BY V4.0, Embargoed Access, Embargo ends: 04/05/2026

Metrics

1 Record Views

Details

Title: TEACHER-GUIDED PSEUDO SUPERVISION AND CROSS-MODAL ALIGNMENT FOR AUDIO-VISUAL VIDEO PARSING
Creators: Yaru Chen - University of Surrey, School of Computer Science & Electronic Engineering
Ruohao Guo - Peking University
Liting Gao - University of Surrey, School of Computer Science & Electronic Engineering
Yang Xiang - University of Surrey, School of Computer Science & Electronic Engineering
Qingyu Luo - University of Surrey, School of Computer Science & Electronic Engineering
Zhenbo Li - China Agricultural University
Wenwu Wang - University of Surrey, School of Computer Science & Electronic Engineering
Publication Details: 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing
Conference: 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (Barcelona, Spain, 04/05/2026–08/05/2026)
Publisher: IEEE
Date accepted for publication: 16/01/2026
Grant note: This work was partly supported by a research scholarship from the China Scholarship Council (CSC).
Identifiers: 991104995302346
Academic Unit: School of Computer Science & Electronic Engineering
Resource Type: Conference proceeding

TEACHER-GUIDED PSEUDO SUPERVISION AND CROSS-MODAL ALIGNMENT FOR AUDIO-VISUAL VIDEO PARSING

Abstract

Files and links (1)

Metrics

Details

Usage Policy