Logo image
TEACHER-GUIDED PSEUDO SUPERVISION AND CROSS-MODAL ALIGNMENT FOR AUDIO-VISUAL VIDEO PARSING
Conference proceeding

TEACHER-GUIDED PSEUDO SUPERVISION AND CROSS-MODAL ALIGNMENT FOR AUDIO-VISUAL VIDEO PARSING

Yaru Chen, Ruohao Guo, Liting Gao, Yang Xiang, Qingyu Luo, Zhenbo Li and Wenwu Wang
2026 IEEE International Conference on Acoustics, Speech, and Signal Processing
2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (Barcelona, Spain, 04/05/2026–08/05/2026)
16/01/2026

Abstract

Weakly-supervised audio-visual video parsing (AVVP) seeks to detect audible, visible, and audio-visual events without temporal annotations. Previous work has emphasized refining global predictions through contrastive or collaborative learning, but neglected stable segment-level supervision and class-aware cross-modal alignment. To address this, we propose two strategies: (1) an exponential moving average (EMA)-guided pseudo supervision framework that generates reliable segment-level masks via adaptive thresholds or top-k selection, offering stable temporal guidance beyond videolevel labels; and (2) a class-aware cross-modal agreement (CMA) loss that aligns audio and visual embeddings at reliable segment–class pairs, ensuring consistency across modalities while preserving temporal structure. Evaluations on LLP and UnAV-100 datasets shows that our method achieves state-of-the-art (SOTA) performance across multiple metrics
pdf
TEACHER_GUIDED_PSEUDO_SUPERVISION_AND_CROSS_MODAL_ALIGNMENT_FOR_AUDIO_VISUAL_VIDEO_PARSING_cr6.57 MB
Author's Accepted Manuscript CC BY V4.0 Embargoed Access, Embargo ends: 04/05/2026

Metrics

1 Record Views

Details

Logo image

Usage Policy