CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing

Yaru Chen; Ruohao Guo; Xubo Liu; Peipei Wu; Guangyao Li; Zhenbo Li; Wenwu Wang

doi:10.48550/arxiv.2310.07517

Back

CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing

Preprint

Open access

CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing

Yaru Chen, Ruohao Guo, Xubo Liu, Peipei Wu, Guangyao Li, Zhenbo Li and Wenwu Wang

arXiv.org

Cornell University Library, arXiv.org

11/10/2023

DOI: https://doi.org/10.48550/arxiv.2310.07517

Abstract

Perception

Segments

Visual signals

Semantics

Audio-visual video parsing is the task of categorizing a video at the segment level with weak labels, and predicting them as audible or visible events. Recent methods for this task leverage the attention mechanism to capture the semantic correlations among the whole video across the audio-visual modalities. However, these approaches have overlooked the importance of individual segments within a video and the relationship among them, and tend to rely on a single modality when learning features. In this paper, we propose a novel interactive-enhanced cross-modal perception method~(CM-PIE), which can learn fine-grained features by applying a segment-based attention module. Furthermore, a cross-modal aggregation block is introduced to jointly optimize the semantic representation of audio and visual signals by enhancing inter-modal interactions. The experimental results show that our model offers improved parsing performance on the Look, Listen, and Parse dataset compared to other methods.

Files and links (1)

url

https://arxiv.org/pdf/2310.07517.pdfView

Preprint (Author's original) Open

Metrics

26 Record Views

Details

Title: CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing
Creators: Yaru Chen - University of Surrey, Centre for Vision, Speech & Signal Processing (CVSSP)
Ruohao Guo - Peking University
Xubo Liu - University of Surrey, Centre for Vision, Speech & Signal Processing (CVSSP)
Peipei Wu - University of Surrey, Centre for Vision, Speech & Signal Processing (CVSSP)
Guangyao Li - Renmin University of China
Zhenbo Li - China Agricultural University
Wenwu Wang - University of Surrey, Centre for Vision, Speech & Signal Processing (CVSSP)
Publication Details: arXiv.org
Publisher: Cornell University Library, arXiv.org; Ithaca
Date published: 11/10/2023
Identifiers: 99822194302346
Academic Unit: School of Computer Science and Electronic Engineering
Language: English
Resource Type: Preprint

CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing

Abstract

Files and links (1)

Metrics

Details

Usage Policy