Logo image
Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos
Conference paper   Open access

Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos

Davide Berghi and Philip J B Jackson
Proceedings of the 10th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2025)
Proceedings - DCASE, DCASE
10th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2025) (Barcelona, Spain, 29/10/2025–31/10/2025)
2025

Abstract

Sound Event Localization and Detection Stereo Sounds Audio-Visual Machine Learning Multimodal Localization Audio Understanding
In this study, we address the multimodal task of stereo sound event localization and detection with source distance estimation (3D SELD) in regular video content. 3D SELD is a complex task that combines temporal event classification with spatial localization, requiring reasoning across spatial, temporal, and semantic dimensions. The last is arguably the most challenging to model. Traditional SELD approaches typically rely on multichannel input, limiting their capacity to benefit from large-scale pre-training due to data constraints. To overcome this, we enhance a standard SELD architecture with semantic information by integrating pre-trained, contrastive language-aligned models: CLAP for audio and OWL-ViT for visual inputs. These embeddings are incorporated into a modified Conformer module tailored for multimodal fusion, which we refer to as the Cross-Modal Conformer. We perform an ablation study on the development set of the DCASE2025 Task3 Stereo SELD Dataset to assess the individual contributions of the language-aligned models and benchmark against the DCASE Task 3 baseline systems. Additionally, we detail the curation process of large synthetic audio and audiovisual datasets used for model pre-training. These datasets were further expanded through left-right channel swapping augmentation. Our approach, combining extensive pre-training, model ensembling, and visual post-processing, achieved second rank in the DCASE 2025 Challenge Task 3 (Track B), underscoring the effectiveness of our method. Future work will explore the modality-specific contributions and architectural refinements.
pdf
DCASE2025_Workshop_CameraReady5.00 MBDownloadView
Author's Accepted Manuscript Open Access CC BY V4.0
url
https://dcase.community/workshop2025/indexView
Event Website Conference website

Metrics

4 File views/ downloads
12 Record Views

Details

Logo image

Usage Policy