Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos

Davide Berghi; Philip J B Jackson

doi:10.48550/arXiv.2509.06598

Back

Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos

Conference paper

Open access

Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos

Davide Berghi and Philip J B Jackson

Proceedings of the 10th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2025)

Proceedings - DCASE, DCASE

10th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2025) (Barcelona, Spain, 29/10/2025–31/10/2025)

2025

DOI: https://doi.org/10.48550/arXiv.2509.06598

Abstract

Sound Event Localization and Detection

Stereo Sounds

Audio-Visual Machine Learning

Multimodal Localization

Audio Understanding

In this study, we address the multimodal task of stereo sound event localization and detection with source distance estimation (3D SELD) in regular video content. 3D SELD is a complex task that combines temporal event classification with spatial localization, requiring reasoning across spatial, temporal, and semantic dimensions. The last is arguably the most challenging to model. Traditional SELD approaches typically rely on multichannel input, limiting their capacity to benefit from large-scale pre-training due to data constraints. To overcome this, we enhance a standard SELD architecture with semantic information by integrating pre-trained, contrastive language-aligned models: CLAP for audio and OWL-ViT for visual inputs. These embeddings are incorporated into a modified Conformer module tailored for multimodal fusion, which we refer to as the Cross-Modal Conformer. We perform an ablation study on the development set of the DCASE2025 Task3 Stereo SELD Dataset to assess the individual contributions of the language-aligned models and benchmark against the DCASE Task 3 baseline systems. Additionally, we detail the curation process of large synthetic audio and audiovisual datasets used for model pre-training. These datasets were further expanded through left-right channel swapping augmentation. Our approach, combining extensive pre-training, model ensembling, and visual post-processing, achieved second rank in the DCASE 2025 Challenge Task 3 (Track B), underscoring the effectiveness of our method. Future work will explore the modality-specific contributions and architectural refinements.

Files and links (2)

pdf

DCASE2025_Workshop_CameraReady5.00 MBDownload View

Author's Accepted Manuscript Open Access CC BY V4.0

url

https://dcase.community/workshop2025/indexView

Event Website Conference website

Metrics

4 File views/ downloads

12 Record Views

Details

Title: Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos
Creators: Davide Berghi (Author) - University of Surrey, School of Computer Science & Electronic Engineering
Philip J B Jackson (Author) - University of Surrey, School of Computer Science & Electronic Engineering
Publication Details: Proceedings of the 10th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2025)
Conference: 10th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2025) (Barcelona, Spain, 29/10/2025–31/10/2025)
Series: Proceedings - DCASE
Publisher: DCASE
Publication Date: 2025
Date accepted for publication: 05/09/2025
Grants: BBC Prosperity Partnership: Future Personalised Object-Based Media Experiences Delivered at Scale Anywhere, EP/V038087/1, Engineering and Physical Sciences Research Council (United Kingdom, Swindon) - EPSRC
Grant note: This research was funded by EPSRC-BBC Prosperity Partnership ‘AI4ME: Future personalised object-based media experiences delivered at scale anywhere’ (EP/V038087/1).
Identifiers: 991029966002346
Copyright: For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising.
Academic Unit: School of Computer Science & Electronic Engineering
Language: English
Resource Type: Conference paper
Data Access Statement: Data supporting this study is available from https: //zenodo.org/records/15559774

Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos

Abstract

Files and links (2)

Metrics

Details

Usage Policy