Leveraging Foundation models for Unsupervised Audio-Visual Segmentation

Swapnil Bhosale; Haosen Yang; Diptesh Kanojia; Xiatian Zhu

doi:10.48550/arxiv.2309.06728

Back

Other

Leveraging Foundation models for Unsupervised Audio-Visual Segmentation

Swapnil Bhosale, Haosen Yang, Diptesh Kanojia and Xiatian Zhu

arXiv.org

Cornell University Library, arXiv.org

13/09/2023

DOI: https://doi.org/10.48550/arxiv.2309.06728

Abstract

Annotations

Audio data

Pixels

Segmentation

Supervised learning

Training

Audio-Visual Segmentation (AVS) aims to precisely outline audible objects in a visual scene at the pixel level. Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion. This limits their scalability since it is time consuming and tedious to acquire such cross-modality pixel level labels. To overcome this obstacle, in this work we introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training. For tackling this newly proposed problem, we formulate a novel Cross-Modality Semantic Filtering (CMSF) approach to accurately associate the underlying audio-mask pairs by leveraging the off-the-shelf multi-modal foundation models (e.g., detection [1], open-world segmentation [2] and multi-modal alignment [3]). Guiding the proposal generation by either audio or visual cues, we design two training-free variants: AT-GDINO-SAM and OWOD-BIND. Extensive experiments on the AVS-Bench dataset show that our unsupervised approach can perform well in comparison to prior art supervised counterparts across complex scenarios with multiple auditory objects. Particularly, in situations where existing supervised AVS methods struggle with overlapping foreground objects, our models still excel in accurately segmenting overlapped auditory objects. Our code will be publicly released.

Metrics

21 Record Views

Details

Title: Leveraging Foundation models for Unsupervised Audio-Visual Segmentation
Creators: Swapnil Bhosale - University of Surrey, School of Computer Science and Electronic Engineering
Haosen Yang - University of Surrey, School of Computer Science and Electronic Engineering
Diptesh Kanojia - University of Surrey, School of Computer Science and Electronic Engineering
Xiatian Zhu - University of Surrey, School of Computer Science and Electronic Engineering
Publication Details: arXiv.org
Publisher: Cornell University Library, arXiv.org; Ithaca
Date published: 13/09/2023
Identifiers: 99822196002346
Academic Unit: School of Computer Science and Electronic Engineering
Language: English
Resource Type: Other

Leveraging Foundation models for Unsupervised Audio-Visual Segmentation

Abstract

Metrics

Details

Usage Policy