Abstract
Audio-visual segmentation (AVS) is an emerging task that aims to accurately
segment sounding objects based on audio-visual cues. The success of AVS
learning systems depends on the effectiveness of cross-modal interaction. Such
a requirement can be naturally fulfilled by leveraging transformer-based
segmentation architecture due to its inherent ability to capture long-range
dependencies and flexibility in handling different modalities. However, the
inherent training issues of transformer-based methods, such as the low efficacy
of cross-attention and unstable bipartite matching, can be amplified in AVS,
particularly when the learned audio query does not provide a clear semantic
clue. In this paper, we address these two issues with the new Class-conditional
Prompting Machine (CPM). CPM improves the bipartite matching with a learning
strategy combining class-agnostic queries with class-conditional queries. The
efficacy of cross-modal attention is upgraded with new learning objectives for
the audio, visual and joint modalities. We conduct experiments on AVS
benchmarks, demonstrating that our method achieves state-of-the-art (SOTA)
segmentation accuracy.