Abstract
Audio-visual segmentation (AVS) is a complex task that involves accurately
segmenting the corresponding sounding object based on audio-visual queries.
Successful audio-visual learning requires two essential components: 1) an
unbiased dataset with high-quality pixel-level multi-class labels, and 2) a
model capable of effectively linking audio information with its corresponding
visual object. However, these two requirements are only partially addressed by
current methods, with training sets containing biased audio-visual data, and
models that generalise poorly beyond this biased training set. In this work, we
propose a new strategy to build cost-effective and relatively unbiased
audio-visual semantic segmentation benchmarks. Our strategy, called Visual
Post-production (VPO), explores the observation that it is not necessary to
have explicit audio-visual pairs extracted from single video sources to build
such benchmarks. We also refine the previously proposed AVSBench to transform
it into the audio-visual semantic segmentation benchmark AVSBench-Single+.
Furthermore, this paper introduces a new pixel-wise audio-visual contrastive
learning method to enable a better generalisation of the model beyond the
training set. We verify the validity of the VPO strategy by showing that
state-of-the-art (SOTA) models trained with datasets built by matching audio
and visual data from different sources or with datasets containing audio and
visual data from the same video source produce almost the same accuracy. Then,
using the proposed VPO benchmarks and AVSBench-Single+, we show that our method
produces more accurate audio-visual semantic segmentation than SOTA models.
Code and dataset will be available.