Abstract
In multi-modal learning, some modalities are more influential than others,
and their absence can have a significant impact on classification/segmentation
accuracy. Hence, an important research question is if it is possible for
trained multi-modal models to have high accuracy even when influential
modalities are absent from the input data. In this paper, we propose a novel
approach called Meta-learned Cross-modal Knowledge Distillation (MCKD) to
address this research question. MCKD adaptively estimates the importance weight
of each modality through a meta-learning process. These dynamically learned
modality importance weights are used in a pairwise cross-modal knowledge
distillation process to transfer the knowledge from the modalities with higher
importance weight to the modalities with lower importance weight. This
cross-modal knowledge distillation produces a highly accurate model even with
the absence of influential modalities. Differently from previous methods in the
field, our approach is designed to work in multiple tasks (e.g., segmentation
and classification) with minimal adaptation. Experimental results on the Brain
tumor Segmentation Dataset 2018 (BraTS2018) and the Audiovision-MNIST
classification dataset demonstrate the superiority of MCKD over current
state-of-the-art models. Particularly in BraTS2018, we achieve substantial
improvements of 3.51\% for enhancing tumor, 2.19\% for tumor core, and 1.14\%
for the whole tumor in terms of average segmentation Dice score.