Abstract
The efficacy of deep learning-based Computer-Aided Diagnosis (CAD) methods
for skin diseases relies on analyzing multiple data modalities (i.e.,
clinical+dermoscopic images, and patient metadata) and addressing the
challenges of multi-label classification. Current approaches tend to rely on
limited multi-modal techniques and treat the multi-label problem as a multiple
multi-class problem, overlooking issues related to imbalanced learning and
multi-label correlation. This paper introduces the innovative Skin Lesion
Classifier, utilizing a Multi-modal Multi-label TransFormer-based model
(SkinM2Former). For multi-modal analysis, we introduce the Tri-Modal
Cross-attention Transformer (TMCT) that fuses the three image and metadata
modalities at various feature levels of a transformer encoder. For multi-label
classification, we introduce a multi-head attention (MHA) module to learn
multi-label correlations, complemented by an optimisation that handles
multi-label and imbalanced learning problems. SkinM2Former achieves a mean
average accuracy of 77.27% and a mean diagnostic accuracy of 77.85% on the
public Derm7pt dataset, outperforming state-of-the-art (SOTA) methods.