Abstract
Abstract
In the fashion domain, there exists a variety of vision-
and-language (V+L) tasks, including cross-modal retrieval,
text-guided image retrieval, multi-modal classification, and
image captioning. They differ drastically in each individ-
ual input/output format and dataset size. It has been com-
mon to design a task-specific model and fine-tune it in-
dependently from a pre-trained V+L model (e.g., CLIP).
This results in parameter inefficiency and inability to ex-
ploit inter-task relatedness. To address such issues, we pro-
pose a novel FAshion-focused Multi-task Efficient learn-
ing method for Vision-and-Language tasks (FAME-ViL) in
this work. Compared with existing approaches, FAME-ViL
applies a single model for multiple heterogeneous fashion
tasks, therefore being much more parameter-efficient. It
is enabled by two novel components: (1) a task-versatile
architecture with cross-attention adapters and task-specific
adapters integrated into a unified V+L model, and (2) a sta-
ble and effective multi-task training strategy that supports
learning from heterogeneous data and prevents negative
transfer. Extensive experiments on four fashion tasks show
that our FAME-ViL can save 61.5% of parameters over
alternatives, while significantly outperforming the conven-
tional independently trained single-task models. Code is
available at https://github.com/BrandonHanx/FAME-ViL