Logo image
Text Augmented Correlation Transformer for Few-shot Classification & Segmentation
Conference paper   Open access

Text Augmented Correlation Transformer for Few-shot Classification & Segmentation

Srinivasa Rao Nandam, Sara Atito Ali Ahmed, Zhen-Hua Feng, Josef Vaclav Kittler and Muhammad Awais
IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 - Proceedings, pp.25357-25366
Institute of Electrical and Electronics Engineers (IEEE)
2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025) (Nashville, TN, USA, 11/06/2025–15/06/2025)
13/08/2025

Abstract

Foundation models like CLIP and ALIGN have transformed few-shot and zero-shot vision applications by fusing visual and textual data, yet the integrative few-shot classification and segmentation (FS-CS) task primarily leverages visual cues, overlooking the potential of textual support. In FS-CS scenarios, ambiguous object boundaries and overlapping classes often hinder model performance, as limited visual data struggles to fully capture high-level semantics. To bridge this gap, we present a novel multi-modal FS-CS framework that integrates textual cues into support data, facilitating enhanced semantic disambiguation and fine-grained segmentation. Our approach first investigates the unique contributions of exclusive text-based support, using only class labels to achieve FS-CS. This strategy alone achieves performance competitive with vision-only methods on FS-CS tasks, underscoring the power of textual cues in few-shot learning. Building on this, we introduce a dual-modal prediction mechanism that synthesizes insights from both textual and visual support sets, yielding robust multi-modal predictions. This integration significantly elevates FS-CS performance, with classification and segmentation improvements of +3.7/6.6% (1-way 1-shot) and +8.0/6.5% (2-way 1-shot) on COCO-20^i, and +2.2/3.8% (1-way 1-shot) and +4.3/4.0% (2-way 1-shot) on Pascal-5^i. Additionally, in weakly supervised FS-CS settings, our method surpasses visual-only benchmarks using textual support exclusively, further enhanced by our dual-modal predictions. By rethinking the role of text in FS-CS, our work establishes new benchmarks for multi-modal few-shot learning and demonstrates the efficacy of textual cues for improving model generalization and segmentation accuracy.
pdf
Nandam_Text_Augmented_Correlation_Transformer_For_Few-shot_Classification__Segmentation_CVPR_2025_paper1.42 MBDownloadView
Author's Accepted Manuscript CC BY-NC-ND V4.0 Open Access
url
https://cvpr.thecvf.com/virtual/2025/index.htmlView
Event WebsiteConference website
url
https://openaccess.thecvf.com/content/CVPR2025/html/Nandam_Text_Augmented_Correlation_Transformer_For_Few-shot_Classification__Segmentation_CVPR_2025_paper.htmlView
Published (Version of record)CC BY-NC-ND V4.0 Open

Metrics

9 File views/ downloads
43 Record Views

Details

Logo image

Usage Policy