Text Augmented Correlation Transformer for Few-shot Classification & Segmentation

Srinivasa Rao Nandam; Sara Atito Ali Ahmed; Zhen-Hua Feng; Josef Vaclav Kittler; Muhammad Awais

doi:10.1109/CVPR52734.2025.02361

Back

Text Augmented Correlation Transformer for Few-shot Classification & Segmentation

Conference paper

Open access

Text Augmented Correlation Transformer for Few-shot Classification & Segmentation

Srinivasa Rao Nandam, Sara Atito Ali Ahmed, Zhen-Hua Feng, Josef Vaclav Kittler and Muhammad Awais

IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 - Proceedings, pp.25357-25366

Institute of Electrical and Electronics Engineers (IEEE)

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025) (Nashville, TN, USA, 11/06/2025–15/06/2025)

13/08/2025

DOI: https://doi.org/10.1109/CVPR52734.2025.02361

Abstract

Foundation models like CLIP and ALIGN have transformed few-shot and zero-shot vision applications by fusing visual and textual data, yet the integrative few-shot classification and segmentation (FS-CS) task primarily leverages visual cues, overlooking the potential of textual support. In FS-CS scenarios, ambiguous object boundaries and overlapping classes often hinder model performance, as limited visual data struggles to fully capture high-level semantics. To bridge this gap, we present a novel multi-modal FS-CS framework that integrates textual cues into support data, facilitating enhanced semantic disambiguation and fine-grained segmentation. Our approach first investigates the unique contributions of exclusive text-based support, using only class labels to achieve FS-CS. This strategy alone achieves performance competitive with vision-only methods on FS-CS tasks, underscoring the power of textual cues in few-shot learning. Building on this, we introduce a dual-modal prediction mechanism that synthesizes insights from both textual and visual support sets, yielding robust multi-modal predictions. This integration significantly elevates FS-CS performance, with classification and segmentation improvements of +3.7/6.6% (1-way 1-shot) and +8.0/6.5% (2-way 1-shot) on COCO-20^i, and +2.2/3.8% (1-way 1-shot) and +4.3/4.0% (2-way 1-shot) on Pascal-5^i. Additionally, in weakly supervised FS-CS settings, our method surpasses visual-only benchmarks using textual support exclusively, further enhanced by our dual-modal predictions. By rethinking the role of text in FS-CS, our work establishes new benchmarks for multi-modal few-shot learning and demonstrates the efficacy of textual cues for improving model generalization and segmentation accuracy.

Files and links (3)

pdf

Nandam_Text_Augmented_Correlation_Transformer_For_Few-shot_Classification__Segmentation_CVPR_2025_paper1.42 MBDownload View

Author's Accepted Manuscript CC BY-NC-ND V4.0, Open Access

url

https://cvpr.thecvf.com/virtual/2025/index.htmlView

Event WebsiteConference website

url

https://openaccess.thecvf.com/content/CVPR2025/html/Nandam_Text_Augmented_Correlation_Transformer_For_Few-shot_Classification__Segmentation_CVPR_2025_paper.htmlView

Published (Version of record)CC BY-NC-ND V4.0, Open

Metrics

9 File views/ downloads

43 Record Views

Details

Title: Text Augmented Correlation Transformer for Few-shot Classification & Segmentation
Creators: Srinivasa Rao Nandam (Author) - University of Surrey, School of Computer Science & Electronic Engineering
Sara Atito Ali Ahmed (Author) - University of Surrey, School of Computer Science & Electronic Engineering
Zhen-Hua Feng (Author) - Jiangnan University
Josef Vaclav Kittler (Author) - University of Surrey, Centre for Vision, Speech & Signal Processing (CVSSP)
Muhammad Awais (Author) - University of Surrey, School of Computer Science & Electronic Engineering
Publication Details: IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 - Proceedings, pp.25357-25366
Conference: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025) (Nashville, TN, USA, 11/06/2025–15/06/2025)
Publisher: Institute of Electrical and Electronics Engineers (IEEE)
First online publication date: 13/08/2025
Date accepted for publication: 08/03/2025
Identifiers: 991007264802346
Academic Unit: School of Computer Science & Electronic Engineering; Centre for Vision, Speech & Signal Processing (CVSSP)
Language: English
Resource Type: Conference paper

Text Augmented Correlation Transformer for Few-shot Classification & Segmentation

Abstract

Files and links (3)

Metrics

Details

Usage Policy