Logo image
Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation
Conference proceeding   Open access   Peer reviewed

Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation

Ozge Mercanoglu Sincan and Richard Bowden
IVA Adjunct '25: Adjunct Proceedings of the 25th ACM International Conference on Intelligent Virtual Agents, 7
IVA: Intelligent Virtual Agents
IVA Adjunct ’25
The 25th ACM International Conference on Intelligent Virtual Agents (IVA 2025) (Berlin, Germany, 16/08/2025–19/08/2025)
30/09/2025

Abstract

CCS Concepts Keywords Multimodal Pretraining Human-centered computing Accessibility technologies Sign Language Translation Gloss-free SLT Contrastive Learning
Sign Language Translation (SLT) aims to convert sign language videos into spoken or written text. While early systems relied on gloss annotations as an intermediate supervision, such annotations are costly to obtain and often fail to capture the full complexity of continuous signing. In this work, we propose a two-phase, dual visual encoder framework for gloss-free SLT, leveraging contrastive visual–language pretraining. During pretraining, our approach employs two complementary visual backbones whose outputs are jointly aligned with each other and with sentence-level text embed-dings via a contrastive objective. During the downstream SLT task, we fuse the visual features and input them into an encoder–decoder model. On the Phoenix-2014T benchmark, our dual encoder architecture consistently outperforms its single-stream variants and achieves the highest BLEU-4 score among existing gloss-free SLT approaches.
pdf
ivaadjunct25-8_DVESLT1.49 MBDownloadView
Author's Accepted Manuscript CC BY V4.0 Open Access
url
https://iva.acm.org/2025/View
Event WebsiteConference website

Metrics

4 File views/ downloads
33 Record Views

Details

Logo image

Usage Policy