Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation

Ozge Mercanoglu Sincan; Richard Bowden

doi:10.1145/3742886.3756703

Back

Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation

Conference proceeding

Open access

Peer reviewed

Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation

Ozge Mercanoglu Sincan and Richard Bowden

IVA Adjunct '25: Adjunct Proceedings of the 25th ACM International Conference on Intelligent Virtual Agents, 7

IVA: Intelligent Virtual Agents

IVA Adjunct ’25

The 25th ACM International Conference on Intelligent Virtual Agents (IVA 2025) (Berlin, Germany, 16/08/2025–19/08/2025)

30/09/2025

DOI: https://doi.org/10.1145/3742886.3756703

Abstract

CCS Concepts

Keywords

Multimodal Pretraining

Human-centered computing

Accessibility technologies

Sign Language Translation

Gloss-free SLT

Contrastive Learning

Sign Language Translation (SLT) aims to convert sign language videos into spoken or written text. While early systems relied on gloss annotations as an intermediate supervision, such annotations are costly to obtain and often fail to capture the full complexity of continuous signing. In this work, we propose a two-phase, dual visual encoder framework for gloss-free SLT, leveraging contrastive visual–language pretraining. During pretraining, our approach employs two complementary visual backbones whose outputs are jointly aligned with each other and with sentence-level text embed-dings via a contrastive objective. During the downstream SLT task, we fuse the visual features and input them into an encoder–decoder model. On the Phoenix-2014T benchmark, our dual encoder architecture consistently outperforms its single-stream variants and achieves the highest BLEU-4 score among existing gloss-free SLT approaches.

Files and links (2)

pdf

ivaadjunct25-8_DVESLT1.49 MBDownload View

Author's Accepted Manuscript CC BY V4.0, Open Access

url

https://iva.acm.org/2025/View

Event WebsiteConference website

Metrics

4 File views/ downloads

33 Record Views

Details

Title: Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation
Creators: Ozge Mercanoglu Sincan (Author) - University of Surrey, School of Computer Science & Electronic Engineering
Richard Bowden (Author) - University of Surrey, School of Computer Science & Electronic Engineering
Publication Details: IVA Adjunct '25: Adjunct Proceedings of the 25th ACM International Conference on Intelligent Virtual Agents, 7
Conference: The 25th ACM International Conference on Intelligent Virtual Agents (IVA 2025) (Berlin, Germany, 16/08/2025–19/08/2025)
Event: IVA Adjunct ’25
Series: IVA: Intelligent Virtual Agents
Publisher: Association for Computing Machinery (ACM)
Number of pages: 1
First online publication date: 30/09/2025
Date accepted for publication: 07/07/2025
Grants: SMILE II, CRSII5 193686, Swiss National Science Foundation (Switzerland, Bern) - FNS
IICT Flagship, PFFS-21-47, Innosuisse – Swiss Innovation Agency (Switzerland, Bern)
SignGPT-EP/Z535370/1, APP24554, UK Research and Innovation (United Kingdom, Swindon) - UKRI
AI for Global Goals Scheme, RB3208, Google (United Kingdom) (United Kingdom, London)
Grant note: We would like to thank Necati Cihan Camgoz for the valuable discussions and feedback. This work was supported by the SNSF project ‘SMILE II’ (CRSII5 193686), the Innosuisse IICT Flagship (PFFS-21-47), EPSRC grant APP24554 (SignGPT-EP/Z535370/1), and through funding from Google.org via the AI for Global Goals scheme.
Identifiers: 991016666402346
Academic Unit: School of Computer Science & Electronic Engineering
Language: English
Resource Type: Conference proceeding

Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation

Abstract

Files and links (2)

Metrics

Details

Usage Policy