Output list
Journal article
Gloss-Free Sign Language Translation: An Unbiased Evaluation of Progress in the Field
First online publication 07/10/2025
Computer Vision and Image Understanding, 261, 104498
Sign Language Translation (SLT) aims to automatically convert visual sign language videos into spoken language text and vice versa.
While recent years have seen rapid progress, the true sources of performance improvements often remain unclear. Do reported performance gains come from methodological novelty, or from the choice of a different backbone, training optimizations, hyperparameter tuning, or even differences in the calculation of evaluation metrics? This paper presents a comprehensive study of recent gloss-free SLT models by re-implementing key contributions in a unified codebase. We ensure fair comparison by standardizing preprocessing, video encoders, and training setups across all methods. Our analysis shows that many of the performance gains reported in the literature often diminish when models are evaluated under consistent conditions, suggesting that implementation details and evaluation setups play a significant role in determining results. We make the codebase publicly available here to support transparency and reproducibility in SLT research.
Conference proceeding
Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation
First online publication 30/09/2025
IVA Adjunct '25: Adjunct Proceedings of the 25th ACM International Conference on Intelligent Virtual Agents, 7
The 25th ACM International Conference on Intelligent Virtual Agents (IVA 2025), 16/08/2025–19/08/2025, Berlin, Germany
IVA Adjunct ’25
Sign Language Translation (SLT) aims to convert sign language videos into spoken or written text. While early systems relied on gloss annotations as an intermediate supervision, such annotations are costly to obtain and often fail to capture the full complexity of continuous signing. In this work, we propose a two-phase, dual visual encoder framework for gloss-free SLT, leveraging contrastive visual–language pretraining. During pretraining, our approach employs two complementary visual backbones whose outputs are jointly aligned with each other and with sentence-level text embed-dings via a contrastive objective. During the downstream SLT task, we fuse the visual features and input them into an encoder–decoder model. On the Phoenix-2014T benchmark, our dual encoder architecture consistently outperforms its single-stream variants and achieves the highest BLEU-4 score among existing gloss-free SLT approaches.
Conference proceeding
Spotter+GPT: Turning Sign Spottings into Sentences with LLMs
First online publication 30/09/2025
IVA Adjunct '25: Adjunct Proceedings of the 25th ACM International Conference on Intelligent Virtual Agents, In Press, In Press
25th ACM International Conference on Intelligent Virtual Agents, 16/09/2025–19/09/2025, Berlin, Germany
IVA Adjunct ’25
Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos. In this paper, we introduce a lightweight, modular SLT framework, Spotter+GPT, that leverages the power of Large Language Models (LLMs) and avoids heavy end-to-end training. Spotter+GPT breaks down the SLT task into two distinct stages. First, a sign spotter identifies individual signs within the input video. The spotted signs are then passed to an LLM, which transforms them into meaningful spoken language sentences. Spotter+GPT eliminates the requirement for SLT-specific training. This significantly reduces computational costs and time requirements. The source code and pretrained weights of the Spotter are available online.
Conference proceeding
Sign Spotting Disambiguation using Large Language Models
Published 30/09/2025
IVA 2025 - Adjunct Proceedings of the 25th ACM International Conference on Intelligent Virtual Agents, 1 - 9
25th ACM International Conference on Intelligent Virtual Agents, 16/09/2025–16/09/2025, Berlin, Germany
Sign spotting, the task of identifying and localizing individual signs within continuous sign language video, plays a pivotal role in scaling dataset annotations and addressing the severe data scarcity issue in sign language translation. While automatic sign spotting holds great promise for enabling frame-level supervision at scale, it grapples with challenges such as vocabulary inflexibility and ambiguity inherent in continuous sign streams. Hence, we introduce a novel, training-free framework that integrates Large Language Models (LLMs) to significantly enhance sign spotting quality. Our approach extracts global spatio-temporal and hand shape features, which are then matched against a large-scale sign dictionary using dynamic time warping and cosine similarity. This dictionary-based matching inherently offers superior vocabulary flexibility without requiring model retraining. To mitigate noise and ambiguity from the matching process, an LLM performs context-aware gloss disambiguation via beam search, notably without fine-tuning. Extensive experiments on both synthetic and real-world sign language datasets demonstrate our method’s superior accuracy and sentence fluency compared to traditional approaches, highlighting the potential of LLMs in advancing sign spotting.
Conference proceeding
Hands-On: Segmenting Individual Signs from Continuous Sequences
First online publication 06/08/2025
2025 IEEE 19th International Conference on Automatic Face and Gesture Recognition (FG)
2025 19th International Conference on Automatic Face and Gesture Recognition (FG), 26/05/2025–30/05/2025, Tampa, Florida
This work tackles the challenge of continuous sign
language segmentation, a key task with huge implications for
sign language translation and data annotation. We propose
a transformer-based architecture that models the temporal
dynamics of signing and frames segmentation as a sequence
labeling problem using the Begin-In-Out (BIO) tagging scheme.
Our method leverages the HaMeR hand features, and is
complemented with 3D Angles. Extensive experiments show that
our model achieves state-of-the-art results on the DGS Corpus,
while our features surpass prior benchmarks on BSLCorpus.
Conference proceeding
SAGE: Segment-Aware Gloss-Free Encoding for Token-Efficient Sign Language Translation
Accepted for publication 12/07/2025
2025 IEEE/CVF International Conference on Computer Vision (ICCV 2025)
International Conference on Computer Vision, ICCV 2025, 19/10/2025–23/10/2025, Honolulu, Hawaii
Workshops
Gloss-free Sign Language Translation (SLT) has advanced rapidly, achieving strong performances without relying on gloss annotations. However, these gains have often come with increased model complexity and high computational demands, raising concerns about scalability, especially as large-scale sign language datasets become more common. We propose a segment-aware visual tokenization framework that leverages sign segmentation to convert continuous video into discrete, sign-informed visual tokens. This reduces input sequence length by up to 50% compared to prior methods, resulting in up to 2.67× lower memory usage and better scalability on larger datasets. To bridge the visual and linguistic modalities, we introduce a token-to-token contrastive alignment objective, along with a dual-level supervision that aligns both language embeddings and intermediate hidden states. This improves fine-grained cross-modal alignment without relying on gloss-level supervision. Our approach notably exceeds the performance of state-of-the-art methods on the PHOENIX14T benchmark, while significantly reducing sequence length. Further experiments also demonstrate our improved performance over prior work under comparable sequence-lengths, validating the potential of our tokenization and alignment strategies.
Conference proceeding
VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis
Accepted for publication 10/07/2025
2025 IEEE/CVF International Conference on Computer Vision (ICCV 2025)
International Conference on Computer Vision, ICCV 2025, 19/10/2025–23/10/2025, Honolulu, Hawaii, United States
Workshops
Realistic, high-fidelity 3D facial animations are crucial for expressive avatar systems in human-computer interaction and accessibility. Although prior methods show promising quality, their reliance on the mesh domain limits their ability to fully leverage the rapid visual innovations seen in 2D computer vision and graphics. We propose VisualSpeaker, a novel method that bridges this gap using photorealistic differentiable rendering, supervised by visual speech recognition, for improved 3D facial animation. Our contribution is a perceptual lip-reading loss, derived by passing photorealistic 3D Gaussian Splatting avatar renders through a pre-trained Visual Automatic Speech Recognition model during training. Evaluation on the MEAD dataset demonstrates that VisualSpeaker improves both the standard Lip Vertex Error metric by 56.1% and the perceptual quality of the generated animations, while retaining the controllability of mesh-driven animation. This perceptual focus naturally supports accurate mouthings, essential cues that disambiguate similar manual signs in sign language avatars.
Conference proceeding
SLRTP2025 Sign Language Production Challenge: Methodology, Results, and Future Work
Accepted for publication 23/05/2025
The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 (CVPR 2025)
The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025, 11/06/2025–11/06/2025, Nashville, Tennessee
CVPR Workshop SLRTP
Sign Language Production (SLP) is the task of generating sign language video from spoken language inputs. The field has seen a range of innovations over the last few years, with the introduction of deep learning-based approaches providing significant improvements in the realism and naturalness of generated outputs. However, the lack of standardized evaluation metrics for SLP approaches hampers meaningful comparisons across different systems. To address this, we introduce the first Sign Language Production Challenge, held as part of the third SLRTP Workshop at CVPR 2025. The competition's aims are to evaluate architectures that translate from spoken language sentences to a sequence of skeleton poses, known as Text-to-Pose (T2P) translation , over a range of metrics. For our evaluation data, we use the RWTH-PHOENIX-Weather-2014T dataset, a Ger-man Sign Language-Deutsche Gebärdensprache (DGS) weather broadcast dataset. In addition, we curate a custom hidden test set from a similar domain of discourse. This paper presents the challenge design and the winning methodologies. The challenge attracted 33 participants who submitted 231 solutions, with the top-performing team achieving BLEU-1 scores of 31.40 and DTW-MJE of 0.0574. The winning approach utilized a retrieval-based framework and a pre-trained language model. As part of the workshop, we release a standardized evaluation network , including high-quality skeleton extraction-based key-points establishing a consistent baseline for the SLP field, which will enable future researchers to compare their work against a broader range of methods.
Conference proceeding
First online publication 11/07/2024
2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), 1 - 10
International Conference on Automatic Face and Gesture Recognition (FG), 27/05/2024–31/05/2024, Istanbul, Turkiye
Recent years have seen significant progress in human image generation, particularly with the advancements in diffusion models. However, existing diffusion methods encounter challenges when producing consistent hand anatomy and the generated images often lack precise control over the hand pose. To address this limitation, we introduce a novel approach to pose-conditioned human image generation, dividing the process into two stages: hand generation and subsequent body outpainting around the hands. We propose training the hand generator in a multi-task setting to produce both hand images and their corresponding segmentation masks, and employ the trained model in the first stage of generation. An adapted ControlNet model is then used in the second stage to outpaint the body around the generated hands, producing the final result. A novel blending technique is introduced to preserve the hand details during the second stage that combines the results of both stages in a coherent way. This involves sequential expansion of the outpainted region while fusing the latent representations, to ensure a seamless and cohesive synthesis of the final image. Experimental evaluations demonstrate the superiority of our proposed method over state-of-the-art techniques, in both pose accuracy and image quality, as validated on the HaGRID dataset. Our approach not only enhances the quality of the generated hands but also offers improved control over hand pose, advancing the capabilities of pose-conditioned human image generation. The source code is available here. 1 1 https://github.com/apelykh/hand-to-diffusion
Conference proceeding
Is context all you need? Scaling Neural Sign Language Translation to Large Domains of Discourse
Published 02/2024
2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW 2023), 1947 - 1957
2023 International Conference on Computer Vision (ICCV 2023), 02/10/2023–06/10/2023, Paris Convention Centre, Paris, France
ICCV workshop - 11th Workshop on Assistive Computer Vision and Robotics (ACVR 2023)
Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos, both of which have different grammar and word/gloss order. From a Neural Machine Translation (NMT) perspective, the straightforward way of training translation models is to use sign language phrase-spoken language sentence pairs. However, human interpreters heavily rely on the context to understand the conveyed information , especially for sign language interpretation, where the vocabulary size may be significantly smaller than their spoken language equivalent.
Taking direct inspiration from how humans translate, we propose a novel multi-modal transformer architecture that tackles the translation task in a context-aware manner, as a human would. We use the context from previous sequences and confident predictions to disambiguate weaker visual cues. To achieve this we use complementary transformer encoders, namely: (1) A Video Encoder, that captures the low-level video features at the frame-level, (2) A Spotting Encoder, that models the recognized sign glosses in the video, and (3) A Context Encoder, which captures the context of the preceding sign sequences. We combine the information coming from these encoders in a final transformer decoder to generate spoken language translations.
We evaluate our approach on the recently published large-scale BOBSL dataset, which contains ∼1.2M sequences , and on the SRF dataset, which was part of the WMT-SLT 2022 challenge. We report significant improvements on state-of-the-art translation performance using contextual information, nearly doubling the reported BLEU-4 scores of baseline approaches.