Abstract
Sign Language Translation (SLT) is a challenging task that requires bridging the
modality gap between visual and linguistic information while capturing subtle variations
in hand shapes and movements. To address these challenges, we introduce BeyondGloss, a novel gloss-free SLT framework that leverages the spatio-temporal reasoning
capabilities of Video Large Language Models (VideoLLMs). Since existing VideoLLMs
struggle to model long videos in detail, we propose a novel approach to generate finegrained, temporally-aware textual descriptions of hand motion. A contrastive alignment
module aligns these descriptions with video features during pre-training, encouraging
the model to focus on hand-centric temporal dynamics and distinguish signs more effectively. To further enrich hand-specific representations, we distill fine-grained features
from Hand Mesh Recovery (HaMeR). Additionally, we apply a contrastive loss between
sign video representations and target language embeddings to reduce the modality gap
in pre-training. BeyondGloss achieves state-of-the-art performance on the Phoenix14T
and CSL-Daily benchmarks, demonstrating the effectiveness of the proposed framework.
Our code is available at https://github.com/elsobhano/BeyondGloss.