Ozge Mercanoglu Sincan

Research Fellow in Computer Vision and Deep Learning, School of Computer Science & Electronic Engineering, Faculty of Engineering and Physical Sciences, University of Surrey

Journal article Open access Peer reviewed

Gloss-Free Sign Language Translation: An Unbiased Evaluation of Progress in the Field

by Ozge Mercanoglu Sincan, Jian He Low, Sobhan Asasi and Richard Bowden Prof

First online publication 07/10/2025

Computer Vision and Image Understanding, 261, 104498

Sign Language Translation (SLT) aims to automatically convert visual sign language videos into spoken language text and vice versa.

While recent years have seen rapid progress, the true sources of performance improvements often remain unclear. Do reported performance gains come from methodological novelty, or from the choice of a different backbone, training optimizations, hyperparameter tuning, or even differences in the calculation of evaluation metrics? This paper presents a comprehensive study of recent gloss-free SLT models by re-implementing key contributions in a unified codebase. We ensure fair comparison by standardizing preprocessing, video encoders, and training setups across all methods. Our analysis shows that many of the performance gains reported in the literature often diminish when models are evaluated under consistent conditions, suggesting that implementation details and evaluation setups play a significant role in determining results. We make the codebase publicly available here to support transparency and reproducibility in SLT research.

Journal article Open access Peer reviewed

Using Motion History Images With 3D Convolutional Networks in Isolated Sign Language Recognition

by Ozge Mercanoglu Sincan and Hacer Yalim Keles

Published 01/01/2022

IEEE access, 10, 18608 - 18618

Sign language recognition using computational models is a challenging problem that requires simultaneous spatio-temporal modeling of the multiple sources, i.e. faces, hands, body, etc. In this paper, we propose an isolated sign language recognition model based on a model trained using Motion History Images (MHI) that are generated from RGB video frames. RGB-MHI images represent spatio-temporal summary of each sign video effectively in a single RGB image. We propose two different approaches using this RGB-MHI model. In the first approach, we use the RGB-MHI model as a motion-based spatial attention module integrated into a 3D-CNN architecture. In the second approach, we use RGB-MHI model features directly with the features of a 3D-CNN model using a late fusion technique. We perform extensive experiments on two recently released large-scale isolated sign language datasets, namely AUTSL and BosphorusSign22k. Our experiments show that our models, which use only RGB data, can compete with the state-of-the-art models in the literature that use multi-modal data.

Journal article Open access Peer reviewed

AUTSL: A Large Scale Multi-Modal Turkish Sign Language Dataset and Baseline Methods

by Ozge Mercanoglu Sincan, Hacer Yalim Keles and Ozge Mercanoglu Sincan

Published 01/01/2020

IEEE access, 8, 181340 - 181355

Sign language recognition is a challenging problem where signs are identified by simultaneous local and global articulations of multiple sources, i.e. hand shape and orientation, hand movements, body posture, and facial expressions. Solving this problem computationally for a large vocabulary of signs in real life settings is still a challenge, even with the state-of-the-art models. In this study, we present a new large-scale multi-modal Turkish Sign Language dataset (AUTSL) with a benchmark and provide baseline models for performance evaluations. Our dataset consists of 226 signs performed by 43 different signers and 38,336 isolated sign video samples in total. Samples contain a wide variety of backgrounds recorded in indoor and outdoor environments. Moreover, spatial positions and the postures of signers also vary in the recordings. Each sample is recorded with Microsoft Kinect v2 and contains color image (RGB), depth, and skeleton modalities. We prepared benchmark training and test sets for user independent assessments of the models. We trained several deep learning based models and provide empirical evaluations using the benchmark; we used Convolutional Neural Networks (CNNs) to extract features, unidirectional and bidirectional Long Short-Term Memory (LSTM) models to characterize temporal information. We also incorporated feature pooling modules and temporal attention to our models to improve the performances. We evaluated our baseline models on AUTSL and Montalbano datasets. Our models achieved competitive results with the state-of-the-art methods on Montalbano dataset, i.e. 96.11% accuracy. In AUTSL random train-test splits, our models performed up to 95.95% accuracy. In the proposed user-independent benchmark dataset our best baseline model achieved 62.02% accuracy. The gaps in the performances of the same baseline models show the challenges inherent in our benchmark dataset. AUTSL benchmark dataset is publicly available at https://cvml.ankara.edu.tr .

Ozge Mercanoglu Sincan

Research Fellow in Computer Vision and Deep Learning, School of Computer Science & Electronic Engineering, Faculty of Engineering and Physical Sciences, University of Surrey

Output list