Is context all you need? Scaling Neural Sign Language Translation to Large Domains of Discourse

Ozge Mercanoglu Sincan; Necati Cihan Camgöz; Richard Bowden

doi:10.1109/ICCVW60793.2023.00210

Back

Is context all you need? Scaling Neural Sign Language Translation to Large Domains of Discourse

Conference proceeding

Open access

Is context all you need? Scaling Neural Sign Language Translation to Large Domains of Discourse

Ozge Mercanoglu Sincan, Necati Cihan Camgöz and Richard Bowden

2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW 2023), pp.1947-1957

IEEE International Conference on Computer Vision Workshops

ICCV workshop - 11th Workshop on Assistive Computer Vision and Robotics (ACVR 2023)

2023 International Conference on Computer Vision (ICCV 2023) (Paris Convention Centre, Paris, France, 02/10/2023–06/10/2023)

02/2024

DOI: https://doi.org/10.1109/ICCVW60793.2023.00210

Abstract

Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos, both of which have different grammar and word/gloss order. From a Neural Machine Translation (NMT) perspective, the straightforward way of training translation models is to use sign language phrase-spoken language sentence pairs. However, human interpreters heavily rely on the context to understand the conveyed information , especially for sign language interpretation, where the vocabulary size may be significantly smaller than their spoken language equivalent. Taking direct inspiration from how humans translate, we propose a novel multi-modal transformer architecture that tackles the translation task in a context-aware manner, as a human would. We use the context from previous sequences and confident predictions to disambiguate weaker visual cues. To achieve this we use complementary transformer encoders, namely: (1) A Video Encoder, that captures the low-level video features at the frame-level, (2) A Spotting Encoder, that models the recognized sign glosses in the video, and (3) A Context Encoder, which captures the context of the preceding sign sequences. We combine the information coming from these encoders in a final transformer decoder to generate spoken language translations. We evaluate our approach on the recently published large-scale BOBSL dataset, which contains ∼1.2M sequences , and on the SRF dataset, which was part of the WMT-SLT 2022 challenge. We report significant improvements on state-of-the-art translation performance using contextual information, nearly doubling the reported BLEU-4 scores of baseline approaches.

Files and links (2)

pdf

Is context all you need - AAM2.00 MBDownload View

Author's Accepted Manuscript Open Access

url

https://iccv2023.thecvf.com/View

Event WebsiteConference website

Metrics

58 File views/ downloads

161 Record Views

Details

Title: Is context all you need? Scaling Neural Sign Language Translation to Large Domains of Discourse
Creators: Ozge Mercanoglu Sincan (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Necati Cihan Camgöz (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Richard Bowden (Author) - University of Surrey, School of Computer Science and Electronic Engineering
Publication Details: 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW 2023), pp.1947-1957
Conference: 2023 International Conference on Computer Vision (ICCV 2023) (Paris Convention Centre, Paris, France, 02/10/2023–06/10/2023)
Event: ICCV workshop - 11th Workshop on Assistive Computer Vision and Robotics (ACVR 2023)
Series: IEEE International Conference on Computer Vision Workshops
Publisher: Institute of Electrical and Electronics Engineers (IEEE)
Number of pages: 11
First online publication date: 25/12/2023
Publication Date: 02/2024
Date accepted for publication: 04/08/2023
Grants: ExTOL, EP/R03298X/1, Engineering and Physical Sciences Research Council (United Kingdom, Swindon) - EPSRC
EASIER : Intelligent Automatic Sign Language Translation, 101016982, Horizon 2020
SMILE II, CRSII5 193686, Swiss National Science Foundation (Switzerland, Bern) - FNS
Grant note: This work was supported by the EPSRC project ExTOL (EP/R03298X/1), SNSF project ’SMILE II’ (CRSII5 193686), European Union’s Horizon2020 programme (’EASIER’ grant agreement 101016982) and the Innosuisse IICT Flagship (PFFS-21-47). This work reflects only the authors view and the Commission is not responsible for any use that may be made of the information it contains
Identifiers: 99794066302346; WOS:001156680302004
Academic Unit: School of Computer Science and Electronic Engineering
Language: English
Resource Type: Conference proceeding

Is context all you need? Scaling Neural Sign Language Translation to Large Domains of Discourse

Abstract

Files and links (2)

Metrics

Details

Usage Policy