Output list
Conference presentation
Progressive Transformers for End-to-End Sign Language Production
Availability date 20/07/2020
2020 European Conference on Computer Vision (ECCV)
European Conference on Computer Vision (ECCV), 23/08/2020–28/08/2020, Virtual Conference
The goal of automatic Sign Language Production (SLP) is to translate spoken language to a continuous stream of sign language video at a level comparable to a human translator. If this was achievable, then it would revolutionise Deaf hearing communications. Previous work on predominantly isolated SLP has shown the need for architectures that are better suited to the continuous domain of full sign sequences. In this paper, we propose Progressive Transformers, the first SLP model to translate from discrete spoken language sentences to continuous 3D sign pose sequences in an end-to-end manner. A novel counter decoding technique is introduced, that enables continuous sequence generation at training and inference. We present two model configurations, an end-to end network that produces sign direct from text and a stacked network that utilises a gloss intermediary. We also provide several data augmentation processes to overcome the problem of drift and drastically improve the performance of SLP models. We propose a back translation evaluation mechanism for SLP, presenting benchmark quantitative results on the challenging RWTH-PHOENIXWeather- 2014T (PHOENIX14T) dataset and setting baselines for future research. Code available at https://github.com/BenSaunders27/ ProgressiveTransformersSLP.
Conference presentation
Nested VAE:Isolating Common Factors via Weak Supervision
Availability date 02/04/2020
15th IEEE International Conference on Automatic Face and Gesture Recognition
15th IEEE International Conference on Automatic Face and Gesture Recognition, Buenos Aires, Argentina
Fair and unbiased machine learning is an important and active field of research, as decision processes are increasingly driven by models that learn from data. Unfortunately, any biases present in the data may be learned by the model, thereby inappropriately transferring that bias into the decision making process. We identify the connection between the task of bias reduction and that of isolating factors common between domains whilst encouraging domain specific invariance. To isolate the common factors we combine the theory of deep latent variable models with information bottleneck theory for scenarios whereby data may be naturally paired across domains and no additional supervision is required. The result is the Nested Variational AutoEncoder (NestedVAE). Two outer VAEs with shared weights attempt to reconstruct the input and infer a latent space, whilst a nested VAE attempt store construct the latent representation of one image,from the latent representation of its paired image. In so doing,the nested VAE isolates the common latent factors/causes and becomes invariant to unwanted factors that are not shared between paired images. We also propose a new metric to provide a balanced method of evaluating consistency and classifier performance across domains which we refer to as the Adjusted Parity metric. An evaluation of Nested VAE on both domain and attribute invariance, change detection,and learning common factors for the prediction of biological sex demonstrates that NestedVAE significantly outperforms alternative methods.
Conference presentation
Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation
Availability date 02/04/2020
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020
IEEE Conference on Computer Vision and Pattern Recognition 2020, Seattle, Washington
Prior work on Sign Language Translation has shown that having a mid-level sign gloss representation(effectively recognizing the individual signs) improves the translation performance drastically. In fact, the current state-of-theart in translation requires gloss level tokenization in order to work. We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation whilebeing trainable in an end-to-end manner. This is achieved by using a Connectionist Temporal Classification(CTC)loss to bind the recognition and translation problems into a single unified architecture. This joint approach does not require any ground-truth timing information,simultaneously solving two co-dependant sequence to sequence learning problems and leads to significant performance gains. We evaluate the recognition and translation performances of our approaches on the challenging RWTHPHOENIX-Weather-2014T(PHOENIX14T)dataset. Wereport state-of-the-art sign language recognition and translation results achieved by our Sign Language Transformers. Our translation net works out perform both sign video to spoken language and gloss to spoken language translation models, in some cases more than doubling the performance (9.58vs. 21.80BLEU-4Score). We also share new baseline translation results using transformer networks for several other text-to-text sign language translation tasks.
Conference presentation
Gated Variational AutoEncoders: Incorporating Weak Supervision to Encourage Disentanglement
Availability date 28/02/2020
15th IEEE International Conference on Automatic Face and Gesture Recognition
15th IEEE International Conference on Automatic Face and Gesture Recognition, 18/05/2020–22/05/2020, Buenos Aires, Argentina
Variational AutoEncoders (VAEs) provide a means to generate representational latent embeddings. Previous research has highlighted the benefits of achieving representations that are disentangled, particularly for downstream tasks. However, there is some debate about how to encourage disentanglement with VAEs, and evidence indicates that existing implementations do not achieve disentanglement consistently. The evaluation of how well a VAE’s latent space has been disentangled is often evaluated against our subjective expectations of which attributes should be disentangled for a given problem. Therefore, by definition, we already have domain knowledge of what should be achieved and yet we use unsupervised approaches to achieve it. We propose a weakly supervised approach that incorporates any available domain knowledge into the training process to form a Gated-VAE. The process involves partitioning the representational embedding and gating backpropagation. All partitions are utilised on the forward pass but gradients are backpropagated through different partitions according to selected image/target pairings. The approach can be used to modify existing VAE models such as beta-VAE, InfoVAE and DIP-VAE-II. Experiments demonstrate that using gated backpropagation, latent factors are represented in their intended partition. The approach is applied to images of faces for the purpose of disentangling head-pose from facial expression. Quantitative metrics show that using Gated-VAE improves average disentanglement, completeness and informativeness, as compared with un-gated implementations. Qualitative assessment of latent traversals demonstrate its disentanglement of head-pose from expression, even when only weak/noisy supervision is available.
Conference presentation
ExTOL: Automatic recognition of British Sign Language using the BSL Corpus
First online publication 29/09/2019
Proceedings of 6th Workshop on Sign Language Translation and Avatar Technology (SLTAT) 2019
6th Workshop on Sign Language Translation and Avatar Technology (SLTAT 2019), 29/09/2019, Hamburg, Germany
Although there have been some recent advances in sign language recognition, part of the problem is that most computer scientists in this research area do not have the required in-depth knowledge of sign language, and often have no connection with the Deaf community or sign linguists. For example, one project described as translation into sign language aimed to take subtitles and turn them into fingerspelling. This is one of many reasons why much of this technology, including sign-language gloves, simply doesn’t address the challenges.
However there are benefits to achieving automatic sign language recognition. The process of annotating and analysing sign language data on video is extremely labour-intensive. Sign language recognition technology could help speed this up. Until recently we have lacked large signed video datasets that have been precisely and consistently transcribed and translated – these are needed to train computers for automation. But sign language corpora – large datasets like the British Sign Language Corpus (Schembri et al., 2014) - bring new possibilities for this technology.
Here we describe the project “ExTOL: End to End Translation of British Sign Language” – which has one aim of building the world's first British Sign Language to English translation system and the first practically functional machine translation system for any sign language. Annotation work previously done on the BSL Corpus is providing essential data to be used by computer vision tools to assist with automatic recognition. To achieve this the computer must be able to recognise not only the shape, motion and location of the hands but also nonmanual features – including facial expression, mouth movements, and body posture of the signer. It must also understand how all of this activity in connected signing can be translated into written/spoken language. The technology for recognising hand, face and body positions and movements is improving all the time, which will allow significant progress in speeding up automatic recognition and identification of these elements (e.g. recognising specific facial expressions or mouth movements or head movements). Full translation from BSL to English is of course more complex but the automatic recognition of basic position and movements will help pave the way towards automatic translation. In addition, a secondary aim of the project is to create automatic annotation tools to be integrated into the software annotation package ELAN. We will additionally make the software tools available as independent packages, thus potentially allowing their inclusion into other annotation software such as iLex.
In this poster we report on some initial progress on the ExTOL project. This includes (1) automatic recognition of English mouthings, which is being trained using 600+ hours of audiovisual spoken English from British television and TED videos, and developed by testing on English mouthing annotations from the BSL Corpus. It also includes (2) translation of IDGloss to Free Translation, where the aim is to produce English-like sentences given sign glosses in BSL word order. We report baseline results on a subset of the BSL Corpus, which contains 10,000+ sequences and over 5,000 unique tokens, using the state-of-the-art attention based Neural Machine Translation approaches (Camgoz et al., 2018; Vaswani et al., 2017). Although it is clear that free translation (i.e. full English translation) cannot be achieved via ID glosses alone, this baseline translation task will help contribute to the overall BSL to English translation process - at least at the level of manual signs.
Conference presentation
Published 12/05/2019
ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2817 - 2821
2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019), 12/05/2019–17/05/2019, Brighton, UK
Sign language conveys information through multiple channels, such as hand shape, hand movement, and mouthing. Modeling this multichannel information is a highly challenging problem. In this paper, we elucidate the link between spoken language and sign language in terms of production phenomenon and perception phenomenon. Through this link we show that hidden Markov model-based approaches developed to model "articulatory" features for spoken language processing can be exploited to model the multichannel information inherent in sign language for sign language processing.
Conference presentation
Sign Language Production using Neural Machine Translation and Generative Adversarial Networks
Published 03/09/2018
Proceedings of the 29th British Machine Vision Conference (BMVC 2018)
29th British Machine Vision Conference (BMVC 2018), 03/09/2018–06/09/2018, Northumbria University, Newcastle Upon Tyne, UK
We present a novel approach to automatic Sign Language Production using stateof- the-art Neural Machine Translation (NMT) and Image Generation techniques. Our system is capable of producing sign videos from spoken language sentences. Contrary to current approaches that are dependent on heavily annotated data, our approach requires minimal gloss and skeletal level annotations for training. We achieve this by breaking down the task into dedicated sub-processes. We first translate spoken language sentences into sign gloss sequences using an encoder-decoder network. We then find a data driven mapping between glosses and skeletal sequences. We use the resulting pose information to condition a generative model that produces sign language video sequences. We evaluate our approach on the recently released PHOENIX14T Sign Language Translation dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of 16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities of our approach by sharing qualitative results of generated sign sequences given their skeletal correspondence.
Conference presentation
Neural Sign Language Translation
Published 2018
Proceedings CVPR 2018, 7784 - 7793
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, 18/06/2018–22/06/2018, Salt Lake City, Utah, USA
Sign Language Recognition (SLR) has been an active research field for the last two decades. However, most research to date has considered SLR as a naive gesture recognition problem. SLR seeks to recognize a sequence of continuous signs but neglects the underlying rich grammatical and linguistic structures of sign language that differ from spoken language. In contrast, we introduce the Sign Language Translation (SLT) problem. Here, the objective is to generate spoken language translations from sign language videos, taking into account the different word orders and grammar. We formalize SLT in the framework of Neural Machine Translation (NMT) for both end-to-end and pretrained settings (using expert knowledge). This allows us to jointly learn the spatial representations, the underlying language model, and the mapping between sign and spoken language. To evaluate the performance of Neural SLT, we collected the first publicly available Continuous SLT dataset, RWTHPHOENIX-Weather 2014T1. It provides spoken language translations and gloss level annotations for German Sign Language videos of weather broadcasts. Our dataset contains over .95M frames with >67K signs from a sign vocabulary of >1K and >99K words from a German vocabulary of >2.8K. We report quantitative and qualitative results for various SLT setups to underpin future research in this newly established field. The upper bound for translation performance is calculated at 19.26 BLEU-4, while our end-to-end frame-level and gloss-level tokenization networks were able to achieve 9.58 and 18.13 respectively.
Conference presentation
SMILE Swiss German Sign Language Dataset
Published 2018
Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC) 2018
11th edition of the Language Resources and Evaluation Conference, 07/05/2018–12/05/2018, Miyazaki, Japan
Sign language recognition (SLR) involves identifying the form and meaning of isolated signs or sequences of signs. To our knowledge, the combination of SLR and sign language assessment is novel. The goal of an ongoing three-year project in Switzerland is to pioneer an assessment system for lexical signs of Swiss German Sign Language (Deutschschweizerische Geb¨ardensprache, DSGS) that relies on SLR. The assessment system aims to give adult L2 learners of DSGS feedback on the correctness of the manual parameters (handshape, hand position, location, and movement) of isolated signs they produce. In its initial version, the system will include automatic feedback for a subset of a DSGS vocabulary production test consisting of 100 lexical items. To provide the SLR component of the assessment system with sufficient training samples, a large-scale dataset containing videotaped repeated productions of the 100 items of the vocabulary test with associated transcriptions and annotations was created, consisting of data from 11 adult L1 signers and 19 adult L2 learners of DSGS. This paper introduces the dataset, which will be made available to the research community.
Conference presentation
Particle Filter based Probabilistic Forced Alignment for Continuous Gesture Recognition
Published 29/10/2017
IEEE International Conference on Computer Vision Workshops (ICCVW) 2017, 3079 - 3085
IEEE International Conference on Computer Vision Workshops (ICCVW) 2017, 22/10/2017–29/10/2017, Venice, Italy
In this paper, we propose a novel particle filter based probabilistic forced alignment approach for training spatiotemporal deep neural networks using weak border level annotations. The proposed method jointly learns to localize and recognize isolated instances in continuous streams. This is done by drawing training volumes from a prior distribution of likely regions and training a discriminative 3D-CNN from this data. The classifier is then used to calculate the posterior distribution by scoring the training examples and using this as the prior for the next sampling stage. We apply the proposed approach to the challenging task of large-scale user-independent continuous gesture recognition. We evaluate the performance on the popular ChaLearn 2016 Continuous Gesture Recognition (ConGD) dataset. Our method surpasses state-of-the-art results by obtaining 0:3646 and 0:3744 Mean Jaccard Index Score on the validation and test sets of ConGD, respectively. Furthermore, we participated in the ChaLearn 2017 Continuous Gesture Recognition Challenge and was ranked 3rd. It should be noted that our method is learner independent, it can be easily combined with other approaches.