Abstract
Sign languages have been studied by computer vision researchers for the last three
decades. One of the end goals of vision-based sign language research is to build systems
that can understand and translate sign languages to spoken/written languages or vice
versa, to create a more natural medium of communication between the hearing and the
Deaf. However, most research to date has mainly focused on isolated sign recognition
and spotting, neglecting the underlying rich grammatical and linguistic structures of sign
language that differ from spoken language. More recently, Continuous Sign Language
Recognition (CSLR) has become feasible with the availability of large benchmark
datasets, such as the RWTH-PHOENIX-Weather-2014 Dataset (PHOENIX14), and the
development of algorithms that can learn from weak annotations. Although, CSLR
is able to recognize sign gloss sequences, further progress is required to produce
meaningful spoken/written language interpretations of continuous sign language videos.
In this thesis, we introduce the Sign Language Translation (SLT) problem and lay
groundwork for future research on this topic. The objective of SLT is to generate
spoken/written language translations from continuous sign language videos, taking into
account the different word orders and grammar. We evaluate our approaches on the
RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset, the first and the currently
only publicly available Continuous SLT dataset aimed at vision based sign language
research. It provides spoken language translations and gloss level annotations for
German Sign Language videos of weather broadcasts. We lay down several evaluation
protocols to underpin future research in this newly established field.
In the first contribution chapter of this thesis, we formalize SLT in the framework
of Neural Machine Translation (NMT) and propose the first SLT approach, Neural
Sign Language Translation. We combine Convolutional Neural Networks (CNNs) and
attention-based encoder-decoder models, which allows us to jointly learn the spatial
representations, the underlying language model, and the mapping between sign and
spoken language. We investigate different configurations of the proposed network
with both end-to-end and pretrained settings (using expert gloss annotations). In
our experiments, recognizing glosses and then translating them to spoken languages
(Sign2Gloss2Text) drastically outperforms an end-to-end direct translation approach
(Sign2Text). Sign2Gloss2Text utilizes a state-of-the-art CSLR model to predict gloss
sequences from sign language videos and then solves SLT as text-to-text translation
problem. This suggests that using gloss level intermediate representations, essentially
dividing the process into two stages, is necessary to train accurate SLT models.
Glosses are incomplete text-based representations of continuous multi-channel visual
signals, that are sign languages. Thus, the best performing two step configuration of
Neural Sign Language Translation has an inherent information bottleneck limiting
translation. To address this issue, in the second contribution chapter of this thesis we
formulate SLT as a multi-task learning problem. We introduce a novel transformer
based architecture, Sign Language Transformers, that jointly learn CSLR and SLT while
being trainable in an end-to-end manner. This is achieved by using a Connectionist
Temporal Classification (CTC) loss to bind the recognition and translation problems
into a single unified architecture. This joint approach does not require any ground-truth
timing information, simultaneously solving two co-dependant sequence-to-sequence
learning problems and leads to significant performance gains. We report state-of-the-art
CSLR and SLT results achieved by our Sign Language Transformers. Our translation
networks outperform both sign video to spoken language and gloss to spoken language
translation models, in some cases more than doubling the performance of Neural Sign
Language Translation (Sign2Text configuration - 9.58 vs. 21.80 BLEU-4 Score).
Models we introduce in both first and second contribution chapters heavily rely on
gloss information, either in the form of direct supervision or for pretraining. To realize
large scale sign language translation, that is on par with their spoken/written language
counterparts, we require more parallel datasets. However, annotating sign glosses is a
laborious task and acquiring such annotations for large datasets is infeasible. To address
this issue, in our last contribution chapter we propose modelling SLT based on sign
articulators instead of glosses. Contrary to previous research, which mainly focused on
manual features, we incorporate both both manual and non-manual features of the sign.
We utilize hand shape, mouthings and upper body pose representations to model sign in
a holistic manner.
We propose a novel transformer based architecture, called Multi-Channel Transformers,
aimed at sequence-to-sequence learning problems where the source information is
embedded over several channels. This approach allows the networks to model both the
inter and the intra relationship between asynchronous source channels. We also intro-
duce a channel anchoring loss to help our models preserve channel specific information
while also regulating training against overfitting.
We apply multi-channel transformers to the task of SLT and realize the first multi-
articulatory translation approach. Our experiments on PHOENIX14T demonstrate that
our approach achieves on par or better translation performance against several baselines,
overcoming the reliance on gloss information which underpin previous approaches.
Now we have broken the dependency upon gloss information, future work will be to
scale learning to larger datasets, such as broadcast footage, where gloss information is
not available.