Abstract
To be truly understandable and accepted by Deaf communities, an automatic
Sign Language Production (SLP) system must generate a photo-realistic signer.
Prior approaches based on graphical avatars have proven unpopular, whereas
recent neural SLP works that produce skeleton pose sequences have been shown to
be not understandable to Deaf viewers.
In this paper, we propose SignGAN, the first SLP model to produce
photo-realistic continuous sign language videos directly from spoken language.
We employ a transformer architecture with a Mixture Density Network (MDN)
formulation to handle the translation from spoken language to skeletal pose. A
pose-conditioned human synthesis model is then introduced to generate a
photo-realistic sign language video from the skeletal pose sequence. This
allows the photo-realistic production of sign videos directly translated from
written text.
We further propose a novel keypoint-based loss function, which significantly
improves the quality of synthesized hand images, operating in the keypoint
space to avoid issues caused by motion blur. In addition, we introduce a method
for controllable video generation, enabling training on large, diverse sign
language datasets and providing the ability to control the signer appearance at
inference.
Using a dataset of eight different sign language interpreters extracted from
broadcast footage, we show that SignGAN significantly outperforms all baseline
methods for quantitative metrics and human perceptual studies.