Abstract
Capturing and annotating Sign language datasets is a time consuming and
costly process. Current datasets are orders of magnitude too small to
successfully train unconstrained \acf{slt} models. As a result, research has
turned to TV broadcast content as a source of large-scale training data,
consisting of both the sign language interpreter and the associated audio
subtitle. However, lack of sign language annotation limits the usability of
this data and has led to the development of automatic annotation techniques
such as sign spotting. These spottings are aligned to the video rather than the
subtitle, which often results in a misalignment between the subtitle and
spotted signs. In this paper we propose a method for aligning spottings with
their corresponding subtitles using large spoken language models. Using a
single modality means our method is computationally inexpensive and can be
utilized in conjunction with existing alignment techniques. We quantitatively
demonstrate the effectiveness of our method on the \acf{mdgs} and \acf{bobsl}
datasets, recovering up to a 33.22 BLEU-1 score in word alignment.