Abstract
This work tackles the challenge of continuous sign
language segmentation, a key task with huge implications for
sign language translation and data annotation. We propose
a transformer-based architecture that models the temporal
dynamics of signing and frames segmentation as a sequence
labeling problem using the Begin-In-Out (BIO) tagging scheme.
Our method leverages the HaMeR hand features, and is
complemented with 3D Angles. Extensive experiments show that
our model achieves state-of-the-art results on the DGS Corpus,
while our features surpass prior benchmarks on BSLCorpus.