Abstract
State-of-the-art natural language processing (NLP) models are trained on
massive training corpora, and report a superlative performance on evaluation
datasets. This survey delves into an important attribute of these datasets: the
dialect of a language. Motivated by the performance degradation of NLP models
for dialectic datasets and its implications for the equity of language
technologies, we survey past research in NLP for dialects in terms of datasets,
and approaches. We describe a wide range of NLP tasks in terms of two
categories: natural language understanding (NLU) (for tasks such as dialect
classification, sentiment analysis, parsing, and NLU benchmarks) and natural
language generation (NLG) (for summarisation, machine translation, and dialogue
systems). The survey is also broad in its coverage of languages which include
English, Arabic, German among others. We observe that past work in NLP
concerning dialects goes deeper than mere dialect classification, and . This
includes early approaches that used sentence transduction that lead to the
recent approaches that integrate hypernetworks into LoRA. We expect that this
survey will be useful to NLP researchers interested in building equitable
language technologies by rethinking LLM benchmarks and model architectures.