Abstract
Automated detection of semantically equivalent questions in longitudinal
social science surveys is crucial for long-term studies informing empirical
research in the social, economic, and health sciences. Retrieving equivalent
questions faces dual challenges: inconsistent representation of theoretical
constructs (i.e. concept/sub-concept) across studies as well as between
question and response options, and the evolution of vocabulary and structure in
longitudinal text. To address these challenges, our multi-disciplinary
collaboration of computer scientists and survey specialists presents a new
information retrieval (IR) task of identifying concept (e.g. Housing, Job,
etc.) equivalence across question and response options to harmonise
longitudinal population studies. This paper investigates multiple unsupervised
approaches on a survey dataset spanning 1946-2020, including probabilistic
models, linear probing of language models, and pre-trained neural networks
specialised for IR. We show that IR-specialised neural models achieve the
highest overall performance with other approaches performing comparably.
Additionally, the re-ranking of the probabilistic model's results with neural
models only introduces modest improvements of 0.07 at most in F1-score.
Qualitative post-hoc evaluation by survey specialists shows that models
generally have a low sensitivity to questions with high lexical overlap,
particularly in cases where sub-concepts are mismatched. Altogether, our
analysis serves to further research on harmonising longitudinal studies in
social science.