Abstract
Longitudinal and comparative research relies heavily on repeated measures and harmonisation of data, DDI-Lifecycle has strong support for this through the variable cascade, however, scaling such activity has proven difficult to put into practice.
Social science (and other!) researchers approach the development of questions from a range of perspectives, even where the response options are (nearly) identical, the phrasing and orchestration of the questions can vary considerably. This places limits on the utility of standard text comparison techniques (e.g. TF-IDF, Bag-of-Words).
The presentation will outline the strengths and weaknesses of the different approaches taken during the project to address this problem. This includes problem decomposition which breaks the problem into sub-problems to mitigate the insensitivity of unsupervised methods to nuanced question relationships. Additionally, we will cover techniques for fine-tuning generative large language models for concept extraction and injecting the results into a subsequent retrieval model.