Abstract
Cross-lingual alignment of nuanced sociological concepts can form the basis of comparing cross-national studies in different languages and harmonising longitudinal studies, by leveraging knowledge from social science taxonomies such as ELSST. Aligning sociological concepts is challenging due to cultural context-dependency, linguistic variation, and data scarcity. Traditional approaches for cross-lingual alignment require extensive parallel data in different languages.
This presentation will outline a method for the multilingual alignment of sociological concepts. The approach posits that word embeddings (numerical vector representations of text) of domain-specific texts can be decomposed into 2 vectors: a domain knowledge vector that is language agnostic and should be the same across languages, and a language-specific feature vector (that is language specific and can be learned). The domain knowledge vector is trained primarily on English data structured by the ELSST hierarchy and captures core sociological semantics. The method will be demonstrated on a cross-lingual sociological concept retrieval task across 10 languages.