Abstract
Vocabularies such as the European Language Social Science Thesaurus (ELSST) and the CLOSER ontology are the foundational taxonomies capturing core social science concepts that form the foundations of large-scale longitudinal social science surveys. However , standard text embeddings often fail to capture the complex hierarchical and relational structures of the sociological concepts, relying on surface similarity. In this work, we propose a framework to model these nuances by adapting a large language model based text embedding model with a learnable diagonal Riemannian metric. This metric allows for a flexible geometry where dimensions can be scaled to reflect semantic importance. Additionally, we introduce a Hierarchical Ranking Loss with dynamic margins as the sole training objective to enforce the multi-level hierarchical constraints (e.g., distinguishing 'self' from narrower, broader, or related concepts, and all from 'unrelated' ones) from ELSST within the Riemannian space, such as ensuring a specific concept like 'social stratification' is correctly positioned by, for instance, being embedded closer to 'social inequality' (as its broader, related concept) and substantially further from an 'unrelated' concept like 'particle physics'. Lastly, we show that our parameter-efficient approach significantly out-performs strong contrastive learning and hyperbolic embedding baselines on hierarchical concept retrieval and classification tasks using the ELSST and CLOSER datasets. Visualizations confirm the learned embedding space exhibits a clear hierarchical structure. Our work offers a more accurate and geometrically informed method for representing complex sociological constructs.