Abstract
Current techniques for measuring social bias in Large Language Models (LLMs) rely on handcrafted probes, creating uneven rulers that lack statistical reliability and hinder scientific progress. To elevate bias measurement from a craft to a science, we introduce Psychometric-driven Probe Optimization (PMPO), a framework that treats a probe set as an optimizable instrument. PMPO uniquely employs a powerful LLM as a neural genetic operator to automatically evolve a probe set for superior psy-chometric properties. We first establish our method's external validity, showing its gender bias measurements strongly correlate with U.S. labor statistics (Pearson's r = 0.83, p < .001) To assess the qualitative strength of PMPO-generated probes, we conducted a double-blind evaluation involving experts in sociology. Results show that PMPO-generated probes, starting from non-expert templates, are rated as comparable to those crafted by trained human experts, measured in four criteria: clarity, relevance , naturalness, and subtlety. Furthermore, PMPO-evolved probe sets demonstrate strong internal consistency and semantic diversity, indicating their robustness as measurement tools. This work presents a systematic pathway to transform LLM probes from artisanal artifacts into reliable scientific instruments, enabling more rigorous and trustworthy measurement of social bias in language models and supporting responsible AI development.