Abstract
In the digital era, people often express their emotions and opinions on social media platforms, generating an enormous volume of emotion-loaded user-generated content (UGC). Given the rapid production and sheer quantity of UGC, machine translation (MT) is the only practical solution for translating this content effectively, as relying solely on human translators is unfeasible. However, the MT community has significantly understudied the quality of MT outputs for these texts, particularly for emotion-loaded Chinese UGC.
This research aims to explore the evaluation of machine translation for emotion-loaded Chinese UGC. Utilizing an open-source dataset of microblog posts, the study begins by investigating the challenges faced by professional translators and identifying key issues MT systems encounter, especially with emotion carrying slang created through homophone substitution. The thesis then introduces a human evaluation framework adapted from the well-established error-based Multi-dimensional Quality Metrics (MQM) to manually assess the emotion preservation of MT outputs. The evaluation reveals that approximately 50% of the English translations contain errors related to emotion preservation, underscoring a gap in current MT evaluations. This evaluation process also yields a dataset suitable for training automatic quality estimation (QE) systems without reference translations. Subsequently, the study employs various machine learning techniques, including multi-task learning, in-context learning, and parameter-efficient fine-tuning of large language models (LLMs), to achieve accurate and interpretable evaluations. The performance of these QE systems surpasses existing frameworks in handling emotion-loaded texts. Furthermore, probing these QE systems using perturbed data created by homophone-substituted emotion words demonstrates that fine-tuned LLMs offer more robust and precise evaluations.
By uncovering the challenges of translating UGC, proposing an MQM-based evaluation framework, creating a human-annotated QE dataset, and applying machine learning techniques, this research makes significant contributions to the MT evaluation of emotion-loaded Chinese UGC. The study also releases relevant resources to the open-source community for further advancements of this field.