Isolating LLM Lexical Bias
A Curation-Free Triangulated Metric for Preference-Stage Learning
DOI:
https://doi.org/10.32473/flairs.39.1.141843Keywords:
Large Language Models, Evaluation Metrics, Model Alignment, Reinforcement Learning from Human Feedback, AI-Associated LanguageAbstract
Various language domains have undergone remarkable changes in recent years; these shifts are largely attributed to the advent of Large Language Models and their misalignment with natural language usage. These misalignments are thought to partly originate in the preference-learning stage, e.g., Reinforcement Learning from Human Feedback, which generally makes models more useful but simultaneously may introduce systematic lexical bias. In terms of lexical behavior, this is visible in a model’s preference for certain formats or the overuse of words (delve, furthermore), even when such patterns are not present in base model outputs. Research on lexical misalignment induced during preference training is constrained by reliance on manual curation. We address this, by introducing the Triangulated Preference Shift score, a metric that triangulates between human gold standards, base models, and instruct variants to isolate shifts induced specifically by preference learning, without manual curation. We provide data across six model families, anchor the results in the literature, and illustrate the general approach’s utility by analyzing whether preference learning shifts models toward what could be interpreted as a “language of prestige”. The metric provides an automated method to quantify behavioral shifts attributable to preference tuning, and thus, supports model alignment and development of trustworthy AI.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Xiaoyang Ming, Jose Hernandez, Thomas Stephan Juzek

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.