Isolating LLM Lexical Bias

A Curation-Free Triangulated Metric for Preference-Stage Learning

Authors

DOI:

https://doi.org/10.32473/flairs.39.1.141843

Keywords:

Large Language Models, Evaluation Metrics, Model Alignment, Reinforcement Learning from Human Feedback, AI-Associated Language

Abstract

Various language domains have undergone remarkable changes in recent years; these shifts are largely attributed to the advent of Large Language Models and their misalignment with natural language usage. These misalignments are thought to partly originate in the preference-learning stage, e.g., Reinforcement Learning from Human Feedback, which generally makes models more useful but simultaneously may introduce systematic lexical bias. In terms of lexical behavior, this is visible in a model’s preference for certain formats or the overuse of words (delve, furthermore), even when such patterns are not present in base model outputs. Research on lexical misalignment induced during preference training is constrained by reliance on manual curation. We address this, by introducing the Triangulated Preference Shift score, a metric that triangulates between human gold standards, base models, and instruct variants to isolate shifts induced specifically by preference learning, without manual curation. We provide data across six model families, anchor the results in the literature, and illustrate the general approach’s utility by analyzing whether preference learning shifts models toward what could be interpreted as a “language of prestige”. The metric provides an automated method to quantify behavioral shifts attributable to preference tuning, and thus, supports model alignment and development of trustworthy AI.

Downloads

Published

06-05-2026

How to Cite

Ming, X., Hernandez, J., & Juzek, T. S. (2026). Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning. The International FLAIRS Conference Proceedings, 39(1). https://doi.org/10.32473/flairs.39.1.141843

Issue

Section

Special Track: Explainable, Fair, and Trustworthy AI