Training Ethical Language Models via Reinforcement Learning from AI Feedback

Authors

  • Alden Duarte-Vasquez California State University
  • Bishal Thapa Texas State University
  • Sahar Hooshmand California State University
  • Heena Rathore Texas State University

DOI:

https://doi.org/10.32473/flairs.39.1.141779

Abstract

Large Language Models (LLMs) continue to exhibit limited reliability when reasoning over moral scenarios, particularly across diverse ethical frameworks. Prior work has shown that Reinforcement Learning from Human Feedback (RLHF) can improve alignment, but it relies on costly and hard-to-scale human annotation. In this work, we investigate the effectiveness of Reinforcement Learning from AI Feedback (RLAIF) for ethical reasoning by distilling theory-specific moral preferences from large language models. We propose an RLAIF framework that integrates supervised fine-tuning, preference-based reward modeling, and Proximal Policy Optimization (PPO) to train theory-specialized ethical models. Using the ETHICS benchmark, spanning across five ethical frameworks: Commonsense Morality, Deontology, Justice, Utilitarianism, and Virtue Ethics, we evaluate both a Distilled reward model approach, which trains a compact Pythia- 410M reward model on AI-generated preferences, and a Direct RLAIF approach that bypasses reward model training entirely by leveraging LLM directly for reward signals. Our results show that supervised fine-tuning significantly improves baseline ethical reasoning and label alignment, while distilled reward models demonstrate consistency and preference discrimination across ethical frameworks.

Downloads

Published

06-05-2026

How to Cite

Duarte-Vasquez, A., Thapa, B., Hooshmand, S., & Rathore, H. (2026). Training Ethical Language Models via Reinforcement Learning from AI Feedback. The International FLAIRS Conference Proceedings, 39(1). https://doi.org/10.32473/flairs.39.1.141779