Training Ethical Language Models via Reinforcement Learning from AI Feedback
DOI:
https://doi.org/10.32473/flairs.39.1.141779Abstract
Large Language Models (LLMs) continue to exhibit limited reliability when reasoning over moral scenarios, particularly across diverse ethical frameworks. Prior work has shown that Reinforcement Learning from Human Feedback (RLHF) can improve alignment, but it relies on costly and hard-to-scale human annotation. In this work, we investigate the effectiveness of Reinforcement Learning from AI Feedback (RLAIF) for ethical reasoning by distilling theory-specific moral preferences from large language models. We propose an RLAIF framework that integrates supervised fine-tuning, preference-based reward modeling, and Proximal Policy Optimization (PPO) to train theory-specialized ethical models. Using the ETHICS benchmark, spanning across five ethical frameworks: Commonsense Morality, Deontology, Justice, Utilitarianism, and Virtue Ethics, we evaluate both a Distilled reward model approach, which trains a compact Pythia- 410M reward model on AI-generated preferences, and a Direct RLAIF approach that bypasses reward model training entirely by leveraging LLM directly for reward signals. Our results show that supervised fine-tuning significantly improves baseline ethical reasoning and label alignment, while distilled reward models demonstrate consistency and preference discrimination across ethical frameworks.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Alden Duarte-Vasquez, Bishal Thapa, Sahar Hooshmand, Heena Rathore

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.