Biasing Exploration towards Positive Error for Efficient Reinforcement Learning
DOI:
https://doi.org/10.32473/flairs.38.1.138835Keywords:
Reinforcement Learning, Deep Reinforcement Learning, BanditsAbstract
Efficient exploration remains a critical challenge in Reinforcement Learning (RL), significantly affecting sample efficiency. This paper demonstrates that biasing exploration towards state-action pairs with positive temporal difference error speeds up convergence and, in some challenging environments, has the potential to result in an improved policy. We show that this Positive Error Bias (PEB) method achieves statistically significant performance improvements across various tasks and estimators. Empirical results demonstrate PEB’s effectiveness in bandits, grid worlds, and classic control tasks with exact and approximate estimators. PEB is particularly effective when unbiased exploration struggles with policy discovery.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Adam Parker, John W. Sheppard

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.