RLHF: Reinforcement Learning with Once-per-Episode Feedback

Abstract

Despite Reinforcement learning’s remarkable success in several application and simulation domains, research in the field has barely ventured beyond the typical modeling assumptions underlying the MDP formalism. In this work we aim to reimagine the way in which rewards are produced by moving away from the typical setting of per-step Markovian rewards to a model that instead produces a binary score acting at the trajectory level. While this is an extreme test case for theory, it is also arguably more representative of real-world applications than the traditional requirement in RL practice that the learner receive feedback at every time step. Indeed, in many real-world applications of reinforcement learning, such as self-driving cars, and robotics, it is easier for a human labeler to evaluate whether a learner’s complete trajectory was either good or bad, but harder to provide a reward signal at each step. To show that learning is possible in this more challenging setting, we study the case where trajectory labels are generated by an unknown parametric model and provide a statistically and computationally efficient algorithm that achieves sub-linear regret. We will also comment on how to extend these results to the dueling setting where a human labeler decides which one of two trajectories is better.

Date
Feb 16, 2023 12:00 AM
Event
Modern Adaptive Experimental Design and Active Learning in the Real World Reading Group
Location
Virtual
Avatar
Aldo Pacchiano
Eric and Wendy Schmidt Center Fellow / Faculty

My research interests include online learning, Reinforcement Learning, Deep RL and Fairness.