The Theoretical Foundations of Reward Learning

Feb 28, 2025, 1:19 PM

In this sequence I provide an overview of the theoretical reward learning research agenda, including its motivating assumptions, several core results, and some starting points for how to contribute to it further.

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Joar SkalseFeb 28, 2025, 7:20 PM

26 points

4 comments14 min readLW link

Partial Identifiability in Reward Learning

Joar SkalseFeb 28, 2025, 7:23 PM

16 points

0 comments12 min readLW link

Misspecification in Inverse Reinforcement Learning

Joar SkalseFeb 28, 2025, 7:24 PM

19 points

0 comments11 min readLW link

STARC: A General Framework For Quantifying Differences Between Reward Functions

Joar SkalseFeb 28, 2025, 7:24 PM

11 points

0 comments8 min readLW link

Misspecification in Inverse Reinforcement Learning—Part II

Joar SkalseFeb 28, 2025, 7:24 PM

9 points

0 comments7 min readLW link

Defining and Characterising Reward Hacking

Joar SkalseFeb 28, 2025, 7:25 PM

15 points

0 comments4 min readLW link

Other Papers About the Theory of Reward Learning

Joar SkalseFeb 28, 2025, 7:26 PM

16 points

0 comments5 min readLW link

How to Contribute to Theoretical Reward Learning Research

Joar SkalseFeb 28, 2025, 7:27 PM

16 points

0 comments21 min readLW link