Offline RL can work well even with wrong reward labels. I think alignment discourse over-focuses on “reward specification.” I think reward specification is important, but far from the full story.
In offline reinforcement learning (RL), an agent optimizes its performance given an offline dataset. Despite being its main objective, we find that return maximization is not sufficient for explaining some of its empirical behaviors. In particular, in many existing benchmark datasets, we observe that offline RL can produce surprisingly good policies even when trained on utterly wrong reward labels.
...
We trained ATAC agents on the original datasets and on three modified versions of each dataset, with “wrong” rewards: 1) zero: assigning a zero reward to all transitions, 2) random: labeling each transition with a reward sampled uniformly from [0,1], and 3) negative: using the negation of the true reward. Although these wrong rewards contain no information about the underlying task or are even misleading, the policies learned from them often perform significantly better than the behavior (data collection) policy and the behavior cloning (BC) policy. They even outperform policies trained with the true reward in some cases.
...
Our empirical and theoretical results suggest a new paradigm for RL, whereby an agent is “nudged” to learn a desirable behavior with imperfect reward but purposely biased data coverage.
...
While a large data coverage improves the best policy that can be learned by offline RL with the true reward, it can also make offline RL more sensitive to imperfect rewards. In other words, collecting a large set of diverse data might not be necessary or helpful. This goes against the common wisdom in the RL community that data should be as exploratory as possible.
...
We believe that our findings shed new light on RL applicability and research. To practitioners, we demonstrate that offline RL does not always require the correct reward to succeed.
Offline RL can work well even with wrong reward labels. I think alignment discourse over-focuses on “reward specification.” I think reward specification is important, but far from the full story.
To this end, a new paper (Survival Instinct in Offline Reinforcement Learning) supports Reward is not the optimization target and associated points that reward is a chisel which shapes circuits inside of the network, and that one should fully consider the range of sources of parameter updates (not just those provided by a reward signal).
Some relevant quotes from the paper: