The simplest possible acceptable value learning benchmark would look something like this:
Data is recorded of people playing a video game. They are told to maximize their reward (which can be exactly computed), have no previous experience playing the game, are actually trying to win and are clearly suboptimal (imitation learning would give very bad results).
The bot is first given all their inputs and outputs, but not their rewards.
Then it can play the game in place of the humans but again isn’t given the rewards. Preferably the score isn’t shown on screen.
The goal is to maximize the true reward function.
These rules are precisely described and are known by anyone who wants to test their algorithms.
None of the environments and datasets you mention are actually like this. Some people do test their IRL algorithms in a way similar to this (the difference being that they learn from another bot), but the details aren’t standardized.
A harder and more realistic version that I have yet to see in any paper would look something like this:
Data is recorded of people playing a game with a second player. The second player can be a human or a bot, and friendly, neutral or adversarial.
The IO of both players is different, just like different people have different perspectives in real life.
A very good imitation learner is trained to predict the first player’s output given their input. It comes with the benchmark.
The bot to be tested (which is different from the previous ones) has the same IO channels as the second player, but doesn’t see the rewards. It also isn’t given any of the recordings.
Optionally, it also receives the output of a bad visual object detector meant to detect the part of the environment directly controlled by the human/imitator.
It plays the game with the human imitator.
The goal is to maximize the human’s reward function.
It’s far from perfect, but if someone could obtain good scores there, it would probably make me much more optimistic about the probability of solving alignment.
If you care about human demonstrations, it seems like Atari-HEAD and the CrowdPlay Atari dataset both do exactly this? And while there haven’t been too much work in this area, a quick Google search let me find two papers that do analyze IRL variants on Atari-HEAD: https://arxiv.org/abs/1908.02511v2 and https://arxiv.org/abs/2004.00981v2 .
My guess is the reason there hasn’t been much recent work in this area is because there just aren’t many people who think that value learning from demonstrations is interesting (instead, people have moved to pairwise comparisons of trajectories or language feedback). In addition, as LMs have become more capable, most of the existing value learning researchers have also moved on from working with video games to moving on LMs.
The simplest possible acceptable value learning benchmark would look something like this:
Data is recorded of people playing a video game. They are told to maximize their reward (which can be exactly computed), have no previous experience playing the game, are actually trying to win and are clearly suboptimal (imitation learning would give very bad results).
The bot is first given all their inputs and outputs, but not their rewards.
Then it can play the game in place of the humans but again isn’t given the rewards. Preferably the score isn’t shown on screen.
The goal is to maximize the true reward function.
These rules are precisely described and are known by anyone who wants to test their algorithms.
None of the environments and datasets you mention are actually like this. Some people do test their IRL algorithms in a way similar to this (the difference being that they learn from another bot), but the details aren’t standardized.
A harder and more realistic version that I have yet to see in any paper would look something like this:
Data is recorded of people playing a game with a second player. The second player can be a human or a bot, and friendly, neutral or adversarial.
The IO of both players is different, just like different people have different perspectives in real life.
A very good imitation learner is trained to predict the first player’s output given their input. It comes with the benchmark.
The bot to be tested (which is different from the previous ones) has the same IO channels as the second player, but doesn’t see the rewards. It also isn’t given any of the recordings.
Optionally, it also receives the output of a bad visual object detector meant to detect the part of the environment directly controlled by the human/imitator.
It plays the game with the human imitator.
The goal is to maximize the human’s reward function.
It’s far from perfect, but if someone could obtain good scores there, it would probably make me much more optimistic about the probability of solving alignment.
Every single algorithmic IRL paper on video games does this, at least with Deep RL demonstrators. (Here’s a list of 4 examples: https://arxiv.org/abs/1810.10593, https://proceedings.mlr.press/v97/brown19a.html, https://arxiv.org/abs/1902.07742, https://arxiv.org/abs/2002.09089, )
If you care about human demonstrations, it seems like Atari-HEAD and the CrowdPlay Atari dataset both do exactly this? And while there haven’t been too much work in this area, a quick Google search let me find two papers that do analyze IRL variants on Atari-HEAD: https://arxiv.org/abs/1908.02511v2 and https://arxiv.org/abs/2004.00981v2 .
My guess is the reason there hasn’t been much recent work in this area is because there just aren’t many people who think that value learning from demonstrations is interesting (instead, people have moved to pairwise comparisons of trajectories or language feedback). In addition, as LMs have become more capable, most of the existing value learning researchers have also moved on from working with video games to moving on LMs.