Caution when interpreting Deepmind’s In-context RL paper

Lots of people I know have had pretty strong reactions to the recent Deepmind paper, which claims to have gotten a transformer to learn an RL algorithm by training it on an RL agent’s training trajectories. At first, I too was pretty shocked—this paper seemed to provide strong evidence of a mesa-optimizer in the wild. But digging into the paper a bit more, I’m quite unimpressed and don’t think that in-context RL is the correct way to interpret the experiments that the authors actually did. This post is a quick, low-effort attempt to write out my thoughts on this.

Recall that in this paper, the authors pick some RL algorithm, use it to train RL agents on some tasks, and save the trajectories generated during training; then they train a transformer to autoregressively model said trajectories, and deploy the transformer on some novel tasks. So for concreteness, during training the transformer sees inputs that look like

which were excerpted from an RL agent’s training on some task (out of a set of training tasks) and which span multiple episodes (i.e. at some point in this input trajectory, one episode ended and the next episode began). The transformer is trained to guess the action that comes next. In deployment, the inputs are determined by the transformer’s own selections, with the environment providing the states and rewards. The authors call this algorithmic distillation (AD).

Many people I know have skimmed the paper and come away with an understanding something like:

In this paper, RL agents are trained on diverse tasks, e.g. playing many different Atari games, and the resulting transcripts are used as training data for AD. Then the AD agent is deployed on a new task, e.g. playing a held-out Atari game. The AD agent is able to learn to play this novel game, which can only be explained by the model implementing an reasonably general RL algorithm. This sounds a whole lot like a mesa-optimizer.

This understanding is incorrect, with two key issues. First the training tasks used in this paper are all extremely similar to each other and to the deployment task; in fact, I think they only ought to count as different under a pathologically narrow notion of “task.” And second, the tasks involved are extremely simple. The complaints taken together challenge the conclusion that the only way for the AD agent to do well on its deployment task is by implementing a general-purpose RL algorithm. In fact, as I’ll explain in more detail below, I’d be quite surprised if it were.

For concreteness, I’ll focus here on one family of experiments, Dark Room, that appeared in the paper, but my complaint applies just as well to the other experiments in the paper. The paper describes the Dark Room environment as:

a 2D discrete POMDP where an agent spawns in a room and must find a goal location. The agent only knows its own (x, y) coordinates but does not know the goal location and must infer it from the reward. The room size is 9 × 9, the possible actions are one step left, right, up, down, and no-op, the episode length is 20, and the agent resets at the center of the map. … [T]he agent receives r = 1 every time the goal is reached. … When not r = 1, then r = 0.

To be clear, Dark Room is not a single task, but an environment supporting a family of tasks, where each task is corresponds to a particular choice of goal location (so there are 81 possible tasks in this environment, one for each location in the 9 x 9 room; note that this is an unusually narrow notion of which tasks count as different). The data on which the AD agent is trained look like: {many episodes of an agent learning to move towards goal position 1}, {many episodes of an agent learning to move towards goal position 2}, and so on. In deployment, a new goal position is chosen, and the agent plays many episodes in which reward is given for reaching this new goal position.

At this point, the issue might be clear: as soon as the model’s input trajectory contains the end of a previous episode in which the agent reached the goal (got reward 1), the model can immediately infer what the goal location is! So rather than AD needing to learn any sort of interesting RL algorithm which involves general-purpose planning, balancing exploration and exploitation, etc., it suffices to implement the much simpler heuristic “if your input contains an episode ending in reward 1, then move towards the position the agent was in at the end of that episode; otherwise, move around pseudorandomly.” I strongly suspect this is basically what the AD agents in this paper have learned, up to corrections like “the more the trajectories in your input look like incompetent RL agents in early training, the more incompetent you should act.”[1]

If the above interpretation of the paper’s experiments are correct, then rather than learning a general-purpose RL algorithm which could be applied to genuinely different tasks, the AD agent has learned a very simple heuristic which is only useful for solving tasks of the form “repeatedly navigate to a particular, unchanging position in a grid.” If the AD agent trained in this paper were able to learn to do any non-trivially different task (e.g. an AD agent trained on Dark Room tasks were able to in-context learn a task involving collecting keys and unlocking boxes), then I would take that as strong evidence of a mesa-optimizer which had learned to implement a reasonably general RL algorithm. But that doesn’t seem to be what’s happened.

[Thanks to Xander Davies, Adam Jermyn, and Logan R. Smith for feedback on a draft.]


Appendix: expert distillation

People who’ve read the in-context RL paper in more detail might be curious about how the above story squares with the paper’s observation that the AD agents outperform expert distillation (ED) agents. Recall that ED trains exactly the same way as AD, except that the trajectories used as training data only consist of expert demonstrations[2]. It ends up that the resulting ED agents aren’t able to do well on the deployment tasks, even though their inputs consist of cross-episode trajectories.

The relevant graph from the paper. “Source” refers to the performance of the RL algorithm used to generate the training data for AD.

I don’t consider this to be strong evidence against my interpretation. In training, ED agents saw {cross-episode trajectories of an expert competently moving to goal position 1}, {cross-episode trajectories of an expert moving to goal position 2}, and so on. The result is that the rewards in these training data are very uninformative—every episode ends with reward 1 and no episodes end with a 0, so there’s no chance for the ED agents to learn to respond differently to different rewards. In fact, I’d guess that ED agents tend to pick some particular goal position from their training data and iteratively navigate to that goal position, never incurring reward so long as the deployment goal position isn’t along the path from the starting position to the ED agent’s chosen position. This comports with what the authors describe in the paper:

In Dark Room, Dark Room (Hard), and Watermaze, ED performance is either flat or it degrades. The reason for this is that ED only saw expert trajectories during training, but during evaluation it needs to first explore (i.e. perform non-expert behavior) to identify the task. This required behavior is out-of-distribution for ED and for this reason it does not reinforcement learning in-context.

This excerpt frames this as ED agent failing to explore (in contrast with the AD agent), which I agree with. But the sort of exploration that the AD agent does is likely “when you don’t know what to do, mimic an incompetent RL agent” rather than some more sophisticated exploration as part of a general RL algorithm.

  1. ^

    Well, probably the AD agent learned a few other interesting heuristics, like “if the last episode didn’t end by reaching the goal position, then navigate to different part of the environment,” but I’d be surprised if the sum total of these heuristics is sophisticated enough for me to consider the result a mesa-optimizer.

  2. ^

    The authors don’t specify whether the expert demonstrations were generated by humans or by trained RL agents, but it probably doesn’t matter.