Yeah, I’m confused about all their results of the same type as fig 4 (fig 5, fig 6, etc.). But I think I’m figuring it out—they really are just taking the predicted action. They’re “learning” in the sense that the sequence model is simulating something that’s learning. So if I’ve got this right, the thousands of environment steps on the x axis just go in one end of the context window and out the other, and by the end the high-performing sequence model is just operating on the memory of 1-2 high-performing episodes.
I guess this raises another question I had, which is—why is the sequence model so bad at pretending to be bad? If it’s supposed to be learning the distribution of the entire training trajectory, why is it so bad at mimicking an actual training trajectory? Maybe copying the previous run when it performed well is just such an easy heuristic that it skews the output? Or maybe performing well is lower-entropy than performing poorly, so lowering a “temperature” parameter at evaluation time will bias the sequence model towards successful trajectories?
Yeah, I’m confused about all their results of the same type as fig 4 (fig 5, fig 6, etc.). But I think I’m figuring it out—they really are just taking the predicted action. They’re “learning” in the sense that the sequence model is simulating something that’s learning. So if I’ve got this right, the thousands of environment steps on the x axis just go in one end of the context window and out the other, and by the end the high-performing sequence model is just operating on the memory of 1-2 high-performing episodes.
I guess this raises another question I had, which is—why is the sequence model so bad at pretending to be bad? If it’s supposed to be learning the distribution of the entire training trajectory, why is it so bad at mimicking an actual training trajectory? Maybe copying the previous run when it performed well is just such an easy heuristic that it skews the output? Or maybe performing well is lower-entropy than performing poorly, so lowering a “temperature” parameter at evaluation time will bias the sequence model towards successful trajectories?