Joseph Bloom comments on A Mechanistic Interpretability Analysis of a GridWorld Agent-Simulator (Part 1 of N)

Joseph Bloom 18 May 2023 13:08 UTC
2 points
0
Thanks Simon, I’m glad you found the app intuitive :)

The RTG is just another token in the input, except that it has an especially strong relationship with training distribution. It’s heavily predictive in a way other tokens aren’t because it’s derived from a labelled trajectory (it’s the remaining reward in the trajectory after that step).

For BabyAI, the idea would be to use an instruction prepended to the trajectory made up of a limited vocab (see baby ai paper for their vocab). I would be pretty partial to throwing out the RTG and using a behavioral clone for a BabyAI model. It seems likely this would be easier to train. Since the goal of these models is to be useful for gaining understanding, I’d like to avoid reusing tokens as that might complicate analysis later on.