Logan Riggs comments on Paper: In-context Reinforcement Learning with Algorithm Distillation [Deepmind]

Logan Riggs 27 Oct 2022 15:40 UTC
LW: 4 AF: 1
2
AF
Notably the model was trained across multiple episodes to pick up on RL improvement.

Though the usual inner misalignment means that it’s trying to gain more reward in future episodes by forgoing reward in earlier ones, but I don’t think this is evidence for that.