This model is a proof of concept of powerful implicit mesa optimizer, which is evidence towards “current architectures could be easily inner misaligned”.
Notably the model was trained across multiple episodes to pick up on RL improvement.
Though the usual inner misalignment means that it’s trying to gain more reward in future episodes by forgoing reward in earlier ones, but I don’t think this is evidence for that.
This model is a proof of concept of powerful implicit mesa optimizer, which is evidence towards “current architectures could be easily inner misaligned”.
This indeed sure seems like there’s an inner optimizer in there somewhere...
Notably the model was trained across multiple episodes to pick up on RL improvement.
Though the usual inner misalignment means that it’s trying to gain more reward in future episodes by forgoing reward in earlier ones, but I don’t think this is evidence for that.