The Risks from Learned Optimization paper and this sequence don’t seem to talk about the possibility of mesa-optimizers developing from supervised learning and the resulting inner alignment problem. The part that gets closest is
First, though we largely focus on reinforcement learning in this sequence, RL is not necessarily the only type of machine learning where mesa-optimizers could appear. For example, it seems plausible that mesa-optimizers could appear in generative adversarial networks.
I wonder if this was intentional, and if not maybe it would be worth making a note somewhere in the paper/posts that an oracle/predictor that is trained on sufficiently diverse data using SL could also become a mesa-optimizer (especially since this seems counterintuitive and might be overlooked by AI researchers/builders). See related discussion here.
The Risks from Learned Optimization paper and this sequence don’t seem to talk about the possibility of mesa-optimizers developing from supervised learning and the resulting inner alignment problem. The part that gets closest is
I wonder if this was intentional, and if not maybe it would be worth making a note somewhere in the paper/posts that an oracle/predictor that is trained on sufficiently diverse data using SL could also become a mesa-optimizer (especially since this seems counterintuitive and might be overlooked by AI researchers/builders). See related discussion here.