figuring out how to do that [is] an important open question.
Great post by the way. This summary made the general failure mode clear, really well.
A part of me is worried that the terminology invites viewing mesa-optimisers as a description of a very specific failure mode, instead of as a language for the general worry described above. I don’t know to what degree this misconception occurs in practice, but I wish to preempt it here anyway. (I want data on this, so please leave a comment if you had confusions about this after reading the original report.)
It did seem like a description of a specific failure mode. This post addressed that well.
In more depth:
I really liked the example in Two Distinctions. Why worry? could have been more specific about “bad things” if it had had an example. (I am sure other documents do that, this post as a whole did a great job of being self-contained—I think I’d have benefited from reading this post even if I hadn’t read the mesa optimization sequence.)
I think it’s worth pointing out that while one of the approaches mentioned in What objectives was black box and the other was transparent, both required knowing the environment.
Errata:
Great post by the way. This summary made the general failure mode clear, really well.
It did seem like a description of a specific failure mode. This post addressed that well.
In more depth:
I really liked the example in Two Distinctions. Why worry? could have been more specific about “bad things” if it had had an example. (I am sure other documents do that, this post as a whole did a great job of being self-contained—I think I’d have benefited from reading this post even if I hadn’t read the mesa optimization sequence.)
I think it’s worth pointing out that while one of the approaches mentioned in What objectives was black box and the other was transparent, both required knowing the environment.