I think the failure case identified in this post is plausible (and likely) and is very clearly explained so props for that!
However, I agree with Jacob’s criticism here. Any AGI success story basically has to have “the safest model” also be “the most powerful” model, because of incentives and coordination problems.
Models that are themselves optimizers are going to be significantly more powerful and useful than “optimizer free” models. So the suggestion of trying to avoiding mesa-optimization altogether is a bit of a fabricated option. There is an interesting parallel here with the suggestion of just “not building agents” (https://www.gwern.net/Tool-AI).
So from where I am sitting, we have no option but to tackle aligning the mesa-optimizer cascade head-on.
I think the failure case identified in this post is plausible (and likely) and is very clearly explained so props for that!
However, I agree with Jacob’s criticism here. Any AGI success story basically has to have “the safest model” also be “the most powerful” model, because of incentives and coordination problems.
Models that are themselves optimizers are going to be significantly more powerful and useful than “optimizer free” models. So the suggestion of trying to avoiding mesa-optimization altogether is a bit of a fabricated option. There is an interesting parallel here with the suggestion of just “not building agents” (https://www.gwern.net/Tool-AI).
So from where I am sitting, we have no option but to tackle aligning the mesa-optimizer cascade head-on.