Maybe a more refined version of the disagreement is about how crucial inner alignment is, vs. defining the right target for outer alignment? I happen to think the latter is more crucial to work on, and perhaps that comes through somewhat in the talk (though it’s not a claim I wanted to strongly defend), whereas you seem to think inner alignment / preventing deceptive alignment is more crucial. Or perhaps both of them are crucial / necessary, so the question becomes where and how to prioritize resources, and you would prioritize inner alignment?
This is the crux. I actually think outside alignment, while hard, has possible solutions, but inner alignment has the nearly impossible task of aligning a mesa-optimizer, and ensuring that no deceptiveness ensues. I think this is nearly impossible under a simplicity prior regime, which is probably the most likely prior to work. I think inner alignment is more important than outer alignment.
Don’t get me wrong, this is a non-trivial advance, and I hope more such posts come. But I do want to lower expectations that will come with such posts.
This is the crux. I actually think outside alignment, while hard, has possible solutions, but inner alignment has the nearly impossible task of aligning a mesa-optimizer, and ensuring that no deceptiveness ensues. I think this is nearly impossible under a simplicity prior regime, which is probably the most likely prior to work. I think inner alignment is more important than outer alignment.
Don’t get me wrong, this is a non-trivial advance, and I hope more such posts come. But I do want to lower expectations that will come with such posts.