First up, I’ve strongly upvoted it as an example of advancing the alignment frontier, and I think this is plausibly the easiest solution provided it can actually be put into code.
But unfortunately there’s a huge wrecking ball into it, and that’s deceptive alignment. As we try to solve increasingly complex problems, deceptive alignment becomes the default, and this solution doesn’t work. Basically in Evhub’s words, mere compliance is the default, and since a treacherous turn when the AI becomes powerful is possible, that this solution alone can’t do very well.
Hmm, I’m confused—I don’t think I said very much about inner alignment, and I hope to have implied that inner alignment is still important! The talk is primarily a critique of existing approaches to outer alignment (eg. why human preferences alone shouldn’t be the alignment target) and is a critique of inner alignment work only insofar as it assumes that defining the right training objective / base objective is not a crucial problem as well.
Maybe a more refined version of the disagreement is about how crucial inner alignment is, vs. defining the right target for outer alignment? I happen to think the latter is more crucial to work on, and perhaps that comes through somewhat in the talk (though it’s not a claim I wanted to strongly defend), whereas you seem to think inner alignment / preventing deceptive alignment is more crucial. Or perhaps both of them are crucial / necessary, so the question becomes where and how to prioritize resources, and you would prioritize inner alignment?
FWIW, I’m less concerned about inner alignment because:
I’m more optimistic about model-based planning approaches that actually optimize for the desired objective in the limit of the large compute (so methods more like neurally-guided MCTS a.k.a AlphaGo, and less like offline reinforcement learning)
I’m aware that these are minority views in the alignment community—I work a lot more on neurosymbolic and probabilistic programming methods, and think they have a clear path to scaling and providing economic value, which probably explains the difference.
Maybe a more refined version of the disagreement is about how crucial inner alignment is, vs. defining the right target for outer alignment? I happen to think the latter is more crucial to work on, and perhaps that comes through somewhat in the talk (though it’s not a claim I wanted to strongly defend), whereas you seem to think inner alignment / preventing deceptive alignment is more crucial. Or perhaps both of them are crucial / necessary, so the question becomes where and how to prioritize resources, and you would prioritize inner alignment?
This is the crux. I actually think outside alignment, while hard, has possible solutions, but inner alignment has the nearly impossible task of aligning a mesa-optimizer, and ensuring that no deceptiveness ensues. I think this is nearly impossible under a simplicity prior regime, which is probably the most likely prior to work. I think inner alignment is more important than outer alignment.
Don’t get me wrong, this is a non-trivial advance, and I hope more such posts come. But I do want to lower expectations that will come with such posts.
First up, I’ve strongly upvoted it as an example of advancing the alignment frontier, and I think this is plausibly the easiest solution provided it can actually be put into code.
But unfortunately there’s a huge wrecking ball into it, and that’s deceptive alignment. As we try to solve increasingly complex problems, deceptive alignment becomes the default, and this solution doesn’t work. Basically in Evhub’s words, mere compliance is the default, and since a treacherous turn when the AI becomes powerful is possible, that this solution alone can’t do very well.
Here’s a link:
https://www.lesswrong.com/posts/A9NxPTwbw6r6Awuwt/how-likely-is-deceptive-alignment
Hmm, I’m confused—I don’t think I said very much about inner alignment, and I hope to have implied that inner alignment is still important! The talk is primarily a critique of existing approaches to outer alignment (eg. why human preferences alone shouldn’t be the alignment target) and is a critique of inner alignment work only insofar as it assumes that defining the right training objective / base objective is not a crucial problem as well.
Maybe a more refined version of the disagreement is about how crucial inner alignment is, vs. defining the right target for outer alignment? I happen to think the latter is more crucial to work on, and perhaps that comes through somewhat in the talk (though it’s not a claim I wanted to strongly defend), whereas you seem to think inner alignment / preventing deceptive alignment is more crucial. Or perhaps both of them are crucial / necessary, so the question becomes where and how to prioritize resources, and you would prioritize inner alignment?
FWIW, I’m less concerned about inner alignment because:
I’m more optimistic about model-based planning approaches that actually optimize for the desired objective in the limit of the large compute (so methods more like neurally-guided MCTS a.k.a AlphaGo, and less like offline reinforcement learning)
I’m more optimistic about methods for directly learning human interpretable, modular, (neuro)symbolic world models that we can understand, verify, and edit, and that are still highly capable. This reduces the need for approaches like Eliciting Latent Knowledge, and avoids a number or pathways toward inner misalignment.
I’m aware that these are minority views in the alignment community—I work a lot more on neurosymbolic and probabilistic programming methods, and think they have a clear path to scaling and providing economic value, which probably explains the difference.
This is the crux. I actually think outside alignment, while hard, has possible solutions, but inner alignment has the nearly impossible task of aligning a mesa-optimizer, and ensuring that no deceptiveness ensues. I think this is nearly impossible under a simplicity prior regime, which is probably the most likely prior to work. I think inner alignment is more important than outer alignment.
Don’t get me wrong, this is a non-trivial advance, and I hope more such posts come. But I do want to lower expectations that will come with such posts.