Thanks for reading, and responding! It’s very helpful to know where my arguments cease being convincing or understandable.
I fully agree that just having AI do the work of solving alignment is not a good or convincing plan. You need to know that AI is aligned to trust it.
Perhaps the missing piece is that I think alignment is already solved for LLM agents. They don’t work well, but they are quite eager to follow instructions. Adding more alignment methods as they improve makes good odds that our first capable/dangerous agents are also aligned. I listed some of the obvious and easy techniques we’ll probably use in Internal independent review for language model agent alignment. I’m not happy with the clarity of that post, though, so I’m currently working on two followups that might be clearer.
Or perhaps the missing link is going from aligned AI systems to aligned “Real AGI”. I do think there’s a discontinuity in alignment once a system starts to learn continuously and reflect on its beliefs (which change how its values/goals are interpreted). However, I think the techniques most likely to be used are probably adequate to make those systems aligned—IF that alignment is for following instructions, and the humans wisely instruct it to be honest about ways its alignment could fail.
So that’s how I get to the first aligned AGI at roughly human level or below.
From there it seems easier, although still possible to fail.
If you have an agent that’s aligned and smarter than you, you can trust it to work on further alignment schemes. It’s wiser to spot-check it, but the humans’ job becomes making sure the existing AGI is truly aligned, and letting it do the work to align its successor, or keep itself aligned as it learns.
I usually think about the progression from AGI to superintelligence as one system/entity learning, being improved, and self-improving. But there’s a good chance that progression will look more generational, with several distinct systems/entities as successors with greater intelligence, designed by the previous system and/or humans. Those discontinuities seem to present more danger of getting alignment wrong
“If you have an agent that’s aligned and smarter than you, you can trust it to work on further alignment schemes. It’s wiser to spot-check it, but the humans’ job becomes making sure the existing AGI is truly aligned, and letting it do the work to align its successor, or keep itself aligned as it learns.”
Ah, that’s the link that I was missing. Now it makes sense. You can use AGI as a reviewer for other AGIs, once it is better than humans at reviewing AGIs. Thank you a lot for clarifying!
Thanks for reading, and responding! It’s very helpful to know where my arguments cease being convincing or understandable.
I fully agree that just having AI do the work of solving alignment is not a good or convincing plan. You need to know that AI is aligned to trust it.
Perhaps the missing piece is that I think alignment is already solved for LLM agents. They don’t work well, but they are quite eager to follow instructions. Adding more alignment methods as they improve makes good odds that our first capable/dangerous agents are also aligned. I listed some of the obvious and easy techniques we’ll probably use in Internal independent review for language model agent alignment. I’m not happy with the clarity of that post, though, so I’m currently working on two followups that might be clearer.
Or perhaps the missing link is going from aligned AI systems to aligned “Real AGI”. I do think there’s a discontinuity in alignment once a system starts to learn continuously and reflect on its beliefs (which change how its values/goals are interpreted). However, I think the techniques most likely to be used are probably adequate to make those systems aligned—IF that alignment is for following instructions, and the humans wisely instruct it to be honest about ways its alignment could fail.
So that’s how I get to the first aligned AGI at roughly human level or below.
From there it seems easier, although still possible to fail.
If you have an agent that’s aligned and smarter than you, you can trust it to work on further alignment schemes. It’s wiser to spot-check it, but the humans’ job becomes making sure the existing AGI is truly aligned, and letting it do the work to align its successor, or keep itself aligned as it learns.
I usually think about the progression from AGI to superintelligence as one system/entity learning, being improved, and self-improving. But there’s a good chance that progression will look more generational, with several distinct systems/entities as successors with greater intelligence, designed by the previous system and/or humans. Those discontinuities seem to present more danger of getting alignment wrong
“If you have an agent that’s aligned and smarter than you, you can trust it to work on further alignment schemes. It’s wiser to spot-check it, but the humans’ job becomes making sure the existing AGI is truly aligned, and letting it do the work to align its successor, or keep itself aligned as it learns.”
Ah, that’s the link that I was missing. Now it makes sense. You can use AGI as a reviewer for other AGIs, once it is better than humans at reviewing AGIs. Thank you a lot for clarifying!
My pleasure. Evan Hubinger made this point to me when I’d misunderstood his scalable oversight proposal.
Thanks again for engaging with my work!