I wrote a blog post on whether AI alignment can be automated last year. The key takeaways:
There’s a chicken-and-egg problem where you need the automated alignment researcher to create the alignment solution but the alignment solution is needed before you can safely create the automated alignment researcher. The solution to this dilemma is an iterative bootstrapping process where the AI’s capabilities and alignment iteratively improve each other (a more aligned AI can be made more capable and a more capable AI can create a more aligned AI and so on).
Creating the automated alignment researcher only makes sense if it is less capable and general than a full-blown AGI. Otherwise, aligning it is just as hard as aligning AGI.
There’s no clear answer to this question because it depends on your definition of “AI alignment” work. Some AI alignment work is already automated today such as generating datasets for evals, RL from AI feedback, and simple coding work. On the other hand, there are probably some AI alignment tasks that are AGI-complete such as deep, cross-domain, and highly creative alignment work.
The idea of the bootstrapping strategy is that as the automated alignment researcher is made more capable, it improves its own alignment strategies which enables further capability and alignment capabilities and so on. So hopefully there is a virtuous feedback loop over time where more and more alignment tasks are automated.
However, this strategy relies on a robust feedback loop which could break down if the AI is deceptive, incorrigible, or undergoes recursive self-improvement and I think these risks increase with higher levels of capability.
I can’t find the source but I remember reading somewhere on the MIRI website that MIRI aims to do work that can’t easily be automated so Eliezer’s pessimism makes sense in light of that information.
I wrote a blog post on whether AI alignment can be automated last year. The key takeaways:
There’s a chicken-and-egg problem where you need the automated alignment researcher to create the alignment solution but the alignment solution is needed before you can safely create the automated alignment researcher. The solution to this dilemma is an iterative bootstrapping process where the AI’s capabilities and alignment iteratively improve each other (a more aligned AI can be made more capable and a more capable AI can create a more aligned AI and so on).
Creating the automated alignment researcher only makes sense if it is less capable and general than a full-blown AGI. Otherwise, aligning it is just as hard as aligning AGI.
There’s no clear answer to this question because it depends on your definition of “AI alignment” work. Some AI alignment work is already automated today such as generating datasets for evals, RL from AI feedback, and simple coding work. On the other hand, there are probably some AI alignment tasks that are AGI-complete such as deep, cross-domain, and highly creative alignment work.
The idea of the bootstrapping strategy is that as the automated alignment researcher is made more capable, it improves its own alignment strategies which enables further capability and alignment capabilities and so on. So hopefully there is a virtuous feedback loop over time where more and more alignment tasks are automated.
However, this strategy relies on a robust feedback loop which could break down if the AI is deceptive, incorrigible, or undergoes recursive self-improvement and I think these risks increase with higher levels of capability.
I can’t find the source but I remember reading somewhere on the MIRI website that MIRI aims to do work that can’t easily be automated so Eliezer’s pessimism makes sense in light of that information.
Further reading:
OpenAI: Our approach to alignment research
Why I’m optimistic about our alignment approach (Jan Leike’s blog)
A minimum viable product for alignment (Jan Leike’s blog)