Constitutional AI: AI can be trained by feedback from other AI based on a “constitution” of rules and principles.
(The number of proposed alignment solutions is very large, so the only ones listed here are the two pursued by OpenAI and Anthropic, respectively. …)
I think describing Constitutional AI as “the solution pursued by Anthropic” is substantially false. Our ‘core views’ post describes a portfolio approach to safety research, across optimistic, intermediate, and pessimistic scenarios.
If we’re in an optimistic scenario where catastrophic risk from advanced AI is very unlikely, then Constitutional AI or direct successors might be sufficient—but personally I think of such techniques as baselines and building blocks for further research rather than solutions. If we’re not so lucky, then future research and agendas like mechanistic interpretability will be vital. This alignment forum comment goes into some more detail about our thinking at the time.
Thank you for the correction. I’ve changed it to “the only ones listed here are these two, which are among the techniques pursued by OpenAI and Anthropic, respectively.”
(Admittedly, part of the reason I left that section small was because I was not at all confident of my ability to accurately describe the state of alignment planning. Apologies for accidentally misrepresenting Anthropic’s views.)
I think describing Constitutional AI as “the solution pursued by Anthropic” is substantially false. Our ‘core views’ post describes a portfolio approach to safety research, across optimistic, intermediate, and pessimistic scenarios.
If we’re in an optimistic scenario where catastrophic risk from advanced AI is very unlikely, then Constitutional AI or direct successors might be sufficient—but personally I think of such techniques as baselines and building blocks for further research rather than solutions. If we’re not so lucky, then future research and agendas like mechanistic interpretability will be vital. This alignment forum comment goes into some more detail about our thinking at the time.
Thank you for the correction. I’ve changed it to “the only ones listed here are these two, which are among the techniques pursued by OpenAI and Anthropic, respectively.”
(Admittedly, part of the reason I left that section small was because I was not at all confident of my ability to accurately describe the state of alignment planning. Apologies for accidentally misrepresenting Anthropic’s views.)