It’s not true if alignment is easy, too, right? My timelines are short, but we do still have a little time to do alignment work. And the orgs are going to do a little alignment work. I wonder if there’s an assumption here that OpenAI and co don’t even believe that alignment is a concern? I don’t think that’s true, although I do think they probably dramatically underrate x-risk dangers based on incentive-driven biases, but they do seem to appreciate the basic arguments.
And I expect them to get a whole lot more serious about it once they’re staring a capable agent in the face. It’s one thing to dismiss the dangers of tigers from a distance, another when there’s just a fence between you and it. I think proximity is going to sharpen everyone’s thinking a good bit by inspiring them to spend more time thinking about the dangers.
Your version of events requires a change of heart (for ‘them to get a whole lot more serious’). I’m just looking at the default outcome. Whether alignment is hard or easy (although not if it’s totally trivial), it appears to be progressing substantially more slowly than capabilities (and the parts of it that are advancing are the most capabilities-synergizing, so it’s unclear what the oft-lauded ‘differential advancement of safety’ really looks like).
I consider at least a modest change of heart to be the default.
And I think it’s really hard to say how fast alignment is progressing relative to capabilities. If by “alignment” you mean formal proofs of safety then definitely we’re not on track. But there’s a real chance that we don’t need those. We are training networks to follow instructions, and it’s possible that weak type of tool “alignment” can be leveraged into true agent alignment for instruction-following or corrigibility. If so, we have solved AGI alignment. That would give us superhuman help solving ASI alignment, and the “societal alignment” problem of surviving intent-aligned AGIs with different masters.
This seems like the default for how we’ll try to align AGI. We don’t know if it will work.
When I get MIRI-style thinkers to fully engage with this set of ideas, they tend to say “hm maybe”. But I haven’t gotten enough engagement to have any confidence. Prosaic alignment, LLM thinkers usually aren’t engaging with the hard problems of alignment that crop up when we hit fully autonomous AGI entities, like strong optimization’s effects on goal misgeneralization, reflection and learning-based alignment shifts. And almost nobody is thinking that far ahead in societal coordination dynamics.
So I’d really like to see agent foundations and prosaic alignment thinking converge on the types of LLM-based AGI agents we seem likely to get in the near future. We just really don’t know if we can align them or not, because we just really haven’t thought about it deeply yet.
Good point—what I said isn’t true in the case of alignment by default.
Edited my initial comment to reflect this
It’s not true if alignment is easy, too, right? My timelines are short, but we do still have a little time to do alignment work. And the orgs are going to do a little alignment work. I wonder if there’s an assumption here that OpenAI and co don’t even believe that alignment is a concern? I don’t think that’s true, although I do think they probably dramatically underrate x-risk dangers based on incentive-driven biases, but they do seem to appreciate the basic arguments.
And I expect them to get a whole lot more serious about it once they’re staring a capable agent in the face. It’s one thing to dismiss the dangers of tigers from a distance, another when there’s just a fence between you and it. I think proximity is going to sharpen everyone’s thinking a good bit by inspiring them to spend more time thinking about the dangers.
Your version of events requires a change of heart (for ‘them to get a whole lot more serious’). I’m just looking at the default outcome. Whether alignment is hard or easy (although not if it’s totally trivial), it appears to be progressing substantially more slowly than capabilities (and the parts of it that are advancing are the most capabilities-synergizing, so it’s unclear what the oft-lauded ‘differential advancement of safety’ really looks like).
I consider at least a modest change of heart to be the default.
And I think it’s really hard to say how fast alignment is progressing relative to capabilities. If by “alignment” you mean formal proofs of safety then definitely we’re not on track. But there’s a real chance that we don’t need those. We are training networks to follow instructions, and it’s possible that weak type of tool “alignment” can be leveraged into true agent alignment for instruction-following or corrigibility. If so, we have solved AGI alignment. That would give us superhuman help solving ASI alignment, and the “societal alignment” problem of surviving intent-aligned AGIs with different masters.
This seems like the default for how we’ll try to align AGI. We don’t know if it will work.
When I get MIRI-style thinkers to fully engage with this set of ideas, they tend to say “hm maybe”. But I haven’t gotten enough engagement to have any confidence. Prosaic alignment, LLM thinkers usually aren’t engaging with the hard problems of alignment that crop up when we hit fully autonomous AGI entities, like strong optimization’s effects on goal misgeneralization, reflection and learning-based alignment shifts. And almost nobody is thinking that far ahead in societal coordination dynamics.
So I’d really like to see agent foundations and prosaic alignment thinking converge on the types of LLM-based AGI agents we seem likely to get in the near future. We just really don’t know if we can align them or not, because we just really haven’t thought about it deeply yet.
Links to all of those ideas in depth can be found in a couple link hops from my recent, brief Intent alignment as a stepping-stone to value alignment.