Seth Herd comments on Communications in Hard Mode (My new job at MIRI)

Seth Herd 16 Dec 2024 19:55 UTC
3 points
1
I consider at least a modest change of heart to be the default.
And I think it’s really hard to say how fast alignment is progressing relative to capabilities. If by “alignment” you mean formal proofs of safety then definitely we’re not on track. But there’s a real chance that we don’t need those. We are training networks to follow instructions, and it’s possible that weak type of tool “alignment” can be leveraged into true agent alignment for instruction-following or corrigibility. If so, we have solved AGI alignment. That would give us superhuman help solving ASI alignment, and the “societal alignment” problem of surviving intent-aligned AGIs with different masters.
This seems like the default for how we’ll try to align AGI. We don’t know if it will work.
When I get MIRI-style thinkers to fully engage with this set of ideas, they tend to say “hm maybe”. But I haven’t gotten enough engagement to have any confidence. Prosaic alignment, LLM thinkers usually aren’t engaging with the hard problems of alignment that crop up when we hit fully autonomous AGI entities, like strong optimization’s effects on goal misgeneralization, reflection and learning-based alignment shifts. And almost nobody is thinking that far ahead in societal coordination dynamics.
So I’d really like to see agent foundations and prosaic alignment thinking converge on the types of LLM-based AGI agents we seem likely to get in the near future. We just really don’t know if we can align them or not, because we just really haven’t thought about it deeply yet.
Links to all of those ideas in depth can be found in a couple link hops from my recent, brief Intent alignment as a stepping-stone to value alignment.