The alignment research that is done will be lower quality due to less access to compute, capability knowhow, and cutting edge AI systems.
I think this is false, though it’s a crux in any case.
Capabilities withdrawal is good because we don’t need big models to do the best alignment work, because that is theoretical work! Theoretical breakthroughs can make empirical research more efficient. It’s OK to stop doing capabilities-promoting empirical alignment, and focus on theory for a while.
(The overall idea of “if all alignment-knowledgeable capabilities people withdraw, then all capabilities will be done by people who don’t know/care about alignment” is still debatable, but distinct. One possible solution: safety-promoting AGI labs stop their capabilities work, but continue to hire capabilities people, partly to prevent them from working elsewhere. This is complicated, but not central to my objection above.)
I see this asymmetry a lot and may write a post on it:
If theoretical researchers are wrong, but you do follow their caution anyway, then empirical alignment goes slower… and capabilities research slows down even more. If theoretical researchers are right, but you don’t follow their caution, you continue or speedup AI capabilities to do less-useful alignment work.
I think this is false, though it’s a crux in any case.
Capabilities withdrawal is good because we don’t need big models to do the best alignment work, because that is theoretical work! Theoretical breakthroughs can make empirical research more efficient. It’s OK to stop doing capabilities-promoting empirical alignment, and focus on theory for a while.
(The overall idea of “if all alignment-knowledgeable capabilities people withdraw, then all capabilities will be done by people who don’t know/care about alignment” is still debatable, but distinct. One possible solution: safety-promoting AGI labs stop their capabilities work, but continue to hire capabilities people, partly to prevent them from working elsewhere. This is complicated, but not central to my objection above.)
I see this asymmetry a lot and may write a post on it:
If theoretical researchers are wrong, but you do follow their caution anyway, then empirical alignment goes slower… and capabilities research slows down even more. If theoretical researchers are right, but you don’t follow their caution, you continue or speedup AI capabilities to do less-useful alignment work.