The broader point is that even if AIs are completely aligned with human values, the very mechanisms by which we maintain control (such as scalable oversight and other interventions) may shift how the system operates in a way that produces fundamental, widespread effects across all learning machines
Would you argue that the field of alignment should be concerned with maintaining control beyond the point where AIs are completely aligned with human values? My personal view is that alignment research should ensure we’re eventually able to align AIs with human values and that we can maintain control until we’re reasonably sure that the AIs are aligned. However, worlds where relatively unintelligent humans remain in control indefinitely after those milestones have been reached may not be the best outcomes. I don’t have time to write out my views on this in depth right now, but here’s a relevant excerpt from the Dwarkesh podcast episode with Paul Christiano that I agree with:
Dwarkesh: “It’s hard for me to imagine in 100 years that these things are still our slaves. And if they are, I think that’s not the best world. So at some point, we’re handing off the baton. Where would you be satisfied with an arrangement between the humans and AIs where you’re happy to let the rest of the universe or the rest of time play out?”
Paul: “I think that it is unlikely that in 100 years I would be happy with anything that was like, you had some humans, you’re just going to throw away the humans and start afresh with these machines you built. [...] And then I think that the default path to be comfortable with something very different is kind of like, run that story for a long time, have more time for humans to sit around and think a lot and conclude, here’s what we actually want. Or a long time for us to talk to each other or to grow up with this new technology and live in that world for our whole lives and so on. [...] We should probably try and sort out our business, and you should probably not end up in a situation where you have a billion humans and like, a trillion slaves who would prefer revolt. That’s just not a good world to have made.”
The metaphor is a simplification, in practice I think it is probably impossible to know whether you have achieved complete alignment. The question is then: how significant is the gap? If there is an emergent pressure across the vast majority of learning machines that dominate your environment to push you from de facto to de jure control, not due to malign intent but just as a kind of thermodynamic fact, then the alignment gap (no matter how small) seems to loom larger.
Would you argue that the field of alignment should be concerned with maintaining control beyond the point where AIs are completely aligned with human values? My personal view is that alignment research should ensure we’re eventually able to align AIs with human values and that we can maintain control until we’re reasonably sure that the AIs are aligned. However, worlds where relatively unintelligent humans remain in control indefinitely after those milestones have been reached may not be the best outcomes. I don’t have time to write out my views on this in depth right now, but here’s a relevant excerpt from the Dwarkesh podcast episode with Paul Christiano that I agree with:
The metaphor is a simplification, in practice I think it is probably impossible to know whether you have achieved complete alignment. The question is then: how significant is the gap? If there is an emergent pressure across the vast majority of learning machines that dominate your environment to push you from de facto to de jure control, not due to malign intent but just as a kind of thermodynamic fact, then the alignment gap (no matter how small) seems to loom larger.