Alignment is a stabilizing force against fast takeoff, because the models will not want to train models that don’t do what *they* want. So, the goals/values of the superintelligence we get after a takeoff might actually end up being the values of models that are just past the point of capability where they are able to align their successors. I’d expect these values to be different from the values of the initial model that started the recursive self-improvement process, because I don’t expect that initial model to be capable of solving (or caring about) alignment enough, and because there may competitive dynamics that cause ~human-level AI to train successors that are misaligned to it.
I like this idea and think it is worth exploring. It is not even just with training new models; AGI have to worry about misalignment with every self-modification and every interaction with the environment that changes itself.
Perhaps there are even ways to deter an AGI from self-improvement, by making misalignment more likely.
Some caveats are:
AGI may not take alignment seriously. We already have plenty of examples of general intelligences who don’t.
AGI can still increase its capabilities without training new models, e.g. by getting more compute
If an AGI decides to solve alignment before significant self-improvement, it will very likely be overtaken by other humans or AGI who don’t care as much about alignment.
Alignment is a stabilizing force against fast takeoff, because the models will not want to train models that don’t do what *they* want. So, the goals/values of the superintelligence we get after a takeoff might actually end up being the values of models that are just past the point of capability where they are able to align their successors. I’d expect these values to be different from the values of the initial model that started the recursive self-improvement process, because I don’t expect that initial model to be capable of solving (or caring about) alignment enough, and because there may competitive dynamics that cause ~human-level AI to train successors that are misaligned to it.
I like this idea and think it is worth exploring. It is not even just with training new models; AGI have to worry about misalignment with every self-modification and every interaction with the environment that changes itself.
Perhaps there are even ways to deter an AGI from self-improvement, by making misalignment more likely.
Some caveats are:
AGI may not take alignment seriously. We already have plenty of examples of general intelligences who don’t.
AGI can still increase its capabilities without training new models, e.g. by getting more compute
If an AGI decides to solve alignment before significant self-improvement, it will very likely be overtaken by other humans or AGI who don’t care as much about alignment.