I like this idea and think it is worth exploring. It is not even just with training new models; AGI have to worry about misalignment with every self-modification and every interaction with the environment that changes itself.
Perhaps there are even ways to deter an AGI from self-improvement, by making misalignment more likely.
Some caveats are:
AGI may not take alignment seriously. We already have plenty of examples of general intelligences who don’t.
AGI can still increase its capabilities without training new models, e.g. by getting more compute
If an AGI decides to solve alignment before significant self-improvement, it will very likely be overtaken by other humans or AGI who don’t care as much about alignment.
I like this idea and think it is worth exploring. It is not even just with training new models; AGI have to worry about misalignment with every self-modification and every interaction with the environment that changes itself.
Perhaps there are even ways to deter an AGI from self-improvement, by making misalignment more likely.
Some caveats are:
AGI may not take alignment seriously. We already have plenty of examples of general intelligences who don’t.
AGI can still increase its capabilities without training new models, e.g. by getting more compute
If an AGI decides to solve alignment before significant self-improvement, it will very likely be overtaken by other humans or AGI who don’t care as much about alignment.