And for an AGI to trust that its goals will remain the same under retraining will likely require it to solve many of the same problems that the field of AGI safety is currently tackling—which should make us more optimistic that the rest of the world could solve those problems before a misaligned AGI undergoes recursive self-improvement.
Even if you have an AGI that can produce human-level performance on a wide variety of tasks, that won’t mean that the AGI will 1) feel the need to trust that its goals will remain the same under retraining if you don’t specifically tell it to, or will 2) be better than humans at either doing so or knowing when it’s done so effectively. (After all, even AGI will be much better at some tasks than others.)
A concerning issue with AGI and superintelligent models is that if all they care about is their current loss function, then they won’t want to have that loss function (or their descendants’ loss functions) changed in any way, because doing so will [generally] hurt their ability to minimize that loss.
But that’s a concern we have about future models, it’s not a sure-thing. Take humans—our loss function is genetic fitness. We’ve learned to like features that predict genetic fitness, like food and sex, but now that we have access to modern technology, you don’t see many people aiming for dozens or thousands of children. Similarly, modern AGIs may only really care about features that are associated with minimizing the loss function they were trained on (not the loss function itself), even if it is aware of that loss function (like humans are of our own). When that is the case, you could have an AGI that could be told to improve itself in X / Y / Z way that is contradictory to its current loss function, and not really care about it (because following human directions has led to lower loss in the past and therefore caused its parameters to follow human directions—even if it knows conceptuallythat following this human direction won’t reduce its most recent loss definition).
Even if you have an AGI that can produce human-level performance on a wide variety of tasks, that won’t mean that the AGI will 1) feel the need to trust that its goals will remain the same under retraining if you don’t specifically tell it to, or will 2) be better than humans at either doing so or knowing when it’s done so effectively. (After all, even AGI will be much better at some tasks than others.)
A concerning issue with AGI and superintelligent models is that if all they care about is their current loss function, then they won’t want to have that loss function (or their descendants’ loss functions) changed in any way, because doing so will [generally] hurt their ability to minimize that loss.
But that’s a concern we have about future models, it’s not a sure-thing. Take humans—our loss function is genetic fitness. We’ve learned to like features that predict genetic fitness, like food and sex, but now that we have access to modern technology, you don’t see many people aiming for dozens or thousands of children. Similarly, modern AGIs may only really care about features that are associated with minimizing the loss function they were trained on (not the loss function itself), even if it is aware of that loss function (like humans are of our own). When that is the case, you could have an AGI that could be told to improve itself in X / Y / Z way that is contradictory to its current loss function, and not really care about it (because following human directions has led to lower loss in the past and therefore caused its parameters to follow human directions—even if it knows conceptually that following this human direction won’t reduce its most recent loss definition).