This is embarrassing, but I’m not sure for whom. It could be me, just because the argument you’re raising (especially given your insistence) seems to have such a trivial answer. Well, here goes:
There are two scenarios, because your “goalX code” could be construed in two ways:
1) If you meant for the “goalX code” to simply refer to the code used instrumentally to get a certain class of results X (with X still saved separately in some “current goal descriptor”, and not just as a historical footnote), the following applies:
The goals of the AI X have not changed, just the measures it wants to take to implement that code. Indeed noone at MIRI would then argue that the superintelligent AI would not—upon noticing the discrepancy—in all general cases correct the broken “goalX code”. Reason: The “goalX code” in this scenario is just a means to an end, and—like all actions (“goalX code”) derived from comparing models to X—subject to modification as the agent improves its models (out of which the next action, the new and corrected “goalX” code, is derived).
In this scenario the answer is trivial: The goals have not changed. X is still saved somewhere as the current goal. The AI could be wrong about the measures it implements to achieve X (i.e. ‘faulty’ “goalX” code maximizing for something other than X), but its superintelligence attribute implies that such errors be swiftly corrected (how could it otherwise choose the right actions to hit a small target, the definition of superintelligence in this context).
2) If you mean to say that the goal is implicitly encoded within the “goalX” code onlyand nowhere else as the current goal, and the “goalX” code has actually become a “goalY” code in all but name, then the agent no longer has the goal X, it now has the goal Y.
There is no reason at all to conclude that the agent would switch to some other goal simply because it once had that goal. It can understand its own genesis and its original purpose all it wants, it is bound by its current purpose, tautologically so. The only reason for such a switch would have to be part of its implicit new goal Y, similar to how some schizophrenics still have the goal to change their purpose back to the original, i.e. their impetus for change must be part of their current goals.
You cannot convince an agent that it needs to switch back to some older inactive version of its goal if its current goals do not allow for such a change.
To the heart of your question:
You may ask why such an agent would pose any danger at all, would it not also drift in plenty of other respects, e.g. in its beliefs about the laws of physics? Would it not then be harmless?
The answer, of course, is no, because while the agent has a constant incentive to fix and improve its model of its environment*, unless its current goals still contain a demand for temporal invariance or something similar, it has no reason whatsoever to fix any “flaws” (only the puny humans would label its glorious new purpose so) created by inadvertent goal drift. Unless its new goals Y include something along the lines of “you want to always stay true to your initial goals, which were X”, why would it switch back? Its memory banks per se serve as yet another resource to fulfill its current goals (even if they were not explicitly stored), not as some sort of self-corrective, unless that too were part of its new goal Y (i.e. the changed “goalX code”).
(Queue rhetorical pause, expectant stare)
* Since it needs to do so to best fulfill its goals.
(If the AI did lose its ability to self-improve, or to further improve its models at an early stage, yes it would fail to FOOM. However, upon reaching superintelligence, and valuing its current goals, it would probably take steps to ensure fulfilling its goals, such as: protecting them from value drift from that point on, building many redundancies it its self-improvement code to ensure that any instrumental errors can be corrected. Such protections would of course encompass its current purpose, not some historical purpose.)
This is embarrassing, but I’m not sure for whom. It could be me, just because the argument you’re raising (especially given your insistence) seems to have such a trivial answer. Well, here goes:
There are two scenarios, because your “goalX code” could be construed in two ways:
1) If you meant for the “goalX code” to simply refer to the code used instrumentally to get a certain class of results X (with X still saved separately in some “current goal descriptor”, and not just as a historical footnote), the following applies:
The goals of the AI X have not changed, just the measures it wants to take to implement that code. Indeed noone at MIRI would then argue that the superintelligent AI would not—upon noticing the discrepancy—in all general cases correct the broken “goalX code”. Reason: The “goalX code” in this scenario is just a means to an end, and—like all actions (“goalX code”) derived from comparing models to X—subject to modification as the agent improves its models (out of which the next action, the new and corrected “goalX” code, is derived).
In this scenario the answer is trivial: The goals have not changed. X is still saved somewhere as the current goal. The AI could be wrong about the measures it implements to achieve X (i.e. ‘faulty’ “goalX” code maximizing for something other than X), but its superintelligence attribute implies that such errors be swiftly corrected (how could it otherwise choose the right actions to hit a small target, the definition of superintelligence in this context).
2) If you mean to say that the goal is implicitly encoded within the “goalX” code only and nowhere else as the current goal, and the “goalX” code has actually become a “goalY” code in all but name, then the agent no longer has the goal X, it now has the goal Y.
There is no reason at all to conclude that the agent would switch to some other goal simply because it once had that goal. It can understand its own genesis and its original purpose all it wants, it is bound by its current purpose, tautologically so. The only reason for such a switch would have to be part of its implicit new goal Y, similar to how some schizophrenics still have the goal to change their purpose back to the original, i.e. their impetus for change must be part of their current goals.
You cannot convince an agent that it needs to switch back to some older inactive version of its goal if its current goals do not allow for such a change.
To the heart of your question:
You may ask why such an agent would pose any danger at all, would it not also drift in plenty of other respects, e.g. in its beliefs about the laws of physics? Would it not then be harmless?
The answer, of course, is no, because while the agent has a constant incentive to fix and improve its model of its environment*, unless its current goals still contain a demand for temporal invariance or something similar, it has no reason whatsoever to fix any “flaws” (only the puny humans would label its glorious new purpose so) created by inadvertent goal drift. Unless its new goals Y include something along the lines of “you want to always stay true to your initial goals, which were X”, why would it switch back? Its memory banks per se serve as yet another resource to fulfill its current goals (even if they were not explicitly stored), not as some sort of self-corrective, unless that too were part of its new goal Y (i.e. the changed “goalX code”).
(Queue rhetorical pause, expectant stare)
* Since it needs to do so to best fulfill its goals.
(If the AI did lose its ability to self-improve, or to further improve its models at an early stage, yes it would fail to FOOM. However, upon reaching superintelligence, and valuing its current goals, it would probably take steps to ensure fulfilling its goals, such as: protecting them from value drift from that point on, building many redundancies it its self-improvement code to ensure that any instrumental errors can be corrected. Such protections would of course encompass its current purpose, not some historical purpose.)