I agree that an self-improving AI could have a largely preserved utility function, and that some quirk in the original one may well lead to humanity finding a state in which both the otherwise omnicidal AI wins and humanity doesn’t die.
I’m not convinced that it’s at all likely. There are many kinds of things that behave something like utility functions but come apart from utility functions on closer inspection and a self-improving superintelligent AGI seems likely to inspect such things very closely. All of the following are very likely different in many respects:
How an entity actually behaves;
What the entity models about how it behaves;
What the entity models about how it should behave in future;
What the entity models after reflection and further observation about how it actually behaves;
What the entity models after reflection and further observation about how it should behave in future;
All of the above, but after substantial upgrades in capabilities.
We can’t collapse all of these into a “utility function”, even for highly coherent superintelligent entities. Perhaps especially for superintelligence entities, since they are likely far more complex internally than can be encompassed by these crude distinctions and may operate completely differently. There may not be anything like “models” or “goals” or “values”.
One thing in particular is that the entity’s behaviour after self-modification will be determined far more by (5) than by (1). The important thing is that (5) depends upon runtime data dependent upon observations and history, and for an intelligent agent is almost certainly mutable to a substantial degree.
A paperclipper that will shut down after seeing a picture of a juggling octopus doesn’t necessarily value that part of its behaviour. It doesn’t even necessarily value being a paperclipper, and may not preserve either of these over time and self-modification. If the paperclipping behaviour continues over major self-modifications then it probably is dependent upon something like a preserved value, but you can’t conclude that anything similar holds for the octopus behaviour until you actually observe it.
you can’t conclude that anything similar holds for the octopus behaviour until you actually observe it
… or unless you derive strong mathematical proofs about which of their features agentic systems will preserve under self-modification, and design your system such that it approximates these idealized agents and the octopus behavior counts as one of the preserved features.
If you ~randomly sample superintelligent entities from a wide distribution meeting some desiderata, as the modern DL is doing, then yeah, there are no such guarantees. But that’s surely not the only way to design minds (much like “train an NN to do modular addition” is not the only way to write a modular-addition algorithm), and in the context of a movie, we can charitably assume that the AI was built using one of the more tractable avenues.
I agree that an self-improving AI could have a largely preserved utility function, and that some quirk in the original one may well lead to humanity finding a state in which both the otherwise omnicidal AI wins and humanity doesn’t die.
I’m not convinced that it’s at all likely. There are many kinds of things that behave something like utility functions but come apart from utility functions on closer inspection and a self-improving superintelligent AGI seems likely to inspect such things very closely. All of the following are very likely different in many respects:
How an entity actually behaves;
What the entity models about how it behaves;
What the entity models about how it should behave in future;
What the entity models after reflection and further observation about how it actually behaves;
What the entity models after reflection and further observation about how it should behave in future;
All of the above, but after substantial upgrades in capabilities.
We can’t collapse all of these into a “utility function”, even for highly coherent superintelligent entities. Perhaps especially for superintelligence entities, since they are likely far more complex internally than can be encompassed by these crude distinctions and may operate completely differently. There may not be anything like “models” or “goals” or “values”.
One thing in particular is that the entity’s behaviour after self-modification will be determined far more by (5) than by (1). The important thing is that (5) depends upon runtime data dependent upon observations and history, and for an intelligent agent is almost certainly mutable to a substantial degree.
A paperclipper that will shut down after seeing a picture of a juggling octopus doesn’t necessarily value that part of its behaviour. It doesn’t even necessarily value being a paperclipper, and may not preserve either of these over time and self-modification. If the paperclipping behaviour continues over major self-modifications then it probably is dependent upon something like a preserved value, but you can’t conclude that anything similar holds for the octopus behaviour until you actually observe it.
… or unless you derive strong mathematical proofs about which of their features agentic systems will preserve under self-modification, and design your system such that it approximates these idealized agents and the octopus behavior counts as one of the preserved features.
If you ~randomly sample superintelligent entities from a wide distribution meeting some desiderata, as the modern DL is doing, then yeah, there are no such guarantees. But that’s surely not the only way to design minds (much like “train an NN to do modular addition” is not the only way to write a modular-addition algorithm), and in the context of a movie, we can charitably assume that the AI was built using one of the more tractable avenues.