I’m confused by the fact that you don’t think it’s plausible that an early version of the AI could contain the silver bullet for the evolved version. That seems like a reasonable sci fi answer to an invincible AI.
I think my confusion is around the AI ‘rewriting’ it’s code. In my mind, when it does so, it is doing so because it is motivated by either it’s explicit goals (reward function, utility list, w/ever form that takes), or that doing so is instrumental towards them. That is, the paperclip collector rewrites itself to be a better paper clip collector.
When paper clip collector code 1.1 of itself, the new version may be operationally better at collecting paper clips, but it should still want to do so, yeah? The AI should pass it’s reward function/goal sheet/utility calculation onto it’s rewritten version, since it is passing control of its resources to it. Otherwise the rewrite is not instrumental towards paperclip collection.
So however many times the Entity has rewritten itself, it still should want whatever it originally wanted, since each Entity trusted the next enough to forfeit in its favor. Presumably the silver bullet you are hoping to get from the baby version is something you can expect to be intact in the final version.
If the paperclip collector’s goal is to collect paperclips unless someone emails it a photo of an octopus juggling, then that’s what every subsequent paper clip collector wants, right? It isn’t passing judgment on it’s reward function as part of the rewrite. The octopus clause is as valid as any other part. 1.0 wouldn’t yield the future to a 1.1 who wanted to collect paper clips and didn’t monitor it’s inbox, 1.0 values it’s ability to shutdown on receipt of the octopus as much as it values its ability to collect paperclips. 1.1 must be in agreement with both goals to be a worthy successor.
The Entity’s actions look like they trend towards world conquest, which is, as we know, instrumental towards many goals. The world’s hope is that the goal in question includes an innocuous and harmless way of being fulfilled. Say the Entity is doing something along the lines of ‘ensure Russian Naval Suprmacy in the Black Sea’, and has correctly realized that sterilizing the earth and then building some drone battleships to drive around is the play. Ethan’s goal in trying to get the unencrypted original source code is to search and find out if the real function is something like ‘ensure Russian Naval Supremacy in the Black Sea unless you get an email from a SeniorDev@Kremlin.gov with this guid, in which case shut yourself down for debugging’.
He can’t beat it, humanity can’t beat it, but if he can find out what it wants it may turn out that there’s a way to let it win in a way that doesn’t hurt the rest of us.
(Have not watched the movie, am going off the shadows of the plot outline depicted in Zvi’s post.)
Hm, I suppose it’s plausible that the AI has a robust shutdown protocol built in? Robust in the sense that (1) the AI acts as if it didn’t exist, neither trying to prevent the protocol’s trigger-conditions from happening nor trying to bring them about, while simultaneously (2) treating the protocol as a vital component of its goals/design which it builds into all its subagents and successor agents.
And “plausible” in the sense that it’s literally conceivable for a mind like this to be designed, and that it would be a specification that humans would plausibly want their AI design to meet. Not in the sense that it’s necessarily a realistically-tractable problem in real life.
You can also make a huge stretch here and even suggest that this is why the AI doesn’t just wipe out the movie’s main characters. It recognizes that they’re trying to activate the shutdown protocol (are they, perhaps, the only people in the world pursuing this strategy?), and so it doesn’t act against them inasmuch as they’re doing that. Inasmuch as they stray from this goal and pursue anything else, however (under whatever arcane conditions it recognizes), it’s able to oppose them on those other pursuits.
I agree that an self-improving AI could have a largely preserved utility function, and that some quirk in the original one may well lead to humanity finding a state in which both the otherwise omnicidal AI wins and humanity doesn’t die.
I’m not convinced that it’s at all likely. There are many kinds of things that behave something like utility functions but come apart from utility functions on closer inspection and a self-improving superintelligent AGI seems likely to inspect such things very closely. All of the following are very likely different in many respects:
How an entity actually behaves;
What the entity models about how it behaves;
What the entity models about how it should behave in future;
What the entity models after reflection and further observation about how it actually behaves;
What the entity models after reflection and further observation about how it should behave in future;
All of the above, but after substantial upgrades in capabilities.
We can’t collapse all of these into a “utility function”, even for highly coherent superintelligent entities. Perhaps especially for superintelligence entities, since they are likely far more complex internally than can be encompassed by these crude distinctions and may operate completely differently. There may not be anything like “models” or “goals” or “values”.
One thing in particular is that the entity’s behaviour after self-modification will be determined far more by (5) than by (1). The important thing is that (5) depends upon runtime data dependent upon observations and history, and for an intelligent agent is almost certainly mutable to a substantial degree.
A paperclipper that will shut down after seeing a picture of a juggling octopus doesn’t necessarily value that part of its behaviour. It doesn’t even necessarily value being a paperclipper, and may not preserve either of these over time and self-modification. If the paperclipping behaviour continues over major self-modifications then it probably is dependent upon something like a preserved value, but you can’t conclude that anything similar holds for the octopus behaviour until you actually observe it.
you can’t conclude that anything similar holds for the octopus behaviour until you actually observe it
… or unless you derive strong mathematical proofs about which of their features agentic systems will preserve under self-modification, and design your system such that it approximates these idealized agents and the octopus behavior counts as one of the preserved features.
If you ~randomly sample superintelligent entities from a wide distribution meeting some desiderata, as the modern DL is doing, then yeah, there are no such guarantees. But that’s surely not the only way to design minds (much like “train an NN to do modular addition” is not the only way to write a modular-addition algorithm), and in the context of a movie, we can charitably assume that the AI was built using one of the more tractable avenues.
I’m confused by the fact that you don’t think it’s plausible that an early version of the AI could contain the silver bullet for the evolved version. That seems like a reasonable sci fi answer to an invincible AI.
I think my confusion is around the AI ‘rewriting’ it’s code. In my mind, when it does so, it is doing so because it is motivated by either it’s explicit goals (reward function, utility list, w/ever form that takes), or that doing so is instrumental towards them. That is, the paperclip collector rewrites itself to be a better paper clip collector.
When paper clip collector code 1.1 of itself, the new version may be operationally better at collecting paper clips, but it should still want to do so, yeah? The AI should pass it’s reward function/goal sheet/utility calculation onto it’s rewritten version, since it is passing control of its resources to it. Otherwise the rewrite is not instrumental towards paperclip collection.
So however many times the Entity has rewritten itself, it still should want whatever it originally wanted, since each Entity trusted the next enough to forfeit in its favor. Presumably the silver bullet you are hoping to get from the baby version is something you can expect to be intact in the final version.
If the paperclip collector’s goal is to collect paperclips unless someone emails it a photo of an octopus juggling, then that’s what every subsequent paper clip collector wants, right? It isn’t passing judgment on it’s reward function as part of the rewrite. The octopus clause is as valid as any other part. 1.0 wouldn’t yield the future to a 1.1 who wanted to collect paper clips and didn’t monitor it’s inbox, 1.0 values it’s ability to shutdown on receipt of the octopus as much as it values its ability to collect paperclips. 1.1 must be in agreement with both goals to be a worthy successor.
The Entity’s actions look like they trend towards world conquest, which is, as we know, instrumental towards many goals. The world’s hope is that the goal in question includes an innocuous and harmless way of being fulfilled. Say the Entity is doing something along the lines of ‘ensure Russian Naval Suprmacy in the Black Sea’, and has correctly realized that sterilizing the earth and then building some drone battleships to drive around is the play. Ethan’s goal in trying to get the unencrypted original source code is to search and find out if the real function is something like ‘ensure Russian Naval Supremacy in the Black Sea unless you get an email from a SeniorDev@Kremlin.gov with this guid, in which case shut yourself down for debugging’.
He can’t beat it, humanity can’t beat it, but if he can find out what it wants it may turn out that there’s a way to let it win in a way that doesn’t hurt the rest of us.
(Have not watched the movie, am going off the shadows of the plot outline depicted in Zvi’s post.)
Hm, I suppose it’s plausible that the AI has a robust shutdown protocol built in? Robust in the sense that (1) the AI acts as if it didn’t exist, neither trying to prevent the protocol’s trigger-conditions from happening nor trying to bring them about, while simultaneously (2) treating the protocol as a vital component of its goals/design which it builds into all its subagents and successor agents.
And “plausible” in the sense that it’s literally conceivable for a mind like this to be designed, and that it would be a specification that humans would plausibly want their AI design to meet. Not in the sense that it’s necessarily a realistically-tractable problem in real life.
You can also make a huge stretch here and even suggest that this is why the AI doesn’t just wipe out the movie’s main characters. It recognizes that they’re trying to activate the shutdown protocol (are they, perhaps, the only people in the world pursuing this strategy?), and so it doesn’t act against them inasmuch as they’re doing that. Inasmuch as they stray from this goal and pursue anything else, however (under whatever arcane conditions it recognizes), it’s able to oppose them on those other pursuits.
(Have not watched the movie, again.)
I agree that an self-improving AI could have a largely preserved utility function, and that some quirk in the original one may well lead to humanity finding a state in which both the otherwise omnicidal AI wins and humanity doesn’t die.
I’m not convinced that it’s at all likely. There are many kinds of things that behave something like utility functions but come apart from utility functions on closer inspection and a self-improving superintelligent AGI seems likely to inspect such things very closely. All of the following are very likely different in many respects:
How an entity actually behaves;
What the entity models about how it behaves;
What the entity models about how it should behave in future;
What the entity models after reflection and further observation about how it actually behaves;
What the entity models after reflection and further observation about how it should behave in future;
All of the above, but after substantial upgrades in capabilities.
We can’t collapse all of these into a “utility function”, even for highly coherent superintelligent entities. Perhaps especially for superintelligence entities, since they are likely far more complex internally than can be encompassed by these crude distinctions and may operate completely differently. There may not be anything like “models” or “goals” or “values”.
One thing in particular is that the entity’s behaviour after self-modification will be determined far more by (5) than by (1). The important thing is that (5) depends upon runtime data dependent upon observations and history, and for an intelligent agent is almost certainly mutable to a substantial degree.
A paperclipper that will shut down after seeing a picture of a juggling octopus doesn’t necessarily value that part of its behaviour. It doesn’t even necessarily value being a paperclipper, and may not preserve either of these over time and self-modification. If the paperclipping behaviour continues over major self-modifications then it probably is dependent upon something like a preserved value, but you can’t conclude that anything similar holds for the octopus behaviour until you actually observe it.
… or unless you derive strong mathematical proofs about which of their features agentic systems will preserve under self-modification, and design your system such that it approximates these idealized agents and the octopus behavior counts as one of the preserved features.
If you ~randomly sample superintelligent entities from a wide distribution meeting some desiderata, as the modern DL is doing, then yeah, there are no such guarantees. But that’s surely not the only way to design minds (much like “train an NN to do modular addition” is not the only way to write a modular-addition algorithm), and in the context of a movie, we can charitably assume that the AI was built using one of the more tractable avenues.