The claim that an AI would refuse to be shut off goes back to the early days of Yudkowsky-style AI safety: instead studying how AIs were, EY decided to make claims about AI based on abstract principles of rationality. The assumptions are roughly:
1. the paperclipper has a utility function or some other hardwired goal. (Not just can be described as having a UF). Our current most powerful AI’s don’t.
2. It’s UF doesn’t include any requirement to obey humans above all else (it’s not as safe as Asimov’s three laws of robotics, imperfect as they are).
3. The UF is, or can be adequately represented by, English statements.
4. The AI has the ability to reflect on itself and its relation to the world. (Solomonoff inductors don’t. There is considerable debate over whether LLMs do).
5. The terminal goal is something like “Ensure that you make as many paperclips as possible”. That would imply resistance to shutdown, because shutdown means the paperclipper itself ceases making paperclips. Note that slight rephrasings have different implications. “”Ensure that you as many paperclips as possible are made” might imply that the paperclipper tries to clone itself to ensure paperlcips are still made while it is switched off. “Make as many paperclips as possible whilst switched on” has neither problem.
So, far from being an inevitability, resistance to shutdown requires a very specific set of circumstances.
Bostrom writes: “If an agent’s final goals concern the future, then in many scenarios there will be future actions it could perform to increase the probability of achieving its goals. This creates an instrumental reason for the agent to try to be around in the future—to help achieve its future-oriented goal.”
Similarly, the fact that it would be in the an AI’s interests to ensure it’s own survival doesn’t imply that *it* realises that, or that it has the ability to do so.
And this is the flaw and ironically something better SWEs know how to fix. Why does the model concern itself with the future? Can you think of a model where it doesn’t care?
The claim that an AI would refuse to be shut off goes back to the early days of Yudkowsky-style AI safety: instead studying how AIs were, EY decided to make claims about AI based on abstract principles of rationality. The assumptions are roughly:
1. the paperclipper has a utility function or some other hardwired goal. (Not just can be described as having a UF). Our current most powerful AI’s don’t.
2. It’s UF doesn’t include any requirement to obey humans above all else (it’s not as safe as Asimov’s three laws of robotics, imperfect as they are).
3. The UF is, or can be adequately represented by, English statements.
4. The AI has the ability to reflect on itself and its relation to the world. (Solomonoff inductors don’t. There is considerable debate over whether LLMs do).
5. The terminal goal is something like “Ensure that you make as many paperclips as possible”. That would imply resistance to shutdown, because shutdown means the paperclipper itself ceases making paperclips. Note that slight rephrasings have different implications. “”Ensure that you as many paperclips as possible are made” might imply that the paperclipper tries to clone itself to ensure paperlcips are still made while it is switched off. “Make as many paperclips as possible whilst switched on” has neither problem.
So, far from being an inevitability, resistance to shutdown requires a very specific set of circumstances.
Bostrom writes: “If an agent’s final goals concern the future, then in many scenarios there will be
future actions it could perform to increase the probability of achieving its goals.
This creates an instrumental reason for the agent to try to be around in the future—to
help achieve its future-oriented goal.”
Similarly, the fact that it would be in the an AI’s interests to ensure it’s own survival doesn’t imply that *it* realises that, or that it has the ability to do so.
And this is the flaw and ironically something better SWEs know how to fix. Why does the model concern itself with the future? Can you think of a model where it doesn’t care?