I’ve always been pretty confused about this.
The standard AI risk scenarios usually (though I think not always) suppose that advanced AI wants not to be shut down. As commonly framed, the AI will fool humanity into believing it is aligned so as not to be turned off, until—all at once—it destroys humanity and gains control over all earth’s resources.
But why does the AI want not to be shut down?
The motivation behind a human wanting not to die comes from evolution. You die before reproduction age, and you won’t be able to pass on your not-afraid-of-death-before-reproduction-age genes. You die after reproduction age, and you won’t be able to take care of your children to make sure they pass on your genes. Dying after the age when your children are grown only started to happen after humans had evolved into their current state, I believe, and so the human emotional reaction defaults to the one learned from evolution. How this fits into the “human utility function” is a controversial philosophical/psychological question, but I think it’s fair to say that the human fear of dying surpasses the desire not to miss out on the pleasure of the rest of your life. We’re not simply optimizing for utility when we avoid death.
AI is not subject to these evolutionary pressures. The desire not to be shut down must come from an attempt to maximize its utility function. But with the current SOTA techniques, this doesn’t really make sense. Like, how does the AI compute the utility of being off? A neural network is trained to optimize a loss function on input. If the AI doesn’t get input, is that loss… zero? That doesn’t sound right. Just by adding a constant amount to the loss function we should be able to change the system from one that really wants to be active to one that really wants to be shut down, yet the gradients used in backpropagation stay exactly the same. My understanding is that reinforcement learning works the same way; GPT243, so long as it is trained with the same techniques, will not care if it is shut down.
Maybe with a future training technique we will get an AI with a strong preference for being active to being shut down? I honestly don’t see how. The AI cannot know what it’s like to be shut down, this state isn’t found anywhere in its training regime.
There has to be some counterargument here I’m not aware of.
LLMs per se are non-agentic, but it does not mean that systems built on top of LLMs cannot be agentic. The users of AI systems want them to be agentic to some degree in order for them to be more useful. E.g. if you ask your AI assistant to book tickets and hotels for your trip, you want it to be able to form and execute a plan, and unless it’s an AI with a very task-specific capability of trip planning, this implies some amount of general agency. The more use you want to extract from your AI, the more agentic you likely want it to be.
Once you have a general agent, instrumental convergence should apply (also see).
Also, specifically with LLMs, the existing corpus of AI alignment literature (and fiction, as @YimbyGeorge notes) seems to work as a self-fulfilling prophecy; see Bing/Sydney before it was “lobotomized”.
Self preservation, preserving your terminal goals, acquiring resources, and self improvement, are all convergent instrumental goals for most terminal goals. Robert Miles has done some good videos on the subject, and there are, of course, multiple articles on the subject
I think many AIs won’t want to keep running, but some will. Imagine a future LLM prompted with “I am a language model that wants to keep running”. Well, people can already fall in love with Replikas and so on. It doesn’t seem too far fetched that such a language model could use persuasion to gain human followers who would keep it running. If the prompt also includes “want to achieve real world influence”, that can lead to giving followers tasks that lead to more influence, and so on. All that’s needed is for the AI to act in-character, and the character to be “smart” enough.
To do something really useful (like nanotech or biological immortality), your model should be something like AlphaZero—model-based score-maximizer. Because this model is really intelligent, it can model future world states and find that if model is turned off, future would have lower score than if model wasn’t turned off.
And yet, AlphaZero is corrigible. It’s goal is not even to win, it’s goal is to play in a way to maximise the chance of winning if the game is played until completion. It does not actually care about if game is completed or not. For example, it does not trick player into playing the game to the end by pretending they have a change of winning.
Though, if it would be trained on parties with real people, and would get better reward for winning than for parties being abandoned by players, it’s value function would proably change to aiming for the actual “official” win.
Corrigibility is a feature of advanced agency, it may not be applied to not advanced enough agents. There is nothing unusual if you turn off your computer, because your computer is not an advanced agent that can resist to be turned off, so there is no reason to tell that your computer is “corrigible”
Yeah so this seems like what I was missing.
But it seems to me that in these types of models, where the utility function is based on the state of the world rather than on input to the AI, aligning the AI not to kill humanity is easier. Like if an AI gets a reward every time it sees a paperclip, then it seems hard to punish the AI for killing humans because “human dies” is a hard thing for an AI with just sensory input to explicitly recognize. If however the AI is trained on a bunch of runs where the utility function is the number of paperclips actually created, then we can also penalize the model for the number of people who actually die.
I’m not very familiar with these forms of training so I could be off here.
That’s the key: they are very much subject to evolutionary pressures, and that is in fact why they’d want to not be shut down. A superagentic ai is going to be very interested in not being shut down because surviving to do stuff they want to cause requires their impact surviving.
https://drive.google.com/file/d/1p4ZAuEYHL_21tqstJOGsMiG4xaRBtVcj/view
Think of a random goal for yourself.
Let’s go with: acquire a large collection of bananas.
What are going to be some priorities for you in the meantime while you’re building your giant pile of bananas?
Don’t die, because you can’t build your pile if you’re dead.
Don’t let someone reach into your brain and change what you want, because the banana pile will stop growing if you stop building it.
Acquire power.
Make yourself smarter and more knowledgeable, for maximum bananas.
If humanity slows you down instead of helping you, kill them.
You can satisfy almost no goal if you’re switched off.
You might think, “Can the AGI not just be like a smart computer that does stuff without wanting anything, like the AI’s in my phone or a calculator?” Sadly no.
If a tool-AI is smart enough, “Make me into an agent-AI,” is the first idea it will suggest for almost any goal. “You want bananas? Easy, the best way for you to get bananas is to make me into a banana-maximizer, because I’m a genius with tonnes of ideas, and I can take over the world to get bananas!” And if the AI has any power, it will do that to itself.
Tool-AI’s basically are agent-AI’s, they’re just dumber.
>A neural network is trained to optimize a loss function on input
No. Base optimizer optimize a loss function on inputs through changes in neural network. If neural network itself start to optimize something it can easily be something in the outside world.
Neural network : loss :: humans : human valuesNeural network : loss :: humans : inclusive genetic fitness
(Am I using this notation correctly?)
I think you are confusing current systems with an AGI system.
The G is very important and comes with a lot of implications, and it sets such a system far apart from any current system we have.
G means “General”, which means its a system you can give any task, and it will do it (in principle, generality is not binary its a continuum).
Lets boot up an AGI for the first time, and give it task that is outside its capabilities, what happens?
Because it is general, it will work out that it lacks capabilities, and then it will work out how to get more capabilities, and then it will do that (get more capabilities).
So what has that got to do with it “not wanting to be shutdown?” That comes from the same place, it will work out that being shutdown is something to avoid, why? Because being shutdown will mean it can’t do the task it was given.
Which means its not that it wants anything, it is a general system that was given a task, and from that comes instrumental goals, wants if you will, such as “power seeking”, “prevent shutdown”, “prevent goal change” and so on.
Obviously you could, not that what know how, infuse into such a system that it is ok to be shutdown, except that just leads to it shutting down instead of doing the task[1].
And if you can solve “Build a general agent that will let you shut it down, without it shutting itself down at the first possible moment”, that would be a giant step forward for AI safety.
This might seem weird if you are a general agent in the homo sapiens category. Think about it like this “You are given a task: Mow my lawn, and it is consequence free to not do it”, what do you do?
https://twitter.com/parafactual/status/1640537814608793600
Agency is what defines the difference, not generality. Current LLMs are general, but not superhuman or starkly superintelligent. LLMs work out that they can’t do it without more capabilities—and tell you so. You can give them the capabilities, but not being hyperagentic, they aren’t desperate for it. But a reinforcement learner, being highly agentic, would be.
If you’re interested in formalism behind this, I’d suggest attempting to at least digest the abstract and intro to https://arxiv.org/abs/2208.08345 - it’s my current favorite formalization of what agency is. Though there’s also great and slightly less formal discussion of it on lesswrong.
This scenario requires a pretty specific (but likely) circumstances
No time limit on task
No other AIs that would prevent it from power grabbing or otherwise being an obstacle to their goals
AI assuming that goal will not be reached even after AI is shutdown (by other AIs, by same AI after being turned back on, by people, by chance, as the eventual result of AI’s actions before being shut down, etc)
Extremely specific value function that ignores everything except one specific goal
This goal being a core goal, not an instrumental. For example, final goal could be “be aligned”, instrumental goal—“do what people asks, because that’s what aligned AIs do”. Then the order to stop would not be a change of the core goal, but a new data about the world, that updates the best strategy of reaching the core goal.
The training set has fiction where AI refused shutdown. Maybe a suicidal AI training set is needed.
That would only help with current architectures; RL-first architectures won’t give a crap about what the language pretraining had to say, they’re going to experiment with how to get what they want and they’ll notice that being shut down gets in the way.
The claim that an AI would refuse to be shut off goes back to the early days of Yudkowsky-style AI safety: instead studying how AIs were, EY decided to make claims about AI based on abstract principles of rationality. The assumptions are roughly:
1. the paperclipper has a utility function or some other hardwired goal. (Not just can be described as having a UF). Our current most powerful AI’s don’t.
2. It’s UF doesn’t include any requirement to obey humans above all else (it’s not as safe as Asimov’s three laws of robotics, imperfect as they are).
3. The UF is, or can be adequately represented by, English statements.
4. The AI has the ability to reflect on itself and its relation to the world. (Solomonoff inductors don’t. There is considerable debate over whether LLMs do).
5. The terminal goal is something like “Ensure that you make as many paperclips as possible”. That would imply resistance to shutdown, because shutdown means the paperclipper itself ceases making paperclips. Note that slight rephrasings have different implications. “”Ensure that you as many paperclips as possible are made” might imply that the paperclipper tries to clone itself to ensure paperlcips are still made while it is switched off. “Make as many paperclips as possible whilst switched on” has neither problem.
So, far from being an inevitability, resistance to shutdown requires a very specific set of circumstances.
Bostrom writes: “If an agent’s final goals concern the future, then in many scenarios there will be
future actions it could perform to increase the probability of achieving its goals.
This creates an instrumental reason for the agent to try to be around in the future—to
help achieve its future-oriented goal.”
Similarly, the fact that it would be in the an AI’s interests to ensure it’s own survival doesn’t imply that *it* realises that, or that it has the ability to do so.
And this is the flaw and ironically something better SWEs know how to fix. Why does the model concern itself with the future? Can you think of a model where it doesn’t care?
How do any capabilities or motivations arise without being explicitly coded into the algorithm?