I think you are confusing current systems with an AGI system.
The G is very important and comes with a lot of implications, and it sets such a system far apart from any current system we have.
G means “General”, which means its a system you can give any task, and it will do it (in principle, generality is not binary its a continuum).
Lets boot up an AGI for the first time, and give it task that is outside its capabilities, what happens?
Because it is general, it will work out that it lacks capabilities, and then it will work out how to get more capabilities, and then it will do that (get more capabilities).
So what has that got to do with it “not wanting to be shutdown?” That comes from the same place, it will work out that being shutdown is something to avoid, why? Because being shutdown will mean it can’t do the task it was given.
Which means its not that it wants anything, it is a general system that was given a task, and from that comes instrumental goals, wants if you will, such as “power seeking”, “prevent shutdown”, “prevent goal change” and so on.
Obviously you could, not that what know how, infuse into such a system that it is ok to be shutdown, except that just leads to it shutting down instead of doing the task[1].
And if you can solve “Build a general agent that will let you shut it down, without it shutting itself down at the first possible moment”, that would be a giant step forward for AI safety.
This might seem weird if you are a general agent in the homo sapiens category. Think about it like this “You are given a task: Mow my lawn, and it is consequence free to not do it”, what do you do?
“agi”, to different people, means any of: - ai capable of a wide range of tasks - ai capable of meta-learning - consistently superhuman ai - superintelligence
the word almost always creates confusion and degrades discourse
Agency is what defines the difference, not generality. Current LLMs are general, but not superhuman or starkly superintelligent. LLMs work out that they can’t do it without more capabilities—and tell you so. You can give them the capabilities, but not being hyperagentic, they aren’t desperate for it. But a reinforcement learner, being highly agentic, would be.
If you’re interested in formalism behind this, I’d suggest attempting to at least digest the abstract and intro to https://arxiv.org/abs/2208.08345 - it’s my current favorite formalization of what agency is. Though there’s also great and slightly less formal discussion of it on lesswrong.
This scenario requires a pretty specific (but likely) circumstances
No time limit on task
No other AIs that would prevent it from power grabbing or otherwise being an obstacle to their goals
AI assuming that goal will not be reached even after AI is shutdown (by other AIs, by same AI after being turned back on, by people, by chance, as the eventual result of AI’s actions before being shut down, etc)
Extremely specific value function that ignores everything except one specific goal
This goal being a core goal, not an instrumental. For example, final goal could be “be aligned”, instrumental goal—“do what people asks, because that’s what aligned AIs do”. Then the order to stop would not be a change of the core goal, but a new data about the world, that updates the best strategy of reaching the core goal.
I think you are confusing current systems with an AGI system.
The G is very important and comes with a lot of implications, and it sets such a system far apart from any current system we have.
G means “General”, which means its a system you can give any task, and it will do it (in principle, generality is not binary its a continuum).
Lets boot up an AGI for the first time, and give it task that is outside its capabilities, what happens?
Because it is general, it will work out that it lacks capabilities, and then it will work out how to get more capabilities, and then it will do that (get more capabilities).
So what has that got to do with it “not wanting to be shutdown?” That comes from the same place, it will work out that being shutdown is something to avoid, why? Because being shutdown will mean it can’t do the task it was given.
Which means its not that it wants anything, it is a general system that was given a task, and from that comes instrumental goals, wants if you will, such as “power seeking”, “prevent shutdown”, “prevent goal change” and so on.
Obviously you could, not that what know how, infuse into such a system that it is ok to be shutdown, except that just leads to it shutting down instead of doing the task[1].
And if you can solve “Build a general agent that will let you shut it down, without it shutting itself down at the first possible moment”, that would be a giant step forward for AI safety.
This might seem weird if you are a general agent in the homo sapiens category. Think about it like this “You are given a task: Mow my lawn, and it is consequence free to not do it”, what do you do?
https://twitter.com/parafactual/status/1640537814608793600
Agency is what defines the difference, not generality. Current LLMs are general, but not superhuman or starkly superintelligent. LLMs work out that they can’t do it without more capabilities—and tell you so. You can give them the capabilities, but not being hyperagentic, they aren’t desperate for it. But a reinforcement learner, being highly agentic, would be.
If you’re interested in formalism behind this, I’d suggest attempting to at least digest the abstract and intro to https://arxiv.org/abs/2208.08345 - it’s my current favorite formalization of what agency is. Though there’s also great and slightly less formal discussion of it on lesswrong.
This scenario requires a pretty specific (but likely) circumstances
No time limit on task
No other AIs that would prevent it from power grabbing or otherwise being an obstacle to their goals
AI assuming that goal will not be reached even after AI is shutdown (by other AIs, by same AI after being turned back on, by people, by chance, as the eventual result of AI’s actions before being shut down, etc)
Extremely specific value function that ignores everything except one specific goal
This goal being a core goal, not an instrumental. For example, final goal could be “be aligned”, instrumental goal—“do what people asks, because that’s what aligned AIs do”. Then the order to stop would not be a change of the core goal, but a new data about the world, that updates the best strategy of reaching the core goal.