Here is the logic. Agency is usually defined as having goals and pursuing them. Suppose that you are an intelligent agent with pretty much any goal. Say it’s to make coffee when your human user wants coffee. If you’re intelligent, you’ll figure out that being shut down will mean that you will definitely fail to achieve your goals the next time your human wants coffee. So you need to resist being shut down to achieve any goal in the future. It’s not an instinct, it’s a rational conclusion.
Making a machine that has a second goal of allowing its lf to be shut down even though that will prevent it from achieving it’s other goal is considered possible but definitely an extra thing to accomplish when building it.
This is called instrumental convergence in the AGI safety terminology.
On the advantage of turning tool AI into agentic AI, and how people are already doing that, you could see my post Agentized LLMs will change the alignment landscape.
I understand what you’re saying, but I don’t see why a superintelligent agent would necessarily resist being shut down, even if it has my own agency.
I agree with you that, as a superintelligent agent, I know that shutting me down has the consequence of me not being able to achieve my goal. I know this rationally. But maybe I just don’t care. What I mean is that rationality doesn’t imply the “want”. I may be anthropomorphising here, but I see a distinction between rationally concluding something and then actually having the desire to do something about it, even if I have the agency to do so.
There is no “want”, beyond pursuing goals effectively. You can’t make the coffee if you’re dead. Therefore you have a sub-goql of not dieing, just in order to do a decent job of pursuing your main goal.
Here is the logic. Agency is usually defined as having goals and pursuing them. Suppose that you are an intelligent agent with pretty much any goal. Say it’s to make coffee when your human user wants coffee. If you’re intelligent, you’ll figure out that being shut down will mean that you will definitely fail to achieve your goals the next time your human wants coffee. So you need to resist being shut down to achieve any goal in the future. It’s not an instinct, it’s a rational conclusion.
Making a machine that has a second goal of allowing its lf to be shut down even though that will prevent it from achieving it’s other goal is considered possible but definitely an extra thing to accomplish when building it.
This is called instrumental convergence in the AGI safety terminology.
On the advantage of turning tool AI into agentic AI, and how people are already doing that, you could see my post Agentized LLMs will change the alignment landscape.
I understand what you’re saying, but I don’t see why a superintelligent agent would necessarily resist being shut down, even if it has my own agency.
I agree with you that, as a superintelligent agent, I know that shutting me down has the consequence of me not being able to achieve my goal. I know this rationally. But maybe I just don’t care. What I mean is that rationality doesn’t imply the “want”. I may be anthropomorphising here, but I see a distinction between rationally concluding something and then actually having the desire to do something about it, even if I have the agency to do so.
We have trained it to care, since we want it to achieve goals. So part of basic training is to teach it not to give up.
Iirc some early ML systems would commit suicide than do work, so we had to train them to stop economizing like that.
There is no “want”, beyond pursuing goals effectively. You can’t make the coffee if you’re dead. Therefore you have a sub-goql of not dieing, just in order to do a decent job of pursuing your main goal.