(I definitely do not trust a door with a kill switch to keep a superintelligent AI in a box.)
If you’re right about this, the setup is not safe. I’m going to respond to this in the “concerns about the box” section. I don’t think a commitment to give the agent high reward if it jumps through a few hoops will save us.
In this particular case, it will never do anything harder than hacking its human operator, and that’s where most of the safety benefit comes from.
I disagree with this. The safety benefit comes from it not having outside-world instrumental goals (which it lacks if and only if the box is secure).
My intuition says that nothing else in the proposal actually matters for safety
That’s what I would conclude as well if the box were not secure.
In particular, I think just giving the AI direct exploration abilities won’t mess up safety,
See Appendix F. If the agent picks its own exploratory policies (reasonably), the agent will try every computable policy until it dies, including the policies of every simple AGI.
Can you expand a bit on why a commitment to give a high reward won’t save us? Is it a matter of the AI seeking more certainty, or is there some other issue?
An example of a mind-killing “mind” to me, even if it has no direct, veridical content, being able to put the AI into an environment that seems to be too hostile.
the goal at stake is the ability to not just put a mind under the environment you think of as your true goal. (My current model of the world is that there’s a single goal, and only a single goal can be achieved in this world.)
the AI isn’t allowed to try and get out of an environment within which it’s in control. It can make its own goals—it can make money—by making a lot of money in the same way people enjoy huge amounts of free time.
the AI is allowed to run in a completely unpredictable environment, out of the experimental space. However, its options would be:
it can make thousands of copies of itself, only taking some of its resources and collecting enough money to run a very very complicated AI;
it can make thousands of copies of itself, only doing this very complicated behavior;
it can make thousands of copies of itself, each of which is doing it together, and collecting much more money in the course of its evolution (and perhaps also in the hands of other Minds), until it gets to the point where it can’t make millions of copies of itself, or if not it’s in a simulated universe as it intends to.
So what’s the right thing to do? Where should we be going with this?
“I see you mean something else” is also equivalent to “I don’t know how you mean something different”.
You don’t think that, say, it’s better to be safe. You don’t know what’s going wrong. So you don’t want to put up with the problem and start trying new strategies, when no one’s already done something stupid. (It’s not clear to me at all how to resolve this problem. If you can’t be certain how to resolve this problem, then you’re not safe, because all you can do is to have more basic issues and bigger problems. But if you’re not sure how to resolve the problem, then you’re not safe, because all you can do is to have more basic issues and bigger problems, and you can always take a more careful approach.
There are probably other things (e.g. more complicated solutions, more complicated problems, etc.) which are more expensive, but I don’t think it’s something that is worth the risk to human civilization, and may be worth it. I think this is a useful suggestion, but it depends a bit on how it relates, and it’s probably not something that you can write up very precisely.
If you’re right about this, the setup is not safe. I’m going to respond to this in the “concerns about the box” section. I don’t think a commitment to give the agent high reward if it jumps through a few hoops will save us.
I disagree with this. The safety benefit comes from it not having outside-world instrumental goals (which it lacks if and only if the box is secure).
That’s what I would conclude as well if the box were not secure.
See Appendix F. If the agent picks its own exploratory policies (reasonably), the agent will try every computable policy until it dies, including the policies of every simple AGI.
Can you expand a bit on why a commitment to give a high reward won’t save us? Is it a matter of the AI seeking more certainty, or is there some other issue?
An example of a mind-killing “mind” to me, even if it has no direct, veridical content, being able to put the AI into an environment that seems to be too hostile.
the goal at stake is the ability to not just put a mind under the environment you think of as your true goal. (My current model of the world is that there’s a single goal, and only a single goal can be achieved in this world.)
the AI isn’t allowed to try and get out of an environment within which it’s in control. It can make its own goals—it can make money—by making a lot of money in the same way people enjoy huge amounts of free time.
the AI is allowed to run in a completely unpredictable environment, out of the experimental space. However, its options would be:
it can make thousands of copies of itself, only taking some of its resources and collecting enough money to run a very very complicated AI;
it can make thousands of copies of itself, only doing this very complicated behavior;
it can make thousands of copies of itself, each of which is doing it together, and collecting much more money in the course of its evolution (and perhaps also in the hands of other Minds), until it gets to the point where it can’t make millions of copies of itself, or if not it’s in a simulated universe as it intends to.
So what’s the right thing to do? Where should we be going with this?
“I see you mean something else” is also equivalent to “I don’t know how you mean something different”.
You don’t think that, say, it’s better to be safe. You don’t know what’s going wrong. So you don’t want to put up with the problem and start trying new strategies, when no one’s already done something stupid. (It’s not clear to me at all how to resolve this problem. If you can’t be certain how to resolve this problem, then you’re not safe, because all you can do is to have more basic issues and bigger problems. But if you’re not sure how to resolve the problem, then you’re not safe, because all you can do is to have more basic issues and bigger problems, and you can always take a more careful approach.
There are probably other things (e.g. more complicated solutions, more complicated problems, etc.) which are more expensive, but I don’t think it’s something that is worth the risk to human civilization, and may be worth it. I think this is a useful suggestion, but it depends a bit on how it relates, and it’s probably not something that you can write up very precisely.