So I think this proposal is a specific instance of a general principle: wireheading acts as a safety valve. As long as the AI is prone to wireheading, and you leave it some path to wireheading, the AI will never bother with anything more difficult than that path. In this particular case, it will never do anything harder than hacking its human operator, and that’s where most of the safety benefit comes from.
This kind of setup creates a direct trade-off between safety and capability: the harder it is to hack the human operator, the more likely that some other escape channel is easier—manipulating electrical circuitry to push atoms around, or some weird thing like that. The AI will become an existential threat if you send in an operator who really doesn’t want to give it 100% reward, and make that operator hard to hack—then the AI has a strong incentive to gain whatever power it needs. (I definitely do not trust a door with a kill switch to keep a superintelligent AI in a box.)
My intuition says that nothing else in the proposal actually matters for safety, except the wireheading safety valve aspect. In particular, I think just giving the AI direct exploration abilities won’t mess up safety, as long as the wireheading path is present and there’s not “99.99% certainty is better than 99.98%”-type failure mode.
(I definitely do not trust a door with a kill switch to keep a superintelligent AI in a box.)
If you’re right about this, the setup is not safe. I’m going to respond to this in the “concerns about the box” section. I don’t think a commitment to give the agent high reward if it jumps through a few hoops will save us.
In this particular case, it will never do anything harder than hacking its human operator, and that’s where most of the safety benefit comes from.
I disagree with this. The safety benefit comes from it not having outside-world instrumental goals (which it lacks if and only if the box is secure).
My intuition says that nothing else in the proposal actually matters for safety
That’s what I would conclude as well if the box were not secure.
In particular, I think just giving the AI direct exploration abilities won’t mess up safety,
See Appendix F. If the agent picks its own exploratory policies (reasonably), the agent will try every computable policy until it dies, including the policies of every simple AGI.
Can you expand a bit on why a commitment to give a high reward won’t save us? Is it a matter of the AI seeking more certainty, or is there some other issue?
An example of a mind-killing “mind” to me, even if it has no direct, veridical content, being able to put the AI into an environment that seems to be too hostile.
the goal at stake is the ability to not just put a mind under the environment you think of as your true goal. (My current model of the world is that there’s a single goal, and only a single goal can be achieved in this world.)
the AI isn’t allowed to try and get out of an environment within which it’s in control. It can make its own goals—it can make money—by making a lot of money in the same way people enjoy huge amounts of free time.
the AI is allowed to run in a completely unpredictable environment, out of the experimental space. However, its options would be:
it can make thousands of copies of itself, only taking some of its resources and collecting enough money to run a very very complicated AI;
it can make thousands of copies of itself, only doing this very complicated behavior;
it can make thousands of copies of itself, each of which is doing it together, and collecting much more money in the course of its evolution (and perhaps also in the hands of other Minds), until it gets to the point where it can’t make millions of copies of itself, or if not it’s in a simulated universe as it intends to.
So what’s the right thing to do? Where should we be going with this?
“I see you mean something else” is also equivalent to “I don’t know how you mean something different”.
You don’t think that, say, it’s better to be safe. You don’t know what’s going wrong. So you don’t want to put up with the problem and start trying new strategies, when no one’s already done something stupid. (It’s not clear to me at all how to resolve this problem. If you can’t be certain how to resolve this problem, then you’re not safe, because all you can do is to have more basic issues and bigger problems. But if you’re not sure how to resolve the problem, then you’re not safe, because all you can do is to have more basic issues and bigger problems, and you can always take a more careful approach.
There are probably other things (e.g. more complicated solutions, more complicated problems, etc.) which are more expensive, but I don’t think it’s something that is worth the risk to human civilization, and may be worth it. I think this is a useful suggestion, but it depends a bit on how it relates, and it’s probably not something that you can write up very precisely.
What if we had some simple way of solving this problem without needing to be safe? I think a solution to the problem would involve some serious technical effort, and an understanding that the “solving” problem won’t be solved by “solving”, but it is the problem of Friendly AI which you see here not missing some big conceptual insight.
One way that I would go about solving the problem would be to build a safe AGI, and build the safety solution. That way “solving” problems won’t always be safe, but (and also won’t make the exact problem safe), the “solving” problem won’t always be safe, and any solution to safe AI will probably be safe. But it would be nice if it worked for practical purposes; if it worked for a big goal, the problem would be safe.
In the world where the solutions are safe, there are no fundamentally scary alternatives so long as their safety is secure, and so the safety solution won’t be scary to humans.
So, yes, it is an AGI safety problem that the system of AGIs will face, because it will not need to be dangerous. But what if the system of AGI does not need to be safe. The only reason to have an AI safety problem is that we want to have a system which is safe. So our AI safety problem will not always be scary to humans, but it definitely will be. We might not be able to solve it one way or another.
The way to make progress on safety is to build an AGI system that can create an AGI system sufficiently smart that at least one of the world’s most intelligent humans is be created. A system which has a safety net are extremely difficult to build. A system which has a safety net of highly trained humans is extremely difficult to build. And so on. The safety net of an AGI system can scale with time and scale with capability.
I think that the problem seems to be that if the world was already as dumb as we think, we should want to do great safety research. If you want to do great safety research, you are going to have to be a lot smarter than the average scientist or programmer. You can’t build an AGI that can actually accomplish anything to the world’s challenges. You have to be the first in person.
I would take a second to say that I want to focus more on these questions than on actually designing an AGI. In
So I think this proposal is a specific instance of a general principle: wireheading acts as a safety valve. As long as the AI is prone to wireheading, and you leave it some path to wireheading, the AI will never bother with anything more difficult than that path. In this particular case, it will never do anything harder than hacking its human operator, and that’s where most of the safety benefit comes from.
This kind of setup creates a direct trade-off between safety and capability: the harder it is to hack the human operator, the more likely that some other escape channel is easier—manipulating electrical circuitry to push atoms around, or some weird thing like that. The AI will become an existential threat if you send in an operator who really doesn’t want to give it 100% reward, and make that operator hard to hack—then the AI has a strong incentive to gain whatever power it needs. (I definitely do not trust a door with a kill switch to keep a superintelligent AI in a box.)
My intuition says that nothing else in the proposal actually matters for safety, except the wireheading safety valve aspect. In particular, I think just giving the AI direct exploration abilities won’t mess up safety, as long as the wireheading path is present and there’s not “99.99% certainty is better than 99.98%”-type failure mode.
If you’re right about this, the setup is not safe. I’m going to respond to this in the “concerns about the box” section. I don’t think a commitment to give the agent high reward if it jumps through a few hoops will save us.
I disagree with this. The safety benefit comes from it not having outside-world instrumental goals (which it lacks if and only if the box is secure).
That’s what I would conclude as well if the box were not secure.
See Appendix F. If the agent picks its own exploratory policies (reasonably), the agent will try every computable policy until it dies, including the policies of every simple AGI.
Can you expand a bit on why a commitment to give a high reward won’t save us? Is it a matter of the AI seeking more certainty, or is there some other issue?
An example of a mind-killing “mind” to me, even if it has no direct, veridical content, being able to put the AI into an environment that seems to be too hostile.
the goal at stake is the ability to not just put a mind under the environment you think of as your true goal. (My current model of the world is that there’s a single goal, and only a single goal can be achieved in this world.)
the AI isn’t allowed to try and get out of an environment within which it’s in control. It can make its own goals—it can make money—by making a lot of money in the same way people enjoy huge amounts of free time.
the AI is allowed to run in a completely unpredictable environment, out of the experimental space. However, its options would be:
it can make thousands of copies of itself, only taking some of its resources and collecting enough money to run a very very complicated AI;
it can make thousands of copies of itself, only doing this very complicated behavior;
it can make thousands of copies of itself, each of which is doing it together, and collecting much more money in the course of its evolution (and perhaps also in the hands of other Minds), until it gets to the point where it can’t make millions of copies of itself, or if not it’s in a simulated universe as it intends to.
So what’s the right thing to do? Where should we be going with this?
“I see you mean something else” is also equivalent to “I don’t know how you mean something different”.
You don’t think that, say, it’s better to be safe. You don’t know what’s going wrong. So you don’t want to put up with the problem and start trying new strategies, when no one’s already done something stupid. (It’s not clear to me at all how to resolve this problem. If you can’t be certain how to resolve this problem, then you’re not safe, because all you can do is to have more basic issues and bigger problems. But if you’re not sure how to resolve the problem, then you’re not safe, because all you can do is to have more basic issues and bigger problems, and you can always take a more careful approach.
There are probably other things (e.g. more complicated solutions, more complicated problems, etc.) which are more expensive, but I don’t think it’s something that is worth the risk to human civilization, and may be worth it. I think this is a useful suggestion, but it depends a bit on how it relates, and it’s probably not something that you can write up very precisely.
What if we had some simple way of solving this problem without needing to be safe? I think a solution to the problem would involve some serious technical effort, and an understanding that the “solving” problem won’t be solved by “solving”, but it is the problem of Friendly AI which you see here not missing some big conceptual insight.
One way that I would go about solving the problem would be to build a safe AGI, and build the safety solution. That way “solving” problems won’t always be safe, but (and also won’t make the exact problem safe), the “solving” problem won’t always be safe, and any solution to safe AI will probably be safe. But it would be nice if it worked for practical purposes; if it worked for a big goal, the problem would be safe.
In the world where the solutions are safe, there are no fundamentally scary alternatives so long as their safety is secure, and so the safety solution won’t be scary to humans.
So, yes, it is an AGI safety problem that the system of AGIs will face, because it will not need to be dangerous. But what if the system of AGI does not need to be safe. The only reason to have an AI safety problem is that we want to have a system which is safe. So our AI safety problem will not always be scary to humans, but it definitely will be. We might not be able to solve it one way or another.
The way to make progress on safety is to build an AGI system that can create an AGI system sufficiently smart that at least one of the world’s most intelligent humans is be created. A system which has a safety net are extremely difficult to build. A system which has a safety net of highly trained humans is extremely difficult to build. And so on. The safety net of an AGI system can scale with time and scale with capability.
I think that the problem seems to be that if the world was already as dumb as we think, we should want to do great safety research. If you want to do great safety research, you are going to have to be a lot smarter than the average scientist or programmer. You can’t build an AGI that can actually accomplish anything to the world’s challenges. You have to be the first in person.
I would take a second to say that I want to focus more on these questions than on actually designing an AGI. In