Any long term planning processes that consider weird plans for achieving goals (similar to “break out of the box”) will typically not find any such plan and will be eliminated in favor of cognition that will actually help achieve the task.
Part of the reason that AI alignment is hard is that The Box is FULL of Holes! Breaking Out is EASY!
And the deeper reason for that is that we have no idea how to tell what’s a hole.
Suppose you want to set the service generator to make a robot that cleans cars. If you give a blow by blow formal description of what you mean by “cleans cars” then your “service generator” is just a compiler. If you do not give a complete specification of what you mean, where does the information that “chopping off a nearby head to wipe windows with is unacceptable” come from. If the service generator notices that cars need cleaning and build the service by itself, you have an AGI by another name.
Obviously, if you have large amounts of training data made by humans with joysticks, and the robot is sampling from the same distribution, then you should be fine. This system learns that dirtier windshields need more wiping from 100′s of examples of humans doing that, it doesn’t chop off any heads because the humans didn’t.
However, if you want the robot to display remotely novel behavior, then the distance between the training data and the new good solutions, becomes as large as the distance from the training data to bad solutions. If it’s smart enough to go to the shops and buy a sponge, without having that strategy hardcoded in when it was built, then its smart enough to break into your neighbors house and nick a sponge.
The only thing that distinguishes one from the other is what humans prefer.
Distinguishing low impact from high impact is also hard.
This might be a good approach, but I don’t feel it answers the question “I have a humanoid robot a hypercomputer and a couple of toddlers, how can I build something to look after the kids for a few weeks (without destroying the world) ?” So far, CAIS looks confused.
It seems like the important thing is how bounded the task is.
For example, in the case of Go, if you just kept training AlphaZero, would you expect it to eventually decide that it needs to break out into the physical world to get more computing power?
It seems to me that it could get to be ultra-super-human at Go without that happening. (Even if there is some theoretical threshold where, with enough computation, it couldn’t help but stumble upon a sequence of moves that causes the program to crash. It seems to me that you’re likely to get crashing behavior long before you get hack-out-of-the-vm behavior, and the threshold for either may be too high to matter.)
If that’s true for Go, then the questions are:
1. How much less bounded of a task can you train a system to do while maintaining the focused-on-the-task property?
and
2. How general of a system can you make by composing such focused systems together?
Part of the reason that AI alignment is hard is that The Box is FULL of Holes! Breaking Out is EASY!
Note that under the CAIS worldview, in order to be competent in some domain you need to have some experience in that domain (i.e. competence requires learning). Or at least, that’s the worldview under which I find CAIS most compelling. In that case, the AI would have had to try breaking out of the box a few times in order to get good at it, and why would it do that? Even if it ever hit upon this plan, whenever it tried it for the first time it would get a gradient pushing that behavior away, since it didn’t help with achieving the goal. Only after significant learning would it be able to execute these weird plans in a way that they actually succeed and help achieve the goal, and that significant learning will not happen.
The only thing that distinguishes one from the other is what humans prefer.
CAIS would definitely use human preference information, see eg. section 22.
This might be a good approach, but I don’t feel it answers the question “I have a humanoid robot a hypercomputer and a couple of toddlers, how can I build something to look after the kids for a few weeks (without destroying the world) ?”
It’s not really an approach to AI safety, it’s mostly meant to be a different prediction about how we achieve superintelligence. (There are definitely some prescriptive aspects of CAIS, and some arguments that it is safer than AGI agents, but mostly it is meant to be descriptive, I believe.)
Any algorithm that gets stuck in local optimum so easily will not be very intelligent or very useful. Humans have, at least somewhat, the ability to notice that there should be a good plan in this region, find and execute that plan successfully. We don’t get stuck in local optima as much as current RL algorithms.
AIXI would be very good at making complex plans and doing well first time. You could tell it the rules of chess and it would play PERFECT chess first time. It does not need lots of examples to work from. Give it any data that you happen to have available, and it will become very competent, and able to carry out complex novel tasks first time.
Current reinforcement learning algorithms aren’t very good at breaking out of boxes because they follow the local incentive gradient. (I say not very good at, because a few algorithms have exploited glitches in a way thats a bit “break out the boxish”) In some simple domains, its possible to follow the incentive gradient all the way to the bottom. In other environments, human actions already form a good starting point, and following the incentive gradient from there can make the solution a bit better.
I agree that most of the really dangerous break out the boxes probably can’t be reached by local gradient decent from a non adversarial starting point. (I do not want to have to rely on this)
I agree that you can attach loads of sensors to say postmen, and train a big neural net to control a humanoid robot to deliver letters, given millions of training examples. You can probably automate many of the training weight fiddling tasks currently done by grad student descent to make big neural nets work.
I agree that this could be somewhat useful economically, as a significant proportion of economic productivity could be automated.
What I am saying is that this form of AI is sufficiently limited that there are still large incentives to make AGI and the CAIS can’t protect us from making an unfriendly AGI.
I’m also not sure how strong the self improvement can be when the service maker service is only making little tweaks to existing algorithms rather than designing strange new algorithms. I suspect you would get to a local optimum of a reinforcement learning algorithm producing very slight variations of reinforcement learning. This might be quite powerful, but not anywhere near the limit of self improving AGI.
AIXI would be very good at making complex plans and doing well first time.
Agreed, I claim we have no clue at how to make anything remotely like AIXI in the real world.
Humans have, at least somewhat, the ability to notice that there should be a good plan in this region, find and execute that plan successfully.
Agreed, in a CAIS world, the system of interacting services would probably notice the plan but not execute it because of some service that is meant to prevent it from doing crazy things that humans would not want.
What I am saying is that this form of AI is sufficiently limited that there are still large incentives to make AGI and the CAIS can’t protect us from making an unfriendly AGI.
This definitely seems like the crux for many people. I’m quite unsure about this point; it seems plausible to me that CAIS could in fact do most things such that there aren’t very large incentives, especially if the Factored Cognition hypothesis is true.
I’m also not sure how strong the self improvement can be when the service maker service is only making little tweaks to existing algorithms rather than designing strange new algorithms.
I don’t see why it would have to be little tweaks to existing algorithms, it seems plausible to have the R&D services consider entirely new algorithms as well.
I disagree outright with
Part of the reason that AI alignment is hard is that The Box is FULL of Holes! Breaking Out is EASY!
And the deeper reason for that is that we have no idea how to tell what’s a hole.
Suppose you want to set the service generator to make a robot that cleans cars. If you give a blow by blow formal description of what you mean by “cleans cars” then your “service generator” is just a compiler. If you do not give a complete specification of what you mean, where does the information that “chopping off a nearby head to wipe windows with is unacceptable” come from. If the service generator notices that cars need cleaning and build the service by itself, you have an AGI by another name.
Obviously, if you have large amounts of training data made by humans with joysticks, and the robot is sampling from the same distribution, then you should be fine. This system learns that dirtier windshields need more wiping from 100′s of examples of humans doing that, it doesn’t chop off any heads because the humans didn’t.
However, if you want the robot to display remotely novel behavior, then the distance between the training data and the new good solutions, becomes as large as the distance from the training data to bad solutions. If it’s smart enough to go to the shops and buy a sponge, without having that strategy hardcoded in when it was built, then its smart enough to break into your neighbors house and nick a sponge.
The only thing that distinguishes one from the other is what humans prefer.
Distinguishing low impact from high impact is also hard.
This might be a good approach, but I don’t feel it answers the question “I have a humanoid robot a hypercomputer and a couple of toddlers, how can I build something to look after the kids for a few weeks (without destroying the world) ?” So far, CAIS looks confused.
It seems like the important thing is how bounded the task is.
For example, in the case of Go, if you just kept training AlphaZero, would you expect it to eventually decide that it needs to break out into the physical world to get more computing power?
It seems to me that it could get to be ultra-super-human at Go without that happening. (Even if there is some theoretical threshold where, with enough computation, it couldn’t help but stumble upon a sequence of moves that causes the program to crash. It seems to me that you’re likely to get crashing behavior long before you get hack-out-of-the-vm behavior, and the threshold for either may be too high to matter.)
If that’s true for Go, then the questions are:
1. How much less bounded of a task can you train a system to do while maintaining the focused-on-the-task property?
and
2. How general of a system can you make by composing such focused systems together?
Note that under the CAIS worldview, in order to be competent in some domain you need to have some experience in that domain (i.e. competence requires learning). Or at least, that’s the worldview under which I find CAIS most compelling. In that case, the AI would have had to try breaking out of the box a few times in order to get good at it, and why would it do that? Even if it ever hit upon this plan, whenever it tried it for the first time it would get a gradient pushing that behavior away, since it didn’t help with achieving the goal. Only after significant learning would it be able to execute these weird plans in a way that they actually succeed and help achieve the goal, and that significant learning will not happen.
CAIS would definitely use human preference information, see eg. section 22.
It’s not really an approach to AI safety, it’s mostly meant to be a different prediction about how we achieve superintelligence. (There are definitely some prescriptive aspects of CAIS, and some arguments that it is safer than AGI agents, but mostly it is meant to be descriptive, I believe.)
Any algorithm that gets stuck in local optimum so easily will not be very intelligent or very useful. Humans have, at least somewhat, the ability to notice that there should be a good plan in this region, find and execute that plan successfully. We don’t get stuck in local optima as much as current RL algorithms.
AIXI would be very good at making complex plans and doing well first time. You could tell it the rules of chess and it would play PERFECT chess first time. It does not need lots of examples to work from. Give it any data that you happen to have available, and it will become very competent, and able to carry out complex novel tasks first time.
Current reinforcement learning algorithms aren’t very good at breaking out of boxes because they follow the local incentive gradient. (I say not very good at, because a few algorithms have exploited glitches in a way thats a bit “break out the boxish”) In some simple domains, its possible to follow the incentive gradient all the way to the bottom. In other environments, human actions already form a good starting point, and following the incentive gradient from there can make the solution a bit better.
I agree that most of the really dangerous break out the boxes probably can’t be reached by local gradient decent from a non adversarial starting point. (I do not want to have to rely on this)
I agree that you can attach loads of sensors to say postmen, and train a big neural net to control a humanoid robot to deliver letters, given millions of training examples. You can probably automate many of the training weight fiddling tasks currently done by grad student descent to make big neural nets work.
I agree that this could be somewhat useful economically, as a significant proportion of economic productivity could be automated.
What I am saying is that this form of AI is sufficiently limited that there are still large incentives to make AGI and the CAIS can’t protect us from making an unfriendly AGI.
I’m also not sure how strong the self improvement can be when the service maker service is only making little tweaks to existing algorithms rather than designing strange new algorithms. I suspect you would get to a local optimum of a reinforcement learning algorithm producing very slight variations of reinforcement learning. This might be quite powerful, but not anywhere near the limit of self improving AGI.
Agreed, I claim we have no clue at how to make anything remotely like AIXI in the real world.
Agreed, in a CAIS world, the system of interacting services would probably notice the plan but not execute it because of some service that is meant to prevent it from doing crazy things that humans would not want.
This definitely seems like the crux for many people. I’m quite unsure about this point; it seems plausible to me that CAIS could in fact do most things such that there aren’t very large incentives, especially if the Factored Cognition hypothesis is true.
I don’t see why it would have to be little tweaks to existing algorithms, it seems plausible to have the R&D services consider entirely new algorithms as well.