Part of the reason that AI alignment is hard is that The Box is FULL of Holes! Breaking Out is EASY!
Note that under the CAIS worldview, in order to be competent in some domain you need to have some experience in that domain (i.e. competence requires learning). Or at least, that’s the worldview under which I find CAIS most compelling. In that case, the AI would have had to try breaking out of the box a few times in order to get good at it, and why would it do that? Even if it ever hit upon this plan, whenever it tried it for the first time it would get a gradient pushing that behavior away, since it didn’t help with achieving the goal. Only after significant learning would it be able to execute these weird plans in a way that they actually succeed and help achieve the goal, and that significant learning will not happen.
The only thing that distinguishes one from the other is what humans prefer.
CAIS would definitely use human preference information, see eg. section 22.
This might be a good approach, but I don’t feel it answers the question “I have a humanoid robot a hypercomputer and a couple of toddlers, how can I build something to look after the kids for a few weeks (without destroying the world) ?”
It’s not really an approach to AI safety, it’s mostly meant to be a different prediction about how we achieve superintelligence. (There are definitely some prescriptive aspects of CAIS, and some arguments that it is safer than AGI agents, but mostly it is meant to be descriptive, I believe.)
Any algorithm that gets stuck in local optimum so easily will not be very intelligent or very useful. Humans have, at least somewhat, the ability to notice that there should be a good plan in this region, find and execute that plan successfully. We don’t get stuck in local optima as much as current RL algorithms.
AIXI would be very good at making complex plans and doing well first time. You could tell it the rules of chess and it would play PERFECT chess first time. It does not need lots of examples to work from. Give it any data that you happen to have available, and it will become very competent, and able to carry out complex novel tasks first time.
Current reinforcement learning algorithms aren’t very good at breaking out of boxes because they follow the local incentive gradient. (I say not very good at, because a few algorithms have exploited glitches in a way thats a bit “break out the boxish”) In some simple domains, its possible to follow the incentive gradient all the way to the bottom. In other environments, human actions already form a good starting point, and following the incentive gradient from there can make the solution a bit better.
I agree that most of the really dangerous break out the boxes probably can’t be reached by local gradient decent from a non adversarial starting point. (I do not want to have to rely on this)
I agree that you can attach loads of sensors to say postmen, and train a big neural net to control a humanoid robot to deliver letters, given millions of training examples. You can probably automate many of the training weight fiddling tasks currently done by grad student descent to make big neural nets work.
I agree that this could be somewhat useful economically, as a significant proportion of economic productivity could be automated.
What I am saying is that this form of AI is sufficiently limited that there are still large incentives to make AGI and the CAIS can’t protect us from making an unfriendly AGI.
I’m also not sure how strong the self improvement can be when the service maker service is only making little tweaks to existing algorithms rather than designing strange new algorithms. I suspect you would get to a local optimum of a reinforcement learning algorithm producing very slight variations of reinforcement learning. This might be quite powerful, but not anywhere near the limit of self improving AGI.
AIXI would be very good at making complex plans and doing well first time.
Agreed, I claim we have no clue at how to make anything remotely like AIXI in the real world.
Humans have, at least somewhat, the ability to notice that there should be a good plan in this region, find and execute that plan successfully.
Agreed, in a CAIS world, the system of interacting services would probably notice the plan but not execute it because of some service that is meant to prevent it from doing crazy things that humans would not want.
What I am saying is that this form of AI is sufficiently limited that there are still large incentives to make AGI and the CAIS can’t protect us from making an unfriendly AGI.
This definitely seems like the crux for many people. I’m quite unsure about this point; it seems plausible to me that CAIS could in fact do most things such that there aren’t very large incentives, especially if the Factored Cognition hypothesis is true.
I’m also not sure how strong the self improvement can be when the service maker service is only making little tweaks to existing algorithms rather than designing strange new algorithms.
I don’t see why it would have to be little tweaks to existing algorithms, it seems plausible to have the R&D services consider entirely new algorithms as well.
Note that under the CAIS worldview, in order to be competent in some domain you need to have some experience in that domain (i.e. competence requires learning). Or at least, that’s the worldview under which I find CAIS most compelling. In that case, the AI would have had to try breaking out of the box a few times in order to get good at it, and why would it do that? Even if it ever hit upon this plan, whenever it tried it for the first time it would get a gradient pushing that behavior away, since it didn’t help with achieving the goal. Only after significant learning would it be able to execute these weird plans in a way that they actually succeed and help achieve the goal, and that significant learning will not happen.
CAIS would definitely use human preference information, see eg. section 22.
It’s not really an approach to AI safety, it’s mostly meant to be a different prediction about how we achieve superintelligence. (There are definitely some prescriptive aspects of CAIS, and some arguments that it is safer than AGI agents, but mostly it is meant to be descriptive, I believe.)
Any algorithm that gets stuck in local optimum so easily will not be very intelligent or very useful. Humans have, at least somewhat, the ability to notice that there should be a good plan in this region, find and execute that plan successfully. We don’t get stuck in local optima as much as current RL algorithms.
AIXI would be very good at making complex plans and doing well first time. You could tell it the rules of chess and it would play PERFECT chess first time. It does not need lots of examples to work from. Give it any data that you happen to have available, and it will become very competent, and able to carry out complex novel tasks first time.
Current reinforcement learning algorithms aren’t very good at breaking out of boxes because they follow the local incentive gradient. (I say not very good at, because a few algorithms have exploited glitches in a way thats a bit “break out the boxish”) In some simple domains, its possible to follow the incentive gradient all the way to the bottom. In other environments, human actions already form a good starting point, and following the incentive gradient from there can make the solution a bit better.
I agree that most of the really dangerous break out the boxes probably can’t be reached by local gradient decent from a non adversarial starting point. (I do not want to have to rely on this)
I agree that you can attach loads of sensors to say postmen, and train a big neural net to control a humanoid robot to deliver letters, given millions of training examples. You can probably automate many of the training weight fiddling tasks currently done by grad student descent to make big neural nets work.
I agree that this could be somewhat useful economically, as a significant proportion of economic productivity could be automated.
What I am saying is that this form of AI is sufficiently limited that there are still large incentives to make AGI and the CAIS can’t protect us from making an unfriendly AGI.
I’m also not sure how strong the self improvement can be when the service maker service is only making little tweaks to existing algorithms rather than designing strange new algorithms. I suspect you would get to a local optimum of a reinforcement learning algorithm producing very slight variations of reinforcement learning. This might be quite powerful, but not anywhere near the limit of self improving AGI.
Agreed, I claim we have no clue at how to make anything remotely like AIXI in the real world.
Agreed, in a CAIS world, the system of interacting services would probably notice the plan but not execute it because of some service that is meant to prevent it from doing crazy things that humans would not want.
This definitely seems like the crux for many people. I’m quite unsure about this point; it seems plausible to me that CAIS could in fact do most things such that there aren’t very large incentives, especially if the Factored Cognition hypothesis is true.
I don’t see why it would have to be little tweaks to existing algorithms, it seems plausible to have the R&D services consider entirely new algorithms as well.