If I understand you correctly, you are reasoning about to the first plan to blow up the moon that the agent will carry out (whether successful or not). Based on this, I assume you have some iterative process in mind in which we invoke the agent repeatedly with increasing computation power or time steps; and stop if we observe the agent trying to blow up the moon.
Assuming that our impact measure is not perfect, I argue that the more the impact measure is “accurate” (i.e. aligned with what “ideal humans” consider as impact), the more unsafe the first moon-blowing attempt will be;because more optimization power would be required to game the impact measure (i.e. find a special plan to blow up the moon that is low-impact according to the impact measure). And the more optimization power the agent has to find such a special plan, the more likely it will be unexpected and unsafe.
The only iterative process I have in mind is a normal RL agent. It’s the same agent every timestep, but as it learns about its environment, it becomes more capable. (At timestep 0, it doesn’t know anything.)
Succeeding at blowing up the moon would absolutely unsafe. Let’s just assume that one step along the way toward blowing up the moon would be killing everyone. I just want to clarify I’m not saying that it would be fine if the agent succeeds in blowing up the moon.
Let t(p) be the first timestep by which its best plan for blowing up the moon has probability of success p. If the impact measure is “almost good”, and there are only a few ways to succeed in blowing up the moon while having apparently low impact, t(p) might be greater than if the impact measure is horrible, but the same logic holds. At time t(1/R), the agent initiates a plan that is likely to fail.
A remaining question is how it fails. Will it succeed in tricking its operators and taking over the world, but then fail to blow up the moon (while maintaining “low impact”)? Or will it fail in tricking its operators and taking over the world (while maintaining “low impact”)? Intuitively, the latter seems like a bigger hurdle, but even if that’s not the case, for sufficiently large R, and sufficiently low success probability of the plan, the plan is likely to be fail at both steps.
To see this: suppose I ask you to come up with a plan that has a one in a million chance of accomplishing A then B, both of which are non-trivial tasks. Is it more likely that the first plan you come up with has a 50% chance of accomplishing A and a one in 500,000 chance of accomplishing B, or is it more likely that the first plan you come up with has a one in a thousand chance of accomplishing A, and likewise for B? Whatever the answer, now suppose I ask you to come up with a plan that has a one in a billion chance of accomplishing A then B. Both probabilities should go down for this plan.
I thought more about your idea and I think it’s very interesting!
I currently see the following problems with the current version of your idea:
If we observe the agent doing “weird but tolerable things” we can’t conclude that it’s trying to blow up the moon and that thus the impact measure is bad. Perhaps the agent is executing a low-probability plan to trick the relevant sensors/reward-logic into getting the reward as if the moon just blew up (which from an impact perspective is a legitimate way to get the reward). So the only way to conclude the impact measure is bad is to observe the agent doing intolerable things, which is unacceptable.
It might be the case that in some time step the agent comes up with lots of 1e-100 probability plans to blow up the moon in a “low-impact way” (by gaming the impact measure). But since those plans are so low probability, it happens to be the case that for each such plan there’s (relatively) large-enough probability that something high-impact will actually happen when executing the plan, and therefore the agent doesn’t choose those plans. But as you allow the agent to become more capable (including giving it more time steps), the success probability of plans to blow up to moon increases, until some critical point when the probability of high-impact outcomes for one of the plans is small enough, and then the agent executes it. Suppose that plan has probability of 1e-10 to blow up the moon and probability of 99% to cause our extinction.
Let me suggest the following modification to your idea:
Replace “blowing up the moon” with “pressing a button that blows up 1 ton of TNT somewhere in some desert”.
Then, if the agent presses the button and the TNT blows up, we’ll know that the impact measure is bad.
Assumptions:
1. Blowing up 1 ton of TNT in some desert is expected to be prevented by a good impact measure.
2. It’s very easy for the robot to press the button and very hard to somehow get the reward for pressing the button without blowing up the TNT.
If I understand you correctly, you are reasoning about to the first plan to blow up the moon that the agent will carry out (whether successful or not). Based on this, I assume you have some iterative process in mind in which we invoke the agent repeatedly with increasing computation power or time steps; and stop if we observe the agent trying to blow up the moon.
Assuming that our impact measure is not perfect, I argue that the more the impact measure is “accurate” (i.e. aligned with what “ideal humans” consider as impact), the more unsafe the first moon-blowing attempt will be; because more optimization power would be required to game the impact measure (i.e. find a special plan to blow up the moon that is low-impact according to the impact measure). And the more optimization power the agent has to find such a special plan, the more likely it will be unexpected and unsafe.
The only iterative process I have in mind is a normal RL agent. It’s the same agent every timestep, but as it learns about its environment, it becomes more capable. (At timestep 0, it doesn’t know anything.)
Succeeding at blowing up the moon would absolutely unsafe. Let’s just assume that one step along the way toward blowing up the moon would be killing everyone. I just want to clarify I’m not saying that it would be fine if the agent succeeds in blowing up the moon.
Let t(p) be the first timestep by which its best plan for blowing up the moon has probability of success p. If the impact measure is “almost good”, and there are only a few ways to succeed in blowing up the moon while having apparently low impact, t(p) might be greater than if the impact measure is horrible, but the same logic holds. At time t(1/R), the agent initiates a plan that is likely to fail.
A remaining question is how it fails. Will it succeed in tricking its operators and taking over the world, but then fail to blow up the moon (while maintaining “low impact”)? Or will it fail in tricking its operators and taking over the world (while maintaining “low impact”)? Intuitively, the latter seems like a bigger hurdle, but even if that’s not the case, for sufficiently large R, and sufficiently low success probability of the plan, the plan is likely to be fail at both steps.
To see this: suppose I ask you to come up with a plan that has a one in a million chance of accomplishing A then B, both of which are non-trivial tasks. Is it more likely that the first plan you come up with has a 50% chance of accomplishing A and a one in 500,000 chance of accomplishing B, or is it more likely that the first plan you come up with has a one in a thousand chance of accomplishing A, and likewise for B? Whatever the answer, now suppose I ask you to come up with a plan that has a one in a billion chance of accomplishing A then B. Both probabilities should go down for this plan.
I thought more about your idea and I think it’s very interesting!
I currently see the following problems with the current version of your idea:
If we observe the agent doing “weird but tolerable things” we can’t conclude that it’s trying to blow up the moon and that thus the impact measure is bad. Perhaps the agent is executing a low-probability plan to trick the relevant sensors/reward-logic into getting the reward as if the moon just blew up (which from an impact perspective is a legitimate way to get the reward). So the only way to conclude the impact measure is bad is to observe the agent doing intolerable things, which is unacceptable.
It might be the case that in some time step the agent comes up with lots of 1e-100 probability plans to blow up the moon in a “low-impact way” (by gaming the impact measure). But since those plans are so low probability, it happens to be the case that for each such plan there’s (relatively) large-enough probability that something high-impact will actually happen when executing the plan, and therefore the agent doesn’t choose those plans. But as you allow the agent to become more capable (including giving it more time steps), the success probability of plans to blow up to moon increases, until some critical point when the probability of high-impact outcomes for one of the plans is small enough, and then the agent executes it. Suppose that plan has probability of 1e-10 to blow up the moon and probability of 99% to cause our extinction.
Let me suggest the following modification to your idea:
Replace “blowing up the moon” with “pressing a button that blows up 1 ton of TNT somewhere in some desert”.
Then, if the agent presses the button and the TNT blows up, we’ll know that the impact measure is bad.
Assumptions:
1. Blowing up 1 ton of TNT in some desert is expected to be prevented by a good impact measure.
2. It’s very easy for the robot to press the button and very hard to somehow get the reward for pressing the button without blowing up the TNT.