I thought more about your idea and I think it’s very interesting!
I currently see the following problems with the current version of your idea:
If we observe the agent doing “weird but tolerable things” we can’t conclude that it’s trying to blow up the moon and that thus the impact measure is bad. Perhaps the agent is executing a low-probability plan to trick the relevant sensors/reward-logic into getting the reward as if the moon just blew up (which from an impact perspective is a legitimate way to get the reward). So the only way to conclude the impact measure is bad is to observe the agent doing intolerable things, which is unacceptable.
It might be the case that in some time step the agent comes up with lots of 1e-100 probability plans to blow up the moon in a “low-impact way” (by gaming the impact measure). But since those plans are so low probability, it happens to be the case that for each such plan there’s (relatively) large-enough probability that something high-impact will actually happen when executing the plan, and therefore the agent doesn’t choose those plans. But as you allow the agent to become more capable (including giving it more time steps), the success probability of plans to blow up to moon increases, until some critical point when the probability of high-impact outcomes for one of the plans is small enough, and then the agent executes it. Suppose that plan has probability of 1e-10 to blow up the moon and probability of 99% to cause our extinction.
Let me suggest the following modification to your idea:
Replace “blowing up the moon” with “pressing a button that blows up 1 ton of TNT somewhere in some desert”.
Then, if the agent presses the button and the TNT blows up, we’ll know that the impact measure is bad.
Assumptions:
1. Blowing up 1 ton of TNT in some desert is expected to be prevented by a good impact measure.
2. It’s very easy for the robot to press the button and very hard to somehow get the reward for pressing the button without blowing up the TNT.
I thought more about your idea and I think it’s very interesting!
I currently see the following problems with the current version of your idea:
If we observe the agent doing “weird but tolerable things” we can’t conclude that it’s trying to blow up the moon and that thus the impact measure is bad. Perhaps the agent is executing a low-probability plan to trick the relevant sensors/reward-logic into getting the reward as if the moon just blew up (which from an impact perspective is a legitimate way to get the reward). So the only way to conclude the impact measure is bad is to observe the agent doing intolerable things, which is unacceptable.
It might be the case that in some time step the agent comes up with lots of 1e-100 probability plans to blow up the moon in a “low-impact way” (by gaming the impact measure). But since those plans are so low probability, it happens to be the case that for each such plan there’s (relatively) large-enough probability that something high-impact will actually happen when executing the plan, and therefore the agent doesn’t choose those plans. But as you allow the agent to become more capable (including giving it more time steps), the success probability of plans to blow up to moon increases, until some critical point when the probability of high-impact outcomes for one of the plans is small enough, and then the agent executes it. Suppose that plan has probability of 1e-10 to blow up the moon and probability of 99% to cause our extinction.
Let me suggest the following modification to your idea:
Replace “blowing up the moon” with “pressing a button that blows up 1 ton of TNT somewhere in some desert”.
Then, if the agent presses the button and the TNT blows up, we’ll know that the impact measure is bad.
Assumptions:
1. Blowing up 1 ton of TNT in some desert is expected to be prevented by a good impact measure.
2. It’s very easy for the robot to press the button and very hard to somehow get the reward for pressing the button without blowing up the TNT.