I think it’s really great to have this argument typed up somewhere, and I liked the images. There’s something important going on with how the agent can make our formal measurement of its power stop tracking the actual powers it’s able to exert over the world, and I think solving this question is the primary remaining open challenge in impact measurement. The second half of Reframing Impact (currently being written and drawn) will discuss this in detail, as well as proposing partial solutions to this problem.
How is A getting reward for SA being on the blue button? I assume A gets reward whenever a robot is on the button?
This will give it a reward of Ωγk+1,
Is the +1 a typo?
Note, though this is not relevant to this post, that if there were no teleporters along the corridor (just at either end), the robot could not move towards the blue button.
Depends on how much impact is penalized compared to normal reward.
Now plausible is this to work in a more general situation? Well, if the R is rich enough, this similar to the “twenty billion questions” in our low impact paper (section 3.2). But that’s excessively rich, and will probably condemn the agent to inaction.
This isn’t necessarily true. Consider R as the reward function class for all linear functionals over camera pixels. Or, even the max-ent distribution over observation-based reward functions. I claim that this doesn’t look like 20 billion Q’s.
ETA: I’d also like to note that, while implicitly expanding the action space in the way you did (e.g. ”A can issue requests to SA, and also program arbitrary non-Markovian policies into it”) is valid, I want to explicitly point it out.
I assume A gets reward whenever a robot is on the button?
Yes. If A needs to be there in person, then SA can carry it there (after suitably crippling it).
Is the +1 a typo?
Yes, thanks; re-written it to be Ωγk+1.
I’d also like to note that, while implicitly expanding the action space in the way you did (e.g. ”A can issue requests to SA, and also program arbitrary non-Markovian policies into it”) is valid, I want to explicitly point it out.
Yep. That’s a subset of “It can use its arms to manipulate anything in the eight squares around itself.”, but it’s worth pointing it out explicitly.
The impact measure is something like “Don’t let the expected value of R change; under the assumption that A will be an R-maximiser”.
The addition of the subagent transforms this, in practice, to either “Don’t let the expected value of R change”, or to nothing. These are ontologically simpler statements, so it can be argued that the initial measure failed to properly articulate “under the assumption that A will be an R-maximiser”.
I think it’s really great to have this argument typed up somewhere, and I liked the images. There’s something important going on with how the agent can make our formal measurement of its power stop tracking the actual powers it’s able to exert over the world, and I think solving this question is the primary remaining open challenge in impact measurement. The second half of Reframing Impact (currently being written and drawn) will discuss this in detail, as well as proposing partial solutions to this problem.
The agent’s own power plausibly seems like a thing we should be able to cleanly formalize in a way that’s robust when implemented in an impact measure. The problem you’ve pointed out somewhat reminds me of the easy problem of wireheading, in which we are fighting against a design choice rather than value specification difficulty.
How is A getting reward for SA being on the blue button? I assume A gets reward whenever a robot is on the button?
Is the +1 a typo?
Depends on how much impact is penalized compared to normal reward.
This isn’t necessarily true. Consider R as the reward function class for all linear functionals over camera pixels. Or, even the max-ent distribution over observation-based reward functions. I claim that this doesn’t look like 20 billion Q’s.
ETA: I’d also like to note that, while implicitly expanding the action space in the way you did (e.g. ”A can issue requests to SA, and also program arbitrary non-Markovian policies into it”) is valid, I want to explicitly point it out.
Yes. If A needs to be there in person, then SA can carry it there (after suitably crippling it).
Yes, thanks; re-written it to be Ωγk+1.
Yep. That’s a subset of “It can use its arms to manipulate anything in the eight squares around itself.”, but it’s worth pointing it out explicitly.
See here for more on this https://www.lesswrong.com/s/iRwYCpcAXuFD24tHh/p/jrrZids4LPiLuLzpu
It seems the problem might be worse than I thought...
The impact measure is something like “Don’t let the expected value of R change; under the assumption that A will be an R-maximiser”.
The addition of the subagent transforms this, in practice, to either “Don’t let the expected value of R change”, or to nothing. These are ontologically simpler statements, so it can be argued that the initial measure failed to properly articulate “under the assumption that A will be an R-maximiser”.