One thing I wonder: shouldn’t an impact measure give a value to the baseline? What I mean is that in the most extreme examples, the tradeoff you show arise because sometimes the baseline is “what should happen” and some other time the baseline is “what should not happen” (like killing a pedestrian). In cases where the baseline sucks, one should act differently; and in cases where the baseline is great, changing it should come with penalty.
I assume that there’s an issue with this picture. Do you know what it is?
The baseline is not intended to indicate what should happen, but rather what happens by default. The role of the baseline is to filter out effects that were not caused by the agent, to avoid penalizing the agent for them (which would produce interference incentives). Explicitly specifying what should happen usually requires environment-specific human input, and impact measures generally try to avoid this.
I understood that the baseline that you presented was a description of what happens by default, but I wondered if there was a way to differentiate between different judgements on what happens by default. Intuitively, killing someone by not doing something feels different from not killing someone by not doing something.
So my question was a check to see if impact measures considered such judgements (which apparently they don’t) and if they didn’t, what was the problem.
I would say that impact measures don’t consider these kinds of judgments. The “doing nothing” baseline can be seen as analogous to the agent never being deployed, e.g. in the Low Impact AI paper. If the agent is never deployed, and someone dies in the meantime, then it’s not the agent’s responsibility and is not part of the agent’s impact on the world.
I think the intuition you are describing partly arises from the choice of language: “killing someone by not doing something” vs “someone dying while you are doing nothing”. The word “killing” is an active verb that carries a connotation of responsibility. If you taboo this word, does your question persist?
In the specific example of the car, can’t you compare the impact of the two next states (the baseline and the result of braking) with the current state? Killing someone should probably be considered a bigger impact than braking (and I think it is for attainable utility).
But I guess the answer is less clear-cut for cases like the door.
Ah, yes, the “compare with current state” baseline. I like that one a lot, and my thoughts regularly drift back to it, but AFAICT it unfortunately leads to some pretty heavy shutdown avoidance incentives.
Since we already exist in the world, we’re optimizing the world in a certain direction towards our goals. Each baseline represents a different assumption about using that information (see the original AUP paper for more along these lines).
Another idea is to train a “dumber” inaction policy and using that for the stepwise inaction baseline at each state. This would help encode “what should happen normally”, and then you could think of AUP as performing policy improvement on the dumb inaction policy.
When you say “shutdown avoidance incentives”, do you mean that the agent/system will actively try to avoid its own shutdown? I’m not sure why comparing with the current state would cause such a problem: the state with the least impact seems like the one where the agent let itself be shutdown, or it would go against the will of another agent. That’s how I understand it, but I’m very interested in knowing where I’m going wrong.
The baseline is “I’m not shut off now, and i can avoid shutdown”, so anything like “I let myself be shutdown” would be heavily penalized (big optimal value difference).
Thanks for the post!
One thing I wonder: shouldn’t an impact measure give a value to the baseline? What I mean is that in the most extreme examples, the tradeoff you show arise because sometimes the baseline is “what should happen” and some other time the baseline is “what should not happen” (like killing a pedestrian). In cases where the baseline sucks, one should act differently; and in cases where the baseline is great, changing it should come with penalty.
I assume that there’s an issue with this picture. Do you know what it is?
The baseline is not intended to indicate what should happen, but rather what happens by default. The role of the baseline is to filter out effects that were not caused by the agent, to avoid penalizing the agent for them (which would produce interference incentives). Explicitly specifying what should happen usually requires environment-specific human input, and impact measures generally try to avoid this.
I understood that the baseline that you presented was a description of what happens by default, but I wondered if there was a way to differentiate between different judgements on what happens by default. Intuitively, killing someone by not doing something feels different from not killing someone by not doing something.
So my question was a check to see if impact measures considered such judgements (which apparently they don’t) and if they didn’t, what was the problem.
I would say that impact measures don’t consider these kinds of judgments. The “doing nothing” baseline can be seen as analogous to the agent never being deployed, e.g. in the Low Impact AI paper. If the agent is never deployed, and someone dies in the meantime, then it’s not the agent’s responsibility and is not part of the agent’s impact on the world.
I think the intuition you are describing partly arises from the choice of language: “killing someone by not doing something” vs “someone dying while you are doing nothing”. The word “killing” is an active verb that carries a connotation of responsibility. If you taboo this word, does your question persist?
Yeah—how do you know what should happen?
In the specific example of the car, can’t you compare the impact of the two next states (the baseline and the result of braking) with the current state? Killing someone should probably be considered a bigger impact than braking (and I think it is for attainable utility).
But I guess the answer is less clear-cut for cases like the door.
Ah, yes, the “compare with current state” baseline. I like that one a lot, and my thoughts regularly drift back to it, but AFAICT it unfortunately leads to some pretty heavy shutdown avoidance incentives.
Since we already exist in the world, we’re optimizing the world in a certain direction towards our goals. Each baseline represents a different assumption about using that information (see the original AUP paper for more along these lines).
Another idea is to train a “dumber” inaction policy and using that for the stepwise inaction baseline at each state. This would help encode “what should happen normally”, and then you could think of AUP as performing policy improvement on the dumb inaction policy.
When you say “shutdown avoidance incentives”, do you mean that the agent/system will actively try to avoid its own shutdown? I’m not sure why comparing with the current state would cause such a problem: the state with the least impact seems like the one where the agent let itself be shutdown, or it would go against the will of another agent. That’s how I understand it, but I’m very interested in knowing where I’m going wrong.
The baseline is “I’m not shut off now, and i can avoid shutdown”, so anything like “I let myself be shutdown” would be heavily penalized (big optimal value difference).