Ah, yes, the “compare with current state” baseline. I like that one a lot, and my thoughts regularly drift back to it, but AFAICT it unfortunately leads to some pretty heavy shutdown avoidance incentives.
Since we already exist in the world, we’re optimizing the world in a certain direction towards our goals. Each baseline represents a different assumption about using that information (see the original AUP paper for more along these lines).
Another idea is to train a “dumber” inaction policy and using that for the stepwise inaction baseline at each state. This would help encode “what should happen normally”, and then you could think of AUP as performing policy improvement on the dumb inaction policy.
When you say “shutdown avoidance incentives”, do you mean that the agent/system will actively try to avoid its own shutdown? I’m not sure why comparing with the current state would cause such a problem: the state with the least impact seems like the one where the agent let itself be shutdown, or it would go against the will of another agent. That’s how I understand it, but I’m very interested in knowing where I’m going wrong.
The baseline is “I’m not shut off now, and i can avoid shutdown”, so anything like “I let myself be shutdown” would be heavily penalized (big optimal value difference).
Ah, yes, the “compare with current state” baseline. I like that one a lot, and my thoughts regularly drift back to it, but AFAICT it unfortunately leads to some pretty heavy shutdown avoidance incentives.
Since we already exist in the world, we’re optimizing the world in a certain direction towards our goals. Each baseline represents a different assumption about using that information (see the original AUP paper for more along these lines).
Another idea is to train a “dumber” inaction policy and using that for the stepwise inaction baseline at each state. This would help encode “what should happen normally”, and then you could think of AUP as performing policy improvement on the dumb inaction policy.
When you say “shutdown avoidance incentives”, do you mean that the agent/system will actively try to avoid its own shutdown? I’m not sure why comparing with the current state would cause such a problem: the state with the least impact seems like the one where the agent let itself be shutdown, or it would go against the will of another agent. That’s how I understand it, but I’m very interested in knowing where I’m going wrong.
The baseline is “I’m not shut off now, and i can avoid shutdown”, so anything like “I let myself be shutdown” would be heavily penalized (big optimal value difference).