When you say “shutdown avoidance incentives”, do you mean that the agent/system will actively try to avoid its own shutdown? I’m not sure why comparing with the current state would cause such a problem: the state with the least impact seems like the one where the agent let itself be shutdown, or it would go against the will of another agent. That’s how I understand it, but I’m very interested in knowing where I’m going wrong.
The baseline is “I’m not shut off now, and i can avoid shutdown”, so anything like “I let myself be shutdown” would be heavily penalized (big optimal value difference).
When you say “shutdown avoidance incentives”, do you mean that the agent/system will actively try to avoid its own shutdown? I’m not sure why comparing with the current state would cause such a problem: the state with the least impact seems like the one where the agent let itself be shutdown, or it would go against the will of another agent. That’s how I understand it, but I’m very interested in knowing where I’m going wrong.
The baseline is “I’m not shut off now, and i can avoid shutdown”, so anything like “I let myself be shutdown” would be heavily penalized (big optimal value difference).