I’ll need a bit of time to think through this, but one quick question before I get too deep into it: what makes the agent’s shutdown-timing have anything to do with the time at which the button is pressed? Is the assumption that the button causes the agent to shutdown when pressed, and that’s just locked into the physics of the situation, i.e. the agent can try to manipulate button-pressing but can’t change whether it’s shut down when the button is pressed?
I’ve been imagining that the button is shutdown-causing for simplicity, but I think you can suppose instead that the button is shutdown-requesting (i.e. agent receives a signal indicating that button has been pressed but still gets to choose whether to shut down) without affecting the points above. You’d just need to append a first step to the training procedure: training the agent to prefer shutting down when they receive the signal.
I’ll need a bit of time to think through this, but one quick question before I get too deep into it: what makes the agent’s shutdown-timing have anything to do with the time at which the button is pressed? Is the assumption that the button causes the agent to shutdown when pressed, and that’s just locked into the physics of the situation, i.e. the agent can try to manipulate button-pressing but can’t change whether it’s shut down when the button is pressed?
I’ve been imagining that the button is shutdown-causing for simplicity, but I think you can suppose instead that the button is shutdown-requesting (i.e. agent receives a signal indicating that button has been pressed but still gets to choose whether to shut down) without affecting the points above. You’d just need to append a first step to the training procedure: training the agent to prefer shutting down when they receive the signal.