I’ve been imagining that the button is shutdown-causing for simplicity, but I think you can suppose instead that the button is shutdown-requesting (i.e. agent receives a signal indicating that button has been pressed but still gets to choose whether to shut down) without affecting the points above. You’d just need to append a first step to the training procedure: training the agent to prefer shutting down when they receive the signal.
I’ve been imagining that the button is shutdown-causing for simplicity, but I think you can suppose instead that the button is shutdown-requesting (i.e. agent receives a signal indicating that button has been pressed but still gets to choose whether to shut down) without affecting the points above. You’d just need to append a first step to the training procedure: training the agent to prefer shutting down when they receive the signal.