I drop further ideas here in this comment for future reference. I may edit the comment whenever I have a new idea:
Someone had the idea that instead of giving a negative reward for acting while an alert is played, one may also give a positive reward for performing the null action in those situations. If the positive reward is high enough, an agent like MuZero (which explicitly tries to maximize the expected predicted reward) may then be incentivized to cause the alert sound from being played during deployment. One could then add further environment details that are learned by the world model and that make it clear to the agent how to achieve the alert in deployment. This also has an analogy in the original shutdown paper.
I drop further ideas here in this comment for future reference. I may edit the comment whenever I have a new idea:
Someone had the idea that instead of giving a negative reward for acting while an alert is played, one may also give a positive reward for performing the null action in those situations. If the positive reward is high enough, an agent like MuZero (which explicitly tries to maximize the expected predicted reward) may then be incentivized to cause the alert sound from being played during deployment. One could then add further environment details that are learned by the world model and that make it clear to the agent how to achieve the alert in deployment. This also has an analogy in the original shutdown paper.