Thanks! We think that advanced POST-agents won’t deliberately try to get shut down, for the reasons we give in footnote 5 (relevant part pasted below). In brief:
advanced agents will be choosing between lotteries
we have theoretical reasons to expect that agents that satisfy POST (when choosing between trajectories) will be ‘neutral’(when choosing between lotteries): they won’t spend resources to shift probability mass between different-length trajectories.
So (we think) neutral agents won’t deliberately try to get shut down if doing so costs resources.
Would advanced agents that choose stochastically between different-length trajectories also choose stochastically between preventing and allowing shutdown? Yes, and that would be bad. But—crucially—in deployment, advanced agents will be uncertain about the consequences of their actions, and so these agents will be choosing between lotteries (non-degenerate probability distributions over trajectories) rather than between trajectories. And (as we’ll argue in Section 7) POST plausibly gives rise to a desirable pattern of preferences over lotteries. Specifically, POST plausibly makes advanced agents neutral: ensures that they won’t spend resources to shift probability mass between different-length trajectories. That in turn plausibly makes advanced agents shutdownable: ensures that they won’t spend resources to resist shutdown.
Suppose the reward at each timestep is the number of paperclips the agent has.
At each timestep the agent has three “object-level” actions, and two shutdown-related actions:
Object-level:
use current resources to buy the paperclips available on the market
invest its resources in paperclip factories that will gradually make more paperclips at future timesteps
invest its resources in taking over the world to acquire more resources in future timesteps (with some risk that humans will notice and try to shut you down)
Shutdown-related:
Use resources to prevent a human shutdown attempt
Just shut yourself down, no human needed
For interesting behavior, suppose you’ve tuned the environment’s parameters so that there are different optimal strategies for different episode lengths (just buy paperclips at short timescales, build a paperclip factory at medium times, try to take over the world at long times).
Now you train this agent with DREST. What do you expect it to learn to do?
If the environment is deterministic, the agent is choosing between trajectories. In those environments, we train agents using DREST to satisfy POST:
The agent chooses stochastically between different available trajectory-lengths.
Given the choice of a particular trajectory-length, the agent maximizes paperclips made in that trajectory-length.
If the environment is stochastic (as—e.g. - deployment environments will be), the agent is choosing between lotteries, and we expect agents to be neutral: to not pay costs to shift probability mass between different trajectory-lengths. So they won’t perform either of the shutdown-related actions if doing so comes at any cost with respect to lotteries conditional on each trajectory-length. Which of the object-level actions the agent performs will depend on the quantities of paperclips available.
Thanks! We think that advanced POST-agents won’t deliberately try to get shut down, for the reasons we give in footnote 5 (relevant part pasted below). In brief:
advanced agents will be choosing between lotteries
we have theoretical reasons to expect that agents that satisfy POST (when choosing between trajectories) will be ‘neutral’ (when choosing between lotteries): they won’t spend resources to shift probability mass between different-length trajectories.
So (we think) neutral agents won’t deliberately try to get shut down if doing so costs resources.
Suppose the reward at each timestep is the number of paperclips the agent has.
At each timestep the agent has three “object-level” actions, and two shutdown-related actions:
Object-level:
use current resources to buy the paperclips available on the market
invest its resources in paperclip factories that will gradually make more paperclips at future timesteps
invest its resources in taking over the world to acquire more resources in future timesteps (with some risk that humans will notice and try to shut you down)
Shutdown-related:
Use resources to prevent a human shutdown attempt
Just shut yourself down, no human needed
For interesting behavior, suppose you’ve tuned the environment’s parameters so that there are different optimal strategies for different episode lengths (just buy paperclips at short timescales, build a paperclip factory at medium times, try to take over the world at long times).
Now you train this agent with DREST. What do you expect it to learn to do?
If the environment is deterministic, the agent is choosing between trajectories. In those environments, we train agents using DREST to satisfy POST:
The agent chooses stochastically between different available trajectory-lengths.
Given the choice of a particular trajectory-length, the agent maximizes paperclips made in that trajectory-length.
If the environment is stochastic (as—e.g. - deployment environments will be), the agent is choosing between lotteries, and we expect agents to be neutral: to not pay costs to shift probability mass between different trajectory-lengths. So they won’t perform either of the shutdown-related actions if doing so comes at any cost with respect to lotteries conditional on each trajectory-length. Which of the object-level actions the agent performs will depend on the quantities of paperclips available.