If the environment is deterministic, the agent is choosing between trajectories. In those environments, we train agents using DREST to satisfy POST:
The agent chooses stochastically between different available trajectory-lengths.
Given the choice of a particular trajectory-length, the agent maximizes paperclips made in that trajectory-length.
If the environment is stochastic (as—e.g. - deployment environments will be), the agent is choosing between lotteries, and we expect agents to be neutral: to not pay costs to shift probability mass between different trajectory-lengths. So they won’t perform either of the shutdown-related actions if doing so comes at any cost with respect to lotteries conditional on each trajectory-length. Which of the object-level actions the agent performs will depend on the quantities of paperclips available.
If the environment is deterministic, the agent is choosing between trajectories. In those environments, we train agents using DREST to satisfy POST:
The agent chooses stochastically between different available trajectory-lengths.
Given the choice of a particular trajectory-length, the agent maximizes paperclips made in that trajectory-length.
If the environment is stochastic (as—e.g. - deployment environments will be), the agent is choosing between lotteries, and we expect agents to be neutral: to not pay costs to shift probability mass between different trajectory-lengths. So they won’t perform either of the shutdown-related actions if doing so comes at any cost with respect to lotteries conditional on each trajectory-length. Which of the object-level actions the agent performs will depend on the quantities of paperclips available.