Suppose the reward at each timestep is the number of paperclips the agent has.
At each timestep the agent has three “object-level” actions, and two shutdown-related actions:
Object-level:
use current resources to buy the paperclips available on the market
invest its resources in paperclip factories that will gradually make more paperclips at future timesteps
invest its resources in taking over the world to acquire more resources in future timesteps (with some risk that humans will notice and try to shut you down)
Shutdown-related:
Use resources to prevent a human shutdown attempt
Just shut yourself down, no human needed
For interesting behavior, suppose you’ve tuned the environment’s parameters so that there are different optimal strategies for different episode lengths (just buy paperclips at short timescales, build a paperclip factory at medium times, try to take over the world at long times).
Now you train this agent with DREST. What do you expect it to learn to do?
If the environment is deterministic, the agent is choosing between trajectories. In those environments, we train agents using DREST to satisfy POST:
The agent chooses stochastically between different available trajectory-lengths.
Given the choice of a particular trajectory-length, the agent maximizes paperclips made in that trajectory-length.
If the environment is stochastic (as—e.g. - deployment environments will be), the agent is choosing between lotteries, and we expect agents to be neutral: to not pay costs to shift probability mass between different trajectory-lengths. So they won’t perform either of the shutdown-related actions if doing so comes at any cost with respect to lotteries conditional on each trajectory-length. Which of the object-level actions the agent performs will depend on the quantities of paperclips available.
Suppose the reward at each timestep is the number of paperclips the agent has.
At each timestep the agent has three “object-level” actions, and two shutdown-related actions:
Object-level:
use current resources to buy the paperclips available on the market
invest its resources in paperclip factories that will gradually make more paperclips at future timesteps
invest its resources in taking over the world to acquire more resources in future timesteps (with some risk that humans will notice and try to shut you down)
Shutdown-related:
Use resources to prevent a human shutdown attempt
Just shut yourself down, no human needed
For interesting behavior, suppose you’ve tuned the environment’s parameters so that there are different optimal strategies for different episode lengths (just buy paperclips at short timescales, build a paperclip factory at medium times, try to take over the world at long times).
Now you train this agent with DREST. What do you expect it to learn to do?
If the environment is deterministic, the agent is choosing between trajectories. In those environments, we train agents using DREST to satisfy POST:
The agent chooses stochastically between different available trajectory-lengths.
Given the choice of a particular trajectory-length, the agent maximizes paperclips made in that trajectory-length.
If the environment is stochastic (as—e.g. - deployment environments will be), the agent is choosing between lotteries, and we expect agents to be neutral: to not pay costs to shift probability mass between different trajectory-lengths. So they won’t perform either of the shutdown-related actions if doing so comes at any cost with respect to lotteries conditional on each trajectory-length. Which of the object-level actions the agent performs will depend on the quantities of paperclips available.