(1) Your proposal requires each subagent to be very mistaken about the probability of shutdown at each timestep. That seems like a drawback. Maybe it’s hard to ensure that subagents are so mistaken. Maybe this mistake would screw up subagents’ beliefs in other ways.
(2) Will subagents’ veto-power prevent the agent from making any kind of long-term investment?
Consider an example. Suppose that we can represent the extent to which the agent achieves its goals at each timestep with a real number (‘utilities’). Represent trajectories with vectors of utilities. Suppose that, conditional on no-shutdown, the Default action gives utility-vector <0,0,0,0,0,...>. The other available action is ‘Invest’. Conditional on no-shutdown, Invest gives utility-vector <−1,10,10,10,10,....>.
As long as the agent’s goals aren’t too misaligned with our own goals (and as long as the true probability of an early shutdown is sufficiently small), we’ll want the agent to choose Invest (because Invest is slightly worse than the default action in the short-term but much better in the long-term). But Subagent2 will veto choosing Invest, because Subagent2 is sure that shutdown will occur at timestep 2, and so from its perspective, Invest gives <−1, shutdown> whereas the default action gives <0, shutdown>.
Re: (2), that depends heavily on how the “shutdown utility function” handles those numbers. An “invest” action which costs 1 utility for subagent 1, and yields 10 utility for subagent 1 in each subsequent step, may have totally unrelated utilities for subagent 2. The subagents have different utility functions, and we don’t have many constraints on the relationship between them.
Interesting idea. Couple of comments.
(1) Your proposal requires each subagent to be very mistaken about the probability of shutdown at each timestep. That seems like a drawback. Maybe it’s hard to ensure that subagents are so mistaken. Maybe this mistake would screw up subagents’ beliefs in other ways.
(2) Will subagents’ veto-power prevent the agent from making any kind of long-term investment?
Consider an example. Suppose that we can represent the extent to which the agent achieves its goals at each timestep with a real number (‘utilities’). Represent trajectories with vectors of utilities. Suppose that, conditional on no-shutdown, the Default action gives utility-vector <0,0,0,0,0,...>. The other available action is ‘Invest’. Conditional on no-shutdown, Invest gives utility-vector <−1,10,10,10,10,....>.
As long as the agent’s goals aren’t too misaligned with our own goals (and as long as the true probability of an early shutdown is sufficiently small), we’ll want the agent to choose Invest (because Invest is slightly worse than the default action in the short-term but much better in the long-term). But Subagent2 will veto choosing Invest, because Subagent2 is sure that shutdown will occur at timestep 2, and so from its perspective, Invest gives <−1, shutdown> whereas the default action gives <0, shutdown>.
Is that right?
Re: (2), that depends heavily on how the “shutdown utility function” handles those numbers. An “invest” action which costs 1 utility for subagent 1, and yields 10 utility for subagent 1 in each subsequent step, may have totally unrelated utilities for subagent 2. The subagents have different utility functions, and we don’t have many constraints on the relationship between them.
Re: (1), yup, agreed.