Martín Soto comments on A Shutdown Problem Proposal

Martín Soto 22 Jan 2024 11:12 UTC
9 points
0
Then there’s the problem of designing the negotiation infrastructure, and in particular allocating bargaining power to the various subagents. They all get a veto, but that still leaves a lot of degrees of freedom in exactly how much the agent pursues the goals of each subagent. For the shutdown use-case, we probably want to allocate most of the bargaining power to the non-shutdown subagent, so that we can see what the system does when mostly optimizing for u_1 (while maintaining the option of shutting down later).
I don’t understand what you mean by “allocating bargaining power”, given already each agent has true veto power. Regardless of the negotiation mechanism you set up for them (if it’s high-bandwidth enough), or whether the master agent says “I’d like this or that agent to have more power”, each subagent could go “give me my proportional (1/n) part of the slice, or else I will veto everything” (and depending on its prior about how other agents could respond, this will seem net-positive to do).
In fact that’s just the tip of the iceberg of individually rational game-theoretic stuff (that messes with your proposal) they could pull off, see Commitment Races.
- Martín Soto 22 Jan 2024 11:19 UTC
  4 points
  0
  Parent
  Brain-storming fixes:
  - Each subagent’s bargaining power is how much compute they can use. This makes everything more chaotic, and is clearly not what you had in mind with this kind of idealized agents solution.
  - Probabilistic vetos, such that those of some subagents are less likely to work. I think this breaks things in your proposal and still has the game-theoretic problems.
  - We ensure the priors of each subagent (about how the others respond) are such that going for risky game-theoretic stuff is not individually rational. Maybe some agents have more optimistic priors, and others less optimistic, and this results in the former controlling more, and the latter only try to use their veto in extreme cases (like to ensure the wrong successor is not built). But it’d be fiddly to think about the effect of these different priors on behavior, and how “extreme” the cases are in which veto is useful. And also this might mess up the agent’s interactions with the world in other ways: for example, dogmatically believing that algorithms that look like subagents have “exactly this behavior”, which is sometimes false. Although of course this kind of problem was already present in your proposal.