Davidmanheim comments on Indifference: multiple changes, multiple agents

Davidmanheim 14 Jul 2019 11:50 UTC
LW: 1 AF: 1
AF
The way the agents interact across interruptions seems to exactly parallel interactions between agents where we design for correct behavior for agents separately, and despite this, agents can corrupt the overall design by hijacking other agents. You say we need to design for mutual indifference, but if we have a solution that fixes the way they exploit interruption, it should also go quite a ways towards solving the generalized issues with Goodhart-like exploitation between agents.
- Stuart_Armstrong 14 Jul 2019 22:14 UTC
  LW: 3 AF: 2
  AF Parent
  
  but if we have a solution that fixes the way they exploit interruption
  
  ? Doesn’t the design above do that?
  - Davidmanheim 15 Jul 2019 15:29 UTC
    LW: 1 AF: 1
    AF Parent
    Yes, and this is a step in the right direction, but as you noted in the writeup, it only applies in a case where we’ve assumed away a number of key problems—among the most critical of which seem to be:
    We have an assumed notion of optimality, and I think an implicit assumption that the optimal point is unique, which seems to be needed to define reward—Abram Demski has noted in another post that this is very problematic.
    We also need to know a significant amount about both/all agents, and compute expectations in order to design any of their reward functions. That means future agents joining the system could break our agent’s indifference. (As an aside, I’m unclear how we can be sure it is possible to compute rewards in a stable way if their optimal policy can change based on the reward we’re computing.) If we can compute another agent’s reward function when designing our agent, however, we can plausibly hijack that agent.
    We also need a reward depending on an expectation of actions, which means we need counterfactuals not only over scenarios, but over the way the other agent reasons. That’s a critical issue I’m still trying to wrap my head around, because it’s unclear to me how a system can reason in those cases.