Alex Flint comments on Misalignment-by-default in multi-agent systems

Alex Flint Oct 14, 2022, 3:34 PM
LW: 3 AF: 2
0
AF
I wonder how your definition of multi-agent power would look in a game of chess or go. There is this intuitive thing where players who have pieces more in the center of the board (chess) or have achieved certain formations (go) seem to acquire a kind of power in those games, but this doesn’t seem to be about achieving different terminal goals. Rather it seems more like having the ability to respond to whatever one’s opponent does. If the two agents cannot perfectly predict what their opponent will do then there is value in having the ability to respond to unforeseen challenges, although in these games this is always in service of a single terminal goal (winning the game).

Any thoughts on how your definition would fit into cases like this?
- Edouard Harris Oct 14, 2022, 6:54 PM
  LW: 3 AF: 2
  2
  AF Parent
  Good question. Unfortunately, one weakness of our definition of multi-agent POWER is that it doesn’t have much useful to say in a case like this one.
  We assume AI learning timescales vastly outstrip human learning timescales as a way of keeping our definition tractable. So the only way to structure this problem in our framework would be to imagine a human is playing chess against a superintelligent AI — a highly distorted situation compared to the case of two roughly equal opponents.
  On the other hand, from other results I’ve seen anecdotally, I suspect that if you gave one of the agents a purely random policy (i.e., take a random legal action at each state) and assigned the other agent some reasonable reward function distribution over material, you’d stand a decent chance of correctly identifying high-POWER states with high-mobility board positions.
  You might also be interested in this comment by David Xu, where he discusses mobility as a measure of instrumental value in chess-playing.
  - Noosphere89 Oct 14, 2022, 7:27 PM
    LW: 2 AF: 2
    1
    AF Parent
    
    We assume AI learning timescales vastly outstrip human learning timescales as a way of keeping our definition tractable. So the only way to structure this problem in our framework would be to imagine a human is playing chess against a superintelligent AI — a highly distorted situation compared to the case of two roughly equal opponents.
    
    I think this is probably true in the long term (the classical-quantum/reversible computer transition is very large, and humans can’t easily modify brains, unlike a virtual human.) But this may not be true in the short-term.
    - Edouard Harris Oct 14, 2022, 9:06 PM
      LW: 1 AF: 1
      0
      AF Parent
      Agreed. We think our human-AI setting is a useful model of alignment in the limit case, but not really so in the transient case. (For the reason you point out.)