Pattern comments on [AN #58] Mesa optimization: what it is, and why we should care

Pattern Jun 25, 2019, 8:46 PM
3 points
Note that in the referenced paper the agents don’t have the same information. (I think you know that, just wanted to clarify in case you didn’t.)
Yes. I brought it up for this point: (edited)
gesturing at a continuum—providing a little information versus all of it.
A better way of putting it would have been—“Open AI Five cooperated with full information and no communication, this work seems interested in cooperation between agents with different information and communication.”
The worry here is that by adding this extra auxiliary intrinsic reward, you are changing the equilibrium behavior. In particular, agents will exploit the commons less and instead focus more on finding and transmitting useful information. This doesn’t really seem like you’ve “learned cooperation”.
That makes sense. I’m curious about what value “this thing that isn’t learned cooperation” doesn’t capture.
Most of the considerations you bring up are not things that can be easily evaluated from the paper (and are hard to evaluate even if you have the agents in code and can look into their innards). One exception: you should expect that the information given by the agent will be true and useful: if it weren’t, then the other agents would learn to ignore the information over time, which means that the information doesn’t affect the other agents’ actions and so won’t get any intrinsic reward.
A better way of putting my question would have been:
Is “useful” a global improvement, or a local improvement? (This sort of protocol leads to improvements, but what kind of improvements are they?)
Most of the considerations you bring up are not things that can be easily evaluated from the paper (and are hard to evaluate even if you have the agents in code and can look into their innards).
Hm, I thought of them as things that would require looking at:
1) Behavior in environments constructed for that purpose.
2) Looking at the information the agents communicate.

My “communication” versus “information” distinction is an attempt to differentiate between (complex) things like:
Information: This is the payoff matrix (or more of it).*
Communication: I’m going for stag next round. (A promise.)
Information: If everyone chooses “C” things are better for everyone.
Communication: If in a round, I’ve chosen “C” and you’ve chosen “D”, the following round I will choose “D”.
Information: Player M has your source code, and will play whatever you play.
*I think of this as being different from advice (proposing an action, or an action over another).
- Rohin Shah Jun 26, 2019, 4:24 PM
  3 points
  Parent
  I’m curious about what value “this thing that isn’t learned cooperation” doesn’t capture.
  It suggests that in other environments that aren’t tragedies of the commons, the technique won’t lead to cooperation. It also suggests that you could get the same result by giving the agents any sort of extra reward (that influences their actions somehow).
  Is “useful” a global improvement, or a local improvement?
  Also not clear what the answer to this is.
  Hm, I thought of them as things that would require looking at:
  1) Behavior in environments constructed for that purpose.
  2) Looking at the information the agents communicate.
  The agents won’t work in any environment other than the one they were trained in, and the information they communicate is probably in the form of vectors of numbers that are not human-interpretable. It’s not impossible to analyze them, but it would be difficult.