Rohin Shah comments on [AN #58] Mesa optimization: what it is, and why we should care

Rohin Shah 24 Jun 2019 23:36 UTC
2 points
and this implicitly means there’s a constructed group of quasi-altruistic agents who are getting less concrete reward because they’re being incentivized by this auxiliary reward.
Suppose Alice and Bob are in an iterated prisoners dilemma ($2/$2 for both cooperating, ¹⁄₁ for both defecting, and ³⁄₀ for cooperate/defect.) I now tell Alice that actually she can have an extra $5 each time if she always cooperates. Now the equilibrium is for Alice to always cooperate and Bob to always defect (which is not an equilibrium behavior in normal IPD).
The worry here is that by adding this extra auxiliary intrinsic reward, you are changing the equilibrium behavior. In particular, agents will exploit the commons less and instead focus more on finding and transmitting useful information. This doesn’t really seem like you’ve “learned cooperation”.
This reminds me of OpenAI Five—the way they didn’t communicate, but all had the same information.
Note that in the referenced paper the agents don’t have the same information. (I think you know that, just wanted to clarify in case you didn’t.)
Most of the considerations you bring up are not things that can be easily evaluated from the paper (and are hard to evaluate even if you have the agents in code and can look into their innards). One exception: you should expect that the information given by the agent will be true and useful: if it weren’t, then the other agents would learn to ignore the information over time, which means that the information doesn’t affect the other agents’ actions and so won’t get any intrinsic reward.
I’m surprised they got good results from “try to get other agents to do something different”, but it is the borrowing from the structure of causality.
I do think that it’s dependent on the particular environments you use.
This reminds me of the Starcraft AI, AlphaStar. While I didn’t get all the details I recall something about the reason for the population was so they could each be given a bunch of different narrower/easier objectives than “Win the game” like “Build 2 Deathstalkers” or “Scout this much of the map” or “find the enemy base ASAP”, in order to find out what kind of easy to learn things helped them get better at the game.
While I didn’t read the Ray Interference paper, I think its point was that if the same weights are used for multiple skills, then updating the weights for one of the skills might reduce performance on the other skills. AlphaStar would have this problem too. I guess by having “specialist” agents in the population you are ensuring that those agents don’t suffer from as much ray interference, but your final general agent will still need to know all the skills and would suffer from ray interference (if it is actually a thing, again I haven’t read the paper).
This sounds like one of those “as General Intelligences we find this easy but it’s really hard to program”.
Yup, sounds right to me.
- Pattern 25 Jun 2019 20:46 UTC
  3 points
  Parent
  Note that in the referenced paper the agents don’t have the same information. (I think you know that, just wanted to clarify in case you didn’t.)
  Yes. I brought it up for this point: (edited)
  gesturing at a continuum—providing a little information versus all of it.
  A better way of putting it would have been—“Open AI Five cooperated with full information and no communication, this work seems interested in cooperation between agents with different information and communication.”
  The worry here is that by adding this extra auxiliary intrinsic reward, you are changing the equilibrium behavior. In particular, agents will exploit the commons less and instead focus more on finding and transmitting useful information. This doesn’t really seem like you’ve “learned cooperation”.
  That makes sense. I’m curious about what value “this thing that isn’t learned cooperation” doesn’t capture.
  Most of the considerations you bring up are not things that can be easily evaluated from the paper (and are hard to evaluate even if you have the agents in code and can look into their innards). One exception: you should expect that the information given by the agent will be true and useful: if it weren’t, then the other agents would learn to ignore the information over time, which means that the information doesn’t affect the other agents’ actions and so won’t get any intrinsic reward.
  A better way of putting my question would have been:
  Is “useful” a global improvement, or a local improvement? (This sort of protocol leads to improvements, but what kind of improvements are they?)
  Most of the considerations you bring up are not things that can be easily evaluated from the paper (and are hard to evaluate even if you have the agents in code and can look into their innards).
  Hm, I thought of them as things that would require looking at:
  1) Behavior in environments constructed for that purpose.
  2) Looking at the information the agents communicate.
  
  My “communication” versus “information” distinction is an attempt to differentiate between (complex) things like:
  Information: This is the payoff matrix (or more of it).*
  Communication: I’m going for stag next round. (A promise.)
  Information: If everyone chooses “C” things are better for everyone.
  Communication: If in a round, I’ve chosen “C” and you’ve chosen “D”, the following round I will choose “D”.
  Information: Player M has your source code, and will play whatever you play.
  *I think of this as being different from advice (proposing an action, or an action over another).
  - Rohin Shah 26 Jun 2019 16:24 UTC
    3 points
    Parent
    I’m curious about what value “this thing that isn’t learned cooperation” doesn’t capture.
    It suggests that in other environments that aren’t tragedies of the commons, the technique won’t lead to cooperation. It also suggests that you could get the same result by giving the agents any sort of extra reward (that influences their actions somehow).
    Is “useful” a global improvement, or a local improvement?
    Also not clear what the answer to this is.
    Hm, I thought of them as things that would require looking at:
    1) Behavior in environments constructed for that purpose.
    2) Looking at the information the agents communicate.
    The agents won’t work in any environment other than the one they were trained in, and the information they communicate is probably in the form of vectors of numbers that are not human-interpretable. It’s not impossible to analyze them, but it would be difficult.