Vlad Mikulik comments on A simple environment for showing mesa misalignment

Vlad Mikulik 26 Sep 2019 9:57 UTC
LW: 6 AF: 5
AF
I have now seen a few suggestions for environments that demonstrate misaligned mesa-optimisation, and this is one of the best so far. It combines being simple and extensible with being compelling as a demonstration of pseudo-alignment if it works (fails?) as predicted. I think that we will want to explore more sophisticated environments with more possible proxies later, but as a first working demo this seems very promising. Perhaps one could start even without the maze, just a gridworld with keys and boxes.

I don’t know whether observing key-collection behaviour here would be sufficient evidence to count for mesa-optimisation, if the agent has too simple a policy. There is room for philosophical disagreement there. Even with that, a working demo of this environment would in my opinion be a good thing, as we would have a concrete agent to disagree about.
- Rohin Shah 28 Sep 2019 17:32 UTC
  LW: 8 AF: 5
  AF Parent
  I don’t see why we care about mesa-optimization in particular. The argument for risk just factors through the fact that capabilities generalize, but the objective doesn’t. Why does it matter whether the agent is internally performing some kind of search?
  - Vlad Mikulik 29 Sep 2019 0:13 UTC
    LW: 7 AF: 4
    AF Parent
    By that I didn’t mean to imply that we care about mesa-optimisation in particular. I think that this demo working “as intended” is a good demo of an inner alignment failure, which is exciting enough as it is. I just also want to flag that the inner alignment failure doesn’t automatically provide an example of a mesa-optimiser.
- Matthew Barnett 26 Sep 2019 19:49 UTC
  LW: 1 AF: 1
  AF Parent
  I don’t know whether observing key-collection behaviour here would be sufficient evidence to count for mesa-optimisation, if the agent has too simple a policy.
  I agree. That’s why I think we should compare it to a hard-coded agent that pursues the optimal policy for collecting keys, and an agent that pursues the optimal policy for opening chests. If the trained agent is similar to the first hard-coded agent rather than the second, this would be striking evidence of misalignment.