Rohin Shah comments on A simple environment for showing mesa misalignment

Rohin Shah 28 Sep 2019 17:32 UTC
LW: 8 AF: 5
0
AF
I don’t see why we care about mesa-optimization in particular. The argument for risk just factors through the fact that capabilities generalize, but the objective doesn’t. Why does it matter whether the agent is internally performing some kind of search?
- Vlad Mikulik 29 Sep 2019 0:13 UTC
  LW: 7 AF: 4
  0
  AF Parent
  By that I didn’t mean to imply that we care about mesa-optimisation in particular. I think that this demo working “as intended” is a good demo of an inner alignment failure, which is exciting enough as it is. I just also want to flag that the inner alignment failure doesn’t automatically provide an example of a mesa-optimiser.