Rohin Shah comments on The Inner Alignment Problem

Rohin Shah 4 Jun 2019 7:34 UTC
LW: 14 AF: 8
AF
It seems like the general pattern here is that, when using machine learning for some task X, there are a bunch of properties that affect the likelihood of learning heuristics or proxies rather than actually learning the optimal thing for X. For any such property, making heuristics/proxies more likely would result in a lower chance of mesa-optimization (since optimizers are less like heuristics/proxies) but conditional on mesa-optimization arising, makes it more likely that it is a pseudo-aligned mesa-optimizer instead of a robustly-aligned mesa-optimizer (because now the pressure for heuristics/proxies leads to learning a proxy mesa-objective instead of the true base objective). Example properties of this form are algorithmic range, simplicity bias, and time complexity penalties. Does that seem right?
then developing a pseudo-aligned mesa-objective may require strictly more subprocesses than developing a robustly aligned mesa-objective.
This is backwards, I think?
- evhub 4 Jun 2019 18:06 UTC
  LW: 8 AF: 5
  AF Parent
  I agree with that as a general takeaway, though I would caution that I don’t think it’s always true—for example, hard-coded optimization seems to help in both cases, and I suspect algorithmic range to be more complicated than that, likely making some pseudo-alignment problems better but also possibly making some worse.
  
  Also, yeah, that was backwards—it should be fixed now.