It seems like the general pattern here is that, when using machine learning for some task X, there are a bunch of properties that affect the likelihood of learning heuristics or proxies rather than actually learning the optimal thing for X. For any such property, making heuristics/proxies more likely would result in a lower chance of mesa-optimization (since optimizers are less like heuristics/proxies) but conditional on mesa-optimization arising, makes it more likely that it is a pseudo-aligned mesa-optimizer instead of a robustly-aligned mesa-optimizer (because now the pressure for heuristics/proxies leads to learning a proxy mesa-objective instead of the true base objective). Example properties of this form are algorithmic range, simplicity bias, and time complexity penalties. Does that seem right?
then developing a pseudo-aligned mesa-objective may require strictly more subprocesses than developing a robustly aligned mesa-objective.
I agree with that as a general takeaway, though I would caution that I don’t think it’s always true—for example, hard-coded optimization seems to help in both cases, and I suspect algorithmic range to be more complicated than that, likely making some pseudo-alignment problems better but also possibly making some worse.
Also, yeah, that was backwards—it should be fixed now.
It seems like the general pattern here is that, when using machine learning for some task X, there are a bunch of properties that affect the likelihood of learning heuristics or proxies rather than actually learning the optimal thing for X. For any such property, making heuristics/proxies more likely would result in a lower chance of mesa-optimization (since optimizers are less like heuristics/proxies) but conditional on mesa-optimization arising, makes it more likely that it is a pseudo-aligned mesa-optimizer instead of a robustly-aligned mesa-optimizer (because now the pressure for heuristics/proxies leads to learning a proxy mesa-objective instead of the true base objective). Example properties of this form are algorithmic range, simplicity bias, and time complexity penalties. Does that seem right?
This is backwards, I think?
I agree with that as a general takeaway, though I would caution that I don’t think it’s always true—for example, hard-coded optimization seems to help in both cases, and I suspect algorithmic range to be more complicated than that, likely making some pseudo-alignment problems better but also possibly making some worse.
Also, yeah, that was backwards—it should be fixed now.