So, I certainly agree that pseudo-alignment is a type of robustness/distributional shift problem. In fact, I would describe “Risks from Learned Optimization” as a deep dive on a particular subset of robustness problems that might be particularly concerning from a safety standpoint. Thus, in that sense, whether it’s really a “new” sort of robustness problem is less the point than the analysis that the paper presents of that robustness problem. That being said, I do think that at least the focus on mesa-optimization was fairly novel in terms of caching out the generalization failures we wanted to discuss in terms of the sorts of learned optimization processes that might exhibit them (as well as the discussion of deception, as you mention).
I don’t understand what “safety properties of the base optimizer” could be, apart from facts about the optima it tends to produce.
I agree with that and I think that the sentence you’re quoting there is meant for a different sort of reader that has less of a clear concept of ML. One way to interpret the passage you’re quoting that might help you is that it’s just saying that guarantees about global optima don’t necessarily translate to local optima or to actual models you might find in practice.
But even without mesa-optimizers, cases of ML generalization failure often involve the latter, not just the former.
I also agree with this. I would describe my picture here as something like: Pseudo-aligned mesa-optimization ⊂ Objective generalization without capability generalization ⊂ Robustness problems. Given that picture, I would say that the pseudo-aligned mesa-optimizer case is the most concerning from a safety perspective, then generic objective generalization without capability generalization, then robustness problems in general. And I would argue that it makes sense to break it down in that way precisely because you get more concerning safety problems as you go narrower.
Also, more detail on the capability vs. objective robustness picture is also available here and here.
I disagree with the framing that: “pseudo-alignment is a type of robustness/distributional shift problem”. This is literally true based on how it’s defined in the paper. But I think in practice, we should expect approximately aligned mesa-optimizers that do very bad things on-distribution (without being detected).
So, I certainly agree that pseudo-alignment is a type of robustness/distributional shift problem. In fact, I would describe “Risks from Learned Optimization” as a deep dive on a particular subset of robustness problems that might be particularly concerning from a safety standpoint. Thus, in that sense, whether it’s really a “new” sort of robustness problem is less the point than the analysis that the paper presents of that robustness problem. That being said, I do think that at least the focus on mesa-optimization was fairly novel in terms of caching out the generalization failures we wanted to discuss in terms of the sorts of learned optimization processes that might exhibit them (as well as the discussion of deception, as you mention).
I agree with that and I think that the sentence you’re quoting there is meant for a different sort of reader that has less of a clear concept of ML. One way to interpret the passage you’re quoting that might help you is that it’s just saying that guarantees about global optima don’t necessarily translate to local optima or to actual models you might find in practice.
I also agree with this. I would describe my picture here as something like: Pseudo-aligned mesa-optimization ⊂ Objective generalization without capability generalization ⊂ Robustness problems. Given that picture, I would say that the pseudo-aligned mesa-optimizer case is the most concerning from a safety perspective, then generic objective generalization without capability generalization, then robustness problems in general. And I would argue that it makes sense to break it down in that way precisely because you get more concerning safety problems as you go narrower.
Also, more detail on the capability vs. objective robustness picture is also available here and here.
I disagree with the framing that: “pseudo-alignment is a type of robustness/distributional shift problem”. This is literally true based on how it’s defined in the paper. But I think in practice, we should expect approximately aligned mesa-optimizers that do very bad things on-distribution (without being detected).