I disagree with the framing that: “pseudo-alignment is a type of robustness/distributional shift problem”. This is literally true based on how it’s defined in the paper. But I think in practice, we should expect approximately aligned mesa-optimizers that do very bad things on-distribution (without being detected).
I disagree with the framing that: “pseudo-alignment is a type of robustness/distributional shift problem”. This is literally true based on how it’s defined in the paper. But I think in practice, we should expect approximately aligned mesa-optimizers that do very bad things on-distribution (without being detected).