David Scott Krueger (formerly: capybaralet) comments on Why is pseudo-alignment “worse” than other ways ML can fail to generalize?

David Scott Krueger (formerly: capybaralet) 17 Sep 2020 19:20 UTC
LW: 5 AF: 4
AF
I disagree with the framing that: “pseudo-alignment is a type of robustness/distributional shift problem”. This is literally true based on how it’s defined in the paper. But I think in practice, we should expect approximately aligned mesa-optimizers that do very bad things on-distribution (without being detected).