No, not at all—we distinguish robustly aligned mesa-optimizers, which are aligned on and off distribution, from pseudo-aligned mesa-optimizers, which appear to be aligned on distribution, but are not necessarily aligned off-distribution. For the full glossary, see here.
Is a mesa-optimizer unaligned by definition?
No, not at all—we distinguish robustly aligned mesa-optimizers, which are aligned on and off distribution, from pseudo-aligned mesa-optimizers, which appear to be aligned on distribution, but are not necessarily aligned off-distribution. For the full glossary, see here.
That was helpful, thank you.