evhub comments on Deceptive Alignment

evhub 28 Jul 2020 19:12 UTC
LW: 4 AF: 2
AF
I talk about this a bit here, but basically if you train huge models for a short period of time, you’re really relying on your inductive biases to find the simplest model that fits the data—and mesa-optimizers, especially deceptive mesa-optimizers, are quite simple, compressed policies.