Ben Cottier comments on Deceptive Alignment

Ben Cottier 28 Jul 2020 10:17 UTC
LW: 5 AF: 3
AF

In the limit of training on a diverse set of tasks, we expect joint optimization of both the base and mesa- objectives to be unstable. Assuming that the mesa-optimizer converges towards behavior that is optimal from the perspective of the base optimizer, the mesa-optimizer must somehow learn the base objective.

Joint optimization may be unstable, but if the model is not trained to convergence, might it still be jointly optimizing at the end of training? This occurred to me after reading https://arxiv.org/abs/2001.08361 which finds that “Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.” If convergence is becoming less common in practical systems, it’s important to think about the implications of that for mesa-optimization.
- evhub 28 Jul 2020 19:12 UTC
  LW: 4 AF: 2
  AF Parent
  I talk about this a bit here, but basically if you train huge models for a short period of time, you’re really relying on your inductive biases to find the simplest model that fits the data—and mesa-optimizers, especially deceptive mesa-optimizers, are quite simple, compressed policies.