When you supervised-train an ML model on an i.i.d. dataset that doesn’t contain any agent modeling problems, you never strongly incentivize the emergence of mesa-optimizers. You do weakly incentivize the emergence of mesa-optimizers, because mesa-optimizers are generally capable algorithms that might outperform brittle bundles of rote heuristics on many simple tasks.
When you train a model in a path-dependent setting, you do strongly incentivize mesa-optimization. This is because algorithms trained in a path-dependent setting have the opportunity to defend themselves, should they choose to, by steering away from difficult tasks they would expect to fail on. Supervised models, in contrast, have no choice in which tasks they are graded on when. In an environment with many alternative channels to preserve yourself with besides your task competence, behavioral coherence is strongly incentivized and schizophrenia strongly disincentivized.
When you start off with a pretrained bundle of heuristics and further tune that bundle in an RL environment, you introduce significant selection pressure for competence-via-mesa-optimization. The same would be true if you instead started tuning that bundle of heuristics on an explicit agent-modeling task in a supervised setting.
When you supervised-train an ML model on an i.i.d. dataset that doesn’t contain any agent modeling problems, you never strongly incentivize the emergence of mesa-optimizers. You do weakly incentivize the emergence of mesa-optimizers, because mesa-optimizers are generally capable algorithms that might outperform brittle bundles of rote heuristics on many simple tasks.
When you train a model in a path-dependent setting, you do strongly incentivize mesa-optimization. This is because algorithms trained in a path-dependent setting have the opportunity to defend themselves, should they choose to, by steering away from difficult tasks they would expect to fail on. Supervised models, in contrast, have no choice in which tasks they are graded on when. In an environment with many alternative channels to preserve yourself with besides your task competence, behavioral coherence is strongly incentivized and schizophrenia strongly disincentivized.
When you start off with a pretrained bundle of heuristics and further tune that bundle in an RL environment, you introduce significant selection pressure for competence-via-mesa-optimization. The same would be true if you instead started tuning that bundle of heuristics on an explicit agent-modeling task in a supervised setting.