Are Generative World Models a Mesa-Optimization Risk?
Suppose we set up a training loop with an eye to get a generative world-model. For concretedness, let’s imagine the predictor from the ELK doc. We show the model the first part of a surveillance video, and ask it to predict the second part. Would we risk producing a mesa-optimizer?
Intuitively, it feels like “no”. Mesa-objectives are likely defined over world-models, shards are defined over world-models, so if we ask the training process for just the world-model, we would get just the world-model. Right?
Well.
The Problem
… is that we can’t actually “just” ask for the world-model, can we? Or, at least, that’s an unsolved problem. We’re always asking for some proxy, be that the second part of the video, an answer to some question, being scored well by some secondary world-model-identifier ML model, and so on.
If we could somehow precisely ask the training process to “improve this world-model”, instead of optimizing some proxy objective that we think highly correlates with a generative world-model, that would be a different story. But I don’t see how.
That given, where can things go awry?
The Low End
The SGD moves the model along the steepest gradients. This means that every next SGD step is optimized to make the model improve on its loss-minimization ability as much as possible within the range of that step. Informally, the SGD wants to see results, and fast.
I’d previously analysed the dynamics it gives rise to. In short: The “world-model” part of the ML model improves incrementally while it’s incentivized to produce results immediately. That means it would develop some functions mapping the imperfect world-model to imperfect results — heuristics. But since these heuristics can only attach to the internal world-model, they necessarily start out “shallow”, only responding to surface correlations in the input-data because they’re the first components of the world-model that are discovered. With time, as the world-model deepens, these heuristics may deepen in turn… or they may stagnate, with ancient shallow heuristics eating up too much of the loss-pie and crowding out younger and better competition[1].
So, we can expect to see something like this here too. The model would develop a number of shallow heuristics for e. g. how the second part of the video should look like, we won’t get a “pure” world-model either way. And that’s a fundamental mesa-optimization precursor.
The High End
Suppose that the model has developed into a mesa-optimizer, after all. Would there be advantages to it?
Certainly. Deliberative reasoning is more powerful than blind optimization processes like the SGD and evolution; that’s the idea behind the sharp left turn. If the ML model were to develop a proper superintelligent mesa-optimizer, that mesa-optimizer would be able to improve the world-model faster than the SGD, delivering better results quicker given the same initial data.
The fact that it would almost certainly be deceptive is besides the point: the training loop doesn’t care.
The Intermediate Stages
Can we go from the low end to the high end? What would that involve?
Intuitively, that would require the heuristics from the low end to gradually grow more and more advanced, until one of them becomes so advanced as to develop general problem-solving and pull off a sharp left turn. I’m… unsure how plausible that is. On the one hand, the world-model the heuristics are attached to would grow more advanced, and more advanced heuristics would be needed to effectively parse it. On the other hand, maybe the heuristical complexity in this case would be upper-bounded somehow, such that no mesa-optimization could arise?
We can look at it from another angle: how much more difficult would it be, for the SGD, to find such an advanced mesa-optimizer, as opposed to a sufficiently precise world-model?
This calls to mind the mathematical argument for the universal prior being malign. A mesa-optimizer that derives the necessary world-model at runtime is probably much, much simpler than the actual highly detailed world-model of some specific scenario. And there are probably many, many such mesa-optimizers (as per the orthogonality thesis), but only ~one fitting world-model.
Complications
There’s a bunch of simplification in the reasoning above.
One, the SGD can’t just beeline for the mesa-optimizer. Because of the dynamics outlined in The Low End section, the SGD would necessarily start out building-in the world-model. The swerve to building a mesa-optimizer would happen at some later point.
Second, the mesa-optimizer wouldn’t have access to all the same data as the SGD. It would only ever has access to one data-point. So it might not actually be quite as good as the SGD, or not much better.
But I think these might cancel out? If the SGD swerves to mesa-optimization after developing some of the world-model, the mesa-optimizer would have some statistical data about the environment it’s in, and that might still make it vastly better than the SGD.
Conclusion
Overall, it seems plausible that asking for a superhumanly advanced generative world-model would result in a misaligned mesa-optimizer, optimizing some random slew of values/shards it developed during the training.
Complexity penalty would incentivize it, inasmuch as a mesa-optimizer that derives the world-model at runtime would be simpler than that world-model itself.
Speed penalty would disincentivize it, inasmuch as deriving the world-model at runtime then running it would take longer than just running it.[2]
The decisive way to avoid this, though, would be coming up with some method to ask for the world-model directly, instead of for a proxy like “predict what this camera will show”. It’s unclear if that’s possible.
Alternatively, there might be some way to upper-bound heuristical complexity, such that no heuristic is ever advanced enough to cause the model to fall into the “mesa-optimization basin”. Note that a naive complexity penalty won’t work, as per above.
- ^
See also: gradient starvation.
- ^
Orrr maybe not, if the mesa-optimizer can generate a quicker-running model, compared to what the SGD can easily produce even under speed regularization.
- World-Model Interpretability Is All We Need by 14 Jan 2023 19:37 UTC; 35 points) (
- Value Formation: An Overarching Model by 15 Nov 2022 17:16 UTC; 34 points) (
- Internal Interfaces Are a High-Priority Interpretability Target by 29 Dec 2022 17:49 UTC; 26 points) (
- Greed Is the Root of This Evil by 13 Oct 2022 20:40 UTC; 18 points) (
- 9 Nov 2022 17:16 UTC; 8 points) 's comment on A caveat to the Orthogonality Thesis by (
- 31 Jan 2023 12:56 UTC; 3 points) 's comment on Inner Misalignment in “Simulator” LLMs by (
I thought the assumption in ELK is that the “world-model” was a Bayes net. Presumably it would get queried by message-passing. Arguably “message-passing in a Bayes net” is an optimization algorithm. Does it qualify as a mesa-optimizer, given that the message-passing algorithm was presumably written by humans?
Or do you think there could be a different optimizer somehow encoded in the message-passing procedure??
(Or maybe you don’t expect the world-model of a future AGI to actually be a Bayes net, and therefore you don’t care about that scenario and aren’t thinking about it?)
No, I think the Bayes net frame was just a suggestion on how to concretely think about it. I don’t think the ELK doc assumes that it will literally be a Bayes net, and neither do I.
If there is a training scheme that can produce a generative world-model in the form of a Bayes net, with an explanation for how that training scheme routes around the path dependencies I’ve outlined in the low-end section, I’d like to hear about it.