In general, I’m a bit unsure about how much of an interpretability advantage we get from slicing the model up into chunks. If the pieces are trained separately, then we can reason about each part individually based on its training procedure. In the optimistic scenario, this means that the computation happening in the part of the system labeled “world model” is actually something humans would call world modelling. This is definitely helpful for interpretability. But the alternative possibility is that we get one or more mesa-optimizers, which seems less interpretable.
I for one am moderately optimistic that the world-model can actually remain “just” a world-model (and not a secret deceptive world-optimizer), and that the value function can actually remain “just” a value function (and not a secret deceptive world-optimizer), and so on, for reasons in my post Thoughts on safety in predictive learning—particularly the idea that the world-model data structure / algorithm can be relatively narrowly tailored to being a world-model, and the value function data structure / algorithm can be relatively narrowly tailored to being a value function, etc.
Since LeCun’s architecture is together a kind of optimizer (I agree with Algon that it’s probably a utility maximizer) then the emergence of additional mesa-optimizers seems less likely.
We expect optimization to emerge because it’s a powerful algorithm for SGD to stumble on that outcompetes the alternatives. But if the system is already an optimizer, then where is that selection pressure coming from to make another one?
In general, I’m a bit unsure about how much of an interpretability advantage we get from slicing the model up into chunks. If the pieces are trained separately, then we can reason about each part individually based on its training procedure. In the optimistic scenario, this means that the computation happening in the part of the system labeled “world model” is actually something humans would call world modelling. This is definitely helpful for interpretability. But the alternative possibility is that we get one or more mesa-optimizers, which seems less interpretable.
I for one am moderately optimistic that the world-model can actually remain “just” a world-model (and not a secret deceptive world-optimizer), and that the value function can actually remain “just” a value function (and not a secret deceptive world-optimizer), and so on, for reasons in my post Thoughts on safety in predictive learning—particularly the idea that the world-model data structure / algorithm can be relatively narrowly tailored to being a world-model, and the value function data structure / algorithm can be relatively narrowly tailored to being a value function, etc.
Since LeCun’s architecture is together a kind of optimizer (I agree with Algon that it’s probably a utility maximizer) then the emergence of additional mesa-optimizers seems less likely.
We expect optimization to emerge because it’s a powerful algorithm for SGD to stumble on that outcompetes the alternatives. But if the system is already an optimizer, then where is that selection pressure coming from to make another one?
it’s coming from the fact that every module wants to be an optimizer of something in order to do its job
Interesting, I wonder how the dynamics of a multiple mesa-optimizer system would play out (if it’s possible).