I recently (re-)read Risks From Learned Optimization, which introduced mesa-optimizers and inner alignment. (I don’t aim to explain what these things mean in this post, but here’s Scott Alexander’s explanation from earlier today.) I have some questions about mesa-optimizers, all to the effect of “What would this actually look like?”
1. Would training a neural net on an optimization task produce a mesa-optimizer?
My understanding (correct me if I’m wrong) is that as of yet we do not have an example of mesa-optimization in practice. So, what would happen if we trained a neural net to do a task that clearly requires optimization?
For example, consider the following task: the neural net receives a polynomial P(x) (perhaps with its coefficients represented in binary as input), and the neural net is to output the value x in the interval [0, 1] such that P(x) is as large as possible. Perhaps the loss on output x is the squared difference between P(x) and the true largest value of P on [0, 1].
I’m not sure what I expect the neural net to learn to do. Maybe it would try lots of values of x and output the optimal x of the ones it tried. Maybe it would learn to take the derivative of P and do Newton’s method. But I can’t imagine it learning to do this well without learning to do some sort of search or optimization. It seems, then, that if a neural net were successfully trained on this task, it would necessarily have a mesa-optimizer.
So my question is: (a) does what I said seem right? And (b) what would happen if we actually tried this now? Are current neural network architectures (with current levels of compute) able to learn this task? Or maybe a simpler task in the same spirit?
2. What would a neural net doing search actually look like?
When explaining mesa-optimizers, I currently wave my hands and say something like “Neural networks are arbitrarily expressive, so it is possible to encode a for loop to do search as weights in a neural net.” But I’d love to actually see what this looks like, as honest-to-goodness weights. I’m not sure whether the following exercise is pretty easy or extremely hard, but consider the following program that, given n, searches for k < n^0.7 that maximizes k^2 mod n:
Input: integer n ⇐ 2^30.
k = 0
for i = 1 to n^0.7:
if i*i mod n > k*k mod n:
k = i
return k
How would one write this program as a neural net? (What do I mean by “write this program as a neural net”? I basically mean that if you spent enough time, you could explain to me what exactly the neural net is doing and I would come out of the conversation satisfied that the neural net’s weights act in such a way that it’s doing something that is equivalent to the above for loop.) How big would the neural net need to be?
(Note: as I was writing this, I noticed that the neural net would need to be really big—linear in n^0.7, exponential in the length of n encoded as a bit string. That’s because a forward pass in a neural net takes time linear in the number of connections, so a neural net isomorphic to the one above would need a number of connections linear in n^0.7. Does this mean that neural nets that do search need to be quite large, or am I missing something? What does this imply about the point at which mesa-optimization becomes more effective than memorizing heuristics?)
3. Do we need mechanistic interpretability to check whether a neural net contains a mesa-optimizer?
At first I thought that the answer is yes: how could we know the details of what a neural net is doing without looking inside? Now I’m not so sure: perhaps it’s possible to detect optimizers purely through input-output behavior. For example, suppose our neural net solves a problem for which some inputs are particularly challenging and require optimization in order to attain a low loss. If the neural net has learned an optimization algorithm, it would quite plausibly do as well on these inputs as on others. If it hasn’t, it would do worse. So maybe we can detect mesa-optimizers by comparing the neural net’s loss on “easy” inputs to its loss on “hard” inputs.
I don’t know to what extent this approach is feasible, and whether it’s pretty generally useful or only useful on some toy mathematical problems.
Fragment of an answer to 2: it seems weird to imagine a feedforward network implementing a for-loop layer by layer, if the weights aren’t tied. I guess it might involve layers with flexible behavior, with the activations carrying instructions for each layer. It’s definitely more natural for a recurrent network, since each layer needs to be able to translate the instructions into the same computational behavior. On the other hand (peeking at other answers) language models seem to be a natural fit for doing for-loop-like behavior across sequence position.
In general, feedforward networks are circuits, not programs, and I don’t have good intuition about what computationally nontrivial circuits are like.
Yeah, it seems more natural to think of a recurrent network, or even better, a memory network such as a Neural Turing Machine. The question becomes similar to “what does a loop look like at the hardware level in a CPU?”.
For #1, yes, what you say makes sense and fits my understanding of the relevant terminology. It’s conceivable that a neural network would learn a good approximate lookup table instead, but if the architecture were sufficiently capable (EG, neural turing machines) learning search would be more compact & more accurate.
Re question 3 (is it possible to detect the presence of a mesa-optimiser from only its input-output behavior?), I was recently mulling over the following idea.
Suppose that on some task with some loss function, ML models tend to produce losses distributed in some predictable way (e.g. the losses are usually exponentially distributed, with the parameter λ getting larger the better your model is). Now if our model has a mesa-optimiser inside of it, I would expect the loss to be distributed like min(X1,…,Xn) where the Xi are drawn from some distribution of the sort we originally expected. This minimum might be distributed differently than our originally expected distribution, giving a way to detect a mesa-optimizer just by looking at the model’s losses!
Some caveats:
This won’t work if we originally expected our losses to be exponentially distributed, because the minimum of a collection of exponential random variables is another exponential random variable.
The more decoupled the mesa-objective is from the base-objective the less this reasoning will apply (since the mesa-optimiser will output the option which minimizes the mesa-loss, not the base-loss). That said, my intuition is that if you have a mesa-optimiser your loss distribution should still look weird somehow (since the output of an argmin should generally be distributed differently than the options among which the argmin selects).
This intuition must break down in extreme cases though: a deceptively aligned mesa-optimiser smart enough to know we’re watching the loss distribution for signs of a mesa-optimiser could ensure that its outputs are distributed the way we expect.
As I understand it, OpenAI argue that GPT-3 is a mesa-optimizer (though not in those terms) in the announcement paper Language Models are Few-Shot Learners. (Search for meta.) (edit: Might have been in another paper. I’ve seen this argued somewhere, but I might have the wrong link :( ) Paraphrased, the model has been shown so many examples of the form “here are some examples that create an implied class, is X an instance of the class? Yes/no”, that instead of memorizing the answers to all the questions, it has acquired a general skill for abstracting at runtime (over its context window). So while you have gradient descent going trying to teach the network a series of classes, the network might actually pick up feature learning itself as a skill instead, and start doing its own learning algorithm over just the context window.