Why GPT wants to mesa-optimize & how we might change this
This post was inspired by orthonormal’s post Developmental Stages of GPTs and the discussion that followed, so only part of it is original.
First I’ll aim to provide a crisper version of the argument for why GPT wants to mesa-optimize. Specifically, I’ll explain a well-known optimization algorithm used in text generation, and argue that GPT can improve performance on its objective by learning to implement something like this algorithm internally.
Then I’ll offer some ideas of mine about how we might change this.
Explanation of beam search
Our goal is to generate plausible text. We evaluate whether text is “plausible” by multiplying together all the individual word probabilities from our language model.
Greedy word selection has a problem: Since it doesn’t do lookahead, it’s liable to get stuck in a dead end. Let’s say we give our system the following poem about cheeses and ask it to generate more text:
Mozzarella is white
So you can see it at night
Cheddar is...
If our language model is decent, the word it will assign the highest probability to is “orange”. But this creates a problem, because “orange” is a hard word to rhyme.
Beam search is an attempt to solve this problem. Instead of picking the next word greedily, we explore the tree of completions and try to find a multi-word completion that maximizes the product of the individual word probabilities.
Because there are so many words in the English language, the tree grows at a very fast exponential rate. So we choose an integer beam_width for the number of partial completions to track, and each time we take another step deeper into the tree, we discard all but the most plausible beam_width partial completions.
Beam search with a beam width of 2. The bold red path corresponds to the maximum-plausibility completion, which would not get discovered by greedy search because “nice” has a higher probability than “dog”. Image stolen from this Hugging Face blog post, which has another explanation of beam search if you didn’t like mine.
Claim: GPT can do better on its training objective if it learns to do beam search internally
We’ve discussed text generation with a pretrained language model. Let’s switch gears and talk about the model’s training process.
Suppose GPT’s training corpus has the following poem:
Mozzarella is white
So you can see it at night
Cheddar is marigold
Unless you let it get too old
GPT is trained by giving it some text and asking it to predict the next word. So eventually GPT will be given the example from above
Mozzarella is white
So you can see it at night
Cheddar is...
and be asked to predict the next word.
Let’s consider the performance of two models on this task: regular “naive” GPT, and “beam search amplified” GPT. Beam search amplified GPT works by performing beam search using naive GPT, then looking at the distribution of the first words in the resulting completions, then outputting some weighted average of that distribution and the distribution from naive GPT.
Because beam search can find lots of ways to continue the poem using “marigold”, but few ways using “orange”, beam search amplified GPT’s distribution ends up being closer to reality than that of naive GPT. Something like this:
So when we update GPT’s weights during training, we’re shifting the weights towards the sort of computational structure that would make predictions like beam search amplified GPT does.
Does this actually help?
In this instance, GPT has an incentive to do internal lookahead. But it’s unclear how frequently these situations actually arise. And maybe it’s usually easier to do something else, like learning which words are easy to rhyme.
It would be straightforward to implement beam search amplified GPT (experimenting with different weighted averaging schemes) and check whether it can be made to assign higher plausibility to real text. (It might be best to try with GPT-2 rather than GPT-3, in case GPT-3 is already doing internal lookahead. Note that there’s a risk of mesa-optimization developing if lookahead improves performance at any point during GPT’s training.)
Is internal lookahead possible for GPT-3?
Relative to other optimization algorithms, it seems to me that beam search would be unusually easy for GPT to implement. Traditional iterative optimization algorithms like gradient descent or simulated annealing require a lot of serial computation, and the number of serial steps GPT can perform is strongly limited. Beam search is way less heavy on the number of serial steps required. The number of available serial steps would still limit the maximum lookahead horizon though.
The transformer architecture learns computations of the form “find some data from the previous step which scores highly according to particular criteria, do some computation on it, pass it on to the next step”. That sounds like beam search.
In any case, the topic of what incentives arise while training a language model seems important more generally.
Is internal lookahead dangerous?
If GPT’s architecture is capable of discovering lookahead internally, the worry is that GPT might modify and misuse it in creative ways after it’s discovered. It might start making plans, or searching for the idea that maximizes some attribute which is correlated with harm.
Let’s say there are chess problems in GPT’s training corpus which describe a board state along with an objective like “black to move and win in 6 turns even with best play by white”. If GPT can do lookahead internally, it can use this to search for game histories where black wins even though white is playing very well. In other words, it’s doing spontaneous internal planning. And this spontaneous internal planning is incentivized because it helps predict solutions to chess problems.
Who knows what other contexts spontaneous internal planning might get used in.
Fix idea #1: Switch to BERT style training
How might we remove the incentive for mesa-optimization?
A simple idea is to stop training the task of predicting the next word, and instead train the task of predicting a masked word given the surrounding context. This is what BERT uses. The incentive for internal lookahead seems smaller with this task, but I guess you’d still see it in e.g. predicting masked chess moves of strong players.
BERT’s ability to generate text is unclear. But I have a feeling BERT-style training actually offers greater potential than GPT-style training for text generation, because BERT can edit its writing—see discussion here. You could get really creative, like doing genetic algorithms with a population of texts generated on a particular topic, and using BERT to mutate and recombine texts in the population until you get one with really high plausibility. (Some words at the beginning could stay fixed as the “prompt”.)
Fix idea #2: Prohibit updates towards lookahead
This fix is more interesting because the core idea seems applicable for stopping mesa-optimization in other contexts too.
v1 of the idea is: During training, before any gradient update, we make a prediction with beam search amplified GPT. And if an update to naive GPT would mean moving it towards beam search amplified GPT (because beam search amplified GPT assigns a higher probability than naive GPT to the actual next word), we just skip the update and keep moving. Nick Bostrom refers to this general approach as “stunting” in Chapter 9 of Superintelligence.
This diagram from above is an example of an update that would get skipped.
Training will take longer, obviously. But only insofar as all this mesa-optimization stuff was a problem in the first place.
v2 of the idea is to add a penalty term to GPT’s loss function that causes it to update away from whatever beam search amplified GPT predicts.
A risk with v2 is that GPT is incentivized to emulate beam search amplified GPT so it can avoid making its prediction (i.e. an internal mesa-optimizer gets developed anyway). Skipping updates as in v1 is probably safer, but the fundamental problem seems possible either way.
So v3 is to track gradient updates towards/away from beam search amplified GPT and ensure a rough balance. We don’t want systematic movement towards or away from beam search amplified GPT—we don’t want incentives to develop either it or its inverse. This could be accomplished by e.g. rescaling gradients on the fly so the average movement towards beam search amplified GPT’s answers is around 0. (To be safe, you might want to balance both the number of updates in each direction and their typical magnitude.) An advantage is that training can go faster than in v1 because less updating is skipped.
Beam search isn’t the only possible way to do lookahead though.
In particular, the distribution of first words in beam search completions will only assign non-zero probability to words which are the first word of a completion. That leaves room for predictive gains by doing lookahead for other words.
So v4 of the idea is to sample from the infinite beam width distribution, particle filter style, and update in such a way that your expected update towards next-word predictions from the infinite beam width distribution is 0. (Not sure how to do this concretely, but it seems like it might be possible.)
The broader issue here is the nearest unblocked strategy problem. The problem seems relatively tractable in this case, because e.g. beam search based stunting seems to remove the incentive for any sort of lookahead insofar as it helps in the cases where beam search amplified GPT outperforms. (I suspect mesa-optimization is not an all-or nothing proposition, and it will matter how much it has been refined + how readily it is deployed.)
Hence, v5 of the idea—even more underspecified than v4 at this point—is adversarial anti-lookahead stunting. Characterize some kind of “lookahead amplifier” which takes naive GPT as input and attempts to amplify it by doing lookahead-type things in a very freeform way. Train this and make it really good at amplifying GPT to achieve better next-word prediction. Then stunt naive GPT during training by forcing its average update towards the next-word predictions of lookahead-amplified GPT to be 0. (Perhaps you’d want to be retraining the lookahead amplifier repeatedly over the course of GPT’s training, in case different lookahead strategies work best at different GPT maturity levels.) In production, go ahead and make use of lookahead-amplified GPT to make performance as competitive as possible. (It wouldn’t shock me if this scheme actually produced superior performance, if the lookahead amplifier learns lookahead that GPT couldn’t learn for itself, due to a better inductive bias.)
Ideally, GPT’s training process would be monitored to ensure that GPT + lookahead in any form consistently outperforms naive GPT. For bonus points, have the lookahead used in the monitoring be a form of lookahead which is not used for stunting (a “dev set”).
Do these fixes actually help?
An objection to Fix #2 is the possibility of mesa-optimization which isn’t very much like lookahead. For example, if we’re training on text that describes a newly discovered animal, the system has an incentive to try & figure out the animal for itself internally so it can better predict how it will be described—and it might make use of some optimization algorithm, genetic algorithms say, to achieve this.
Another objection is that pulling optimization up from the mesa level, as in the “BERT + genetic algorithms” idea or the “lookahead amplifier in production” idea, isn’t actually helpful. There’s still optimization happening, and the system as a whole could still make devious plans or search for harmful ideas.
However, less mesa-optimization means less risk that transformer blocks develop optimization/planning capabilities and reuse them in contexts we didn’t expect. It’s easier to reason about searching for text which maximizes plausibility than a mysterious mesa-objective. In particular, an agent that gets instantiated internally might search for side-channel attacks in the text generation machinery and surrounding system (especially risky if GPT has read about this stuff). But it seems very unlikely that a search for plausibility-maximizing text would cause this (except maybe if those attacks somehow got activated during training). Non-mesa-optimization also has parameters that allow us to control its strength without retraining the model, and we have a better understanding of how it works.
There’s still a lot of potential for misuse & accidents either way, of course.
OpenAI doesn’t offer beam search? Why? Is GPT-3 already mesa-optimizing?
Up until now, I’ve been pretending that maximizing plausibility (product of individual word probabilities) is a good way to generate text. But beam search doesn’t even seem to be an option in the GPT-3 interface. (Please correct me if I’m missing something!)
Why is beam search missing? One possibility is that GPT-3 already does internal lookahead. OpenAI tried beam search, found it didn’t improve text generation, and didn’t bother adding it as an option. In other words, GPT-3 is already mesa-optimizing 😲
Another possibility:
[Generated text:] “I enjoy walking with my cute dog, but I’m not sure if I’ll ever be able to walk with my dog. I’m not sure if I’ll ever be able to walk with my dog.”
...
...The generated words following the context are reasonable, but the model quickly starts repeating itself! This is a very common problem in language generation in general and seems to be even more so in greedy and beam search...
...
...Recently, there has been more evidence though that the apparent flaws of greedy and beam search—mainly generating repetitive word sequences—are caused by the model (especially the way the model is trained), rather than the decoding method, cf. Welleck et al. (2019).
From the Hugging Face post (emphasis mine). OK, this thing about language models that find repetitive text plausible sounds like a problem that will eventually get solved. Anything else?
As argued in Ari Holtzman et al. (2019), high quality human language does not follow a distribution of high probability next words. In other words, as humans, we want generated text to surprise us and not to be boring/predictable. The authors show this nicely by plotting the probability, a model would give to human text vs. what beam search does.
So let’s stop being boring and introduce some randomness 🤪.
This is a much deeper & more interesting issue IMO. It may be that only a superintelligent language model will find human writing so boringly predictable that every word has high likelihood based on what came before.
Will there be an intermediate stage where prompting a language model with “I just had a brilliant and highly original idea related to X” will cause it to assign higher plausibilities to completions that are actually quite brilliant & original? (Is this the case for GPT-3 already?) I have no idea.
In any case, maybe we could get the benefits of both originality and avoidance of dead ends by sampling from beam search amplified GPT’s next-word distribution to generate text? (This could be especially useful if Fix #2 has been applied and the GPT’s ability to do lookahead for itself has been stunted.)
Note also that the surprisingness of human text could be an objection to the “GPT can do better on its training objective if it learns to do beam search for itself” claim above. If human text tends to have periodic surprises, using beam search to look for predictable completions may not help performance since those predictions aren’t actually very likely.
However, it also may be the case that beam search ends up improving the accuracy of next-word prediction despite the fact that it doesn’t generate interesting text.
- Discussion with Eliezer Yudkowsky on AGI interventions by 11 Nov 2021 3:01 UTC; 328 points) (
- Modeling Risks From Learned Optimization by 12 Oct 2021 20:54 UTC; 45 points) (
- A mesa-optimization perspective on AI valence and moral patienthood by 9 Sep 2021 22:23 UTC; 10 points) (EA Forum;
- Thoughts on Dangerous Learned Optimization by 19 Feb 2022 10:46 UTC; 4 points) (
- 20 Sep 2020 2:01 UTC; 2 points) 's comment on Developmental Stages of GPTs by (
I’m skeptical that internal beam search would help in language modeling.
Language modeling is like predicting the weather, in the sense that even if you are literally as good as possible at it, your prediction accuracy still degrades rapidly as a function of the number of steps ahead you’re looking. So a predictor which seems (and is) frighteningly powerful at some short range L will do little better than random guessing if you chain its predictions up to some small multiple of L.
Weather is like this because of chaotic dynamics. Language modeling is like this because
(a) Text is used to communicate: the writer expects the audience to learn something from the last X% of a text that they couldn’t extrapolate from reading the first (100-X)%, or else they’d just stop and not write the remaining X%.
(b) By construction, language modeling gives you nothing to work with except the text itself, so you don’t know who produced it or for whom. So even if you were smart enough to guess what any individual human would say next (!), you don’t know which human produced the text you’re looking at. (Or even whether it was a human at all.)
Thus (IMO), language modeling is not really about thinking ahead to find some “objectively correct” next move as in Chess/Go. It’s more about trying to guess what the author of this text will do in the very next step. The author and the LM are almost sure to diverge after a few more steps, so even if the LM had a beam search oracle, I expect it wouldn’t find it very useful.
To make the point concrete, I don’t think “orange” is necessarily a bad guess here—among other things, it would be the correct guess if the author were trying to illustrate the point of your example!
And if we were predicting this post itself, the true next token would not be orange or any other word but an ellipsis ”...”, which seems bizarre from the narrow perspective of the example, but is typical of the wild world LMs operate in. (Which also contains typos, actually-incoherent writers, mangled formatting, the list goes on . . . )
A system which develops small-L lookahead (for L > 1) may find large-L lookahead to be nearby in programspace. If so, incentivizing the development of small-L lookahead makes it more likely that the system will try large-L lookahead and find it to be useful as well (in predicting chess moves for instance).
My intuition is that small-L lookahead could be close to large-L lookahead in programspace for something like an RNN, but not for GPT-3′s transformer architecture.
Anyway, the question here isn’t whether lookahead will be perfectly accurate, but whether the post-lookahead distribution of next words will allow for improvement over the pre-lookahead distribution. Lookahead is almost certainly going to do better than random guessing, even topic models can do that.
Are you saying that GPT-3′s training corpus was preprocessed to remove information about the author, title, and publication venue? Or are you only talking about what happens when this info is outside the context window?
Can you say a bit more about why you only need look-ahead to improve performance? SGD favors better improvements over worse improvements—it feels like I could think of many programs that are improvements but which won’t be found by SGD. Maybe you would say there don’t seem to be any improvements that are this good and this seemingly easy for SGD to find?
From a safety standpoint, hoping and praying that SGD won’t stumble across lookahead doesn’t seem very robust, if lookahead represents a way to improve performance. I imagine that whether SGD stumbles across lookahead will end up depending on complicated details of the loss surface that’s being traversed.
I agree, and thanks for the reply. And I agree that even a small chance of catastrophe is not robust. Though I asked because I still care about the probability of things going badly, even if I think that probability is worryingly high. Though I see now (thanks to you!) that in this case our prior that SGD will find look-ahead is still relatively high and that belief won’t change much by thinking about it more due to sensitivity to complicated details we can’t easily know.
No, it’s a more philosophical point. Even if such things appear in the context window, they’re simply more text, and convey the same kind of information: not “the denotation of these words is factually true,” but “these words are part of the text.”
For example, the mere appearance of something like
Title: Why GPT wants to mesa-optimize & how we might change this
Author: John_Maxwell
does not guarantee that the text following it bears that title, or was written by that author. (As I am illustrating right now.)
Of course, one can design datasets where information like this is provided more authoritatively—say, always at the start of each text, curated for quality, etc. (GPT isn’t like that, but Grover and CTRL kind of are, in different ways.)
But even that can only go so far. If the author is “Julius Caesar,” does that mean the historical figure, some internet poster with that handle, or any number of other possibilities? A passage of fiction written in a character’s voice—is the appropriate author cue the actual writer (who may have written in many different voices over their career) or the character? (Note that the character is a much better answer to the question “who does this sound like?”) And doesn’t the date matter too, so we know whether this post in the venue “Less Wrong” was on 2010′s LW or 2020′s?
Fundamentally, language modeling is about understanding structures in decontextualized blocks of contiguous words. You can try to hack in some sidechannels to provide context, but there’s no way they will capture everything needing to locate the text fully in its social, physical, and temporal position within the broader world. And just as a definitional manner, these sidechannels are modifications to “language modeling,” which in its purest sense is just about filling in an arbitrary text from substrings of it (and no other information).
Yeah, not for transformers I think.
capybaralet’s point about conservation of expected evidence applies here—GPT is trying to be optimal at next-step prediction, and an optimal next-step predictor should not get improved by lookahead, it should already have those facts priced in to its next-step prediction.
If we then say “the mechanism for pricing them in is doing internal lookahead,” then we are imagining that lookahead operating over some predictor that is otherwise good but hasn’t priced in lookahead yet. But I don’t know why we should imagine the computation would naturally factor this way, when the benefits of lookahead are small and it beam search take a lot of parameters to implement internally.
Your philosophical point is interesting; I have a post in the queue about that. However I don’t think it really proves what you want it to.
Having John_Maxwell in the byline makes it far more likely that I’m the author of the post.
If humans can make useful judgements re: whether this is something I wrote, vs something nostalgebraist wrote to make a point about bylines, I don’t see why a language model can’t do the same, in principle.
A perfectly optimal next-step predictor would not be improved by lookahead or anything else, it’s perfectly optimal. I’m talking about computational structures which might be incentivized during training when the predictor is suboptimal. (It’s still going to be suboptimal after training with current technology, of course.)
In orthonormal’s post they wrote:
I suspect that either GPT-4 will still be unable to plan its way to a satisfying resolution, or GPT-4 will develop some kind of internal lookahead (probably not beam search, but beam search could be a useful model for understanding it) which is sufficiently general to be re-used across many different writing tasks. (Generality takes fewer parameters.) I don’t know what the relative likelihoods of those possibilities are. But the whole idea of AI safety is to ask what happens if we succeed.
Beam search has never worked for likelihood-trained NNs, since at least char-RNNs back in 2015. Beam search does trigger repetition and other pathologies in GPT, see “The Curious Case of Neural Text Degeneration”, Holtzman et al 2019. And while unlikelihood training seems to help, it’s not a silver bullet, and is a bit ad hoc (especially if you think of it in terms of reinforcement learning).
Seq2seq used beam search and found it helped (https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43155.pdf). It was standard practice in the early days of NMT; I’m not sure when that changed.
This blog post gives some insight into why beam search might not be a good idea, and is generally very interesting: https://benanne.github.io/2020/09/01/typicality.html
It still is, it’s just that beam search (or other search strategies) seem to be mostly useful for closed-end short text generation; translating a sentence apparently is a task with enough of a right-or-wrong-ness to it that beam search apparently taps into no pathologies. But they get exposed for open-ended longform generation.
I’m going with “very frequently, perhaps universally”. An example I came up with here was choosing “a” vs “an” which depends on the next word.
I think writing many, maybe most, sentences, requires some idea of how the sentence structure is going to be laid out, and that “idea” extends beyond the next token. Ditto at the paragraph level etc.
So I think it already does lookahead in effect, but I don’t think it does it by “beam search” per se. I think it’s more like “using concepts that extend over many tokens”, concepts like “this sentence has the following overall cadence...” and “this sentence conveys the following overall idea...” and “we’re in the middle of writing out this particular idiomatic phrase”. The training simultaneously incentives both finding the right extended concepts for where you’re at in the text, and choosing a good word in light of that context.
I used your idea of “a” vs. “an” as the basis of a GPT-3 experiment to show that GPT-3 indeed probably does do lookahead. Details are at https://www.reddit.com/r/GPT3/comments/k0mvf3/experiment_that_shows_that_gpt3_can_probably_plan/
Thanks for sharing!
You’re welcome, and thank you for your post also :). I posted an updated version of my experiment, which (hopefully) improves the logic of my prior experiment, at https://www.reddit.com/r/MachineLearning/comments/k2n3yv/d_an_experiment_that_shows_that_gpt3_can_plan/.
This post distinguishes between mesa-optimization and learned heuristics. What you’re describing sounds like learned heuristics. (“Learning which words are easy to rhyme” was an example I gave in the post.) Learned heuristics aren’t nearly as worrisome as mesa-optimization because they’re harder to modify and misuse to do planning in unexpected domains. When I say “lookahead” in the post I’m pretty much always referring to the mesa-optimization sort.
Suppose I said (and I actually believe something like this is true):
“GPT often considers multiple possibilities in parallel for where the text is heading—including both where it’s heading in the short-term (is this sentence going to end with a prepositional phrase or is it going to turn into a question?) and where it’s heading in the long-term (will the story have a happy ending or a sad ending?)—and it calculates which of those possibilities are most likely in light of the text so far. It chooses the most likely next word in light of this larger context it figured out about where the text is heading.”
If that’s correct, would you call GPT a mesa-optimizer?
Well I suppose mesa-optimization isn’t really a binary is it? Like, maybe there’s a trivial sense in which self-attention “mesa-optimizes” over its input when figuring out what to pay attention to.
But ultimately, what matters isn’t the definition of the term “mesa-optimization”, it’s the risk of spontaneous internal planning/optimization that generalizes in unexpected ways or operates in unexpected domains. At least in my mind. So the question is whether this considering multiple possibilities about text stuff could also improve its ability to consider multiple possibilities in other domains. Which depends on whether the implementation of “considering multiple possibilities” looks more like beam search vs very domain-adapted heuristics.
I think the Transformer is successful in part because it tends to solve problems by considering multiple possibilities, processing them in parallel, and picking the one that looks best. (Selection-type optimization.) If you train it on text prediction, that’s part of how it will do text prediction. If you train it on a different domain, that’s part of how it will solve problems in that domain too.
I don’t think GPT builds a “mesa-optimization infrastructure” and then applies that infrastructure to language modeling. I don’t think it needs to. I think the Transformer architecture is already raring to go forth and mesa-optimize, as soon as you as you give it any optimization pressure to do so.
So anyway your question is: can it display foresight / planning in a different domain via without being trained in that domain? I would say, “yeah probably, because practically every domain is instrumentally useful for text prediction”. So somewhere in GPT-3′s billions of parameters I think there’s code to consider multiple possibilities, process them in parallel, and pick the best answer, in response to the question of What will happen next when you put a sock in a blender? or What is the best way to fix an oil leak?—not just those literal words as a question, but the concepts behind them, however they’re invoked.
(Having said that, I don’t think GPT-3 specifically will do side-channel attacks, but for other unrelated reasons off-topic. Namely, I don’t think it is capable of make the series of new insights required to develop an understanding of itself and its situation and then take appropriate actions. That’s based on my speculations here.)
This makes me wonder, how would Monte Carlo tree search do for GPT? And could you do AlphaGo-style IDA?
You’d need an analogue of the value network (or value head). (Where current GPT seems analogous to the policy network.) And then ideally you’d also want some analogue of winning / losing to ground out the evaluation.
Maybe you could set it up like this --
start with a task description like, “write a poem in the style of e.e. cummings about the romance between cryptographers Alice and Bob”
feed the task description (with some boilerplate) into GPT, and have it start generating continuations
do MCTS on the continuations; use your value network (head) to evaluate the continuations vs the task description; update the policy network based on the evaluations
include an “is done” head and evaluate it to decide when to stop
send completed works to humans to provide feedback; the feedback should include separate scores for “good so far” for the value head, and “is a completed work” for the “is done” head.
I’d be curious whether this would enable GPT to significantly improve. Specifically, would you be able to generate longer works with less intervention?
See GPT-f for combining a transformer model (with pre-trained language weights?) with alphazero style training to learn to prove theorems
Oh, I had actually seen that paper. Forgot that they did that though. Thanks!
I didn’t read the post (yet...), but I’m immediately skeptical of the claim that beam search is useful here (“in principle”), since GPT-3 is just doing next step prediction (it is never trained on its own outputs, IIUC). This means it should always just match the conditional P(x_t | x_1, .., x_{t-1}). That conditional itself can be viewed as being informed by possible future sequences, but conservation of expected evidence says we shouldn’t be able to gain anything by doing beam search if we already know that conditional. Now it’s true that efficiently estimating that conditional using a single forward pass of a transformer might involve approximations to beam search sometimes.
At a high level, I don’t think we really need to be concerned with this form of “internal lookahead” unless/until it starts to incorporate mechanisms outside of the intended software environment (e.g. the hardware, humans, the external (non-virtual) world).
Yeah, that’s the possibility the post explores.
Is there an easy way to detect if it’s started doing that / tell it to restrict its lookahead to particular domains? If not, it may be easier to just prevent it from mesa-optimizing in the first place. (The post has arguments for why that’s (a) possible and (b) wouldn’t necessarily involve a big performance penalty.)
My intuitions on this matter are:
1) Stopping mesa-optimizing completely seems mad hard.
2) Managing “incentives” is the best way to deal with this stuff, and will probably scale to something like 1,000,000x human intelligence.
3) On the other hand, it’s probably won’t scale forever.
To elaborate on the incentive management thing… if we figure that stuff out and do it right and it has the promise that I think it does… then it won’t restrict lookahead to particular domains, but it will remove incentives for instrumental goal seeking.
If we’re still in a situation where the AI doesn’t understand its physical environment and isn’t incentivized to learn to control it, then we can do simple things like use a fixed dataset (as opposed to data we’re collecting online) in order to make it harder for the AI to learn anything significant about its physical environment.
Learning about the physical environment and using it to improve performance is not necessarily bad/scary absent incentives for control. However, I worry that having a good world model makes an AI much more liable to infer that it should try to control and not just predict the world.
As I mentioned in the post, I don’t think this is a binary, and stopping mesa-optimization “incompletely” seems pretty useful. I also have a lot of ideas about how to stop it, so it doesn’t seem mad hard to me.
I’m less optimistic about this approach.
There is a stochastic aspect to training ML models, so it’s not enough to say “the incentives favor Mesa-Optimizing for X over Mesa-Optimizing for Y”. If Mesa-Optimizing for Y is nearby in model-space, we’re liable to stumble across it.
Even if your mesa-optimizer is aligned, if it doesn’t have a way to stop mesa-optimization, there’s the possibility that your mesa-optimizer would develop another mesa-optimizer inside itself which isn’t necessarily aligned.
I’m picturing value learning via (un)supervised learning, and I don’t see an easy way to control the incentives of any mesa-optimizer that develops in the context of (un)supervised learning. (Curious to hear about your ideas though.)
My intuition is that the distance between Mesa-Optimizing for X and Mesa-Optimizing for Y is likely to be smaller than the distance between an Incompetent Mesa-Optimizer and a Competent Mesa-Optimizer. If you’re shooting for a Competent Human Values Mesa-Optimizer, it would be easy to stumble across a Competent Not Quite Human Values Mesa-Optimizer along the way. All it would take would be having the “Competent” part in place before the “Human Values” part. And running a Competent Not Quite Human Values Mesa-Optimizer during training is likely to be dangerous.
On the other hand, if we have methods for detecting mesa-optimization or starving it of compute that work reasonably well, we’re liable to stumble across an Incompetent Mesa-Optimizer and run it a few times, but it’s less likely that we’ll hit the smaller target of a Competent Mesa-Optimizer.
By managing incentives I expect we can, in practice, do things like: “[telling it to] restrict its lookahead to particular domains”… or remove any incentive for control of the environment.
I think we’re talking past each other a bit here.
Epistemic status: I’m not really an expert at NLP. I’ve only been working on language modeling for ~8mo, which is much less than some of the folks here, and this is based on my experiences.
Beam Search:
Beam search with large unsupervised generatively pretrained transformers (GPTs) is weirder than it appears in the NLP literature. Other commenters have mentioned degeneracies, but for me the sticking points for beam search were:
It tends to quickly fall on a modal response — so it’s already bad for any sort of situation you want to generate a diversity of samples and choose the best from
It’s hard to correctly score between varying-length segments. Every paper that uses beam search has some heuristic hack here, which is almost always some parametrized function they pulled from another paper or hacked together.
It seems to mostly do best (once tuned) at some narrow/specific distribution (e.g. generating short responses in a chat setting). It’s hard to get beam search tuned to work well across the full distribution used to train these models (i.e. “text on the internet”)
Given these three issues, in my experience it’s been better to just focus on tuning naive sampling, with a few key parameters: temperature, top_p, etc (these are part of the OpenAI API).
Caveat: it’s possible I’m just bad at tuning beam search. It’s possible I’m bad at scholarship and missed the “one key paper” that would make it all clear to me. I would take the above as more of an anecdote than a scientific result.
Separation of training and sampling:
This has been mentioned by other commenters, but might bear repeating that there is no sampling at all in the training process for GPTs. They’re trained to approximate marginal next token distributions, and the default is to share the loss on the prediction for every token equally. In practice the loss on later tokens is lower.
All of this is saying that training is a separate process for sampling. I think there is probably very good research to be done in better sampling — in particular, I think it is possible to have a machine which aligns sampling from an unaligned model.
Lookahead & pondering:
I think the point about lookahead is still worth considering. One of the differences between transformers and the previous most-popular architecture for language models (LSTMs) is that transformers use the same amount of compute for every token. (It’s possible to build them otherwise, but I haven’t seen any of these that I’ve been impressed by yet)
I think my favorite example of this in the literature is Adaptive Computation Time (ACT)[https://arxiv.org/abs/1603.08983], where essentially the model learns how to “spend” extra compute on certain characters.
(One of the things going on with ACT is dealing with the non-uniformity of the distribution of information content in character strings — for GPTs this is at least partially ameliorated by the byte-pair encoding)
So I think it is reasonable to train a model to be able to use extra “pondering” time when sampling. Either by having an external controller that tells the model when to ponder and when to output, or by having the model learn itself how to ponder (which is the “halting neuron” signal in ACT).
I do think that any sort of pondering is subject to mesa-optimization concerns.
Fix 1 - BERT:
Caveat: I haven’t trained BERT models or taken a trained one and tried hard to get high quality samples from it. This is based on intuitions and hearsay.
Here I’ll use “GPT” to refer to autoregressive next token prediction objectives, to mirror the style of the article. This objective can of course be used with other architectures in other settings.
Instead of thinking the “mask-part-out prediction” (BERT) and the “mask future text” (GPT) as two separate tasks, think of them as points in the space of distributions over masks.
In particular, its trivial to come up with mask distributions that include both a preponderance of masks which leave small parts out (BERT-like) and masks which leave future tokens out (GPT-like) as well as possibly other mask patterns.
My intuition is that the higher probability you mask out all future tokens, the easier it is to get high quality samples from that model.
Fix 1 - Editing Text:
(Same caveat as above regarding inexperience w/ BERT models)
BERT objectives by themselves do not allow efficient text editing, and neither do GPT objectives.
Thinking about the task of composing an edit you, the model needs to:
Identify the section that will be removed (if any)
Figure out the length of the replacement text (if any)
Compose the replacement text (if any)
Possibly also have some way of attending over the old text, while still knowing to replace it
Neither BERT nor GPT objectives do a great job of this by itself. If I had to choose, though, I think you can encode this sort of thing in the GPT dataset and have it autoregressively generate edits.
(This is part of a conjecture I’ve been meaning to writeup for lesswrong of “the dataset is the interface” for GPT models)
Fix 2 - Changing the training:
I think there’s some interesting stuff here, but so far this is in the regime of training algorithms that are unexplored, enormously complex, and poorly understood.
The clearest part here is that it uses sampling in the training loop which so far I’ve almost exclusively seen in reinforcement learning (RL).
But, we can probably implement something like this with RL. In particular, training is a process of selecting a context (masking), sampling from the model to fill in the mask, and scoring based on the objective.
In this case, drawing some analogies to RL:
Action—token
Action distribution—token distribution (the basic output of a GPT model given an input context)
Policy—language model (in particular a GPT model, though with hacks BERT/other models could be used)
Reward—objective (log-loss on the true document, for a GPT model)
Environment—a document, probably with some starting context already provided
It’s pretty easy to see here that this wouldn’t work well from generating from scratch. If I provide zero contextual tokens to the model, sample N tokens, and then score it on how close it got to a true (hidden) document, I am going to have a very bad time.
This might be a good approach for fine-tuning a GPT model — which is (exactly what some colleagues did)[https://openai.com/blog/fine-tuning-gpt-2/].
Even in the fine-tuning case, we have all of the myriad and sundry problems with RL (instability, inefficiency, etc) that our plain-and-simple language modeling objective lacks.
Fix 2 - update away:
I think this probably won’t work just from experience. I’ve found it very hard to get the model to “reduce your probability on the most likely outcome and increase your probability on the next most likely outcome” — instead objectives like this tend to just increase the temperature of everything (or worse, it puts all of the increase in entropy in the long tail of bad answers).
It’s possible there is a good way to do this, but for now I don’t know of a good way to get a model to increase the probability of “secondary options” without just degenerating into increasing entropy.
Fix 2 - track updates:
If I understand this correctly, I think this is easily approximated by having an objective/loss/reward term which penalizes differences from the original model. For small deltas I think this is a good approach, and unfortunately is only as good as the original model you’re comparing it too.
As far as the specific proposal for managing updates towards/away from beam search updates, that seems also possible via a similar mechanism — penalize distributional difference from those samples.
I think we haven’t really explored these sort of penalties enough, and in particular how they interact when combined with other objectives.
Fix 2 - will it stunt:
I think that any objective that scores better predictions higher will incentivize some sort of lookahead/pondering.
If you prevent it from being coincident with the beam search distribution, then I expect the model will learn how to do lookahead/pondering in the null space of beam search.
Will these solve mesa-optimization:
This isn’t clear to me, but I think it’s worth studying.
In particular, it would be good to figure out some way of contriving a mesa-optimization setup, such that we could measure if these fixes would prevent it or not.
Beam Search in the API:
I think my above comments about Beam Search apply here.
Beam search, like any optimization algorithm, is hugely dependent on its scoring function. If you score on likelihood, you’ll end up with high-likelihood (“unsurprising”) text.
Future thoughts—sampling research:
I think in general we’re in a weirdly asymmetric world, where we have a huge amount of compute and effort into computing auto-regressive next token distributions, and comparatively very little sophistication in sampling from them.
This comment is probably too long already for me to expand too much on this, but in particular, I think the log-likelihood objective is default unaligned (as most datasets are default unaligned) but I think we can find ways of sampling from log-likelihood optimized models in ways that are aligned.
With regard to the editing text discussion, I was thinking of a really simple approach where we resample words in the text at random. Perhaps that wouldn’t work great, but I do think editing has potential because it allows for more sophisticated thinking.
Let’s say we want our language model to design us an aircraft. Perhaps its starts by describing the engine, and then it describes the wings. Standard autoregressive text generation (assuming no lookahead) will allow the engine design to influence the wing design (assuming the engine design is inside the context window when it’s writing about the wings), but it won’t allow the wing design to influence the engine design. However, if the model is allowed to edit its text, it can rethink the engine in light of the wings and rethink the wings in light of the engine until it’s designed a really good aircraft.
Agreed. Perhaps if we generated lots of travelling salesman problem instances where the greedy approach doesn’t get you something that looks like the optimal route, then try & train a GPT architecture to predict the cities in the optimal route in order?
This is an interesting quote:
Source.
I suspect GPT will be biased towards avoiding mesa-optimization and making use of heuristics, so the best contrived mesa-optimization setup may be an optimization problem with little structure where heuristics aren’t very helpful. Maybe we could focus on problems where non-heuristic methods such as branch and bound / backtracking are considered state of the art, and train the architecture to mesa-optimize by starting with easy instances and gradually moving to harder and harder ones.
Clarifying Q: Does mesa-optimization refer to any inner optimizer, or one that is in particular not aligned with the outer context?
I was using it to refer to “any inner optimizer”. I think that’s the standard usage but I’m not completely sure.
Why?
My thought was that if lookahead improves performance during some period of the training, it’s liable to develop mesa-optimization during that period, and then find it to be a useful for other things later on.
Does the fact that GPT-3 can do word segmentation for English text with no spaces shed any light on the issues covered in this post? Please see https://www.reddit.com/r/MachineLearning/comments/j9a6lh/d_gpt3_can_do_word_segmentation_for_english_text/ for further details.