A Ray comments on Why GPT wants to mesa-optimize & how we might change this

A Ray Jan 2, 2021, 7:16 PM
LW: 4 AF: 3
0
AF
Epistemic status: I’m not really an expert at NLP. I’ve only been working on language modeling for ~8mo, which is much less than some of the folks here, and this is based on my experiences.
Beam Search:
Beam search with large unsupervised generatively pretrained transformers (GPTs) is weirder than it appears in the NLP literature. Other commenters have mentioned degeneracies, but for me the sticking points for beam search were:
- It tends to quickly fall on a modal response — so it’s already bad for any sort of situation you want to generate a diversity of samples and choose the best from
- It’s hard to correctly score between varying-length segments. Every paper that uses beam search has some heuristic hack here, which is almost always some parametrized function they pulled from another paper or hacked together.
- It seems to mostly do best (once tuned) at some narrow/specific distribution (e.g. generating short responses in a chat setting). It’s hard to get beam search tuned to work well across the full distribution used to train these models (i.e. “text on the internet”)
Given these three issues, in my experience it’s been better to just focus on tuning naive sampling, with a few key parameters: temperature, top_p, etc (these are part of the OpenAI API).
Caveat: it’s possible I’m just bad at tuning beam search. It’s possible I’m bad at scholarship and missed the “one key paper” that would make it all clear to me. I would take the above as more of an anecdote than a scientific result.

Separation of training and sampling:

This has been mentioned by other commenters, but might bear repeating that there is no sampling at all in the training process for GPTs. They’re trained to approximate marginal next token distributions, and the default is to share the loss on the prediction for every token equally. In practice the loss on later tokens is lower.
All of this is saying that training is a separate process for sampling. I think there is probably very good research to be done in better sampling — in particular, I think it is possible to have a machine which aligns sampling from an unaligned model.
Lookahead & pondering:
I think the point about lookahead is still worth considering. One of the differences between transformers and the previous most-popular architecture for language models (LSTMs) is that transformers use the same amount of compute for every token. (It’s possible to build them otherwise, but I haven’t seen any of these that I’ve been impressed by yet)
I think my favorite example of this in the literature is Adaptive Computation Time (ACT)[https://arxiv.org/abs/1603.08983], where essentially the model learns how to “spend” extra compute on certain characters.

(One of the things going on with ACT is dealing with the non-uniformity of the distribution of information content in character strings — for GPTs this is at least partially ameliorated by the byte-pair encoding)
So I think it is reasonable to train a model to be able to use extra “pondering” time when sampling. Either by having an external controller that tells the model when to ponder and when to output, or by having the model learn itself how to ponder (which is the “halting neuron” signal in ACT).
I do think that any sort of pondering is subject to mesa-optimization concerns.
Fix 1 - BERT:
Caveat: I haven’t trained BERT models or taken a trained one and tried hard to get high quality samples from it. This is based on intuitions and hearsay.
Here I’ll use “GPT” to refer to autoregressive next token prediction objectives, to mirror the style of the article. This objective can of course be used with other architectures in other settings.
Instead of thinking the “mask-part-out prediction” (BERT) and the “mask future text” (GPT) as two separate tasks, think of them as points in the space of distributions over masks.
In particular, its trivial to come up with mask distributions that include both a preponderance of masks which leave small parts out (BERT-like) and masks which leave future tokens out (GPT-like) as well as possibly other mask patterns.
My intuition is that the higher probability you mask out all future tokens, the easier it is to get high quality samples from that model.
Fix 1 - Editing Text:
(Same caveat as above regarding inexperience w/ BERT models)
BERT objectives by themselves do not allow efficient text editing, and neither do GPT objectives.
Thinking about the task of composing an edit you, the model needs to:
- Identify the section that will be removed (if any)
- Figure out the length of the replacement text (if any)
- Compose the replacement text (if any)
- Possibly also have some way of attending over the old text, while still knowing to replace it
Neither BERT nor GPT objectives do a great job of this by itself. If I had to choose, though, I think you can encode this sort of thing in the GPT dataset and have it autoregressively generate edits.
(This is part of a conjecture I’ve been meaning to writeup for lesswrong of “the dataset is the interface” for GPT models)
Fix 2 - Changing the training:
I think there’s some interesting stuff here, but so far this is in the regime of training algorithms that are unexplored, enormously complex, and poorly understood.

The clearest part here is that it uses sampling in the training loop which so far I’ve almost exclusively seen in reinforcement learning (RL).
But, we can probably implement something like this with RL. In particular, training is a process of selecting a context (masking), sampling from the model to fill in the mask, and scoring based on the objective.
In this case, drawing some analogies to RL:
- Action—token
- Action distribution—token distribution (the basic output of a GPT model given an input context)
- Policy—language model (in particular a GPT model, though with hacks BERT/other models could be used)
- Reward—objective (log-loss on the true document, for a GPT model)
- Environment—a document, probably with some starting context already provided
It’s pretty easy to see here that this wouldn’t work well from generating from scratch. If I provide zero contextual tokens to the model, sample N tokens, and then score it on how close it got to a true (hidden) document, I am going to have a very bad time.
This might be a good approach for fine-tuning a GPT model — which is (exactly what some colleagues did)[https://openai.com/blog/fine-tuning-gpt-2/].
Even in the fine-tuning case, we have all of the myriad and sundry problems with RL (instability, inefficiency, etc) that our plain-and-simple language modeling objective lacks.
Fix 2 - update away:
I think this probably won’t work just from experience. I’ve found it very hard to get the model to “reduce your probability on the most likely outcome and increase your probability on the next most likely outcome” — instead objectives like this tend to just increase the temperature of everything (or worse, it puts all of the increase in entropy in the long tail of bad answers).
It’s possible there is a good way to do this, but for now I don’t know of a good way to get a model to increase the probability of “secondary options” without just degenerating into increasing entropy.
Fix 2 - track updates:
If I understand this correctly, I think this is easily approximated by having an objective/loss/reward term which penalizes differences from the original model. For small deltas I think this is a good approach, and unfortunately is only as good as the original model you’re comparing it too.
As far as the specific proposal for managing updates towards/away from beam search updates, that seems also possible via a similar mechanism — penalize distributional difference from those samples.
I think we haven’t really explored these sort of penalties enough, and in particular how they interact when combined with other objectives.
Fix 2 - will it stunt:
I think that any objective that scores better predictions higher will incentivize some sort of lookahead/pondering.
If you prevent it from being coincident with the beam search distribution, then I expect the model will learn how to do lookahead/pondering in the null space of beam search.
Will these solve mesa-optimization:
This isn’t clear to me, but I think it’s worth studying.
In particular, it would be good to figure out some way of contriving a mesa-optimization setup, such that we could measure if these fixes would prevent it or not.
Beam Search in the API:
I think my above comments about Beam Search apply here.
Beam search, like any optimization algorithm, is hugely dependent on its scoring function. If you score on likelihood, you’ll end up with high-likelihood (“unsurprising”) text.
Future thoughts—sampling research:
I think in general we’re in a weirdly asymmetric world, where we have a huge amount of compute and effort into computing auto-regressive next token distributions, and comparatively very little sophistication in sampling from them.
This comment is probably too long already for me to expand too much on this, but in particular, I think the log-likelihood objective is default unaligned (as most datasets are default unaligned) but I think we can find ways of sampling from log-likelihood optimized models in ways that are aligned.
- John_Maxwell Jan 9, 2021, 2:30 AM
  LW: 2 AF: 1
  0
  AF Parent
  With regard to the editing text discussion, I was thinking of a really simple approach where we resample words in the text at random. Perhaps that wouldn’t work great, but I do think editing has potential because it allows for more sophisticated thinking.
  
  Let’s say we want our language model to design us an aircraft. Perhaps its starts by describing the engine, and then it describes the wings. Standard autoregressive text generation (assuming no lookahead) will allow the engine design to influence the wing design (assuming the engine design is inside the context window when it’s writing about the wings), but it won’t allow the wing design to influence the engine design. However, if the model is allowed to edit its text, it can rethink the engine in light of the wings and rethink the wings in light of the engine until it’s designed a really good aircraft.
  
  In particular, it would be good to figure out some way of contriving a mesa-optimization setup, such that we could measure if these fixes would prevent it or not.
  
  Agreed. Perhaps if we generated lots of travelling salesman problem instances where the greedy approach doesn’t get you something that looks like the optimal route, then try & train a GPT architecture to predict the cities in the optimal route in order?
  
  This is an interesting quote:
  
  ...in our experience we find that lean stochastic local search techniques such as simulated annealing are often the most competitive for hard problems with little structure to exploit.
  
  Source.
  
  I suspect GPT will be biased towards avoiding mesa-optimization and making use of heuristics, so the best contrived mesa-optimization setup may be an optimization problem with little structure where heuristics aren’t very helpful. Maybe we could focus on problems where non-heuristic methods such as branch and bound / backtracking are considered state of the art, and train the architecture to mesa-optimize by starting with easy instances and gradually moving to harder and harder ones.
  - A Ray Jan 9, 2021, 6:15 PM
    LW: 1 AF: 1
    0
    AF Parent
    Clarifying Q: Does mesa-optimization refer to any inner optimizer, or one that is in particular not aligned with the outer context?
    - John_Maxwell Jan 12, 2021, 12:03 AM
      LW: 3 AF: 2
      0
      AF Parent
      I was using it to refer to “any inner optimizer”. I think that’s the standard usage but I’m not completely sure.