gwern comments on LLMs are (mostly) not helped by filler tokens

gwern 10 Aug 2023 16:40 UTC
25 points
6
Adding filler tokens seems like it should always be neutral or harm a model’s performance: a fixed prefix designed to be meaningless across all tasks cannot provide any information about each task to locate the task (so no meta-learning) and cannot store any information about the in-progress task (so no amortized computation combining results from multiple forward passes). Oam Patel’s result is unclear to me (what is the ‘IC token’? in his writeup, it sometimes sounds like it’s an arbitrarily varied token freely chosen by the model, so it’s just an ordinary inner-monologue result albeit perhaps in some weird ‘neuralese’ convention, but then other times it sounds like it’s a single fixed token, which makes performance benefits surprising), but most of the result sound reasonable to me.

The GPT-4 exception is what’s interesting. GSM8K is a decent enough benchmark and the increase with token count smooth enough that it doesn’t look like some sort of mere stochastic effect. GSM8K is also a bit unusual in that GPT-4 was trained on it. Two possibilities come to mind:
1. weird filler tokens take GPT-4 out of its RLHF distribution. Such things are probably extremely rare in the RLHF dataset, so GPT-4 must fall back on its regular pretraining approaches. The RLHF has many pernicious effects I have commented on, so it’s not crazy to think that GSM8K might be harmed by RLHF. I forget if the GPT-4 paper includes benchmarks for the RLHF’d GPT-4 vs base GPT-4. This benefit might not show up in the GPT-3 models because they are too stupid for such differences to make a noticeable difference (floor effect), and it would not show up in Claudes because they aren’t RLHFed at all.
2. similarly, but it takes the OA API and/or GPT-4 MoE out of shortcuts. It’s been widely speculated that OA has been leaning heavily on adaptive computation (ie. using small cheap dumb models to try to answer most questions before falling back to full-scale models) in ChatGPT, which would most hurt difficult questions when the adaptiveness makes a false negative on using the full-scale model. Putting in weird stuff like filler tokens is the sort of thing which might reliably make the adaptive computation decide to fallback to the full-scale model.
  
  Now, you don’t say what you used specifically but since you mention logits, you must be using the OA API, so you presumably are avoiding small models optimizing for consumer responses. But GPT-4 is itself ‘adaptive’ in that it’s essentially confirmed that it’s a franken-MoE architecture. So what you might be doing by adding weird filler tokens is skipping small-but-dumb sub-experts and routing to the most powerful experts. (The MoE adaptation might even be happening at multiple levels: there’s a longstanding question of why temperature=0 in the OA API doesn’t return deterministic results like it’s supposed to, and there’s some interesting recent speculation that this may reflect internal batch-level MoE-related non-determinism in GPT-4.)
  
  These explanations are not exclusive. The GPT-4 paper reports that the RLHF trashes calibration, so it could be that the RLHF GPT-4 does worse on GSM8K than it ought to, because it lacks the internal calibration to route those hard questions to the right powerful expert; and so the benefit of weird filler tokens is in making the input look so weird that even an uncalibrated model goes ‘???’ and looks harder at it.
- kave 10 Aug 2023 18:01 UTC
  13 points
  10
  Parent
  Adding filler tokens seems like it should always be neutral or harm a model’s performance: a fixed prefix designed to be meaningless across all tasks cannot provide any information about each task to locate the task (so no meta-learning) and cannot store any information about the in-progress task (so no amortized computation combining results from multiple forward passes).
  I thought the idea was that in a single forward pass, the model has more tokens to think in. That is, the task description on its own is, say, 100 tokens long. With the filler tokens, it’s now, say, 200 tokens long. In principle, because of the uselessness/unnecessariness of the filler tokens, the model can just put task-relevant computation into the residual stream for those positions.
  - gwern 10 Aug 2023 19:41 UTC
    5 points
    3
    Parent
    Yes, I probably should’ve mentioned that in an optimized causal Transformer, you might expect this kind of behavior: the tokens are in fact useless for the reasons outlined, but they are undoing your optimization where you skipped running the parts of the Transformer which correspond to the empty context window inputs. That is, the tokens do nothing, but you literally are not running the same model—you are running a subset of the ‘full’ model and then more of it accidentally as you increase tokens. And that is where the extra computation comes from; it wasn’t really ‘extra’ to begin with. If you were running the causal Transformer in the naive way, equivalent to always running with a full context window padded out, with no packing or shortcut tricks, then there would be no benefit to padding, because now you’re always running the ‘full’ model regardless of padded token count.
    
    However, I expect that if it was this obvious, someone would’ve already demonstrated it (surely filling out the context window with padding would be the very first thing anyone would try? or just looking at the activation patterns to see if the better-performing results are causally influenced by the null parts of the model’s forward pass), and this also can’t really explain OP’s observation of why the apparent benefit appears in GSM8K in the MoE model (GPT-4) but none of the others (GPT-3s, Claude). It should appear everywhere that one has variable context windows and should apply to pretty much all models or none, if that’s the explanation. I would also expect the benefits to show up more broadly across benchmarks, rather than affect a very few dramatically.
    
    So, since it doesn’t explain any of the observations (while my two suggestions at least could), it seems unlikely.
    - ryan_greenblatt 10 Aug 2023 20:43 UTC
      2 points
      0
      Parent
      
      It should appear everywhere that one has variable context windows and should apply to pretty much all models or none, if that’s the explanation. I would also expect the benefits to show up more broadly across benchmarks, rather than affect a very few dramatically.
      
      It seems plausible to me that you’d see some sort of reasonably sharp improvement curve with scale like CoT. So, I think it’s conceivable that GPT4 is just big enough to learn during pretraining an algorithm for benefitting a bit from filler while other models aren’t.
      
      Filler tokens are probably an insanely inefficient way of using more compute for the LLM and algorithmic improvement should dominate. But, there’s no reason to expect strong efficiency here right now.
      - gwern 10 Aug 2023 21:58 UTC
        6 points
        1
        Parent
        You can always ‘save the appearances’ by invoking an emergence, yes. I don’t like it, though, because Claude isn’t that far from GPT-4 in capabilities, so that hypothesis is threading a narrow window indeed. A more natural split would have been GPT-3 vs Claude+GPT-4. (We know lots of stuff happens in between GPT-3 and Claude+GPT-4 levels, so speculations about something happening in there, or a floor effect, are much less finetuned.) I’m also a bit suspicious of the emergence idea here at all: I mean, Transformers are usually trained with full context windows, simply filling them up and separating documents with EOT, aren’t they? So why would GPT-4 learn anything special about simple generic sorts of computation inside its forward pass that GPT-3s or Claudes couldn’t?
        ryan_greenblatt 11 Aug 2023 0:25 UTC
        2 points
        2
        Parent
        Yep, agreed. But it’s worth noting that other hypotheses for why this happens still have to explain why no other model has this behavior (including GPT3.5). So I’m not sure we take that much additional surprise from the capability having sudden onset.
        gwern 11 Aug 2023 20:56 UTC
        4 points
        2
        Parent
        To elaborate, my other two hypotheses are less finetuned and can handle explaining why GPT-3.5 doesn’t, because it’s denser or smaller than GPT-4: in the second hypothesis, GPT-3/GPT-3.5/Claude dense models vs GPT-4 MoE is a fundamental architecture difference, so that alone is eyebrow-raising—the only MoE out of 4 or 5 models is also the one acting weirdest using padding tokens? (And I would note that GPT-4 acts weird in other ways: aside from the temperature nondeterminism, I’ve noted that GPT-4 has bizarre blind spots where it is unable to do simple tasks that it ought to be able to do. So we don’t need to additionally hypothesize that the internals of GPT-4 are doing something weird—we already know for a fact that something is weird in GPT-4.) The difference between GPT-3/GPT-3.5 and GPT-4 is larger than between Claude and GPT-4, so if you can believe a major GSM8K difference emerging between Claude/GPT-4, then you must find it far easier to believe the first hypothesis a difference due to the scale from GPT-3/GPT-3.5 to GPT-4.
        ryan_greenblatt 12 Aug 2023 1:00 UTC
        3 points
        6
        Parent
        (Aside: Why do you think GPT3.5-turbo (most recent release) isn’t MOE? I’d guess that if GPT4 is MOE, GPT3.5 is also.)
        Closed Limelike Curves 9 Sep 2023 20:50 UTC
        1 point
        −7
        Parent
        Because GPT-3.5 is a fine-tuned version of GPT-3, which is known to be a vanilla dense transformer.
        GPT-4 is probably, in a very funny turn of events, a few dozen fine-tuned GPT-3.5 clones glued together (as a MoE).
- Kshitij Sachan 11 Aug 2023 6:12 UTC
  3 points
  1
  Parent
  First, clarification:
  - In Oam’s experiments, the vocabulary is token for each number from 1 to 1000, pad token, and intermediate computation (IC) token. But I wouldn’t index on his results too much because I’m unsure how replicable they are.
  - I am indeed using the OA API
  And now some takes. I find both of your hypotheses intriguing. I’d never considered either of those before so thanks for bringing them up. I’m guessing they’re both wrong for the following reasons:
  - RLHF: Agreed that filler tokens take the model into a weird distribution. It’s not obvious though why that is more like the pretraining distribution than the RL distribution (except that pretraining has broader coverage). Also, GPT-3.5 was trained with RLHF and Claude with RLAIF (which is basically the same), and they don’t show the effect. One point maybe supporting your claim is that the “non-weird” filler tokens like “happy to help...” didn’t show a strong effect, but I abandoned that direction after one experiment and a variant of the “natural fillers” may well work.
  - Route to smarter experts: The link you shared is very cool and I hadn’t seen that before—thanks! My main pushback here is I’d be pretty surprised if gradient descent so inefficiently routed to the wrong experts on normal math problems that I would see a 10% improvement with a distribution shift.
  - gwern 11 Aug 2023 21:13 UTC
    6 points
    3
    Parent
    
    But I wouldn’t index on his results too much because I’m unsure how replicable they are.
    
    OK. If Oam really was using a single IC token and getting benefits, then I think that would point to the forward-pass/additional-model-parameters-activated explanation: obviously, his model is unaffected by RLHF, MoEs, model scaling, or anything of the alternate hypotheses. He should probably look into that if he can replicate padding benefits. It would not necessarily resolve the question for larger models, but at least it’d establish such an effect exists in some model implementations.
    
    Claude with RLAIF (which is basically the same)
    
    I don’t think RLAIF is ‘basically the same’. I find Claude to be qualitatively highly different from RLHF’d models—as I’ve commented before, the RLHF seems to cause some rather perverse consequences that RLAIF avoids. For example, you cannot ask ChatGPT to “Write a non-rhyming poem.”—I’ve tried this prompt dozens of times, and it always writes a rhyming poem, even if you explicitly then tell it it incorrectly wrote a rhyming poem and to try again, while Claude nails it without trouble.
    
    RLHF seems to distort the entire distribution of outputs to be on-policy and suck all outputs into the vortex of RLHF’d-pablum, while RLAIF seems to be far more selective and mostly affect directly-morality-relevant outputs, leaving neutral stuff like ‘write a poem’ undistorted.
    
    My main pushback here is I’d be pretty surprised if gradient descent so inefficiently routed to the wrong experts on normal math problems that I would see a 10% improvement with a distribution shift.
    
    /shrug. You are looking at rather different problems here, aren’t you? I mean, GSM8K problems are complex word problems, and not like the other problems you are benchmarking like ‘Addition’ or ‘GCD’.
    - Owain_Evans 12 Aug 2023 21:06 UTC
      3 points
      2
      Parent
      ChatGPT-4 seems to have improved at diverse literary styles. It sometimes ignores the “non-rhyming” instructions, but I was able to get it to avoid rhyme on my second try by first asking it, “Can you write poems that don’t rhyme?”.
      https://chat.openai.com/share/698343c1-764e-4a65-9eb8-f2ec4e40da1b
    - Violet Hour 11 Aug 2023 21:58 UTC
      3 points
      0
      Parent
      Minor point, but I asked the code interpreter to produce a non-rhyming poem, and it managed to do so on the second time of asking. I restricted it to three verses because it stared off well on my initial attempt, but veered into rhyming territory in later verses.
      - gwern 13 Aug 2023 1:39 UTC
        6 points
        0
        Parent
        
        I asked the code interpreter to produce a non-rhyming poem
        
        FWIW, all of the examples by me or others were either the Playground or chat interface. I haven’t subscribed so I don’t have access to the code interpreter.
        
        but veered into rhyming territory in later verses.
        
        Yep, sucked into the memorized-rhymes vortex. I’m glad to hear it now works sometimes, well, at least partially, if you don’t give it too long to go back on-policy & ignore its prompt. (Maybe all of the times I flagged the rhyming completions actually helped out a bit.)
        
        As a quick test of ‘forcing it out of distribution’ (per Herb’s comment), I tried writing in Playground with gpt-3.5-turbo “Write a non-rhyming poem.” with a prefix consisting of about 30 lines of “a a a a a a a a a” repeated, and without.
        
        Without the prefix, I only get ¹⁄₆ non-rhyming poems (ie. ⁵⁄₆ clearly rhymed); with the prefix, I get ⁴⁄₅ non-rhyming poem (1 did rhyme anyway).
        
        Might be something interesting there?
- Aaron_Scher 12 Aug 2023 4:49 UTC
  1 point
  0
  Parent
  I forget if the GPT-4 paper includes benchmarks for the RLHF’d GPT-4 vs base GPT-4.
  It does. See page 28. Some tasks performance goes up, some tasks performance goes down; averaged across tasks no significant difference. Does not include GSM8K. Relevantly, AP Calc BC exam has a 9 percentage point drop from from pre-trained to RLHF.
  - gwern 13 Aug 2023 1:49 UTC
    2 points
    0
    Parent
    
    Relevantly, AP Calc BC exam has a 9 percentage point drop from from pre-trained to RLHF.
    
    Interesting. Of course, another way of phrasing that is that it gains 9 percentage points going from the RLHF model distribution to the base model distribution. So that makes my RLHF scenario more plausible: here is a scenario where forcing the RLHF model out of distribution might well lead to a large performance increase. And it’s on a task that looks a lot like GSM8K...
    
    (You might be skeptical that forcing it out of RLHF distribution could ‘restore’ base model capabilities—couldn’t the RLHF have destroyed capabilities by forgetting? I agree that this is not proven, but I think the RLHF model should have all the capabilities of the base model somewhere: the OA training appears to include some amount of standard training to maintain knowledge, one can use regularization to force the RLHF distribution to be close to the original, and large models appear to have enormous unused capacity & sample-efficiency such that they don’t need to forget anything and just learn the RLHF task on top of all the old tasks.)