Kshitij Sachan comments on LLMs are (mostly) not helped by filler tokens

Kshitij Sachan 15 Aug 2023 20:16 UTC
1 point
0
It’s possible that the model treats filler tokens differently in the “user” vs “assistant” part of the prompt, so they aren’t identical. That being said, I chose to generate tokens rather than appending to the prompt because it’s more superficially similar to chain of thought.
Also, adding a padding prefix to the original question wouldn’t act as a filler token because the model can’t attend to future tokens.
- dr_s 16 Aug 2023 5:48 UTC
  2 points
  0
  Parent
  It could be prepended then, but also, does it make a difference? It won’t attend to the filler while going over the question, but it will attend to the question while going over the filler. Also, how could it treat tokens differently? Wouldn’t it need to be specially trained and have some additional input to do that? Or are you just thinking of the wrapper software doing something?
  - Kshitij Sachan 16 Aug 2023 16:52 UTC
    1 point
    0
    Parent
    It could be prepended then, but also, does it make a difference? It won’t attend to the filler while going over the question, but it will attend to the question while going over the filler.
    I think you’re saying there should be no difference between “<filler><question>” and “<question><filler>”. Your reasoning is: In the first layout the model attends to filler tokens while going over the question, and in the second the model attends to the question while going over the filler.
    But the first layout doesn’t actually get us anything: computation at the filler token positions can’t use information from future token positions (i.e. the question). Thanks for asking this though, I hadn’t actually explicitly thought through putting the filler before the question rather than after.
    Also, how could it treat tokens differently? Wouldn’t it need to be specially trained and have some additional input to do that? Or are you just thinking of the wrapper software doing something?
    I’m not imagining any wrapper software, etc. I think this behavior could be an artifact of pretraining. Language models are trained to precompute features that are useful for predicting all future token positions, not just the immediate next token. This is because gradients flow from the current token being predicted to all previous token positions. (e.g. see How LLMs are and are not myopic)