It could be prepended then, but also, does it make a difference? It won’t attend to the filler while going over the question, but it will attend to the question while going over the filler.
I think you’re saying there should be no difference between “<filler><question>” and “<question><filler>”. Your reasoning is: In the first layout the model attends to filler tokens while going over the question, and in the second the model attends to the question while going over the filler.
But the first layout doesn’t actually get us anything: computation at the filler token positions can’t use information from future token positions (i.e. the question). Thanks for asking this though, I hadn’t actually explicitly thought through putting the filler before the question rather than after.
Also, how could it treat tokens differently? Wouldn’t it need to be specially trained and have some additional input to do that? Or are you just thinking of the wrapper software doing something?
I’m not imagining any wrapper software, etc. I think this behavior could be an artifact of pretraining. Language models are trained to precompute features that are useful for predicting all future token positions, not just the immediate next token. This is because gradients flow from the current token being predicted to all previous token positions. (e.g. see How LLMs are and are not myopic)
I think you’re saying there should be no difference between “<filler><question>” and “<question><filler>”. Your reasoning is: In the first layout the model attends to filler tokens while going over the question, and in the second the model attends to the question while going over the filler.
But the first layout doesn’t actually get us anything: computation at the filler token positions can’t use information from future token positions (i.e. the question). Thanks for asking this though, I hadn’t actually explicitly thought through putting the filler before the question rather than after.
I’m not imagining any wrapper software, etc. I think this behavior could be an artifact of pretraining. Language models are trained to precompute features that are useful for predicting all future token positions, not just the immediate next token. This is because gradients flow from the current token being predicted to all previous token positions. (e.g. see How LLMs are and are not myopic)