Unless of course GPT-4 secretly does hold state somehow, and that’s why its results are different.
Both ChatGPT-4 and Bing Sydney (early GPT-4 checkpoint initially, unknown currently) are known to have internal scratchpads and possibly other kinds of secret state hidden from the user where they can do computations without the user seeing any of it, only the final result (if even that). However, this OP is using the OA API and calling the underlying models directly, so we should be able to rule those out.
Ah, interesting, I didn’t know that was a thing. But yeah, then if we can exclude that sort of thing, I don’t imagine that having to create the filler tokens can make any sort of effect. In practice the only computation that will matter to the final step will be the one with input [initial prompt] + [filler tokens], with the added burden that the LLM has to figure out that the tokens were the response to a previous task, and go from there. If just having some pointless padding to the context helps, then you should be able to just simply paste some Lorem Ipsum at the beginning of your prompt yourself and call it a day.
It’s possible that the model treats filler tokens differently in the “user” vs “assistant” part of the prompt, so they aren’t identical. That being said, I chose to generate tokens rather than appending to the prompt because it’s more superficially similar to chain of thought.
Also, adding a padding prefix to the original question wouldn’t act as a filler token because the model can’t attend to future tokens.
It could be prepended then, but also, does it make a difference? It won’t attend to the filler while going over the question, but it will attend to the question while going over the filler. Also, how could it treat tokens differently? Wouldn’t it need to be specially trained and have some additional input to do that? Or are you just thinking of the wrapper software doing something?
It could be prepended then, but also, does it make a difference? It won’t attend to the filler while going over the question, but it will attend to the question while going over the filler.
I think you’re saying there should be no difference between “<filler><question>” and “<question><filler>”. Your reasoning is: In the first layout the model attends to filler tokens while going over the question, and in the second the model attends to the question while going over the filler.
But the first layout doesn’t actually get us anything: computation at the filler token positions can’t use information from future token positions (i.e. the question). Thanks for asking this though, I hadn’t actually explicitly thought through putting the filler before the question rather than after.
Also, how could it treat tokens differently? Wouldn’t it need to be specially trained and have some additional input to do that? Or are you just thinking of the wrapper software doing something?
I’m not imagining any wrapper software, etc. I think this behavior could be an artifact of pretraining. Language models are trained to precompute features that are useful for predicting all future token positions, not just the immediate next token. This is because gradients flow from the current token being predicted to all previous token positions. (e.g. see How LLMs are and are not myopic)
Both ChatGPT-4 and Bing Sydney (early GPT-4 checkpoint initially, unknown currently) are known to have internal scratchpads and possibly other kinds of secret state hidden from the user where they can do computations without the user seeing any of it, only the final result (if even that). However, this OP is using the OA API and calling the underlying models directly, so we should be able to rule those out.
Ah, interesting, I didn’t know that was a thing. But yeah, then if we can exclude that sort of thing, I don’t imagine that having to create the filler tokens can make any sort of effect. In practice the only computation that will matter to the final step will be the one with input [initial prompt] + [filler tokens], with the added burden that the LLM has to figure out that the tokens were the response to a previous task, and go from there. If just having some pointless padding to the context helps, then you should be able to just simply paste some Lorem Ipsum at the beginning of your prompt yourself and call it a day.
It’s possible that the model treats filler tokens differently in the “user” vs “assistant” part of the prompt, so they aren’t identical. That being said, I chose to generate tokens rather than appending to the prompt because it’s more superficially similar to chain of thought.
Also, adding a padding prefix to the original question wouldn’t act as a filler token because the model can’t attend to future tokens.
It could be prepended then, but also, does it make a difference? It won’t attend to the filler while going over the question, but it will attend to the question while going over the filler. Also, how could it treat tokens differently? Wouldn’t it need to be specially trained and have some additional input to do that? Or are you just thinking of the wrapper software doing something?
I think you’re saying there should be no difference between “<filler><question>” and “<question><filler>”. Your reasoning is: In the first layout the model attends to filler tokens while going over the question, and in the second the model attends to the question while going over the filler.
But the first layout doesn’t actually get us anything: computation at the filler token positions can’t use information from future token positions (i.e. the question). Thanks for asking this though, I hadn’t actually explicitly thought through putting the filler before the question rather than after.
I’m not imagining any wrapper software, etc. I think this behavior could be an artifact of pretraining. Language models are trained to precompute features that are useful for predicting all future token positions, not just the immediate next token. This is because gradients flow from the current token being predicted to all previous token positions. (e.g. see How LLMs are and are not myopic)