FWIW I think this is basically right in pointing out that there’s a bunch of errors in reasoning when people claim a large deep neural network “knows” something or that it “doesn’t know” something.
I think this exhibits another issue, though, by strongly changing the contextual prefix, you’ve confounded it in a bunch of ways that are worth explicitly pointing out:
Longer contexts use more compute to generate the same size answer, since they they attend over more tokens of input (and it’s reasonable to think that in some cases that more compute → better answer)
Few-shot examples are a very compact way of expressing a structure or controlling the model to give a certain output—but are very different than seeing if the model can implicitly understand context (or lack thereof). I liked the original question because it seemed to be pointed at this comparison, in a way that your few-shot example lacks.
There is an invisible first token sometimes (an ‘end-of-text’ token) which indicates whether the given context represents the beginning of a new whole document. If this token is not present, then it’s very possible (in fact very probable) that the context is in the middle of a document somewhere, but the model doesn’t have access to the first part of the document. This leads to something more like “what document am I in the middle of, where the text just before this is <context>”. Explicitly prefixing end-of-text signals that there is no additional document that the model needs to guess or account for.
My basic point here is that your prompt implies a much different ‘prior document’ distribution than the original posts.
In general I’m pretty appreciative of efforts to help us get more clear understanding of neural networks, but often that doesn’t cleanly fit into “the model knows X” or “the model doesn’t know X”.
FWIW I think this is basically right in pointing out that there’s a bunch of errors in reasoning when people claim a large deep neural network “knows” something or that it “doesn’t know” something.
I think this exhibits another issue, though, by strongly changing the contextual prefix, you’ve confounded it in a bunch of ways that are worth explicitly pointing out:
Longer contexts use more compute to generate the same size answer, since they they attend over more tokens of input (and it’s reasonable to think that in some cases that more compute → better answer)
Few-shot examples are a very compact way of expressing a structure or controlling the model to give a certain output—but are very different than seeing if the model can implicitly understand context (or lack thereof). I liked the original question because it seemed to be pointed at this comparison, in a way that your few-shot example lacks.
There is an invisible first token sometimes (an ‘end-of-text’ token) which indicates whether the given context represents the beginning of a new whole document. If this token is not present, then it’s very possible (in fact very probable) that the context is in the middle of a document somewhere, but the model doesn’t have access to the first part of the document. This leads to something more like “what document am I in the middle of, where the text just before this is <context>”. Explicitly prefixing end-of-text signals that there is no additional document that the model needs to guess or account for.
My basic point here is that your prompt implies a much different ‘prior document’ distribution than the original posts.
In general I’m pretty appreciative of efforts to help us get more clear understanding of neural networks, but often that doesn’t cleanly fit into “the model knows X” or “the model doesn’t know X”.