I mean big in terms of number of tokens, and I am thinking about this question specifically in the context of training windows vs context windows. This question is inspired by an Andrew Mayne tweet covered in AI #80: Never Have I Ever:
Most AI systems are trained on less than 2,000 words per sample. They can generalize across all of its training but no system to date trains on 80,000 words per sample. This will change with more memory and processing power. When that happens AI will be able to actually “read” entire novels and understand story structure on a fundamental level.
Gwern says context windows solve this. Pretty sure I am behind the curve on this one, but concretely, if LLMs cannot write a good novel because they are constrained by a training window of 2000 tokens, would they also under step-by-step prompting tend to produce steps with that same constraint?
On the flip side of the coin, how does adding a context window resolve this, especially when drawing on information outside of what was provided in the context window?
I think people are conflating between
A) “the models with large context windows can process the large context in order to produce an output dependent on that context”
B) “the models have been trained on producing extra reliable and/or pertinent short outputs based on using a very long input, making good use of their new long context windows.”
C) “the models have been trained to produce very long contexts after being given just a short input prompt (e.g. giving a summary blurb from the back of novel as a prompt, and receiving a full novel as a response.)”
D) “the models with large context windows can process a small amount of context and reliably output a large output that fills the rest of their large context in a coherent useful way that logically follows from the small context.”
What we currently have is A. I don’t know if anyone is near-term planning on try to do B, maybe they are? I’m pretty sure no-one is near-term planning on trying to do C, which seems much harder than B. I think D is basically the result of a deploying a model trained like C, and thus I don’t expect D without someone first doing C.
I think Andrew’s quote is saying that we now have A, but do not yet have B, and that B would likely be an improvement over A.
Not enough context to know what Gwern means by the short summary you give, but I’d think perhaps he could mean either:
x) A is good enough for the thing Andrew is describing, we don’t need B. Maybe B will happen, and maybe it will be better, but Andrew’s expressed desire is possible just with A.
y) Agrees that what Andrew is describing requires B, but is implying that A unlocks B, and that we should thus expect B to come pretty soon due to competitive pressures.