GPT-3 not being trained on even one pass of its training dataset
“Use way more compute” achieving outsized gains by training longer than by most other architectural modifications for a fixed model size (while you’re correct that bigger model = faster training, you’re trading off against ease of deployment, and models much bigger than GPT-3 become increasingly difficult to serve at prod. Plus, we know it’s about the same size, from the Q&A)
Some experience with undertrained enormous language models underperforming relative to expectation
This is not to say that GPT-4 wont have architectural changes. Sam mentioned a longer context at the least. But these sorts of architectural changes probably qualify as “small” in the parlance of the above conversation.
To be clear: Do you remember Sam Altman saying that “they’re simply training a GPT-3-variant for significantly longer”, or is that an inference from ~”it will use a lot more compute” and ~”it will not be much bigger”?
Because if you remember him saying that, then that contradicts my memory (and, uh, the notes that people took that I remember reading), and I’m confused.
While if it’s an inference: sure, that’s a non-crazy guess, and I take your point that smaller models are easier to deploy. I just want it to be flagged as a claimed deduction, not as a remembered statement.
(And I maintain my impression that something more is going on; especially since I remember Sam generally talking about how models might use more test-time compute in the future, and be able to think for longer on harder questions.)
This is based on:
The Q&A you mention
GPT-3 not being trained on even one pass of its training dataset
“Use way more compute” achieving outsized gains by training longer than by most other architectural modifications for a fixed model size (while you’re correct that bigger model = faster training, you’re trading off against ease of deployment, and models much bigger than GPT-3 become increasingly difficult to serve at prod. Plus, we know it’s about the same size, from the Q&A)
Some experience with undertrained enormous language models underperforming relative to expectation
This is not to say that GPT-4 wont have architectural changes. Sam mentioned a longer context at the least. But these sorts of architectural changes probably qualify as “small” in the parlance of the above conversation.
To be clear: Do you remember Sam Altman saying that “they’re simply training a GPT-3-variant for significantly longer”, or is that an inference from ~”it will use a lot more compute” and ~”it will not be much bigger”?
Because if you remember him saying that, then that contradicts my memory (and, uh, the notes that people took that I remember reading), and I’m confused.
While if it’s an inference: sure, that’s a non-crazy guess, and I take your point that smaller models are easier to deploy. I just want it to be flagged as a claimed deduction, not as a remembered statement.
(And I maintain my impression that something more is going on; especially since I remember Sam generally talking about how models might use more test-time compute in the future, and be able to think for longer on harder questions.)
Honestly, at this point, I don’t remember if it’s inferred or primary-sourced. Edited the above for clarity.