I think the long gap between GPT-3 and GPT-4 can be explained by Chinchilla. That was the point where OpenAI realized their models were undertrained for their size, and switched focus from scaling to fine-tuning for a couple of years. InstructGPT, Codex, text-davinci-003, and GPT-3.5 were all released in this period.
You’re likely correct, but I’m not sure that’s relevant. For one, Chinchilla wasn’t announced until 2022, nearly two years after the release of GPT-3. So the slowdown is still apparent even if we assume OpenAI was nearly done training an undertrained GPT-4 (which I have seen no evidence of).
Moreover, the focus on efficiency itself is evidence of an approaching wall. Taking an example from the 20th century, machines got much more energy efficient after the 70s which is also when energy stopped getting cheaper. Why didn’t OpenAI pivot their attention to fine-tuning and efficiency after the release of GPT-2? Because GPT-2 was cheap to train and relied on a tiny fraction of all available data, sidelining their importance. Efficiency is typically a reaction to scarcity.
I think the long gap between GPT-3 and GPT-4 can be explained by Chinchilla. That was the point where OpenAI realized their models were undertrained for their size, and switched focus from scaling to fine-tuning for a couple of years. InstructGPT, Codex, text-davinci-003, and GPT-3.5 were all released in this period.
You’re likely correct, but I’m not sure that’s relevant. For one, Chinchilla wasn’t announced until 2022, nearly two years after the release of GPT-3. So the slowdown is still apparent even if we assume OpenAI was nearly done training an undertrained GPT-4 (which I have seen no evidence of).
Moreover, the focus on efficiency itself is evidence of an approaching wall. Taking an example from the 20th century, machines got much more energy efficient after the 70s which is also when energy stopped getting cheaper. Why didn’t OpenAI pivot their attention to fine-tuning and efficiency after the release of GPT-2? Because GPT-2 was cheap to train and relied on a tiny fraction of all available data, sidelining their importance. Efficiency is typically a reaction to scarcity.