Maybe Sam knows a lot I don’t know but here are some reasons why I’m skeptical about the end of scaling large language models:
From scaling laws we know that more compute and data reliably lead to better performance and therefore scaling seems like a low-risk investment.
I’m not sure how much GPT-4 cost but GPT-3 only cost $5-10 million which isn’t much for large tech companies (e.g. Meta spends billions on the metaverse every year).
There are limits to how big and expensive supercomputers can be but I doubt we’re near them. I’ve heard that GPT-4 was trained on ~10,000 GPUs which is a lot but not an insane amount (~$300m worth of GPUs). If there were 100 GPUs/m^2, all 10,000 GPUs could fit in a 10m x 10m room. A model trained with millions of GPUs is not inconceivable and is probably technically and economically possible today.
Because scaling laws are power laws (x-axis is logarithmic and y-axis is linear), there are diminishing returns to resources like more compute but I doubt we’ve reached the point where the marginal cost of training larger models exceeds the marginal benefit. Think of a company like Google: building the biggest and best model is immensely valuable in a global, winner-takes-all market like search.
Maybe Sam knows a lot I don’t know but here are some reasons why I’m skeptical about the end of scaling large language models:
From scaling laws we know that more compute and data reliably lead to better performance and therefore scaling seems like a low-risk investment.
I’m not sure how much GPT-4 cost but GPT-3 only cost $5-10 million which isn’t much for large tech companies (e.g. Meta spends billions on the metaverse every year).
There are limits to how big and expensive supercomputers can be but I doubt we’re near them. I’ve heard that GPT-4 was trained on ~10,000 GPUs which is a lot but not an insane amount (~$300m worth of GPUs). If there were 100 GPUs/m^2, all 10,000 GPUs could fit in a 10m x 10m room. A model trained with millions of GPUs is not inconceivable and is probably technically and economically possible today.
Because scaling laws are power laws (x-axis is logarithmic and y-axis is linear), there are diminishing returns to resources like more compute but I doubt we’ve reached the point where the marginal cost of training larger models exceeds the marginal benefit. Think of a company like Google: building the biggest and best model is immensely valuable in a global, winner-takes-all market like search.
I’m getting the same conclusions.
And this is in a world, where Google already announced that they’re going to build even bigger model of their own
We have to upgrade our cluster with a fresh batch of Nvidia gadgets.