If I understand point 6 correctly, then you are proposing that Hoffman’s scaling laws lead to shorter timelines because data-efficiency can be improved algorithmically. To me it seems that it might just as well make timelines longer to depend on algorithmic innovations as opposed to the improvements in compute that would help increase parameters. It feels like there is more uncertainty about if people will keep coming up with the novel ideas required to improve data efficiency compared to if the available compute will continue to increase in the near to mid-term future. If the available data really becomes exhausted within the next few years, then improving the quality of models will be more dependend on such novel ideas under Hoffman’s laws than under Kaplan’s.
To me it seems that it might just as well make timelines longer to depend on algorithmic innovations as opposed to the improvements in compute that would help increase parameters.
I’ll give you an analogy:
Suppose your friend is running a marathon. You hear that at the halfway point she has a time of 1 hour 30 minutes. You think “okay I estimate she’ll finish the race in 4 hours”. Now you hear she has been running with her shoelaces untied. Should you increase or decrease your estimate?
Well, decrease. The time of 1:30 is more impressive if you learn her shoelaces were untied! It’s plausible your friend will notice and tie up her shoelaces.
But note that if you didn’t condition on the 1:30 information, then your estimate would increase if you learned her shoelaces were untied for the first half.
Now for Large Language Models:
Believing Kaplan’s scaling laws, we figure that the performance of LLMs depended on N the number of parameters. But maybe there’s no room for improvement in N-efficiency. LLMs aren’t much more N-inefficient than the human brain, which is our only reference-point for general intelligence. So we expect little algorithmic innovation. LLMs will only improve because N and D grows.
On the other hand, believing Hoffman’s scaling laws, we figure that the performance of LLMs depended on D the number of datapoints. But there is likely room for improvement in D-efficiency. The brain is far more D-inefficient than LLMs. So LLMs have been metaphorically running with their shoes untied. There is room for improvement. So we’re less surprised by algorithmic innovation. LLMs will still improve because N and D grows, but this isn’t the only path.
So Hoffman’s scaling laws shorten our timeline estimates.
This is an important observation to grok. If you’re already impressed by how an algorithm performs, and you learn that the algorithm has a flaw which would disadvantage it, then you should increase your estimate of future performance.
This analogy is misleading because it pumps the intuition that we know how to generate the algorithmic innovations that would improve future performance, much as we know how to tie our shoelaces once we notice they are untied. This is not the case. Research programmes can and do stagnate for long periods because crucial insights are hard to come by and hard to implement correctly at scale. Predicting the timescale on which algorithmic innovations occur is a very different proposition from predicting the timescale on which it will be feasible to increase parameter count.
This is an important observation to grok. If you’re already impressed by how an algorithm performs, and you learn that the algorithm has a flaw which would disadvantage it, then you should increase your estimate of future performance.
It’s not clear to me that this is the case. You have both found evidence that there are large increases available, AND evidence that there is one less large increase than previously. It seems to depend on your priors which way you should update about the expectance on finding future similar increases.
Very interesting, after reading chinchilla’s wild implications I was hoping someone would write something like this!
If I understand point 6 correctly, then you are proposing that Hoffman’s scaling laws lead to shorter timelines because data-efficiency can be improved algorithmically. To me it seems that it might just as well make timelines longer to depend on algorithmic innovations as opposed to the improvements in compute that would help increase parameters. It feels like there is more uncertainty about if people will keep coming up with the novel ideas required to improve data efficiency compared to if the available compute will continue to increase in the near to mid-term future. If the available data really becomes exhausted within the next few years, then improving the quality of models will be more dependend on such novel ideas under Hoffman’s laws than under Kaplan’s.
I’ll give you an analogy:
Suppose your friend is running a marathon. You hear that at the halfway point she has a time of 1 hour 30 minutes. You think “okay I estimate she’ll finish the race in 4 hours”. Now you hear she has been running with her shoelaces untied. Should you increase or decrease your estimate?
Well, decrease. The time of 1:30 is more impressive if you learn her shoelaces were untied! It’s plausible your friend will notice and tie up her shoelaces.
But note that if you didn’t condition on the 1:30 information, then your estimate would increase if you learned her shoelaces were untied for the first half.
Now for Large Language Models:
Believing Kaplan’s scaling laws, we figure that the performance of LLMs depended on N the number of parameters. But maybe there’s no room for improvement in N-efficiency. LLMs aren’t much more N-inefficient than the human brain, which is our only reference-point for general intelligence. So we expect little algorithmic innovation. LLMs will only improve because N and D grows.
On the other hand, believing Hoffman’s scaling laws, we figure that the performance of LLMs depended on D the number of datapoints. But there is likely room for improvement in D-efficiency. The brain is far more D-inefficient than LLMs. So LLMs have been metaphorically running with their shoes untied. There is room for improvement. So we’re less surprised by algorithmic innovation. LLMs will still improve because N and D grows, but this isn’t the only path.
So Hoffman’s scaling laws shorten our timeline estimates.
This is an important observation to grok. If you’re already impressed by how an algorithm performs, and you learn that the algorithm has a flaw which would disadvantage it, then you should increase your estimate of future performance.
This analogy is misleading because it pumps the intuition that we know how to generate the algorithmic innovations that would improve future performance, much as we know how to tie our shoelaces once we notice they are untied. This is not the case. Research programmes can and do stagnate for long periods because crucial insights are hard to come by and hard to implement correctly at scale. Predicting the timescale on which algorithmic innovations occur is a very different proposition from predicting the timescale on which it will be feasible to increase parameter count.
It’s not clear to me that this is the case. You have both found evidence that there are large increases available, AND evidence that there is one less large increase than previously. It seems to depend on your priors which way you should update about the expectance on finding future similar increases.