First of all, serious points for making predictions! And thanks for the thoughtful response.
Before I address specific points: I’ve been working on a research project that’s intended to help resolve the debate about LLMs and general reasoning. If you have a chance to take a look, I’d be very interested to hear whether you would find the results of the proposed experiment compelling; if not, then why not, and are there changes that could be made that would make it provide evidence you’d find more compelling?
Humans are eager to find meaning and tend to project their own thoughts onto external sources. We even go so far as to attribute consciousness and intelligence to inanimate objects, as seen in animistic traditions. In the case of LLMs this behaviour could lead to an overly optimistic extrapolation of capabilities from toy problems.
Absolutely! And then on top of that, it’s very easy to mistake using knowledge from the truly vast training data for actual reasoning.
But in 2024 the overhang has been all but consumed. Humans continue to produce more data, at an unprecedented rate, but still nowhere near enough to keep up with the demand.
This does seem like one possible outcome. That said, it seems more likely to me that continued algorithmic improvements will result in better sample efficiency (certainly humans need a far tinier amount of language examples to learn language), and multimodal data /synthetic data / self-play / simulated environments continue to improve. I suspect capabilities researchers would have made more progress on all those fronts, had it not been the case that up to now it was easy to throw more data at the models.
In the past couple of weeks lots of people have been saying the scaling labs have hit the data wall, because of rumors of slowdowns in capabilities improvements. But before that, I was hearing at least some people in those labs saying that they expected to wring another 0.5 − 1 order of magnitude of human-generated training data out of what they had access to, and that still seems very plausible to me (although that would basically be the generation of GPT-5 and peer models; it seems likely to me that the generation past that will require progress on one or more of the fronts I named above).
Taking the globe representation as an example, it is unclear to me how much of the resulting globe (or atlas) is actually the result of choices the authors made. The decision to map distance vectors in two or three dimensions seems to change the resulting representation. So, to what extent are these representations embedded in the model itself versus originating from the author’s mind?
I think that’s a reasonable concern in the general case. But in cases like the ones mentioned, the authors are retrieving information (eg lat/long) using only linear probes. I don’t know how familiar you are with the math there, but if something can be retrieved with a linear probe, it means that the model is already going to some lengths to represent that information and make it easily accessible.
In the past couple of weeks lots of people have been saying the scaling labs have hit the data wall, because of rumors of slowdowns in capabilities improvements. But before that, I was hearing at least some people in those labs saying that they expected to wring another 0.5 − 1 order of magnitude of human-generated training data out of what they had access to, and that still seems very plausible to me
Epoch’s analysis from June supports this view, and suggests it may even be a bit too conservative:
(and that’s just for text—there are also other significant sources of data for multimodal models, eg video)
First of all, serious points for making predictions! And thanks for the thoughtful response.
Before I address specific points: I’ve been working on a research project that’s intended to help resolve the debate about LLMs and general reasoning. If you have a chance to take a look, I’d be very interested to hear whether you would find the results of the proposed experiment compelling; if not, then why not, and are there changes that could be made that would make it provide evidence you’d find more compelling?
Absolutely! And then on top of that, it’s very easy to mistake using knowledge from the truly vast training data for actual reasoning.
This does seem like one possible outcome. That said, it seems more likely to me that continued algorithmic improvements will result in better sample efficiency (certainly humans need a far tinier amount of language examples to learn language), and multimodal data /synthetic data / self-play / simulated environments continue to improve. I suspect capabilities researchers would have made more progress on all those fronts, had it not been the case that up to now it was easy to throw more data at the models.
In the past couple of weeks lots of people have been saying the scaling labs have hit the data wall, because of rumors of slowdowns in capabilities improvements. But before that, I was hearing at least some people in those labs saying that they expected to wring another 0.5 − 1 order of magnitude of human-generated training data out of what they had access to, and that still seems very plausible to me (although that would basically be the generation of GPT-5 and peer models; it seems likely to me that the generation past that will require progress on one or more of the fronts I named above).
I think that’s a reasonable concern in the general case. But in cases like the ones mentioned, the authors are retrieving information (eg lat/long) using only linear probes. I don’t know how familiar you are with the math there, but if something can be retrieved with a linear probe, it means that the model is already going to some lengths to represent that information and make it easily accessible.
Epoch’s analysis from June supports this view, and suggests it may even be a bit too conservative:
(and that’s just for text—there are also other significant sources of data for multimodal models, eg video)