The main update is to undermine confidence in the generality and utility of these ‘Scaling Laws’. It’s clear that the current LLM transformer recipe does not scale to AGI: it is vastly too data-inefficient. Human brains are a proof of concept that it’s possible to train systems using orders of magnitude less data while simultaneously reaching higher levels of performance on the key downstream linguistic tasks.
Also there is now mounting evidence that these LLMs trained on internet scale data are memorizing all kinds of test sets for many downstream tasks, a problem which only gets worse as you try to feed them ever more training data.
Also there is now mounting evidence that these LLMs trained on internet scale data are memorizing all kinds of test sets for many downstream tasks, a problem which only gets worse as you try to feed them ever more training data.
Not really? If we assume that they just memorize data without having intelligence, then their memory requirements would scale as N parameters, when instead we see a smaller constant for compression, which essentially requires actual intelligence rather than simply memorizing all that data.
I didn’t say they were simply memorizing, it’s more complex than that: would need to look at the parameter scaling compression ratio vs data similarity/repetition, and compare to simpler SOTA compressors. Regardless of whether it’s ‘true’ memorization or not, exposure to downstream task test sets distorts evaluations (this is already a problem for humans where many answers are available on the internet, it’s just much more of a problem for AI that actually digests the entire internet).
The main update is to undermine confidence in the generality and utility of these ‘Scaling Laws’. It’s clear that the current LLM transformer recipe does not scale to AGI: it is vastly too data-inefficient. Human brains are a proof of concept that it’s possible to train systems using orders of magnitude less data while simultaneously reaching higher levels of performance on the key downstream linguistic tasks.
Also there is now mounting evidence that these LLMs trained on internet scale data are memorizing all kinds of test sets for many downstream tasks, a problem which only gets worse as you try to feed them ever more training data.
Not really? If we assume that they just memorize data without having intelligence, then their memory requirements would scale as N parameters, when instead we see a smaller constant for compression, which essentially requires actual intelligence rather than simply memorizing all that data.
I didn’t say they were simply memorizing, it’s more complex than that: would need to look at the parameter scaling compression ratio vs data similarity/repetition, and compare to simpler SOTA compressors. Regardless of whether it’s ‘true’ memorization or not, exposure to downstream task test sets distorts evaluations (this is already a problem for humans where many answers are available on the internet, it’s just much more of a problem for AI that actually digests the entire internet).