There is research suggesting that, when the amount of synthetic (AI-generated) data reaches some critical point, a complete degeneration of a model happens (“AI dementia”), and that “organic”, human-generated data is, in fact, crucial not only in training an initial model, but in maintaining model “intelligence” in later generations of a model.
So it may be vice versa, that a human input is infinitely more valuable and a need for quality human input will only grow with time.
They even managed to publish it in Nature. But if you don’t throw out the original data and instead train on both the original data and the generated data, this doesn’t seem to happen (see also). Besides, there is the empirical observation that o1 in fact works at GPT-4 scale, so similar methodology might survive more scaling. At least at the upcoming ~5e26 FLOPs level of next year, which is the focus of this post, the hypothetical where an open weights release arrives before there is an open source reproduction of o1′s methodology, which subsequently makes that model much stronger in a way that wasn’t accounted for when deciding to release that open weights model.
AlphaZero is purely synthetic data, and humans (note congenitally blind humans, so video data isn’t crucial) use maybe 10,000 times less natural data than Llama-3-405B (15 trillion tokens) to get better performance, though we individually know much fewer facts. So clearly there is some way to get very far with merely 50 trillion natural tokens, though this is not relevant to o1 specifically.
Another point is that you can repeat the data for LLMs (5-15 times with good results, up to 60 times with slight further improvement, then there is double descent with worst performance at 200 repetitions, so improvement might resume after hundreds of repetitions). This suggests that it might be possible to repeat natural data many times to balance out a lot more unique synthetic data.
Yeah, I’m extremely skeptical of the paper, and to point out one of it’s flaws, if you don’t throw away the original data, that doesn’t happen.
There are almost certainly other assumptions that are likely wrong, but that alone makes me extremely skeptical of the model collapse concept, and while a lot of people want it to be true, there is no reason to believe it’s truth.
There is research suggesting that, when the amount of synthetic (AI-generated) data reaches some critical point, a complete degeneration of a model happens (“AI dementia”), and that “organic”, human-generated data is, in fact, crucial not only in training an initial model, but in maintaining model “intelligence” in later generations of a model.
So it may be vice versa, that a human input is infinitely more valuable and a need for quality human input will only grow with time.
They even managed to publish it in Nature. But if you don’t throw out the original data and instead train on both the original data and the generated data, this doesn’t seem to happen (see also). Besides, there is the empirical observation that o1 in fact works at GPT-4 scale, so similar methodology might survive more scaling. At least at the upcoming ~5e26 FLOPs level of next year, which is the focus of this post, the hypothetical where an open weights release arrives before there is an open source reproduction of o1′s methodology, which subsequently makes that model much stronger in a way that wasn’t accounted for when deciding to release that open weights model.
AlphaZero is purely synthetic data, and humans (note congenitally blind humans, so video data isn’t crucial) use maybe 10,000 times less natural data than Llama-3-405B (15 trillion tokens) to get better performance, though we individually know much fewer facts. So clearly there is some way to get very far with merely 50 trillion natural tokens, though this is not relevant to o1 specifically.
Another point is that you can repeat the data for LLMs (5-15 times with good results, up to 60 times with slight further improvement, then there is double descent with worst performance at 200 repetitions, so improvement might resume after hundreds of repetitions). This suggests that it might be possible to repeat natural data many times to balance out a lot more unique synthetic data.
Yeah, I’m extremely skeptical of the paper, and to point out one of it’s flaws, if you don’t throw away the original data, that doesn’t happen.
There are almost certainly other assumptions that are likely wrong, but that alone makes me extremely skeptical of the model collapse concept, and while a lot of people want it to be true, there is no reason to believe it’s truth.