They even managed to publish it in Nature. But if you don’t throw out the original data and instead train on both the original data and the generated data, this doesn’t seem to happen (see also). Besides, there is the empirical observation that o1 in fact works at GPT-4 scale, so similar methodology might survive more scaling. At least at the upcoming ~5e26 FLOPs level of next year, which is the focus of this post, the hypothetical where an open weights release arrives before there is an open source reproduction of o1′s methodology, which subsequently makes that model much stronger in a way that wasn’t accounted for when deciding to release that open weights model.
AlphaZero is purely synthetic data, and humans (note congenitally blind humans, so video data isn’t crucial) use maybe 10,000 times less natural data than Llama-3-405B (15 trillion tokens) to get better performance, though we individually know much fewer facts. So clearly there is some way to get very far with merely 50 trillion natural tokens, though this is not relevant to o1 specifically.
Another point is that you can repeat the data for LLMs (5-15 times with good results, up to 60 times with slight further improvement, then there is double descent with worst performance at 200 repetitions, so improvement might resume after hundreds of repetitions). This suggests that it might be possible to repeat natural data many times to balance out a lot more unique synthetic data.
They even managed to publish it in Nature. But if you don’t throw out the original data and instead train on both the original data and the generated data, this doesn’t seem to happen (see also). Besides, there is the empirical observation that o1 in fact works at GPT-4 scale, so similar methodology might survive more scaling. At least at the upcoming ~5e26 FLOPs level of next year, which is the focus of this post, the hypothetical where an open weights release arrives before there is an open source reproduction of o1′s methodology, which subsequently makes that model much stronger in a way that wasn’t accounted for when deciding to release that open weights model.
AlphaZero is purely synthetic data, and humans (note congenitally blind humans, so video data isn’t crucial) use maybe 10,000 times less natural data than Llama-3-405B (15 trillion tokens) to get better performance, though we individually know much fewer facts. So clearly there is some way to get very far with merely 50 trillion natural tokens, though this is not relevant to o1 specifically.
Another point is that you can repeat the data for LLMs (5-15 times with good results, up to 60 times with slight further improvement, then there is double descent with worst performance at 200 repetitions, so improvement might resume after hundreds of repetitions). This suggests that it might be possible to repeat natural data many times to balance out a lot more unique synthetic data.