We don’t know how o3 works, but we can speculate. If it’s like the open source huggingface kinda-replication then it uses all kinds of expensive methods to make the next level of reward model, and this model teaches a simpler student model. That means that the expensive methods are only needed once, during the training.
In other words, you use all kinds of expensive methods (process supervision, test time compute, MCTS) to bootstrap the next level of labels/supervision, which teaches a cheaper student model. This is essentially bootstrapping superhuman synthetic data/supervision.
o3 seems to have shown that this bootstrapping process can be repeated beyond the limits of human training data.
If this is true, we’ve reached peak cheap data. Not peak data.
We don’t know how o3 works, but we can speculate. If it’s like the open source huggingface kinda-replication then it uses all kinds of expensive methods to make the next level of reward model, and this model teaches a simpler student model. That means that the expensive methods are only needed once, during the training.
In other words, you use all kinds of expensive methods (process supervision, test time compute, MCTS) to bootstrap the next level of labels/supervision, which teaches a cheaper student model. This is essentially bootstrapping superhuman synthetic data/supervision.
o3 seems to have shown that this bootstrapping process can be repeated beyond the limits of human training data.
If this is true, we’ve reached peak cheap data. Not peak data.