I don’t know if we can be confident in the exact 95% result, but it is the case that o1 consistently performs at a roughly similar level on math across a variety of different benchmarks (e.g., AIME and other people have found strong performance on other math tasks which are unlikely to have been in the training corpus).
Do we know that the test set isn’t in the training data?
I don’t know if we can be confident in the exact 95% result, but it is the case that o1 consistently performs at a roughly similar level on math across a variety of different benchmarks (e.g., AIME and other people have found strong performance on other math tasks which are unlikely to have been in the training corpus).