I recently translated 100 AIME level math questions from another language into English for testing set for a kaggle competition. The best model was GPT-4-32k, which could only solve 5-6 questions correctly. The rest of the models managed to solve just 1-3 questions.
Then, I tried the MATH dataset. While the difficulty level was similar, the results were surprisingly different: 60-80% of the problems were solved correctly.
I can not see any 1o improvement on this.
Is this a well-known phenomenon, or am I onto something significant here?
Are you saying that o1 did not do any better than 5-6% on your AIME-equivalent dataset? That would be interesting given that o1 did far better on the 2024 AIME which presumably was released after the training cutoff: https://openai.com/index/learning-to-reason-with-llms/
I recently translated 100 AIME level math questions from another language into English for testing set for a kaggle competition. The best model was GPT-4-32k, which could only solve 5-6 questions correctly. The rest of the models managed to solve just 1-3 questions.
Then, I tried the MATH dataset. While the difficulty level was similar, the results were surprisingly different: 60-80% of the problems were solved correctly.
I can not see any 1o improvement on this.
Is this a well-known phenomenon, or am I onto something significant here?
Are you saying that o1 did not do any better than 5-6% on your AIME-equivalent dataset? That would be interesting given that o1 did far better on the 2024 AIME which presumably was released after the training cutoff: https://openai.com/index/learning-to-reason-with-llms/
How did you translated the dataset, and what is the translation quality?