Kevin Amiri comments on OpenAI o1, Llama 4, and AlphaZero of LLMs

Kevin Amiri 15 Sep 2024 19:39 UTC
7 points
1
I recently translated 100 AIME level math questions from another language into English for testing set for a kaggle competition. The best model was GPT-4-32k, which could only solve 5-6 questions correctly. The rest of the models managed to solve just 1-3 questions.
Then, I tried the MATH dataset. While the difficulty level was similar, the results were surprisingly different: 60-80% of the problems were solved correctly.
I can not see any 1o improvement on this.
Is this a well-known phenomenon, or am I onto something significant here?
- Kei 17 Sep 2024 15:18 UTC
  5 points
  2
  Parent
  
  I can not see any 1o improvement on this.
  
  Are you saying that o1 did not do any better than 5-6% on your AIME-equivalent dataset? That would be interesting given that o1 did far better on the 2024 AIME which presumably was released after the training cutoff: https://openai.com/index/learning-to-reason-with-llms/
- ZY 28 Sep 2024 22:12 UTC
  2 points
  0
  Parent
  How did you translated the dataset, and what is the translation quality?