The model’s performance is still well below human performance
At this point I have to ask what exactly is meant by this. The bigger model beats the average human performance on the national math exam in Poland. Sure, the people taking this exam are usually not adults, but for many it may be where they peak in their mathematical abilities, so I wouldn’t be surprised if it beats average human performance in the US. It’s all rather vague though; looking at the MATH dataset paper all I could find regarding human performance was the following:
Human-Level Performance. To provide a rough but informative comparison to human-level performance, we randomly sampled 20 problems from the MATH test set and gave them to humans. We artificially require that the participants have 1 hour to work on the problems and must perform calculations by hand. All participants are university students. One participant who does not like mathematics got 8⁄20 = 40% correct. A participant ambivalent toward mathematics got 13⁄20. Two participants who like mathematics got 14⁄20 and 15⁄20. A participant who got a perfect score on the AMC 10 exam and attended USAMO several times got 18⁄20. A three-time IMO gold medalist got 18⁄20 = 90%, though missed questions were exclusively due to small errors of arithmetic. Expert-level performance is theoretically 100% given enough time. Even 40% would accuracy for a machine learning model would be impressive but have ramifications for cheating on homework.
So, for solving undergraduate-level math problems, this model would be somewhere between university students who dislike mathematics and ones who are neutral towards it? Maybe. Would be nice to get more details here, I assume they didn’t think much about human-level performance since the previous SOTA was clearly very far from it.
They test on the basic (Poziom podstawowy) Matura tier for testing on math problems. In countries with Matura-based education, the basic tier math test is not usually taken by mathematically inclined students—it is just the law that anyone going to a public university has to pass some sort of math exam beforehand. Students who want to study anything where mathematics skills are needed would take the higher tier (Poziom rozszezony). Can someone from Poland confirm this?
A quick estimate of the percentage of high-school students taking the Polish Matura exams is 50%-75%, though. If the number of students taking the higher tier is not too large, then average performance on the basic tier corresponds to essentially average human-level performance on this kind of test.
Note that many students taking the basic math exam only want to pass and not necessarily perform well; and some of the bottom half of the 270k students are taking the exam for the second or third time after failing before.
At this point I have to ask what exactly is meant by this. The bigger model beats the average human performance on the national math exam in Poland. Sure, the people taking this exam are usually not adults, but for many it may be where they peak in their mathematical abilities, so I wouldn’t be surprised if it beats average human performance in the US. It’s all rather vague though; looking at the MATH dataset paper all I could find regarding human performance was the following:
So, for solving undergraduate-level math problems, this model would be somewhere between university students who dislike mathematics and ones who are neutral towards it? Maybe. Would be nice to get more details here, I assume they didn’t think much about human-level performance since the previous SOTA was clearly very far from it.
They test on the basic (Poziom podstawowy) Matura tier for testing on math problems.
In countries with Matura-based education, the basic tier math test is not usually taken by mathematically inclined students—it is just the law that anyone going to a public university has to pass some sort of math exam beforehand. Students who want to study anything where mathematics skills are needed would take the higher tier (Poziom rozszezony).
Can someone from Poland confirm this?
A quick estimate of the percentage of high-school students taking the Polish Matura exams is 50%-75%, though. If the number of students taking the higher tier is not too large, then average performance on the basic tier corresponds to essentially average human-level performance on this kind of test.
Note that many students taking the basic math exam only want to pass and not necessarily perform well; and some of the bottom half of the 270k students are taking the exam for the second or third time after failing before.