I would like to note that this dataset is not as hard as it might look like. Humans performed not so well because there is a strict time limit, I don’t remember exactly but it was something like 1 hour for 25 tasks (and IIRC the medalist only made arithmetic errors). I am pretty sure any IMO gold medailst would typically score 100% given (say) 3 hours.
Nevertheless, it’s very impressive, and AIMO results are even more impressive in my opinion.
I would like to note that this dataset is not as hard as it might look like. Humans performed not so well because there is a strict time limit, I don’t remember exactly but it was something like 1 hour for 25 tasks (and IIRC the medalist only made arithmetic errors). I am pretty sure any IMO gold medailst would typically score 100% given (say) 3 hours.
Nevertheless, it’s very impressive, and AIMO results are even more impressive in my opinion.