I kind of disagree. (I was on South Korean IMO team.) I agree IMO problems are in similar category of tasks including research math than high school math, but since IMO problems are intended to be solvable within a time limit, there is (quite low, in absolute sense) upper limit to their difficulty. Basically, intended solution is not longer than a single page. Research math problems have no such limit and can be arbitrarily difficult, or have a solution arbitrarily long.
Edit: Apart from time limit, length limit, and difficulty limit, another important aspect is that IMO problems are already solved, so known to be solvable. IMO problems are “Prove X”. Research math problems, even if they are stated as “Prove X”, is really “Prove or disprove X”, and sometimes this matters.
“Prove or disprove X” is only like 2x harder than “Prove X.” Sometimes the gap is larger for humans because of psychological difficulties, but a machine can literally just pursue both in parallel. (That said, research math involves a ton of problems other than prove or disprove.)
I basically agree that IMO problems are significantly easier than research math or other realistic R&D tasks. However I think that they are very much harder than the kinds of test questions that machines have solved so far. I’m not sure the difference is about high school math vs research math so much as about very easy problems vs problems designed to be challenging and require novel thinking.
My view, having spent a fair amount of time on IMO problems as well as on theoretical research and more practical R&D, is that the IMO is significantly easier but just not very far away from the kind of work human scientists need to do in order to be productive.
I think the biggest remaining difference is that the hardest research math problems operate over a timescale about 2-3 orders of magnitude longer than IMO problems, and I would guess transformative R&D requires operating over a timescale somewhere in between. (While IMO problems are themselves about 2-3 orders of magnitude longer for humans than questions that you can solve automatically.)
Research problems also involve a messier set of data and so training on “all IMO problems” is more like getting good at an incredibly narrow form of R&D. And I do think it’s just cognitively harder, but by an amount that feels like much less than a GPT-3 to GPT-4 sized gap.
I’d be personally surprised if you couldn’t close the gap between IMO gold and transformative R&D with 3-4 orders of magnitude of compute (or equivalent algorithmic progress) + an analogous effort to construct relevant data and feedback for particular R&D tasks. If we got an IMO gold in 2023 I would intuitively expect transformative AI to happen well before 2030, and I would shift my view from focusing more on compute to focusing more on data and adapting R&D workflows to benefit from AI.
At least in certain areas of mathematics, research problems are often easier than the harder IMO problems. That is to say, you can get pretty far if you know a lot of previously proven results and combine them in relatively straightforward ways. This seems especially true in areas where it is hard for a single human to know a lot of results, just because it takes a long time to read and learn things.
In the MIRI dialogues from 2021/2022 I thought you said you would update to 40% of AGI by 2040 if AI got an IMO gold medal by 2025? Did I misunderstand or have you shifted your thinking (if so, how?)
I agree timescale is a good way to think about this. My intuition is if high school math problems are 1 then IMO math problems are 100(1e2) and typical research math problems are 10,000(1e4). So exactly half way! I don’t have first hand experience with hardest research math problems, but from what I heard about timescale they seem to reach 1,000,000(1e6). I’d rate typical practical R&D problems 1e3 and transformative R&D problems 1e5.
Edit: Using this scale, I rate GPT-3 at 1 and GPT-4 at 10. This suggests GPT-5 for IMO, which feels uncomfortable to me! Thinking about this, I think while there are lots of 1-data and 10-data, there are considerably less 100-data and above that most things are not written down. But maybe that is an excuse and it doesn’t matter.
I kind of disagree. (I was on South Korean IMO team.) I agree IMO problems are in similar category of tasks including research math than high school math, but since IMO problems are intended to be solvable within a time limit, there is (quite low, in absolute sense) upper limit to their difficulty. Basically, intended solution is not longer than a single page. Research math problems have no such limit and can be arbitrarily difficult, or have a solution arbitrarily long.
Edit: Apart from time limit, length limit, and difficulty limit, another important aspect is that IMO problems are already solved, so known to be solvable. IMO problems are “Prove X”. Research math problems, even if they are stated as “Prove X”, is really “Prove or disprove X”, and sometimes this matters.
“Prove or disprove X” is only like 2x harder than “Prove X.” Sometimes the gap is larger for humans because of psychological difficulties, but a machine can literally just pursue both in parallel. (That said, research math involves a ton of problems other than prove or disprove.)
I basically agree that IMO problems are significantly easier than research math or other realistic R&D tasks. However I think that they are very much harder than the kinds of test questions that machines have solved so far. I’m not sure the difference is about high school math vs research math so much as about very easy problems vs problems designed to be challenging and require novel thinking.
My view, having spent a fair amount of time on IMO problems as well as on theoretical research and more practical R&D, is that the IMO is significantly easier but just not very far away from the kind of work human scientists need to do in order to be productive.
I think the biggest remaining difference is that the hardest research math problems operate over a timescale about 2-3 orders of magnitude longer than IMO problems, and I would guess transformative R&D requires operating over a timescale somewhere in between. (While IMO problems are themselves about 2-3 orders of magnitude longer for humans than questions that you can solve automatically.)
Research problems also involve a messier set of data and so training on “all IMO problems” is more like getting good at an incredibly narrow form of R&D. And I do think it’s just cognitively harder, but by an amount that feels like much less than a GPT-3 to GPT-4 sized gap.
I’d be personally surprised if you couldn’t close the gap between IMO gold and transformative R&D with 3-4 orders of magnitude of compute (or equivalent algorithmic progress) + an analogous effort to construct relevant data and feedback for particular R&D tasks. If we got an IMO gold in 2023 I would intuitively expect transformative AI to happen well before 2030, and I would shift my view from focusing more on compute to focusing more on data and adapting R&D workflows to benefit from AI.
At least in certain areas of mathematics, research problems are often easier than the harder IMO problems. That is to say, you can get pretty far if you know a lot of previously proven results and combine them in relatively straightforward ways. This seems especially true in areas where it is hard for a single human to know a lot of results, just because it takes a long time to read and learn things.
In the MIRI dialogues from 2021/2022 I thought you said you would update to 40% of AGI by 2040 if AI got an IMO gold medal by 2025? Did I misunderstand or have you shifted your thinking (if so, how?)
I agree timescale is a good way to think about this. My intuition is if high school math problems are 1 then IMO math problems are 100(1e2) and typical research math problems are 10,000(1e4). So exactly half way! I don’t have first hand experience with hardest research math problems, but from what I heard about timescale they seem to reach 1,000,000(1e6). I’d rate typical practical R&D problems 1e3 and transformative R&D problems 1e5.
Edit: Using this scale, I rate GPT-3 at 1 and GPT-4 at 10. This suggests GPT-5 for IMO, which feels uncomfortable to me! Thinking about this, I think while there are lots of 1-data and 10-data, there are considerably less 100-data and above that most things are not written down. But maybe that is an excuse and it doesn’t matter.