Ha! Okay then. My probability is at least 16%, though I’d have to think more and Look into Things, and maybe ask for such sad little metrics as are available before I was confident saying how much more. Paul?
EDIT: I see they want to demand that the AI be open-sourced publicly before the first day of the IMO, which unfortunately sounds like the sort of foolish little real-world obstacle which can prevent a proposition like this from being judged true even where the technical capability exists. I’ll stand by a >16% probability of the technical capability existing by end of 2025, as reported on eg solving a non-trained/heldout dataset of past IMO problems, conditional on such a dataset being available; I frame no separate sociological prediction about whether somebody is willing to open-source the AI model that does it.
I don’t care about whether the AI is open-sourced (I don’t expect anyone to publish the weights even if they describe their method) and I’m not that worried about our ability to arbitrate overfitting.
Ajeya suggested that I clarify: I’m significantly more impressed by an AI getting a gold medal than getting a bronze, and my 4% probability is for getting a gold in particular (as described in the IMO grand challenge). There are some categories of problems that can be solved using easy automation (I’d guess about 5-10% could be done with no deep learning and modest effort). Together with modest progress in deep learning based methods, and a somewhat serious effort, I wouldn’t be surprised by people getting up to 20-40% of problems. The bronze cutoff is usually 3⁄6 problems, and the gold cutoff is usually 5⁄6 (assuming the AI doesn’t get partial credit). The difficulty of problems also increases very rapidly for humans—there are often 3 problems that a human can do more-or-less mechanically.
I could tighten any of these estimates by looking at the distribution more carefully rather than going off of my recollections from 2008, and if this was going to be one of a handful of things we’d bet about I’d probably spend a few hours doing that and some other basic digging.
I looked at a few recent IMOs to get better calibrated. I think the main update is that I significantly underestimated how many years you can get a gold with only 4⁄6 problems.
For example I don’t have the same “this is impossible” reaction about IMO 2012 or IMO 2015 as about most years. That said, I feel like they do have to get reasonably lucky with both IMO content and someone has to make a serious and mostly-successful effort, but I’m at least a bit scared by that. There’s also quite often a geo problem as 3 or 6.
Might be good to make some side bets:
Conditioned on winning I think it’s only maybe 20% probability to get all 6 problems (whereas I think you might have a higher probability on jumping right past human level, or at least have 50% on 6 vs 5?).
Conditioned on a model getting 3+ problems I feel like we have a pretty good guess about what algorithm will be SOTA on this problem (e.g. I’d give 50% to a pretty narrow class of algorithms with some uncertain bells and whistles, with no inside knowledge). Whereas I’d guess you have a much broader distribution.
But more useful to get other categories of bets. (Maybe in programming, investment in AI, economic impact from robotics, economic impact from chatbots, translation?)
Going through previous ten IMOs, and imagining a very impressive automated theorem prover, I think
2020 - unlikely, need 5⁄6 and probably can’t get problems 3 or 6. Also good chance to mess up at 4 or 5
2019 - tough but possible, 3 seems hard but even that is not unimaginable, 5 might be hard but might be straightforward, and it can afford to get one wrong
2018 - tough but possible, 3 is easier for machine than human but probably still hard, 5 may be hard, can afford to miss one
2017 - tough but possible, 3 looks out of reach, 6 looks hard but not sure about that, 5 looks maybe hard, 1 is probably easy. But it can miss 2, which could happen.
2016 - probably not possible, 3 and 6 again look hard, and good chance to fail on 2 and 5, only allowed to miss 1
2015 - seems possible, 3 might be hard but like 50-50 it’s simple for machine, 6 is probably hard, but you can miss 2
2014 - probably not possible, can only miss 1, probably miss one of 2 or 5 and 6
2013 - probably not possible, 6 seems hard, 2 seems very hard, can only miss 1
2012 - tough but possible, 6 and 3 look hard but you can miss 2
2011 - seems possible, allowed to miss two and both 3 and 6 look brute-forceable
Overall this was much easier than I expected. 4⁄10 seem unlikely, 4⁄10 seem tough but possible, 2⁄10 I can imagine a machine doing it. There are a lot of problems that look really hard, but there are a fair number of tests where you can just skip those.
That said, even to get the possible ones you do need to be surprisingly impressive, and that’s getting cut down by like 25-50% for a solvable test. That said, they get to keep trying (assuming they get promising results in early years) and eventually they will hit one of the easier years.
It also looks fairly likely to me that if one of DeepMind or OpenAI tries seriously they will be able to get an HM with a quite reasonable chance at bronze, and this is maybe enough of a PR coup to motivate work, and then it’s more likely there will be a large effort subsequently to finish the job or to opportunistically take advantage of an easy test.
Overall I’m feeling bad about my 4%, I deserve to lose some points regardless but might think about what my real probability is after looking at tests (though I was also probably moved by other folks in EA systematically giving higher estimates than I did).
Based on the other thread I now want to revise this prediction, both because 4% was too low and “IMO gold” has a lot of noise in it based on test difficulty.
I’d put 4% on “For the 2022, 2023, 2024, or 2025 IMO an AI built before the IMO is able to solve the single hardest problem” where “hardest problem” = “usually problem #6, but use problem #3 instead if either: (i) problem 6 is geo or (ii) problem 3 is combinatorics and problem 6 is algebra.” (Would prefer just pick the hardest problem after seeing the test but seems better to commit to a procedure.)
Maybe I’ll go 8% on “gets gold” instead of “solves hardest problem.”
Would be good to get your updated view on this so that we can treat it as staked out predictions.
(News: OpenAI has built a theorem-prover that solved many AMC12 and AIME competition problems, and 2 IMO problems, and they say they hope this leads to work that wins the IMO Grand Challenge.)
I’ll stand by a >16% probability of the technical capability existing by end of 2025, as reported on eg solving a non-trained/heldout dataset of past IMO problems, conditional on such a dataset being available
It feels like this bet would look a lot better if it were about something that you predict at well over 50% (with people in Paul’s camp still maintaining less than 50%). So, we could perhaps modify the terms such that the bot would only need to surpass a certain rank or percentile-equivalent in the competition (and not necessarily receive the equivalent of a Gold medal).
The relevant question is which rank/percentile you think is likely to be attained by 2025 under your model but you predict would be implausible under Paul’s model. This may be a daunting task, but one way to get started is to put a probability distribution over what you think the state-of-the-art will look like by 2025, and then compare to Paul’s.
I expect it to be hella difficult to pick anything where I’m at 75% that it happens in the next 5 years and Paul is at 25%. Heck, it’s not easy to find things where I’m at over 75% that aren’t just obvious slam dunks; the Future isn’t that easy to predict. Let’s get up to a nice crawl first, and then maybe a small portfolio of crawlings, before we start trying to make single runs that pierce the sound barrier.
I frame no prediction about whether Paul is under 16%. That’s a separate matter. I think a little progress is made toward eventual epistemic virtue if you hand me a Metaculus forecast and I’m like “lol wut” and double their probability, even if it turns out that Paul agrees with me about it.
It feels like this bet would look a lot better if it were about something that you predict at well over 50% (with people in Paul’s camp still maintaining less than 50%).
My model of Eliezer may be wrong, but I’d guess that this isn’t a domain where he has many over-50% predictions of novel events at all? See also ‘I don’t necessarily expect self-driving cars before the apocalypse’.
My Eliezer-model has a more flat prior over what might happen, which therefore includes stuff like ‘maybe we’ll make insane progress on theorem-proving (or whatever) out of the blue’. Again, I may be wrong, but my intuition is that you’re Paul-omorphizing Eliezer when you assume that >16% probability of huge progress in X by year Y implies >50% probability of smaller-but-meaningful progress in X by year Y.
If this task is bad for operationalization reasons, there are other theorem proving benchmarks. Unfortunately it looks like there aren’t a lot of people that are currently trying to improve on the known benchmarks, as far as I’m aware.
The code generation benchmarks are slightly more active. I’m personally partial to Hendrycks et al.’s APPS benchmark, which includes problems that “range in difficulty from introductory to collegiate competition level and measure coding and problem-solving ability.” (Github link).
Ha! Okay then. My probability is at least 16%, though I’d have to think more and Look into Things, and maybe ask for such sad little metrics as are available before I was confident saying how much more. Paul?
EDIT: I see they want to demand that the AI be open-sourced publicly before the first day of the IMO, which unfortunately sounds like the sort of foolish little real-world obstacle which can prevent a proposition like this from being judged true even where the technical capability exists. I’ll stand by a >16% probability of the technical capability existing by end of 2025, as reported on eg solving a non-trained/heldout dataset of past IMO problems, conditional on such a dataset being available; I frame no separate sociological prediction about whether somebody is willing to open-source the AI model that does it.
I don’t care about whether the AI is open-sourced (I don’t expect anyone to publish the weights even if they describe their method) and I’m not that worried about our ability to arbitrate overfitting.
Ajeya suggested that I clarify: I’m significantly more impressed by an AI getting a gold medal than getting a bronze, and my 4% probability is for getting a gold in particular (as described in the IMO grand challenge). There are some categories of problems that can be solved using easy automation (I’d guess about 5-10% could be done with no deep learning and modest effort). Together with modest progress in deep learning based methods, and a somewhat serious effort, I wouldn’t be surprised by people getting up to 20-40% of problems. The bronze cutoff is usually 3⁄6 problems, and the gold cutoff is usually 5⁄6 (assuming the AI doesn’t get partial credit). The difficulty of problems also increases very rapidly for humans—there are often 3 problems that a human can do more-or-less mechanically.
I could tighten any of these estimates by looking at the distribution more carefully rather than going off of my recollections from 2008, and if this was going to be one of a handful of things we’d bet about I’d probably spend a few hours doing that and some other basic digging.
I looked at a few recent IMOs to get better calibrated. I think the main update is that I significantly underestimated how many years you can get a gold with only 4⁄6 problems.
For example I don’t have the same “this is impossible” reaction about IMO 2012 or IMO 2015 as about most years. That said, I feel like they do have to get reasonably lucky with both IMO content and someone has to make a serious and mostly-successful effort, but I’m at least a bit scared by that. There’s also quite often a geo problem as 3 or 6.
Might be good to make some side bets:
Conditioned on winning I think it’s only maybe 20% probability to get all 6 problems (whereas I think you might have a higher probability on jumping right past human level, or at least have 50% on 6 vs 5?).
Conditioned on a model getting 3+ problems I feel like we have a pretty good guess about what algorithm will be SOTA on this problem (e.g. I’d give 50% to a pretty narrow class of algorithms with some uncertain bells and whistles, with no inside knowledge). Whereas I’d guess you have a much broader distribution.
But more useful to get other categories of bets. (Maybe in programming, investment in AI, economic impact from robotics, economic impact from chatbots, translation?)
Going through previous ten IMOs, and imagining a very impressive automated theorem prover, I think
2020 - unlikely, need 5⁄6 and probably can’t get problems 3 or 6. Also good chance to mess up at 4 or 5
2019 - tough but possible, 3 seems hard but even that is not unimaginable, 5 might be hard but might be straightforward, and it can afford to get one wrong
2018 - tough but possible, 3 is easier for machine than human but probably still hard, 5 may be hard, can afford to miss one
2017 - tough but possible, 3 looks out of reach, 6 looks hard but not sure about that, 5 looks maybe hard, 1 is probably easy. But it can miss 2, which could happen.
2016 - probably not possible, 3 and 6 again look hard, and good chance to fail on 2 and 5, only allowed to miss 1
2015 - seems possible, 3 might be hard but like 50-50 it’s simple for machine, 6 is probably hard, but you can miss 2
2014 - probably not possible, can only miss 1, probably miss one of 2 or 5 and 6
2013 - probably not possible, 6 seems hard, 2 seems very hard, can only miss 1
2012 - tough but possible, 6 and 3 look hard but you can miss 2
2011 - seems possible, allowed to miss two and both 3 and 6 look brute-forceable
Overall this was much easier than I expected. 4⁄10 seem unlikely, 4⁄10 seem tough but possible, 2⁄10 I can imagine a machine doing it. There are a lot of problems that look really hard, but there are a fair number of tests where you can just skip those.
That said, even to get the possible ones you do need to be surprisingly impressive, and that’s getting cut down by like 25-50% for a solvable test. That said, they get to keep trying (assuming they get promising results in early years) and eventually they will hit one of the easier years.
It also looks fairly likely to me that if one of DeepMind or OpenAI tries seriously they will be able to get an HM with a quite reasonable chance at bronze, and this is maybe enough of a PR coup to motivate work, and then it’s more likely there will be a large effort subsequently to finish the job or to opportunistically take advantage of an easy test.
Overall I’m feeling bad about my 4%, I deserve to lose some points regardless but might think about what my real probability is after looking at tests (though I was also probably moved by other folks in EA systematically giving higher estimates than I did).
What do you think of Deepmind’s new whoop-de-doo about doing research-level math assisted by GNNs?
Not surprising in any of the ways that good IMO performance would be surprising.
Based on the other thread I now want to revise this prediction, both because 4% was too low and “IMO gold” has a lot of noise in it based on test difficulty.
I’d put 4% on “For the 2022, 2023, 2024, or 2025 IMO an AI built before the IMO is able to solve the single hardest problem” where “hardest problem” = “usually problem #6, but use problem #3 instead if either: (i) problem 6 is geo or (ii) problem 3 is combinatorics and problem 6 is algebra.” (Would prefer just pick the hardest problem after seeing the test but seems better to commit to a procedure.)
Maybe I’ll go 8% on “gets gold” instead of “solves hardest problem.”
Would be good to get your updated view on this so that we can treat it as staked out predictions.
(News: OpenAI has built a theorem-prover that solved many AMC12 and AIME competition problems, and 2 IMO problems, and they say they hope this leads to work that wins the IMO Grand Challenge.)
It feels like this bet would look a lot better if it were about something that you predict at well over 50% (with people in Paul’s camp still maintaining less than 50%). So, we could perhaps modify the terms such that the bot would only need to surpass a certain rank or percentile-equivalent in the competition (and not necessarily receive the equivalent of a Gold medal).
The relevant question is which rank/percentile you think is likely to be attained by 2025 under your model but you predict would be implausible under Paul’s model. This may be a daunting task, but one way to get started is to put a probability distribution over what you think the state-of-the-art will look like by 2025, and then compare to Paul’s.
Edit: Here are, for example, the individual rankings for 2021: https://www.imo-official.org/year_individual_r.aspx?year=2021
I expect it to be hella difficult to pick anything where I’m at 75% that it happens in the next 5 years and Paul is at 25%. Heck, it’s not easy to find things where I’m at over 75% that aren’t just obvious slam dunks; the Future isn’t that easy to predict. Let’s get up to a nice crawl first, and then maybe a small portfolio of crawlings, before we start trying to make single runs that pierce the sound barrier.
I frame no prediction about whether Paul is under 16%. That’s a separate matter. I think a little progress is made toward eventual epistemic virtue if you hand me a Metaculus forecast and I’m like “lol wut” and double their probability, even if it turns out that Paul agrees with me about it.
My model of Eliezer may be wrong, but I’d guess that this isn’t a domain where he has many over-50% predictions of novel events at all? See also ‘I don’t necessarily expect self-driving cars before the apocalypse’.
My Eliezer-model has a more flat prior over what might happen, which therefore includes stuff like ‘maybe we’ll make insane progress on theorem-proving (or whatever) out of the blue’. Again, I may be wrong, but my intuition is that you’re Paul-omorphizing Eliezer when you assume that >16% probability of huge progress in X by year Y implies >50% probability of smaller-but-meaningful progress in X by year Y.
(Ah, EY already replied.)
If this task is bad for operationalization reasons, there are other theorem proving benchmarks. Unfortunately it looks like there aren’t a lot of people that are currently trying to improve on the known benchmarks, as far as I’m aware.
The code generation benchmarks are slightly more active. I’m personally partial to Hendrycks et al.’s APPS benchmark, which includes problems that “range in difficulty from introductory to collegiate competition level and measure coding and problem-solving ability.” (Github link).