I predict that people will explicitly collect much larger datasets of human behavior as the economic stakes rise. This is in contrast to e.g. theorem-proving working well, although I think that theorem-proving may end up being an important bellwether because it allows you to assess the capabilities of large models without multi-billion-dollar investments in training infrastructure.
Well, it sounds like I might be more bullish than you on theorem-proving, possibly. Not on it being useful or profitable, but in terms of underlying technology making progress on non-profitable amazing demo feats, maybe I’m more bullish on theorem-proving than you are? Is there anything you think it shouldn’t be able to do in the next 5 years?
I’m going to make predictions by drawing straight-ish lines through metrics like the ones in the gpt-f paper. Big unknowns are then (i) how many orders of magnitude of “low-hanging fruit” are there before theorem-proving even catches up to the rest of NLP? (ii) how hard their benchmarks are compared to other tasks we care about. On (i) my guess is maybe 2? On (ii) my guess is “they are pretty easy” / “humans are pretty bad at these tasks,” but it’s somewhat harder to quantify. If you think your methodology is different from that then we will probably end up disagreeing.
Looking towards more ambitious benchmarks, I think that the IMO grand challenge is currently significantly more than 5 years away. In 5 year’s time my median guess (without almost any thinking about it) is that automated solvers can do 10% of non-geometry, non-3-variable-inequality IMO shortlist problems.
So yeah, I’m happy to play ball in this area, and I expect my predictions to be somewhat more right than yours after the dust settles. Is there some way of measuring such that you are willing to state any prediction?
(I still feel like I’m basically looking for any predictions at all beyond sometimes saying “my model wouldn’t be surprised by <vague thing X>”, whereas I’m pretty constantly throwing out made-up guesses which I’m happy to refine with more effort. Obviously I’m going to look worse in retrospect than you if we keep up this way though, that particular asymmetry is a lot of the reason people mostly don’t play ball. ETA: that’s a bit unfair, the romantic chatbot vs self-driving car prediction is one where we’ve both given off-the-cuff takes.)
I have a sense that there’s a lot of latent potential for theorem-proving to advance if more energy gets thrown at it, in part because current algorithms seem a bit weird to me—that we are waiting on the equivalent of neural MCTS as an enabler for AlphaGo, not just a bigger investment, though of course the key trick could already have been published in any of a thousand papers I haven’t read. I feel like I “would not be surprised at all” if we get a bunch of shocking headlines in 2023 about theorem-proving problems falling, after which the IMO challenge falls in 2024 - though of course, as events like this lie in the Future, they are very hard to predict.
Can you say more about why or whether you would, in this case, say that this was an un-Paulian set of events? As I have trouble manipulating my Paul model, it does not exclude Paul saying, “Ah, yes, well, they were using 700M models in that paper, so if you jump to 70B, of course the IMO grand challenge could fall; there wasn’t a lot of money there.” Though I haven’t even glanced at any metrics here, let alone metrics that the IMO grand challenge could be plotted on, so if smooth metrics rule out IMO in 5yrs, I am more interested yet—it legit decrements my belief, but not nearly as much as I imagine it would decrement yours.
(Edit: Also, on the meta-level, is this, like, anywhere at all near the sort of thing you were hoping to hear from me? Am I now being a better epistemic citizen, if maybe not a good one by your lights?)
Yes, IMO challenge falling in 2024 is surprising to me at something like the 1% level or maybe even more extreme (though could also go down if I thought about it a lot or if commenters brought up relevant considerations, e.g. I’d look at IMO problems and gold medal cutoffs and think about what tasks ought to be easy or hard; I’m also happy to make more concrete per-question predictions). I do think that there could be huge amounts of progress from picking the low hanging fruit and scaling up spending by a few orders of magnitude, but I still don’t expect it to get you that far.
I don’t think this is an easy prediction to extract from a trendline, in significant part because you can’t extrapolate trendlines this early that far out. So this is stress-testing different parts of my model, which is fine by me.
At the meta-level, this is the kind of thing I’m looking for, though I’d prefer have some kind of quantitative measure of how not-surprised you are. If you are only saying 2% then we probably want to talk about things less far in your tails than the IMO challenge.
Okay, then we’ve got at least one Eliezerverse item, because I’ve said below that I think I’m at least 16% for IMO theorem-proving by end of 2025. The drastic difference here causes me to feel nervous, and my second-order estimate has probably shifted some in your direction just from hearing you put 1% on 2024, but that’s irrelevant because it’s first-order estimates we should be comparing here.
So we’ve got huge GDP increases for before-End-days signs of Paulverse and quick IMO proving for before-End-days signs of Eliezerverse? Pretty bare portfolio but it’s at least a start in both directions. If we say 5% instead of 1%, how much further would you extend the time limit out beyond 2024?
I also don’t know at all what part of your model forbids theorem-proving to fall in a shocking headline followed by another headline a year later—it doesn’t sound like it’s from looking at a graph—and I think that explaining reasons behind our predictions in advance, not just making quantitative predictions in advance, will help others a lot here.
EDIT: Though the formal IMO challenge has a barnacle about the AI being open-sourced, which is a separate sociological prediction I’m not taking on.
I think IMO gold medal could be well before massive economic impact, I’m just surprised if it happens in the next 3 years. After a bit more thinking (but not actually looking at IMO problems or the state of theorem proving) I probably want to bump that up a bit, maybe 2%, it’s hard reasoning about the tails.
I’d say <4% on end of 2025.
I think this is the flipside of me having an intuition where I say things like “AlphaGo and GPT-3 aren’t that surprising”—I have a sense for what things are and aren’t surprising, and not many things happen that are so surprising.
If I’m at 4% and you are 12% and we had 8 such bets, then I can get a factor of 2 if they all come out my way, and you get a factor of ~1.5 if one of them comes out your way.
I might think more about this and get a more coherent probability distribution, but unless I say something else by end of 2021 you can consider 4% on end of 2025 this my prediction.
Maybe another way of phrasing this—how much warning do you expect to get, how far out does your Nope Vision extend? Do you expect to be able to say “We’re now in the ‘for all I know the IMO challenge could be won in 4 years’ regime” more than 4 years before it happens, in general? Would it be fair to ask you again at the end of 2022 and every year thereafter if we’ve entered the ‘for all I know, within 4 years’ regime?
Added: This question fits into a larger concern I have about AI soberskeptics in general (not you, the soberskeptics would not consider you one of their own) where they saunter around saying “X will not occur in the next 5 / 10 / 20 years” and they’re often right for the next couple of years, because there’s only one year where X shows up for any particular definition of that, and most years are not that year; but also they’re saying exactly the same thing up until 2 years before X shows up, if there’s any early warning on X at all. It seems to me that 2 years is about as far as Nope Vision extends in real life, for any case that isn’t completely slam-dunk; when I called upon those gathered AI luminaries to say the least impressive thing that definitely couldn’t be done in 2 years, and they all fell silent, and then a single one of them named Winograd schemas, they were right that Winograd schemas at the stated level didn’t fall within 2 years, but very barely so (they fell the year after). So part of what I’m flailingly asking here, is whether you think you have reliable and sensitive Nope Vision that extends out beyond 2 years, in general, such that you can go on saying “Not for 4 years” up until we are actually within 6 years of the thing, and then, you think, your Nope Vision will actually flash an alert and you will change your tune, before you are actually within 4 years of the thing. Or maybe you think you’ve got Nope Vision extending out 6 years? 10 years? Or maybe theorem-proving is just a special case and usually your Nope Vision would be limited to 2 years or 3 years?
This is all an extremely Yudkowskian frame on things, of course, so feel free to reframe.
I think I’ll get less confident as our accomplishments get closer to the IMO grand challenge. Or maybe I’ll get much more confident if we scale up from $1M → $1B and pick the low hanging fruit without getting fairly close, since at that point further progress gets a lot easier to predict
There’s not really a constant time horizon for my pessimism, it depends on how long and robust a trend you are extrapolating from. 4 years feels like a relatively short horizon, because theorem-proving has not had much investment so compute can be scaled up several orders of magnitude, and there is likely lots of low-hanging fruit to pick, and we just don’t have much to extrapolate from (compared to more mature technologies, or how I expect AI will be shortly before the end of days), and for similar reasons there aren’t really any benchmarks to extrapolate.
(Also note that it matters a lot whether you know what problems labs will try to take a stab at. For the purpose of all of these forecasts, I am trying insofar as possible to set aside all knowledge about what labs are planning to do though that’s obviously not incentive-compatible and there’s no particular reason you should trust me to do that.)
I feel like I “would not be surprised at all” if we get a bunch of shocking headlines in 2023 about theorem-proving problems falling, after which the IMO challenge falls in 2024
Possibly helpful: Metaculus currently puts the chances of the IMO grand challenge falling by 2025 at about 8%. Their median is 2039.
I think this would make a great bet, as it would definitely show that your model can strongly outperform a lot of people (and potentially Paul too). And the operationalization for the bet is already there—so little work will be needed to do that part.
Ha! Okay then. My probability is at least 16%, though I’d have to think more and Look into Things, and maybe ask for such sad little metrics as are available before I was confident saying how much more. Paul?
EDIT: I see they want to demand that the AI be open-sourced publicly before the first day of the IMO, which unfortunately sounds like the sort of foolish little real-world obstacle which can prevent a proposition like this from being judged true even where the technical capability exists. I’ll stand by a >16% probability of the technical capability existing by end of 2025, as reported on eg solving a non-trained/heldout dataset of past IMO problems, conditional on such a dataset being available; I frame no separate sociological prediction about whether somebody is willing to open-source the AI model that does it.
I don’t care about whether the AI is open-sourced (I don’t expect anyone to publish the weights even if they describe their method) and I’m not that worried about our ability to arbitrate overfitting.
Ajeya suggested that I clarify: I’m significantly more impressed by an AI getting a gold medal than getting a bronze, and my 4% probability is for getting a gold in particular (as described in the IMO grand challenge). There are some categories of problems that can be solved using easy automation (I’d guess about 5-10% could be done with no deep learning and modest effort). Together with modest progress in deep learning based methods, and a somewhat serious effort, I wouldn’t be surprised by people getting up to 20-40% of problems. The bronze cutoff is usually 3⁄6 problems, and the gold cutoff is usually 5⁄6 (assuming the AI doesn’t get partial credit). The difficulty of problems also increases very rapidly for humans—there are often 3 problems that a human can do more-or-less mechanically.
I could tighten any of these estimates by looking at the distribution more carefully rather than going off of my recollections from 2008, and if this was going to be one of a handful of things we’d bet about I’d probably spend a few hours doing that and some other basic digging.
I looked at a few recent IMOs to get better calibrated. I think the main update is that I significantly underestimated how many years you can get a gold with only 4⁄6 problems.
For example I don’t have the same “this is impossible” reaction about IMO 2012 or IMO 2015 as about most years. That said, I feel like they do have to get reasonably lucky with both IMO content and someone has to make a serious and mostly-successful effort, but I’m at least a bit scared by that. There’s also quite often a geo problem as 3 or 6.
Might be good to make some side bets:
Conditioned on winning I think it’s only maybe 20% probability to get all 6 problems (whereas I think you might have a higher probability on jumping right past human level, or at least have 50% on 6 vs 5?).
Conditioned on a model getting 3+ problems I feel like we have a pretty good guess about what algorithm will be SOTA on this problem (e.g. I’d give 50% to a pretty narrow class of algorithms with some uncertain bells and whistles, with no inside knowledge). Whereas I’d guess you have a much broader distribution.
But more useful to get other categories of bets. (Maybe in programming, investment in AI, economic impact from robotics, economic impact from chatbots, translation?)
Going through previous ten IMOs, and imagining a very impressive automated theorem prover, I think
2020 - unlikely, need 5⁄6 and probably can’t get problems 3 or 6. Also good chance to mess up at 4 or 5
2019 - tough but possible, 3 seems hard but even that is not unimaginable, 5 might be hard but might be straightforward, and it can afford to get one wrong
2018 - tough but possible, 3 is easier for machine than human but probably still hard, 5 may be hard, can afford to miss one
2017 - tough but possible, 3 looks out of reach, 6 looks hard but not sure about that, 5 looks maybe hard, 1 is probably easy. But it can miss 2, which could happen.
2016 - probably not possible, 3 and 6 again look hard, and good chance to fail on 2 and 5, only allowed to miss 1
2015 - seems possible, 3 might be hard but like 50-50 it’s simple for machine, 6 is probably hard, but you can miss 2
2014 - probably not possible, can only miss 1, probably miss one of 2 or 5 and 6
2013 - probably not possible, 6 seems hard, 2 seems very hard, can only miss 1
2012 - tough but possible, 6 and 3 look hard but you can miss 2
2011 - seems possible, allowed to miss two and both 3 and 6 look brute-forceable
Overall this was much easier than I expected. 4⁄10 seem unlikely, 4⁄10 seem tough but possible, 2⁄10 I can imagine a machine doing it. There are a lot of problems that look really hard, but there are a fair number of tests where you can just skip those.
That said, even to get the possible ones you do need to be surprisingly impressive, and that’s getting cut down by like 25-50% for a solvable test. That said, they get to keep trying (assuming they get promising results in early years) and eventually they will hit one of the easier years.
It also looks fairly likely to me that if one of DeepMind or OpenAI tries seriously they will be able to get an HM with a quite reasonable chance at bronze, and this is maybe enough of a PR coup to motivate work, and then it’s more likely there will be a large effort subsequently to finish the job or to opportunistically take advantage of an easy test.
Overall I’m feeling bad about my 4%, I deserve to lose some points regardless but might think about what my real probability is after looking at tests (though I was also probably moved by other folks in EA systematically giving higher estimates than I did).
Based on the other thread I now want to revise this prediction, both because 4% was too low and “IMO gold” has a lot of noise in it based on test difficulty.
I’d put 4% on “For the 2022, 2023, 2024, or 2025 IMO an AI built before the IMO is able to solve the single hardest problem” where “hardest problem” = “usually problem #6, but use problem #3 instead if either: (i) problem 6 is geo or (ii) problem 3 is combinatorics and problem 6 is algebra.” (Would prefer just pick the hardest problem after seeing the test but seems better to commit to a procedure.)
Maybe I’ll go 8% on “gets gold” instead of “solves hardest problem.”
Would be good to get your updated view on this so that we can treat it as staked out predictions.
(News: OpenAI has built a theorem-prover that solved many AMC12 and AIME competition problems, and 2 IMO problems, and they say they hope this leads to work that wins the IMO Grand Challenge.)
I’ll stand by a >16% probability of the technical capability existing by end of 2025, as reported on eg solving a non-trained/heldout dataset of past IMO problems, conditional on such a dataset being available
It feels like this bet would look a lot better if it were about something that you predict at well over 50% (with people in Paul’s camp still maintaining less than 50%). So, we could perhaps modify the terms such that the bot would only need to surpass a certain rank or percentile-equivalent in the competition (and not necessarily receive the equivalent of a Gold medal).
The relevant question is which rank/percentile you think is likely to be attained by 2025 under your model but you predict would be implausible under Paul’s model. This may be a daunting task, but one way to get started is to put a probability distribution over what you think the state-of-the-art will look like by 2025, and then compare to Paul’s.
I expect it to be hella difficult to pick anything where I’m at 75% that it happens in the next 5 years and Paul is at 25%. Heck, it’s not easy to find things where I’m at over 75% that aren’t just obvious slam dunks; the Future isn’t that easy to predict. Let’s get up to a nice crawl first, and then maybe a small portfolio of crawlings, before we start trying to make single runs that pierce the sound barrier.
I frame no prediction about whether Paul is under 16%. That’s a separate matter. I think a little progress is made toward eventual epistemic virtue if you hand me a Metaculus forecast and I’m like “lol wut” and double their probability, even if it turns out that Paul agrees with me about it.
It feels like this bet would look a lot better if it were about something that you predict at well over 50% (with people in Paul’s camp still maintaining less than 50%).
My model of Eliezer may be wrong, but I’d guess that this isn’t a domain where he has many over-50% predictions of novel events at all? See also ‘I don’t necessarily expect self-driving cars before the apocalypse’.
My Eliezer-model has a more flat prior over what might happen, which therefore includes stuff like ‘maybe we’ll make insane progress on theorem-proving (or whatever) out of the blue’. Again, I may be wrong, but my intuition is that you’re Paul-omorphizing Eliezer when you assume that >16% probability of huge progress in X by year Y implies >50% probability of smaller-but-meaningful progress in X by year Y.
If this task is bad for operationalization reasons, there are other theorem proving benchmarks. Unfortunately it looks like there aren’t a lot of people that are currently trying to improve on the known benchmarks, as far as I’m aware.
The code generation benchmarks are slightly more active. I’m personally partial to Hendrycks et al.’s APPS benchmark, which includes problems that “range in difficulty from introductory to collegiate competition level and measure coding and problem-solving ability.” (Github link).
I think Metaculus is closer to Eliezer here: conditioned on this problem being resolved it seems unlikely for the AI to be either open-sourced or easily reproducible.
Well, it sounds like I might be more bullish than you on theorem-proving, possibly. Not on it being useful or profitable, but in terms of underlying technology making progress on non-profitable amazing demo feats, maybe I’m more bullish on theorem-proving than you are? Is there anything you think it shouldn’t be able to do in the next 5 years?
I’m going to make predictions by drawing straight-ish lines through metrics like the ones in the gpt-f paper. Big unknowns are then (i) how many orders of magnitude of “low-hanging fruit” are there before theorem-proving even catches up to the rest of NLP? (ii) how hard their benchmarks are compared to other tasks we care about. On (i) my guess is maybe 2? On (ii) my guess is “they are pretty easy” / “humans are pretty bad at these tasks,” but it’s somewhat harder to quantify. If you think your methodology is different from that then we will probably end up disagreeing.
Looking towards more ambitious benchmarks, I think that the IMO grand challenge is currently significantly more than 5 years away. In 5 year’s time my median guess (without almost any thinking about it) is that automated solvers can do 10% of non-geometry, non-3-variable-inequality IMO shortlist problems.
So yeah, I’m happy to play ball in this area, and I expect my predictions to be somewhat more right than yours after the dust settles. Is there some way of measuring such that you are willing to state any prediction?
(I still feel like I’m basically looking for any predictions at all beyond sometimes saying “my model wouldn’t be surprised by <vague thing X>”, whereas I’m pretty constantly throwing out made-up guesses which I’m happy to refine with more effort. Obviously I’m going to look worse in retrospect than you if we keep up this way though, that particular asymmetry is a lot of the reason people mostly don’t play ball. ETA: that’s a bit unfair, the romantic chatbot vs self-driving car prediction is one where we’ve both given off-the-cuff takes.)
I have a sense that there’s a lot of latent potential for theorem-proving to advance if more energy gets thrown at it, in part because current algorithms seem a bit weird to me—that we are waiting on the equivalent of neural MCTS as an enabler for AlphaGo, not just a bigger investment, though of course the key trick could already have been published in any of a thousand papers I haven’t read. I feel like I “would not be surprised at all” if we get a bunch of shocking headlines in 2023 about theorem-proving problems falling, after which the IMO challenge falls in 2024 - though of course, as events like this lie in the Future, they are very hard to predict.
Can you say more about why or whether you would, in this case, say that this was an un-Paulian set of events? As I have trouble manipulating my Paul model, it does not exclude Paul saying, “Ah, yes, well, they were using 700M models in that paper, so if you jump to 70B, of course the IMO grand challenge could fall; there wasn’t a lot of money there.” Though I haven’t even glanced at any metrics here, let alone metrics that the IMO grand challenge could be plotted on, so if smooth metrics rule out IMO in 5yrs, I am more interested yet—it legit decrements my belief, but not nearly as much as I imagine it would decrement yours.
(Edit: Also, on the meta-level, is this, like, anywhere at all near the sort of thing you were hoping to hear from me? Am I now being a better epistemic citizen, if maybe not a good one by your lights?)
Yes, IMO challenge falling in 2024 is surprising to me at something like the 1% level or maybe even more extreme (though could also go down if I thought about it a lot or if commenters brought up relevant considerations, e.g. I’d look at IMO problems and gold medal cutoffs and think about what tasks ought to be easy or hard; I’m also happy to make more concrete per-question predictions). I do think that there could be huge amounts of progress from picking the low hanging fruit and scaling up spending by a few orders of magnitude, but I still don’t expect it to get you that far.
I don’t think this is an easy prediction to extract from a trendline, in significant part because you can’t extrapolate trendlines this early that far out. So this is stress-testing different parts of my model, which is fine by me.
At the meta-level, this is the kind of thing I’m looking for, though I’d prefer have some kind of quantitative measure of how not-surprised you are. If you are only saying 2% then we probably want to talk about things less far in your tails than the IMO challenge.
Okay, then we’ve got at least one Eliezerverse item, because I’ve said below that I think I’m at least 16% for IMO theorem-proving by end of 2025. The drastic difference here causes me to feel nervous, and my second-order estimate has probably shifted some in your direction just from hearing you put 1% on 2024, but that’s irrelevant because it’s first-order estimates we should be comparing here.
So we’ve got huge GDP increases for before-End-days signs of Paulverse and quick IMO proving for before-End-days signs of Eliezerverse? Pretty bare portfolio but it’s at least a start in both directions. If we say 5% instead of 1%, how much further would you extend the time limit out beyond 2024?
I also don’t know at all what part of your model forbids theorem-proving to fall in a shocking headline followed by another headline a year later—it doesn’t sound like it’s from looking at a graph—and I think that explaining reasons behind our predictions in advance, not just making quantitative predictions in advance, will help others a lot here.
EDIT: Though the formal IMO challenge has a barnacle about the AI being open-sourced, which is a separate sociological prediction I’m not taking on.
I think IMO gold medal could be well before massive economic impact, I’m just surprised if it happens in the next 3 years. After a bit more thinking (but not actually looking at IMO problems or the state of theorem proving) I probably want to bump that up a bit, maybe 2%, it’s hard reasoning about the tails.
I’d say <4% on end of 2025.
I think this is the flipside of me having an intuition where I say things like “AlphaGo and GPT-3 aren’t that surprising”—I have a sense for what things are and aren’t surprising, and not many things happen that are so surprising.
If I’m at 4% and you are 12% and we had 8 such bets, then I can get a factor of 2 if they all come out my way, and you get a factor of ~1.5 if one of them comes out your way.
I might think more about this and get a more coherent probability distribution, but unless I say something else by end of 2021 you can consider 4% on end of 2025 this my prediction.
Maybe another way of phrasing this—how much warning do you expect to get, how far out does your Nope Vision extend? Do you expect to be able to say “We’re now in the ‘for all I know the IMO challenge could be won in 4 years’ regime” more than 4 years before it happens, in general? Would it be fair to ask you again at the end of 2022 and every year thereafter if we’ve entered the ‘for all I know, within 4 years’ regime?
Added: This question fits into a larger concern I have about AI soberskeptics in general (not you, the soberskeptics would not consider you one of their own) where they saunter around saying “X will not occur in the next 5 / 10 / 20 years” and they’re often right for the next couple of years, because there’s only one year where X shows up for any particular definition of that, and most years are not that year; but also they’re saying exactly the same thing up until 2 years before X shows up, if there’s any early warning on X at all. It seems to me that 2 years is about as far as Nope Vision extends in real life, for any case that isn’t completely slam-dunk; when I called upon those gathered AI luminaries to say the least impressive thing that definitely couldn’t be done in 2 years, and they all fell silent, and then a single one of them named Winograd schemas, they were right that Winograd schemas at the stated level didn’t fall within 2 years, but very barely so (they fell the year after). So part of what I’m flailingly asking here, is whether you think you have reliable and sensitive Nope Vision that extends out beyond 2 years, in general, such that you can go on saying “Not for 4 years” up until we are actually within 6 years of the thing, and then, you think, your Nope Vision will actually flash an alert and you will change your tune, before you are actually within 4 years of the thing. Or maybe you think you’ve got Nope Vision extending out 6 years? 10 years? Or maybe theorem-proving is just a special case and usually your Nope Vision would be limited to 2 years or 3 years?
This is all an extremely Yudkowskian frame on things, of course, so feel free to reframe.
I think I’ll get less confident as our accomplishments get closer to the IMO grand challenge. Or maybe I’ll get much more confident if we scale up from $1M → $1B and pick the low hanging fruit without getting fairly close, since at that point further progress gets a lot easier to predict
There’s not really a constant time horizon for my pessimism, it depends on how long and robust a trend you are extrapolating from. 4 years feels like a relatively short horizon, because theorem-proving has not had much investment so compute can be scaled up several orders of magnitude, and there is likely lots of low-hanging fruit to pick, and we just don’t have much to extrapolate from (compared to more mature technologies, or how I expect AI will be shortly before the end of days), and for similar reasons there aren’t really any benchmarks to extrapolate.
(Also note that it matters a lot whether you know what problems labs will try to take a stab at. For the purpose of all of these forecasts, I am trying insofar as possible to set aside all knowledge about what labs are planning to do though that’s obviously not incentive-compatible and there’s no particular reason you should trust me to do that.)
Possibly helpful: Metaculus currently puts the chances of the IMO grand challenge falling by 2025 at about 8%. Their median is 2039.
I think this would make a great bet, as it would definitely show that your model can strongly outperform a lot of people (and potentially Paul too). And the operationalization for the bet is already there—so little work will be needed to do that part.
Ha! Okay then. My probability is at least 16%, though I’d have to think more and Look into Things, and maybe ask for such sad little metrics as are available before I was confident saying how much more. Paul?
EDIT: I see they want to demand that the AI be open-sourced publicly before the first day of the IMO, which unfortunately sounds like the sort of foolish little real-world obstacle which can prevent a proposition like this from being judged true even where the technical capability exists. I’ll stand by a >16% probability of the technical capability existing by end of 2025, as reported on eg solving a non-trained/heldout dataset of past IMO problems, conditional on such a dataset being available; I frame no separate sociological prediction about whether somebody is willing to open-source the AI model that does it.
I don’t care about whether the AI is open-sourced (I don’t expect anyone to publish the weights even if they describe their method) and I’m not that worried about our ability to arbitrate overfitting.
Ajeya suggested that I clarify: I’m significantly more impressed by an AI getting a gold medal than getting a bronze, and my 4% probability is for getting a gold in particular (as described in the IMO grand challenge). There are some categories of problems that can be solved using easy automation (I’d guess about 5-10% could be done with no deep learning and modest effort). Together with modest progress in deep learning based methods, and a somewhat serious effort, I wouldn’t be surprised by people getting up to 20-40% of problems. The bronze cutoff is usually 3⁄6 problems, and the gold cutoff is usually 5⁄6 (assuming the AI doesn’t get partial credit). The difficulty of problems also increases very rapidly for humans—there are often 3 problems that a human can do more-or-less mechanically.
I could tighten any of these estimates by looking at the distribution more carefully rather than going off of my recollections from 2008, and if this was going to be one of a handful of things we’d bet about I’d probably spend a few hours doing that and some other basic digging.
I looked at a few recent IMOs to get better calibrated. I think the main update is that I significantly underestimated how many years you can get a gold with only 4⁄6 problems.
For example I don’t have the same “this is impossible” reaction about IMO 2012 or IMO 2015 as about most years. That said, I feel like they do have to get reasonably lucky with both IMO content and someone has to make a serious and mostly-successful effort, but I’m at least a bit scared by that. There’s also quite often a geo problem as 3 or 6.
Might be good to make some side bets:
Conditioned on winning I think it’s only maybe 20% probability to get all 6 problems (whereas I think you might have a higher probability on jumping right past human level, or at least have 50% on 6 vs 5?).
Conditioned on a model getting 3+ problems I feel like we have a pretty good guess about what algorithm will be SOTA on this problem (e.g. I’d give 50% to a pretty narrow class of algorithms with some uncertain bells and whistles, with no inside knowledge). Whereas I’d guess you have a much broader distribution.
But more useful to get other categories of bets. (Maybe in programming, investment in AI, economic impact from robotics, economic impact from chatbots, translation?)
Going through previous ten IMOs, and imagining a very impressive automated theorem prover, I think
2020 - unlikely, need 5⁄6 and probably can’t get problems 3 or 6. Also good chance to mess up at 4 or 5
2019 - tough but possible, 3 seems hard but even that is not unimaginable, 5 might be hard but might be straightforward, and it can afford to get one wrong
2018 - tough but possible, 3 is easier for machine than human but probably still hard, 5 may be hard, can afford to miss one
2017 - tough but possible, 3 looks out of reach, 6 looks hard but not sure about that, 5 looks maybe hard, 1 is probably easy. But it can miss 2, which could happen.
2016 - probably not possible, 3 and 6 again look hard, and good chance to fail on 2 and 5, only allowed to miss 1
2015 - seems possible, 3 might be hard but like 50-50 it’s simple for machine, 6 is probably hard, but you can miss 2
2014 - probably not possible, can only miss 1, probably miss one of 2 or 5 and 6
2013 - probably not possible, 6 seems hard, 2 seems very hard, can only miss 1
2012 - tough but possible, 6 and 3 look hard but you can miss 2
2011 - seems possible, allowed to miss two and both 3 and 6 look brute-forceable
Overall this was much easier than I expected. 4⁄10 seem unlikely, 4⁄10 seem tough but possible, 2⁄10 I can imagine a machine doing it. There are a lot of problems that look really hard, but there are a fair number of tests where you can just skip those.
That said, even to get the possible ones you do need to be surprisingly impressive, and that’s getting cut down by like 25-50% for a solvable test. That said, they get to keep trying (assuming they get promising results in early years) and eventually they will hit one of the easier years.
It also looks fairly likely to me that if one of DeepMind or OpenAI tries seriously they will be able to get an HM with a quite reasonable chance at bronze, and this is maybe enough of a PR coup to motivate work, and then it’s more likely there will be a large effort subsequently to finish the job or to opportunistically take advantage of an easy test.
Overall I’m feeling bad about my 4%, I deserve to lose some points regardless but might think about what my real probability is after looking at tests (though I was also probably moved by other folks in EA systematically giving higher estimates than I did).
What do you think of Deepmind’s new whoop-de-doo about doing research-level math assisted by GNNs?
Not surprising in any of the ways that good IMO performance would be surprising.
Based on the other thread I now want to revise this prediction, both because 4% was too low and “IMO gold” has a lot of noise in it based on test difficulty.
I’d put 4% on “For the 2022, 2023, 2024, or 2025 IMO an AI built before the IMO is able to solve the single hardest problem” where “hardest problem” = “usually problem #6, but use problem #3 instead if either: (i) problem 6 is geo or (ii) problem 3 is combinatorics and problem 6 is algebra.” (Would prefer just pick the hardest problem after seeing the test but seems better to commit to a procedure.)
Maybe I’ll go 8% on “gets gold” instead of “solves hardest problem.”
Would be good to get your updated view on this so that we can treat it as staked out predictions.
(News: OpenAI has built a theorem-prover that solved many AMC12 and AIME competition problems, and 2 IMO problems, and they say they hope this leads to work that wins the IMO Grand Challenge.)
It feels like this bet would look a lot better if it were about something that you predict at well over 50% (with people in Paul’s camp still maintaining less than 50%). So, we could perhaps modify the terms such that the bot would only need to surpass a certain rank or percentile-equivalent in the competition (and not necessarily receive the equivalent of a Gold medal).
The relevant question is which rank/percentile you think is likely to be attained by 2025 under your model but you predict would be implausible under Paul’s model. This may be a daunting task, but one way to get started is to put a probability distribution over what you think the state-of-the-art will look like by 2025, and then compare to Paul’s.
Edit: Here are, for example, the individual rankings for 2021: https://www.imo-official.org/year_individual_r.aspx?year=2021
I expect it to be hella difficult to pick anything where I’m at 75% that it happens in the next 5 years and Paul is at 25%. Heck, it’s not easy to find things where I’m at over 75% that aren’t just obvious slam dunks; the Future isn’t that easy to predict. Let’s get up to a nice crawl first, and then maybe a small portfolio of crawlings, before we start trying to make single runs that pierce the sound barrier.
I frame no prediction about whether Paul is under 16%. That’s a separate matter. I think a little progress is made toward eventual epistemic virtue if you hand me a Metaculus forecast and I’m like “lol wut” and double their probability, even if it turns out that Paul agrees with me about it.
My model of Eliezer may be wrong, but I’d guess that this isn’t a domain where he has many over-50% predictions of novel events at all? See also ‘I don’t necessarily expect self-driving cars before the apocalypse’.
My Eliezer-model has a more flat prior over what might happen, which therefore includes stuff like ‘maybe we’ll make insane progress on theorem-proving (or whatever) out of the blue’. Again, I may be wrong, but my intuition is that you’re Paul-omorphizing Eliezer when you assume that >16% probability of huge progress in X by year Y implies >50% probability of smaller-but-meaningful progress in X by year Y.
(Ah, EY already replied.)
If this task is bad for operationalization reasons, there are other theorem proving benchmarks. Unfortunately it looks like there aren’t a lot of people that are currently trying to improve on the known benchmarks, as far as I’m aware.
The code generation benchmarks are slightly more active. I’m personally partial to Hendrycks et al.’s APPS benchmark, which includes problems that “range in difficulty from introductory to collegiate competition level and measure coding and problem-solving ability.” (Github link).
I think Metaculus is closer to Eliezer here: conditioned on this problem being resolved it seems unlikely for the AI to be either open-sourced or easily reproducible.
My honest guess is that most predictors didn’t see that condition and the distribution would shift right if someone pointed that out in the comments.