You might want to rethink the definition of “hardest problem” because I wouldn’t expect the IMO committee to care particularly much that their problems are hard for machines as well as humans.
For example, if you look at the 2005 IMO, but suppose that problem 6 was geometry, so that the problem 3 was the “hardest”, then you’re in trouble. Problem 3 has a one-line proof “by sum_of_squares”, assuming that someone has written a sum-of-squares tactic in time and that the sum-of-squares certificate is small enough to find within the time limit.
Alternatively, the IMO committee might set a problem without realizing that it’s close to a special case of something already in mathlib.
As defined, I would put higher probability on solving the “hardest problem” than getting a gold medal and would bet on that.
Yeah, this may well be right. I agree that IMO 2015 problem 3 can be solved by machine. I think both the geometry and 3 variable inequalities could be a very big problem (in that lots of instances are easy for machines and hard for humans). Some diophantine equations and functional equations are also very easy for machine. By 2022 my vague understanding is that the committee is much less likely to put that kind of question in the hardest slot since increasingly many contestants can basically do the machine-like proofs. But I maybe shouldn’t count on that, since the big problem is that we are letting it win at any of 4 IMOs (2022, 2023, 2024, 2025).
I’d guess we have like a 2-3% chance of 3-variable inequality in hardest spot (hasn’t happened in last 20 years and is getting less likely, but not that confident). By now I think it would be considered unreasonable to put a muirhead bashable problem as 3⁄6 (though even the other 3 variable inequalities will likely be much easier for machines than humans), even in 2005 I think that was probably considered a mistake. But still lots of these may be easy for machines, so you may easily be looking at like 10% chance of one of them being a 3-variable inequality and a 50% chance of solving it conditioned on a serious effort which would already be >4%.
Other candidate definitions would have been:
Instead of just using the type of problem, try to explicitly flag constant-number-variables inequalities or diophantine equations or functional equations as special cases to be ranked lower.
Have a stronger preference for combo, e.g. pick from amongst 3⁄6 (or even 2/3/5/6) with preference for combo then NT then algebra than geo. (But I’m scared of e.g. combo 2⁄5 that can be solved with easy monovariant or that kind of thing, I strongly suspect many of those are also automatically solvable without deep learning.)
Actually ask someone who doesn’t yet know the results of running a machine which problem seems hardest for machine.
I also agree that “special case of existing result” is plausible, but I think that’s going to be <1% (and it’s fairly likely to be missed by machines even so). Though that could get worse if many more formal proofs are added in the interim as part of increased investment in theorem proving which could drive that up, and that’s very correlated with a serious effort that could effectively exploit such data.
If Eliezer wants to actually bet about the hardest problem claim I may want to argue for some kind of reformulation (even though I said I didn’t want to change my predictions). If Eliezer doesn’t want to bet about it, I think it’s OK to ding me epistemically if the AI ends up getting a “hardest problem” that is a 3 variable inequality or bashable functional or diophantine equation, though I think it’s a bit of a technicality / a general error in my forecasting rather than a claim about AI.
Isn’t it just a lot more interesting to bet on gold than on hardest problem?
The former seems like it averages out over lots of random bits of potential bad luck. And if the bot can do the hardest problem, wouldn’t it likely be able to do the other problems as well?
Problems 1 and 4 are often very easy and nearly-mechanical for humans. And it’s not uncommon for 2 of the other problems to happen to be easy as well (2/5 are much easier than 3⁄6 and it’s considered OK for them to be kind of brute-forceable, and I expect that you could solve most IMO geometry problems without any kind of AI if you tried hard enough, even the ones that take insight for a human).
Sometimes you can get a gold by solving only 4⁄6 of the problems, and that’s not very well-correlated with whether the problems happen to be easy for machine. And even when you need 5⁄6 it looks like sometimes you could get a very lucky break.
So if you get 4 swings, it looks fairly likely that one of them will be quite easy, and that dominates P(gold). E.g. I wouldn’t be too surprised by someone doing a bronze this year, having a swing next year and getting a silver or gold based on how lucky they get, and then if they get a silver trying again the next year just to be able to say they did it.
What I was expressing a more confident prediction about was an idea like “The hardest IMO problems, and especially the ad hoc problems, seem very hard for machines.” Amongst humans solving those problems is pretty well-correlated with getting golds, but unfortunately for machines it looks quite noisy and so my probability had to go way up.
Unfortunately “hardest problem” is also noisy in its own way (in large part because a mechanical definition of “hardest” isn’t good) so this isn’t a great fix.
You might want to rethink the definition of “hardest problem” because I wouldn’t expect the IMO committee to care particularly much that their problems are hard for machines as well as humans.
For example, if you look at the 2005 IMO, but suppose that problem 6 was geometry, so that the problem 3 was the “hardest”, then you’re in trouble. Problem 3 has a one-line proof “by sum_of_squares”, assuming that someone has written a sum-of-squares tactic in time and that the sum-of-squares certificate is small enough to find within the time limit.
Alternatively, the IMO committee might set a problem without realizing that it’s close to a special case of something already in mathlib.
As defined, I would put higher probability on solving the “hardest problem” than getting a gold medal and would bet on that.
Yeah, this may well be right. I agree that IMO 2015 problem 3 can be solved by machine. I think both the geometry and 3 variable inequalities could be a very big problem (in that lots of instances are easy for machines and hard for humans). Some diophantine equations and functional equations are also very easy for machine. By 2022 my vague understanding is that the committee is much less likely to put that kind of question in the hardest slot since increasingly many contestants can basically do the machine-like proofs. But I maybe shouldn’t count on that, since the big problem is that we are letting it win at any of 4 IMOs (2022, 2023, 2024, 2025).
I’d guess we have like a 2-3% chance of 3-variable inequality in hardest spot (hasn’t happened in last 20 years and is getting less likely, but not that confident). By now I think it would be considered unreasonable to put a muirhead bashable problem as 3⁄6 (though even the other 3 variable inequalities will likely be much easier for machines than humans), even in 2005 I think that was probably considered a mistake. But still lots of these may be easy for machines, so you may easily be looking at like 10% chance of one of them being a 3-variable inequality and a 50% chance of solving it conditioned on a serious effort which would already be >4%.
Other candidate definitions would have been:
Instead of just using the type of problem, try to explicitly flag constant-number-variables inequalities or diophantine equations or functional equations as special cases to be ranked lower.
Have a stronger preference for combo, e.g. pick from amongst 3⁄6 (or even 2/3/5/6) with preference for combo then NT then algebra than geo. (But I’m scared of e.g. combo 2⁄5 that can be solved with easy monovariant or that kind of thing, I strongly suspect many of those are also automatically solvable without deep learning.)
Actually ask someone who doesn’t yet know the results of running a machine which problem seems hardest for machine.
I also agree that “special case of existing result” is plausible, but I think that’s going to be <1% (and it’s fairly likely to be missed by machines even so). Though that could get worse if many more formal proofs are added in the interim as part of increased investment in theorem proving which could drive that up, and that’s very correlated with a serious effort that could effectively exploit such data.
If Eliezer wants to actually bet about the hardest problem claim I may want to argue for some kind of reformulation (even though I said I didn’t want to change my predictions). If Eliezer doesn’t want to bet about it, I think it’s OK to ding me epistemically if the AI ends up getting a “hardest problem” that is a 3 variable inequality or bashable functional or diophantine equation, though I think it’s a bit of a technicality / a general error in my forecasting rather than a claim about AI.
Isn’t it just a lot more interesting to bet on gold than on hardest problem?
The former seems like it averages out over lots of random bits of potential bad luck. And if the bot can do the hardest problem, wouldn’t it likely be able to do the other problems as well?
Problems 1 and 4 are often very easy and nearly-mechanical for humans. And it’s not uncommon for 2 of the other problems to happen to be easy as well (2/5 are much easier than 3⁄6 and it’s considered OK for them to be kind of brute-forceable, and I expect that you could solve most IMO geometry problems without any kind of AI if you tried hard enough, even the ones that take insight for a human).
Sometimes you can get a gold by solving only 4⁄6 of the problems, and that’s not very well-correlated with whether the problems happen to be easy for machine. And even when you need 5⁄6 it looks like sometimes you could get a very lucky break.
So if you get 4 swings, it looks fairly likely that one of them will be quite easy, and that dominates P(gold). E.g. I wouldn’t be too surprised by someone doing a bronze this year, having a swing next year and getting a silver or gold based on how lucky they get, and then if they get a silver trying again the next year just to be able to say they did it.
What I was expressing a more confident prediction about was an idea like “The hardest IMO problems, and especially the ad hoc problems, seem very hard for machines.” Amongst humans solving those problems is pretty well-correlated with getting golds, but unfortunately for machines it looks quite noisy and so my probability had to go way up.
Unfortunately “hardest problem” is also noisy in its own way (in large part because a mechanical definition of “hardest” isn’t good) so this isn’t a great fix.
Hmm, in that case, would “all the problems” be better than either “hardest problem” or “gold”?