Having thought about each of these milestones more carefully, and having already updated towards short timelines months ago, I think it was really bad in hindsight to make this bet, even on medium-to-long timeline views. Honestly, I’m surprised more people didn’t want to bet us, since anyone familiar with the relevant benchmarks probably could have noticed that we were making quite poor predictions.
I’ll explain what I mean by going through each of these milestones individually,
“A model/ensemble of models achieves >80% on all tasks in the MMLU benchmark”
The trend on this benchmark suggests that we will reach >90% performance within a few years. You can get 25% on this benchmark by guessing randomly (previously I thought it was 20%), so a score of 80% would not even indicate high competency at any given task.
“A credible estimate reveals that an AI lab deployed EITHER >10^30 FLOPs OR hardware that would cost $1bn if purchased through competitive cloud computing vendors at the time on a training run to develop a single ML model (excluding autonomous driving efforts)”
The trend was for compute to double every six months. Plugging in the relevant numbers reveals that we would lose this prediction easily if the trend was kept up for another 3.5 years.
“A model/ensemble of models will achieve >90% on the MATH dataset using a no-calculator rule.”
Having looked at the dataset I estimate that about 80% of problems in the MATH benchmark are simple plug-and-chug problems that don’t rely on sophisticated mathematical intuition. Therefore, getting above 90% requires only that models acquire basic competency on competition-level math.
“A model/ensemble of models achieves >80% top-1 strict accuracy on competition-level problems on the APPS benchmark”
DeepMind’s paper for AlphaCode revealed that the APPS benchmark was pretty bad, since it was possible to write code that passed all the test cases without being correct. I missed this at the time.
“A gold medal for the IMO Grand Challenge”
As noted by Daniel Paleka, it seems that Paul Christiano may have exaggerated how difficult it is to obtain gold at the IMO (despite attending himself in 2008).
“A robot that can, from beginning to end, reliably wash dishes, take them out of an ordinary dishwasher and stack them into a cabinet, without breaking any dishes, and at a comparable speed to humans (<120% the average time)”
Unlike the other predictions, I still suspect we were fundamentally right about this one.
“Tesla’s full-self-driving capability makes fewer than one major mistake per 100,000 miles”
For this one, I’m not sure what I was thinking, since self-driving technology is already quite good. I think the tech has problems with reliability, but 100,000 miles is not actually very long of a distance. Most human drivers drive well over 100,000 miles in their lifetime, and I remember aiming for a bar that was something like “meets human-level performance”. I think I just didn’t realize what human-level driving looked like in concrete terms.
I’m quite shocked that I wasn’t able to notice the poor reasoning that I had made in almost every single one of our predictions that we made at the time. I guess it’s good that I learned from this experience though.
Considering your terms were so in favour of the bet takers, I was also surprised last summer when so few actually committed. Especially considering there were dozens, if not hundreds, of LW members with short timelines who saw your original post.
Perhaps that says something about actual beliefs vs talked about beliefs?
Well, to be fair, I don’t think many people realized how weak some of these benchmarks were. It is hard to tell without digging into the details, which I regrettably did not either.
You said that you updated and shortened your median timeline to 2047 and mode to 2035. But it seems to me that you need to shorten your timelines again.
“it seems very possible (>30%) that we are now in the crunch-time section of a short-timelines world, and that we have 3-7 years until Moore’s law and organizational prioritization put these systems at extremely dangerous levels of capability.”
It seems that the purpose of the bet was to test this hypothesis:
“we are offering to bet up to $1000 against the idea that we are in the “crunch-time section of a short-timelines”
My understanding is that if AI progress occurred slowly and no more than one of the advancements listed were made by 2026-01-01 then this short timelines hypothesis would be proven false and could then be ignored.
However, the bet was conceded on 2023-03-16 which is much earlier than the deadline and therefore the bet failed to prove the hypothesis false.
It seems to me that the rational action is to now update toward believing that this short timelines hypothesis is true and 3-7 years from 2022 is 2025-2029 which is substantially earlier than 2047.
It seems to me that the rational action is to now update toward believing that this short timelines hypothesis is true and 3-7 years from 2022 is 2025-2029 which is substantially earlier than 2047.
I don’t really agree, although it might come down to what you mean. When some people talk about their AGI timelines they often mean something much weaker than what I’m imagining, which can lead to significant confusion.
If your bar for AGI was “score very highly on college exams” then my median “AGI timelines” dropped from something like 2030 to 2025 over the last 2 years. Whereas if your bar was more like “radically transform the human condition”, I went from ~2070 to 2047.
I just see a lot of ways that we could have very impressive software programs and yet it still takes a lot of time to fundamentally transform the human condition, for example because of regulation, or because we experience setbacks due to war. My fundamental model hasn’t changed here, although I became substantially more impressed with current tech than I used to be.
(Actually, I think there’s a good chance that there will be no major delays at all and the human condition will be radically transformed some time in the 2030s. But because of the long list of possible delays, my overall distribution is skewed right. This means that even though my median is 2047, my mode is like 2034.)
“a score of 80% would not even indicate high competency at any given task”
Although the MMLU task is fairly straightforward given that there are only 4 options to choose from (25% accuracy for random choices) and experts typically score about 90%, getting 80% accuracy still seems quite difficult for a human given that average human raters only score about 35%. Also, GPT-3 only scores about 45% (GPT-3 fine-tuned still only scores 54%), and GPT-2 scores just 32% even when fine-tuned.
One of my recent posts has a nice chart showing different levels of MMLU performance.
“To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average.”
A retrospective on this bet:
Having thought about each of these milestones more carefully, and having already updated towards short timelines months ago, I think it was really bad in hindsight to make this bet, even on medium-to-long timeline views. Honestly, I’m surprised more people didn’t want to bet us, since anyone familiar with the relevant benchmarks probably could have noticed that we were making quite poor predictions.
I’ll explain what I mean by going through each of these milestones individually,
“A model/ensemble of models achieves >80% on all tasks in the MMLU benchmark”
The trend on this benchmark suggests that we will reach >90% performance within a few years. You can get 25% on this benchmark by guessing randomly (previously I thought it was 20%), so a score of 80% would not even indicate high competency at any given task.
“A credible estimate reveals that an AI lab deployed EITHER >10^30 FLOPs OR hardware that would cost $1bn if purchased through competitive cloud computing vendors at the time on a training run to develop a single ML model (excluding autonomous driving efforts)”
The trend was for compute to double every six months. Plugging in the relevant numbers reveals that we would lose this prediction easily if the trend was kept up for another 3.5 years.
“A model/ensemble of models will achieve >90% on the MATH dataset using a no-calculator rule.”
Having looked at the dataset I estimate that about 80% of problems in the MATH benchmark are simple plug-and-chug problems that don’t rely on sophisticated mathematical intuition. Therefore, getting above 90% requires only that models acquire basic competency on competition-level math.
“A model/ensemble of models achieves >80% top-1 strict accuracy on competition-level problems on the APPS benchmark”
DeepMind’s paper for AlphaCode revealed that the APPS benchmark was pretty bad, since it was possible to write code that passed all the test cases without being correct. I missed this at the time.
“A gold medal for the IMO Grand Challenge”
As noted by Daniel Paleka, it seems that Paul Christiano may have exaggerated how difficult it is to obtain gold at the IMO (despite attending himself in 2008).
“A robot that can, from beginning to end, reliably wash dishes, take them out of an ordinary dishwasher and stack them into a cabinet, without breaking any dishes, and at a comparable speed to humans (<120% the average time)”
Unlike the other predictions, I still suspect we were fundamentally right about this one.
“Tesla’s full-self-driving capability makes fewer than one major mistake per 100,000 miles”
For this one, I’m not sure what I was thinking, since self-driving technology is already quite good. I think the tech has problems with reliability, but 100,000 miles is not actually very long of a distance. Most human drivers drive well over 100,000 miles in their lifetime, and I remember aiming for a bar that was something like “meets human-level performance”. I think I just didn’t realize what human-level driving looked like in concrete terms.
I’m quite shocked that I wasn’t able to notice the poor reasoning that I had made in almost every single one of our predictions that we made at the time. I guess it’s good that I learned from this experience though.
Thanks for posting this retrospective.
Considering your terms were so in favour of the bet takers, I was also surprised last summer when so few actually committed. Especially considering there were dozens, if not hundreds, of LW members with short timelines who saw your original post.
Perhaps that says something about actual beliefs vs talked about beliefs?
Well, to be fair, I don’t think many people realized how weak some of these benchmarks were. It is hard to tell without digging into the details, which I regrettably did not either.
You said that you updated and shortened your median timeline to 2047 and mode to 2035. But it seems to me that you need to shorten your timelines again.
In the It’s time for EA leadership to pull the short-timelines fire alarm post says:
It seems that the purpose of the bet was to test this hypothesis:
My understanding is that if AI progress occurred slowly and no more than one of the advancements listed were made by 2026-01-01 then this short timelines hypothesis would be proven false and could then be ignored.
However, the bet was conceded on 2023-03-16 which is much earlier than the deadline and therefore the bet failed to prove the hypothesis false.
It seems to me that the rational action is to now update toward believing that this short timelines hypothesis is true and 3-7 years from 2022 is 2025-2029 which is substantially earlier than 2047.
I don’t really agree, although it might come down to what you mean. When some people talk about their AGI timelines they often mean something much weaker than what I’m imagining, which can lead to significant confusion.
If your bar for AGI was “score very highly on college exams” then my median “AGI timelines” dropped from something like 2030 to 2025 over the last 2 years. Whereas if your bar was more like “radically transform the human condition”, I went from ~2070 to 2047.
I just see a lot of ways that we could have very impressive software programs and yet it still takes a lot of time to fundamentally transform the human condition, for example because of regulation, or because we experience setbacks due to war. My fundamental model hasn’t changed here, although I became substantially more impressed with current tech than I used to be.
(Actually, I think there’s a good chance that there will be no major delays at all and the human condition will be radically transformed some time in the 2030s. But because of the long list of possible delays, my overall distribution is skewed right. This means that even though my median is 2047, my mode is like 2034.)
I don’t agree with the first point:
Although the MMLU task is fairly straightforward given that there are only 4 options to choose from (25% accuracy for random choices) and experts typically score about 90%, getting 80% accuracy still seems quite difficult for a human given that average human raters only score about 35%. Also, GPT-3 only scores about 45% (GPT-3 fine-tuned still only scores 54%), and GPT-2 scores just 32% even when fine-tuned.
One of my recent posts has a nice chart showing different levels of MMLU performance.
Extract from the abstract of the paper (2021):