My first ‘dunk’ on April 18, about a 5-year shortening of Metaculus timelines in response to evidence that didn’t move me at all, asking about a Metaculus forecast of the Metaculus forecast 3 years later, implicitly predicts that Metaculus will update again within 3 years.
My second ‘dunk’ on May 12 is about Metaculus updating that much again in that direction, one month later.
I do admit, it’s not a good look that I once again understate my position by so much compared to what the reality turns out to be, especially after having made that mistake a few times before.
I do however claim it as a successful advance prediction, if something of a meta one, and cast a stern glance in your direction for failing to note this over the course of your attempting to paint me in a negative light by using terms like ‘dunk’.
asking about a Metaculus forecast of the Metaculus forecast 3 years later, implicitly predicts that Metaculus will update again within 3 years. [emphasis mine]
I feel like this is missing the key claim underlying this post: that verbal statements making implicit predictions are too hard to judge and too easy to hindsight bias about, and so aren’t strong evidence about a person’s foresight.
For instance, if Metaculus, did not, in fact, update again over the upcoming 3 years, and you were merely optimizing for the appearance of accuracy, you could claim that you weren’t making a prediction, merely voicing a question. And more likely, you and everyone else would just have forgotten about this tweet.
I don’t particularly want to take a stance on whether verbal forecasts like that one ought to be treated as part of one’s forecasting record. But insofar as the author of this post clearly doesn’t think they should be, this comment is not addressing his objection.
These sorts of observations sound promising for someone’s potential as a forecaster. But by themselves, they are massively easier to cherry-pick, fudge, omit, or re-define things, versus proper forecasts.
When you see other people make non-specific “predictions”, how do you score them? How do you know the scoring that you’re doing is coherent, and isn’t rationalizing? How do you avoid the various pitfalls that Tetlock wrote about? How do you *ducks stern glance* score yourself on any of that, in a way that you’ll know isn’t rationalizing?
For emphasis, in this comment you reinforce that you consider it a successful advance prediction. This gives very little information about your forecasting accuracy. We don’t even know what your actual distribution is, and it’s a long time before this resolves, we only know it went in your direction. I claim that to critique other people’s proper-scored forecasts, you should be transparent and give your own.
EDIT: Pasted from another comment I wrote:
Instead of that actual [future resolution] reality, and because of how abruptly the community ended up shifting, Eliezer seems to be interpreting that to mean that his position about that reality is not extreme enough. Those 2 things are somewhat related but pretty weakly, so it seems like rationalizing for him to frame it as showing his forecast isn’t extreme enough.
My first ‘dunk’ on April 18, about a 5-year shortening of Metaculus timelines in response to evidence that didn’t move me at all, asking about a Metaculus forecast of the Metaculus forecast 3 years later, implicitly predicts that Metaculus will update again within 3 years.
I do however claim it as a successful advance prediction, if something of a meta one
Wait, unless I misunderstand you there’s a reasoning mistake here. You request epistemic credit for predicting implicitly that the Metaculus median was going to drop by five years at some point in the next three years. But that’s a prediction that the majority of Metaculites would also have made and it’s a given that it was going to happen, in an interval of time as long as three years. It’s a correct advance prediction, if you did make it (let’s assume so and not get into inferring implicit past predictions with retrospective text analysis), but it’s not one that is even slightly impressive at all.
As an example to explain why, I predict (with 80% probability) that there will be a five-year shortening in the median on the general AI question at some point in the next three years. And I also predict (with 85% probability) that there will be a five-year lengthening at some point in the next three years.
I’m predicting both that Metaculus timelines will shorten and that they will lengthen! What gives? Well, I’m predicting volatility… Should I be given much epistemic credit if I later turned out to be right on both predictions? No, it’s very predictable and you don’t need to be a good forecaster to anticipate it. If you think you should get some credit for your prediction, I should get much more from these two predictions. But it’s not the case that I should get much, nor that you should.
Are there inconsistencies in the AGI questions on Metaculus? Within the forecast timeline, with other questions, with the resolution criteria? Yes, there are plenty! Metaculus is full of glaring inconsistencies. The median on one question will contradict the median on another. An AI question with stronger operationalization will have a lower median than a question with weaker operationalization. The current median says there is a four percent chance that AGI was already developed. The resolution criteria on a question will say it can’t resolve at the upper bound and the median will have 14% for it resolving at the upper bound anyway.
It’s commendable to notice these inconsistencies and right to downgrade your opinion of Metaculus because of them. But it’s wrong to conclude (even with weak confidence), because you can observe such glaring inconsistencies frequently, and predict in advance that specific ones will happen, including changes over time in the median that are predictable even in expected value after accounting for skew, that you are a better forecaster on even just AGI questions (and the implicit claim of being “a slightly better Bayesian” actually seems far stronger and more general than that) than most of the Metaculites forecasting on these questions.
Why? Because Metaculites know there are glaring inconsistencies everywhere, they identify them often, they know that there are more, and they can find them, and fix most of them, easily. It’s not that you’re a better forecaster, just that you have unreasonable expectations of a community of forecasters who are almost all effectively unpaid volunteers.
It’s not surprising that the Metaculus median will change over time in specific and predictable ways that are inconsistent with good Bayesianism. That doesn’t mean they’re that bad: let us see you do better, after all. It’s because people’s energy and interest are scarce. The questions in tournaments with money prizes get more engagement, as do questions about things that are currently in the news. There are still glaring inconsistencies in these questions, because it’s still not enough engagement to fix them all. (Also because the tools are expensive in time to use for making and checking your distributions.)
There are only 601 forecasters who have more than 1000 points on Metaculus: that means only 601 forecasters who have done even a pretty basic amount of forecasting. One of the two forecasters with exactly 1000 points has made predictions on only six questions, for example. You can do that in less than one hour, so it’s really not a lot.
If 601 sounds like a lot, there are thousands of questions on the site, each one with a wall of text describing the background and the resolution criteria. Predictions need updated constantly! The most active predictors on the site burn out because it takes so much time.
It’s not reasonable to expect not to see inconsistencies, predictable changes in the median, and so on. It’s not that they’re bad forecasters. Of course you can do better on one or a few specific questions, but that doesn’t mean much. If you want even just a small but worthwhile amount of evidence, from correct advance predictions, that you are a better forecaster than other Metaculites, you need, for example, to go and win a tournament. One of the tournaments with money prizes that many people are participating in.
Evaluating forecasting track records in practice is hard and very dependent on the scoring rule you use (rankings for PredictionBook vary a lot with your methodology for evaluating relative performance, for example). You need a lot of data, and high quality, to get significant evidence. If you have low-quality data, and only a little, you just aren’t going to get a useful amount of evidence.
You’re right that volatility is an additional category of reasons that him not giving his actual distribution makes it less informative.
It’s interesting to me that in his comment, he states:
I do admit, it’s not a good look that I once again understate my position by so much compared to what the reality turns out to be, especially after having made that mistake a few times before.
He sees it as significant evidence that his position wasn’t extreme enough. But he didn’t even clearly given his position, and “the reality” is a thing that is determined by the specific question resolution when that day comes. Instead of that actual reality, and because of how abruptly the community ended up shifting, Eliezer seems to be interpreting that to mean that his position about that reality is not extreme enough. Those 2 things are somewhat related but pretty weakly, so it seems like rationalizing for him to frame it as showing his forecast isn’t extreme enough.
I don’t expect him to spend time engaging with me, but for what it’s worth, to me the comment he wrote here doesn’t address anything I brought up, it’s essentially just him restating that he interprets this as a nice addition to his “forecasting track record”. He certainly could have made it part of a meaningful track record! It was a tantalizing candidate for such a thing, but he doesn’t want to, but expects people to just interpret it the same, which doesn’t make sense.
As an example to explain why, I predict (with 80% probability) that there will be a five-year shortening in the median on the general AI question at some point in the next three years. And I also predict (with 85% probability) that there will be a five-year lengthening at some point in the next three years.
Both of these things have happened. The community prediction was June 28, 2036 at one time in July 2022, July 30, 2043 in September 2022 and is March 13, 2038 now. So there has been a five-year shortening and a five-year lengthening.
My first ‘dunk’ on April 18, about a 5-year shortening of Metaculus timelines in response to evidence that didn’t move me at all, asking about a Metaculus forecast of the Metaculus forecast 3 years later, implicitly predicts that Metaculus will update again within 3 years.
My second ‘dunk’ on May 12 is about Metaculus updating that much again in that direction, one month later.
I do admit, it’s not a good look that I once again understate my position by so much compared to what the reality turns out to be, especially after having made that mistake a few times before.
I do however claim it as a successful advance prediction, if something of a meta one, and cast a stern glance in your direction for failing to note this over the course of your attempting to paint me in a negative light by using terms like ‘dunk’.
I feel like this is missing the key claim underlying this post: that verbal statements making implicit predictions are too hard to judge and too easy to hindsight bias about, and so aren’t strong evidence about a person’s foresight.
For instance, if Metaculus, did not, in fact, update again over the upcoming 3 years, and you were merely optimizing for the appearance of accuracy, you could claim that you weren’t making a prediction, merely voicing a question. And more likely, you and everyone else would just have forgotten about this tweet.
I don’t particularly want to take a stance on whether verbal forecasts like that one ought to be treated as part of one’s forecasting record. But insofar as the author of this post clearly doesn’t think they should be, this comment is not addressing his objection.
These sorts of observations sound promising for someone’s potential as a forecaster. But by themselves, they are massively easier to cherry-pick, fudge, omit, or re-define things, versus proper forecasts.
When you see other people make non-specific “predictions”, how do you score them? How do you know the scoring that you’re doing is coherent, and isn’t rationalizing? How do you avoid the various pitfalls that Tetlock wrote about? How do you *ducks stern glance* score yourself on any of that, in a way that you’ll know isn’t rationalizing?
For emphasis, in this comment you reinforce that you consider it a successful advance prediction. This gives very little information about your forecasting accuracy. We don’t even know what your actual distribution is, and it’s a long time before this resolves, we only know it went in your direction. I claim that to critique other people’s proper-scored forecasts, you should be transparent and give your own.
EDIT: Pasted from another comment I wrote:
Wait, unless I misunderstand you there’s a reasoning mistake here. You request epistemic credit for predicting implicitly that the Metaculus median was going to drop by five years at some point in the next three years. But that’s a prediction that the majority of Metaculites would also have made and it’s a given that it was going to happen, in an interval of time as long as three years. It’s a correct advance prediction, if you did make it (let’s assume so and not get into inferring implicit past predictions with retrospective text analysis), but it’s not one that is even slightly impressive at all.
As an example to explain why, I predict (with 80% probability) that there will be a five-year shortening in the median on the general AI question at some point in the next three years. And I also predict (with 85% probability) that there will be a five-year lengthening at some point in the next three years.
I’m predicting both that Metaculus timelines will shorten and that they will lengthen! What gives? Well, I’m predicting volatility… Should I be given much epistemic credit if I later turned out to be right on both predictions? No, it’s very predictable and you don’t need to be a good forecaster to anticipate it. If you think you should get some credit for your prediction, I should get much more from these two predictions. But it’s not the case that I should get much, nor that you should.
Are there inconsistencies in the AGI questions on Metaculus? Within the forecast timeline, with other questions, with the resolution criteria? Yes, there are plenty! Metaculus is full of glaring inconsistencies. The median on one question will contradict the median on another. An AI question with stronger operationalization will have a lower median than a question with weaker operationalization. The current median says there is a four percent chance that AGI was already developed. The resolution criteria on a question will say it can’t resolve at the upper bound and the median will have 14% for it resolving at the upper bound anyway.
It’s commendable to notice these inconsistencies and right to downgrade your opinion of Metaculus because of them. But it’s wrong to conclude (even with weak confidence), because you can observe such glaring inconsistencies frequently, and predict in advance that specific ones will happen, including changes over time in the median that are predictable even in expected value after accounting for skew, that you are a better forecaster on even just AGI questions (and the implicit claim of being “a slightly better Bayesian” actually seems far stronger and more general than that) than most of the Metaculites forecasting on these questions.
Why? Because Metaculites know there are glaring inconsistencies everywhere, they identify them often, they know that there are more, and they can find them, and fix most of them, easily. It’s not that you’re a better forecaster, just that you have unreasonable expectations of a community of forecasters who are almost all effectively unpaid volunteers.
It’s not surprising that the Metaculus median will change over time in specific and predictable ways that are inconsistent with good Bayesianism. That doesn’t mean they’re that bad: let us see you do better, after all. It’s because people’s energy and interest are scarce. The questions in tournaments with money prizes get more engagement, as do questions about things that are currently in the news. There are still glaring inconsistencies in these questions, because it’s still not enough engagement to fix them all. (Also because the tools are expensive in time to use for making and checking your distributions.)
There are only 601 forecasters who have more than 1000 points on Metaculus: that means only 601 forecasters who have done even a pretty basic amount of forecasting. One of the two forecasters with exactly 1000 points has made predictions on only six questions, for example. You can do that in less than one hour, so it’s really not a lot.
If 601 sounds like a lot, there are thousands of questions on the site, each one with a wall of text describing the background and the resolution criteria. Predictions need updated constantly! The most active predictors on the site burn out because it takes so much time.
It’s not reasonable to expect not to see inconsistencies, predictable changes in the median, and so on. It’s not that they’re bad forecasters. Of course you can do better on one or a few specific questions, but that doesn’t mean much. If you want even just a small but worthwhile amount of evidence, from correct advance predictions, that you are a better forecaster than other Metaculites, you need, for example, to go and win a tournament. One of the tournaments with money prizes that many people are participating in.
Evaluating forecasting track records in practice is hard and very dependent on the scoring rule you use (rankings for PredictionBook vary a lot with your methodology for evaluating relative performance, for example). You need a lot of data, and high quality, to get significant evidence. If you have low-quality data, and only a little, you just aren’t going to get a useful amount of evidence.
You’re right that volatility is an additional category of reasons that him not giving his actual distribution makes it less informative.
It’s interesting to me that in his comment, he states:
He sees it as significant evidence that his position wasn’t extreme enough. But he didn’t even clearly given his position, and “the reality” is a thing that is determined by the specific question resolution when that day comes. Instead of that actual reality, and because of how abruptly the community ended up shifting, Eliezer seems to be interpreting that to mean that his position about that reality is not extreme enough. Those 2 things are somewhat related but pretty weakly, so it seems like rationalizing for him to frame it as showing his forecast isn’t extreme enough.
I don’t expect him to spend time engaging with me, but for what it’s worth, to me the comment he wrote here doesn’t address anything I brought up, it’s essentially just him restating that he interprets this as a nice addition to his “forecasting track record”. He certainly could have made it part of a meaningful track record! It was a tantalizing candidate for such a thing, but he doesn’t want to, but expects people to just interpret it the same, which doesn’t make sense.
Both of these things have happened. The community prediction was June 28, 2036 at one time in July 2022, July 30, 2043 in September 2022 and is March 13, 2038 now. So there has been a five-year shortening and a five-year lengthening.