I disagree that this is a meaningful forecasting track record. Massive degrees of freedom, and the mentioned events seem unresolvable, and it’s highly ambiguous how these things particularly prove the degree of error unless they were properly disambiguated in advance. Log score or it didn’t happen.
I want to register a gripe, re your follow-up post: when Eliezer says that he, Demis Hassabis, and Dario Amodei have a good “track record” because of their qualitative prediction successes, you object that the phrase “track record” should be reserved for things like Metaculus forecasts.
But when Ben Garfinkel says that Eliezer has a bad “track record” because he made various qualitative predictions Ben disagrees with, you slam the retweet button.
I already thought this narrowing of the term “track record” was weird. If you’re saying that we shouldn’t count Linus Pauling’s achievements in chemistry, or his bad arguments for Vitamin C megadosing, as part of Pauling’s “track record”, because they aren’t full probability distributions over concrete future events, then I worry a lot that this new word usage will cause confusion and lend itself to misuse. As long as it’s used even-handedly, though, it’s ultimately just a word.
(On my model, the main consequence of this is just that “track records” matter a lot less, because they become a much smaller slice of the evidence we have about a lot of people’s epistemics, expertise, etc.)
But if you’re going to complain about “track record” talk when the track record is alleged to be good but not when it’s alleged to be bad, then I have a genuine gripe with this terminology proposal. It already sounded a heck of a lot like an isolated demand for rigor to me, but if you’re going to redefine “track record” to refer to a narrow slice of the evidence, you at least need to do this consistently, and not crow some variant of ‘Aha! His track record is terrible after all!’ as soon as you find equally qualitative evidence that you like.
This was already a thing I worried would happen if we adopted this terminological convention, and it happened immediately.
I see what you’re saying, but it looks like you’re strawmanning me yet again with a more extreme version of my position. You’ve done that several times and you need to stop that.
What you’ve argued here prevents me from questioning the forecasting performance of every pundit who I can’t formally score, which is ~all of them.
Yes, it’s not a real forecasting track record unless it meets the sort of criteria that are fairly well understood in Tetlockian research. And neither is Ben Garfinkel’s post, that doesn’t give us a forecasting track record, like on Metaculus.
But if a non-track-recorded person suggests they’ve been doing a good job anticipating things, it’s quite reasonable to point out non-scorable things they said that seem incorrect, even with no way to score it.
In an earlier draft of my essay, I considered getting into bets he’s made (several of which he’s lost). I ended up not including those things. Partly because my focus was waning and it was more attainable to stick to the meta-level point. And partly because I thought the essay might be better if it was more focused. I don’t think there is literally zero information about his forecasting performance (that’s not plausible), but it seemed like it would be more of a distraction from my epistemic point. Bets are not as informative as Metaculus-style forecasts, but they are better than nothing. This stuff is a spectrum, even Metaculus doesn’t retain some kinds of information about the forecaster. Still, I didn’t get into it, though I could have.
But I ended up later editing in a link to one of Paul’s comments, where he describes some reasons that Robin looks pretty bad in hindsight, but also includes several things Eliezer said that seem quite off. None of those are scorable. But I added in a link to that, because Eliezer explicitly claimed he came across better in that debate, which overall he may have, but it’s actually more mixed than that, and that’s relevant to my meta-point that one can obfuscate these things without a proper track record. And Ben Garfinkel’s post is similarly relevant.
If the community felt more ambivalently about Eliezer’s forecasts, or even if Eliezer was more ambivalent about his own forecasts? And then there was some guy trying to convince people he has made bad forecasts? Then your objection of one-sidedness would make much more sense to me. That’s not what this is.
Eliezer actively tells people he’s anticipating things well, but he deliberately prevents his forecasts from being scorable. Pundits do that too, and you bet I would eagerly criticize vague non-scorable stuff they said that seems wrong. And yes, I would retweet someone criticizing those things too. Does that also bother you?
IMO that’s a much more defensible position, and is what the discussion should have initially focused on. From my perspective, the way the debate largely went is:
Jotto: Eliezer claims to have a relatively successful forecasting track record, along with Dario and Demis; but this is clearly dissembling, because a forecasting track record needs to look like a long series of Metaculus predictions.
Other people: (repeat without qualification the claim that Eliezer is falsely claiming to have a “forecasting track record”; simultaneously claims that Eliezer has a subpar “forecasting track record”, based on evidence that wouldn’t meet Jotto’s stated bar)
Jotto: (signal-boosts the inconsistent claims other people are making, without noting that this is equivocating between two senses of “track record” and therefore selectively applying two different standards)
Rob B: (gripes and complains)
Whereas the way the debate should have gone is:
Jotto: I personally disagree with Eliezer that the AI Foom debate is easy to understand and cash out into rough predictions about how the field has progressed since 2009, or how it is likely to progress in the future. Also, I wish that all of Eliezer, Robin, Demis, Dario, and Paul had made way more Metaculus-style forecasts back in 2010, so it would be easier to compare their prediction performance. I find it frustrating that nobody did this, and think we should start doing this way more now. Also, I think this sharper comparison would probably have shown that Eliezer is significantly worse at thinking about this topic than Paul, and maybe than Robin, Demis, and Dario.
Rob B: I disagree with your last sentence, and I disagree quantitatively that stuff like the Foom debate is as hard-to-interpret as you suggest. But I otherwise agree with you, and think it would have been useful if the circa-2010 discussions had included more explicit probability distributions, scenario breakdowns, quantitative estimates, etc. (suitably flagged as unstable, spitballed ass-numbers). Even where these aren’t cruxy and don’t provide clear evidence about people’s quality of reasoning about AGI, it’s still just helpful to have a more precise sense of what people’s actual beliefs at the time were. “X is unlikely” is way less useful than knowing whether it’s more like 30%, or more like 5%, or more like 0.1%, etc.
I think the whole ‘X isn’t a real track record’ thing was confusing, and made your argument sound more forceful than it should have.
Plus maybe some disagreements about how possible it is in general to form good models of people and of topics like AGI in the absence of Metaculus-ish forecasts, and disagreement about exactly how informative it would be to have a hundred examples of narrow-AI benchmark predictions over the last ten years from all the influential EAs?
(I think it would be useful, but more like ‘1% to 10% of the overall evidence for weighing people’s reasoning and correctness about AGI’, not ’90% to 100% of the evidence’.)
(An exception would be if, e.g., it turned out that ML progress is way more predictable than Eliezer or I believe. ML’s predictability is a genuine crux for us, so seeing someone else do amazing at this prediction task for a bunch of years, with foresight rather than hindsight, would genuinely update us a bunch. But we don’t expect to learn much from Eliezer or Rob trying to predict stuff, because while someone else may have secret insight that lets them predict the future of narrow-AI advances very narrowly, we are pretty sure we don’t know how to do that.)
Part of what I object to is that you’re a Metaculus radical, whose Twitter bio says “Replace opinions with forecasts.”
This is a view almost no one in the field currently agrees with or tries to live up to.
Which is fine, on its own. I like radicals, and want to hear their views argued for and hashed out in conversation.
But then you selectively accuse Eliezer of lying about having a “track record”, without noting how many other people are also expressing non-forecast “opinions” (and updating on these), and while using language in ways that make it sound like Eliezer is doing something more unusual than he is, and making it sound like your critique is more independent of your nonstandard views on track records and “opinions” than it actually is.
That’s the part that bugs me. If you have an extreme proposal for changing EA’s norms, argue for that proposal. Don’t just selectively take potshots at views or people you dislike more, while going easy on everyone else.
That’s the part that bugs me. If you have an extreme proposal for changing EA’s norms, argue for that proposal. Don’t just selectively take potshots at views or people you dislike more, while going easy on everyone else.
I think Jotto has argued for the proposal in the past. Whether he did it in that particular comment is not very important, so long as he holds everyone to the same standards.
As for his standards: I think he sees Eliezer as an easy target because he’s high status in this community and has explicitly said that he thinks his track record is good (in fact, better than other people). On its own, therefore, it’s not surprising that Eliezer would get singled out.
I no longer see exchanges with you as a good use of energy, unless you’re able to describe some of the strawmanning of me you’ve done and come clean about that.
EDIT: Since this is being downvoted, here is a comment chain where Rob Besinger interpreted me in ways that are bizarre, such as suggesting that I think Eliezer is saying he has “a crystal ball”, or that “if you record any prediction anywhere other than Metaculus (that doesn’t have similarly good tools for representing probability distributions), you’re a con artist”. Things that sound thematically similar to what I was saying, but were weird, persistent extremes that I don’t see as good-faith readings of me. It kept happening over Twitter, then again on LW. At no point have I felt he’s trying to understand what I actually think. So I don’t see the point of continuing with him.
This is a strawman. Ben Garfinkel never says that Yudkowsky has a bad track record. In fact the only time the phrase “bad track record” comes up in Garfinkel’s post is when you mention it in one of your comments.
The most Ben Garfinkel says about Yudkowsky’s track record is that it’s “at least pretty mixed”, which I think the content of the post supports, especially the clear-cut examples. He even emphasizes that he is deliberately cherry-picking bad examples from Eliezer’s track record in order to make a point, e.g. about Eliezer never having addressed his own bad predictions from the past.
It’s not enough to say “my world model was bad in such and such ways and I’ve changed it” to address your mistakes; you have to say “I made this specific prediction and it later turned out to be wrong”. Can you cite any instance of Eliezer ever doing that?
This is a strawman. Ben Garfinkel never says that Yudkowsky has a bad track record.
In the post, he says “his track record is at best fairly mixed” and “Yudkowsky may have a track record of overestimating or overstating the quality of his insights into AI”; and in the comments, he says “Yudkowsky’s track record suggests a substantial bias toward dramatic and overconfident predictions”.
What makes a track record “bad” is relative, but if Ben objects to my summarizing his view with the imprecise word “bad”, then I’ll avoid doing that. It doesn’t strike me as an important point for anything I said above.
The most Ben Garfinkel says about Yudkowsky’s track record is that it’s “at least pretty mixed”, which I think the content of the post supports, especially the clear-cut examples.
As long as we agree that “track record” includes the kind of stuff Jotto was saying it doesn’t include, I’m happy to say that Eliezer’s track record includes failures as well as successes. Indeed, I think that would make way more sense.
about Eliezer never having addressed his own bad predictions from the past.
Maybe worth mentioning in passing that this is of course false?
It’s not enough to say “my world model was bad in such and such ways and I’ve changed it” to address your mistakes; you have to say “I made this specific prediction and it later turned out to be wrong”. Can you cite any instance of Eliezer ever doing that?
Sure! “I wouldn’t have predicted AlphaGo and lost money betting against the speed of its capability gains”.
In the post, he says “his track record is at best fairly mixed” and “Yudkowsky may have a track record of overestimating or overstating the quality of his insights into AI”; and in the comments, he says “Yudkowsky’s track record suggests a substantial bias toward dramatic and overconfident predictions”.
Yes, I think all of that checks out. It’s hard to say, of course, because Eliezer rarely makes explicit predictions, but insofar as he does make them I think he clearly puts a lot of weight on his inside view into things.
That doesn’t make his track record “bad” but it’s something to keep in mind when he makes predictions.
Sure! “I wouldn’t have predicted AlphaGo and lost money betting against the speed of its capability gains”.
This counts as a mistake but I don’t think it’s important relative to the bad prediction about AI timelines Ben brings up in his post. If Eliezer explained why he had been wrong then it would make his position now more convincing, especially given his condescending attitude towards e.g. Metaculus forecasts.
I still think there’s something about the way Eliezer admits he was wrong that rubs me the wrong way but it’s hard to explain what that is right now. It’s not correct to say he doesn’t admit his mistakes per se, but there’s some other problem with how much he seems to “internalize” the fact that he was wrong.
I’ve retracted my original comment because of your example as it was not correct (despite having the right “vibe”, whatever that means).
I disagree that this is a meaningful forecasting track record. Massive degrees of freedom, and the mentioned events seem unresolvable, and it’s highly ambiguous how these things particularly prove the degree of error unless they were properly disambiguated in advance. Log score or it didn’t happen.
(Slightly edited to try and sound less snarky)
I want to register a gripe, re your follow-up post: when Eliezer says that he, Demis Hassabis, and Dario Amodei have a good “track record” because of their qualitative prediction successes, you object that the phrase “track record” should be reserved for things like Metaculus forecasts.
But when Ben Garfinkel says that Eliezer has a bad “track record” because he made various qualitative predictions Ben disagrees with, you slam the retweet button.
I already thought this narrowing of the term “track record” was weird. If you’re saying that we shouldn’t count Linus Pauling’s achievements in chemistry, or his bad arguments for Vitamin C megadosing, as part of Pauling’s “track record”, because they aren’t full probability distributions over concrete future events, then I worry a lot that this new word usage will cause confusion and lend itself to misuse. As long as it’s used even-handedly, though, it’s ultimately just a word.
(On my model, the main consequence of this is just that “track records” matter a lot less, because they become a much smaller slice of the evidence we have about a lot of people’s epistemics, expertise, etc.)
But if you’re going to complain about “track record” talk when the track record is alleged to be good but not when it’s alleged to be bad, then I have a genuine gripe with this terminology proposal. It already sounded a heck of a lot like an isolated demand for rigor to me, but if you’re going to redefine “track record” to refer to a narrow slice of the evidence, you at least need to do this consistently, and not crow some variant of ‘Aha! His track record is terrible after all!’ as soon as you find equally qualitative evidence that you like.
This was already a thing I worried would happen if we adopted this terminological convention, and it happened immediately.
</end of gripe>
I see what you’re saying, but it looks like you’re strawmanning me yet again with a more extreme version of my position. You’ve done that several times and you need to stop that.
What you’ve argued here prevents me from questioning the forecasting performance of every pundit who I can’t formally score, which is ~all of them.
Yes, it’s not a real forecasting track record unless it meets the sort of criteria that are fairly well understood in Tetlockian research. And neither is Ben Garfinkel’s post, that doesn’t give us a forecasting track record, like on Metaculus.
But if a non-track-recorded person suggests they’ve been doing a good job anticipating things, it’s quite reasonable to point out non-scorable things they said that seem incorrect, even with no way to score it.
In an earlier draft of my essay, I considered getting into bets he’s made (several of which he’s lost). I ended up not including those things. Partly because my focus was waning and it was more attainable to stick to the meta-level point. And partly because I thought the essay might be better if it was more focused. I don’t think there is literally zero information about his forecasting performance (that’s not plausible), but it seemed like it would be more of a distraction from my epistemic point. Bets are not as informative as Metaculus-style forecasts, but they are better than nothing. This stuff is a spectrum, even Metaculus doesn’t retain some kinds of information about the forecaster. Still, I didn’t get into it, though I could have.
But I ended up later editing in a link to one of Paul’s comments, where he describes some reasons that Robin looks pretty bad in hindsight, but also includes several things Eliezer said that seem quite off. None of those are scorable. But I added in a link to that, because Eliezer explicitly claimed he came across better in that debate, which overall he may have, but it’s actually more mixed than that, and that’s relevant to my meta-point that one can obfuscate these things without a proper track record. And Ben Garfinkel’s post is similarly relevant.
If the community felt more ambivalently about Eliezer’s forecasts, or even if Eliezer was more ambivalent about his own forecasts? And then there was some guy trying to convince people he has made bad forecasts? Then your objection of one-sidedness would make much more sense to me. That’s not what this is.
Eliezer actively tells people he’s anticipating things well, but he deliberately prevents his forecasts from being scorable. Pundits do that too, and you bet I would eagerly criticize vague non-scorable stuff they said that seems wrong. And yes, I would retweet someone criticizing those things too. Does that also bother you?
IMO that’s a much more defensible position, and is what the discussion should have initially focused on. From my perspective, the way the debate largely went is:
Jotto: Eliezer claims to have a relatively successful forecasting track record, along with Dario and Demis; but this is clearly dissembling, because a forecasting track record needs to look like a long series of Metaculus predictions.
Other people: (repeat without qualification the claim that Eliezer is falsely claiming to have a “forecasting track record”; simultaneously claims that Eliezer has a subpar “forecasting track record”, based on evidence that wouldn’t meet Jotto’s stated bar)
Jotto: (signal-boosts the inconsistent claims other people are making, without noting that this is equivocating between two senses of “track record” and therefore selectively applying two different standards)
Rob B: (gripes and complains)
Whereas the way the debate should have gone is:
Jotto: I personally disagree with Eliezer that the AI Foom debate is easy to understand and cash out into rough predictions about how the field has progressed since 2009, or how it is likely to progress in the future. Also, I wish that all of Eliezer, Robin, Demis, Dario, and Paul had made way more Metaculus-style forecasts back in 2010, so it would be easier to compare their prediction performance. I find it frustrating that nobody did this, and think we should start doing this way more now. Also, I think this sharper comparison would probably have shown that Eliezer is significantly worse at thinking about this topic than Paul, and maybe than Robin, Demis, and Dario.
Rob B: I disagree with your last sentence, and I disagree quantitatively that stuff like the Foom debate is as hard-to-interpret as you suggest. But I otherwise agree with you, and think it would have been useful if the circa-2010 discussions had included more explicit probability distributions, scenario breakdowns, quantitative estimates, etc. (suitably flagged as unstable, spitballed ass-numbers). Even where these aren’t cruxy and don’t provide clear evidence about people’s quality of reasoning about AGI, it’s still just helpful to have a more precise sense of what people’s actual beliefs at the time were. “X is unlikely” is way less useful than knowing whether it’s more like 30%, or more like 5%, or more like 0.1%, etc.
I think the whole ‘X isn’t a real track record’ thing was confusing, and made your argument sound more forceful than it should have.
Plus maybe some disagreements about how possible it is in general to form good models of people and of topics like AGI in the absence of Metaculus-ish forecasts, and disagreement about exactly how informative it would be to have a hundred examples of narrow-AI benchmark predictions over the last ten years from all the influential EAs?
(I think it would be useful, but more like ‘1% to 10% of the overall evidence for weighing people’s reasoning and correctness about AGI’, not ’90% to 100% of the evidence’.)
(An exception would be if, e.g., it turned out that ML progress is way more predictable than Eliezer or I believe. ML’s predictability is a genuine crux for us, so seeing someone else do amazing at this prediction task for a bunch of years, with foresight rather than hindsight, would genuinely update us a bunch. But we don’t expect to learn much from Eliezer or Rob trying to predict stuff, because while someone else may have secret insight that lets them predict the future of narrow-AI advances very narrowly, we are pretty sure we don’t know how to do that.)
Part of what I object to is that you’re a Metaculus radical, whose Twitter bio says “Replace opinions with forecasts.”
This is a view almost no one in the field currently agrees with or tries to live up to.
Which is fine, on its own. I like radicals, and want to hear their views argued for and hashed out in conversation.
But then you selectively accuse Eliezer of lying about having a “track record”, without noting how many other people are also expressing non-forecast “opinions” (and updating on these), and while using language in ways that make it sound like Eliezer is doing something more unusual than he is, and making it sound like your critique is more independent of your nonstandard views on track records and “opinions” than it actually is.
That’s the part that bugs me. If you have an extreme proposal for changing EA’s norms, argue for that proposal. Don’t just selectively take potshots at views or people you dislike more, while going easy on everyone else.
I think Jotto has argued for the proposal in the past. Whether he did it in that particular comment is not very important, so long as he holds everyone to the same standards.
As for his standards: I think he sees Eliezer as an easy target because he’s high status in this community and has explicitly said that he thinks his track record is good (in fact, better than other people). On its own, therefore, it’s not surprising that Eliezer would get singled out.
I no longer see exchanges with you as a good use of energy, unless you’re able to describe some of the strawmanning of me you’ve done and come clean about that.
EDIT: Since this is being downvoted, here is a comment chain where Rob Besinger interpreted me in ways that are bizarre, such as suggesting that I think Eliezer is saying he has “a crystal ball”, or that “if you record any prediction anywhere other than Metaculus (that doesn’t have similarly good tools for representing probability distributions), you’re a con artist”. Things that sound thematically similar to what I was saying, but were weird, persistent extremes that I don’t see as good-faith readings of me. It kept happening over Twitter, then again on LW. At no point have I felt he’s trying to understand what I actually think. So I don’t see the point of continuing with him.
This is a strawman. Ben Garfinkel never says that Yudkowsky has a bad track record. In fact the only time the phrase “bad track record” comes up in Garfinkel’s post is when you mention it in one of your comments.
The most Ben Garfinkel says about Yudkowsky’s track record is that it’s “at least pretty mixed”, which I think the content of the post supports, especially the clear-cut examples. He even emphasizes that he is deliberately cherry-picking bad examples from Eliezer’s track record in order to make a point, e.g. about Eliezer never having addressed his own bad predictions from the past.
It’s not enough to say “my world model was bad in such and such ways and I’ve changed it” to address your mistakes; you have to say “I made this specific prediction and it later turned out to be wrong”. Can you cite any instance of Eliezer ever doing that?
In the post, he says “his track record is at best fairly mixed” and “Yudkowsky may have a track record of overestimating or overstating the quality of his insights into AI”; and in the comments, he says “Yudkowsky’s track record suggests a substantial bias toward dramatic and overconfident predictions”.
What makes a track record “bad” is relative, but if Ben objects to my summarizing his view with the imprecise word “bad”, then I’ll avoid doing that. It doesn’t strike me as an important point for anything I said above.
As long as we agree that “track record” includes the kind of stuff Jotto was saying it doesn’t include, I’m happy to say that Eliezer’s track record includes failures as well as successes. Indeed, I think that would make way more sense.
Maybe worth mentioning in passing that this is of course false?
Sure! “I wouldn’t have predicted AlphaGo and lost money betting against the speed of its capability gains”.
Extremely important failures and extremely important successes, no less.
Yes, I think all of that checks out. It’s hard to say, of course, because Eliezer rarely makes explicit predictions, but insofar as he does make them I think he clearly puts a lot of weight on his inside view into things.
That doesn’t make his track record “bad” but it’s something to keep in mind when he makes predictions.
This counts as a mistake but I don’t think it’s important relative to the bad prediction about AI timelines Ben brings up in his post. If Eliezer explained why he had been wrong then it would make his position now more convincing, especially given his condescending attitude towards e.g. Metaculus forecasts.
I still think there’s something about the way Eliezer admits he was wrong that rubs me the wrong way but it’s hard to explain what that is right now. It’s not correct to say he doesn’t admit his mistakes per se, but there’s some other problem with how much he seems to “internalize” the fact that he was wrong.
I’ve retracted my original comment because of your example as it was not correct (despite having the right “vibe”, whatever that means).