Epistemic Status: Confident I have useful things to say, but apologies for the long post because I don’t think it’s worth my time to make it shorter. Better to get thoughts down for those who want to read them.
Scott Alexander’s latest post points to the question of how best to evaluate predictions. The way he characterized leading predictions on Trump and Brexit, that ‘prediction is hard,’ instinctively bothered me. Characterizing the coronavirus situation the same way bothered me even more.
(Before anyone asks, I was ahead of the public but definitely dropped the ball on the early coronavirus prediction front, in the sense that I failed to make a lot of money and failed to warn the public, and waited far too long to move out of New York City. A large part of that fault was that I predicted things badly by my own standards. A large part was other failures. I take full responsibility. But that’s not what I want to talk about right now.)
How does one evaluate past predictions?
As someone who used to place wagers and/or price prediction markets for a living, and later traded with what I believe are some of the best traders on the planet, I’ve thought about this question a lot.
We can divide situation into easy mode, where there are a large number of independent predictions made and robust markets or probabilities against which one can evaluate those predictions, and hard mode, where this is not true, and you are often evaluating an individual prediction of a singular event.
Easy Mode
Most of my time was spent largely in the ‘easy mode.’ Here, easy mode is when one is predicting lots of things for which there are established market prices, or at a minimum baseline fair values that one can be evaluated against. You have a lot of data points, and are comparing your predictions and decisions to a known baseline.
Easy mode makes it realistic to seek a metric that cannot be easily fooled, where you can use your results as evidence to prove what you are doing ‘works’ in some sense.
There is no one metric that is best even in easy mode. There are a few different ones that have merit. I’ll go through at least some of them.
Method One: Money Talks, Bull*** Walks
Did you make money?
If you did, congratulations. Good predicting.
If you didn’t, sorry. Bad predicting. If you didn’t bet, it doesn’t count.
This method has a lot to recommend it. It’s especially great over long periods of time with lots of distinct opportunities of relatively constant size and odds, where gains or losses from individual trades are well bounded.
There are however some severe problems, and one should try to seek other method.
If your trades have tail risk, and can have huge positive or negative payoffs, then that risk can often dominate your expected value (your Alpha) but not impact your observed results, or impact your observed results out of proportion to (or even in the opposite direction of) the Alpha involved.
If you sometimes trade big and sometimes trade small, that reflects your confidence and should be weighed, but also can lead to your bigger trades being all that matters. Which trades are big is often a matter of opportunity and circumstance, or an attempt to manipulate results, rather than reflecting what we want to measure.
Often trades and predictions are highly correlated even if they don’t look correlated.
Trading results often reflect other trading skills, such as speed and negotiation and ability to spot obvious errors. It’s very possible to have a trading strategy that spends most of its time doing things at random, but occasionally someone else typos or makes a huge mental error or there’s a bug in someone’s code, or you find a really dumb counter-party who likes to play very big, and suddenly you make a ton.
The exact method of trading, and which instruments are used, often has a dramatic effect on results even though it expresses the same underlying predictions.
Adverse selection shows up in real world trades where and how you least expect it to. Also exactly how you most expect it to.
And so on. It gets ugly.
These and other things allow someone trying to demonstrate skill to manipulate their results and usually get away with it, or for someone with good luck to look much better than they are, often for a remarkably long time.
Method Two: Trading Simulation, Where Virtual Money Talks
In this method, we start with a set of predictions, or a model that makes those predictions from data. Then we establish rules for how it will trade based on that information, and see whether the system makes money.
The big risk is that we can cheat. Our system would have made lots of money. Uh huh. Lots of ways to cheat.
Thus, the best simulations are where the person making the predictions is distinct from the person running the simulation, and the simulation runs in real time.
My good friend Seth Burn will run these simulations on sports models. He’ll take Nate Silver’s or ESPN’s or someone else’s predictions, translate them into win probabilities if necessary, then see what would happen if they were allowed to wager at market odds using Kelly betting. Sometimes it goes well. Other times it goes poorly. I generally consider making such models go full Kelly, without adjusting beliefs at all for market odds, a bit harsh. You Never Go Full Kelly. But I do get it.
When I run simulations on my own stuff, or often other people’s stuff, I will instead use threshold betting. If something is good enough, one unit will be wagered, and sometimes this will be scaled up to two or three units gradually as perceived edge improves. But we won’t force the system to Go Full Kelly. Because that would lead to heavily distorted results. The way you know if your plan is profitable is if it can make money a little at a time, not whether it would get lucky or blow itself up if you didn’t take reasonable precautions. And again, You Never Go Full Kelly.
Simulated trading is vital if you are planning to do actual trading. If you don’t simulate the things you intend to actually do, you can find yourself effectively testing a dramatically different hypothesis than the hypothesis you expected to test. That can end very badly.
These types of tests are good sanity checks and gut checks, all around. They make it much harder to fool yourself, if implemented reasonably.
Of course, in other ways, they make it easier to fool yourself.
Overfitting on results of any kind is highly dangerous, and this can encourage that and make it much worse. Often simulations are doing much more highly correlated things than one realizes, on any number of levels. Unscrupulous people of course can easily manipulate such results, it can become the worst kind of p-hacking taken up a level.
A big risk that is that you can think that your predictions are good because you have a handful of data errors. If your predictions are remotely sane, then any large error in historical prices will be something your simulation jumps on. You’ll make a ton on those, whereas in real life any attempt to take advantage of those opportunities would not have been allowed, and also not all that impressive an act of prediction. Guarding against this is super important, and usually involves manually looking at any situations where you think your edge is super large to ensure your recorded market prices are real.
Most of all, this method doesn’t actually reward accurate predictions. It rewards predictions that tend to disagree in the correct direction. That’s a very different thing.
Thus, think of this as an indicative and necessary method of evaluation wherever it is available, but in no way as a sufficient method, even when implemented properly in real time. But certainly, if the real time simulated test keeps working, I will consider updating my priors away from the market prices, and putting real money on the line after a while.
Method Three: The Green Knight Test
The Green Knight test gets its name from a character in the Arthurian legend. You get to swing at The Green Knight, then The Green Knight gets to swing at you.
Thus, you get to trade against the market at its fair price. Then the market gets to trade against you, at the model’s fair price, for the same amount. So if it’s a prediction market and the market says 50% and you say 60%, your net price is 55%. Whereas if you say 90%, your average price will be 70%, and you’ll do a lot worse.
How much you are allowing yourself to consider market prices, when deciding on your own beliefs, is your decision. If the answer isn’t ‘quite a lot’ it can get very expensive.
The point of The Green Knight Test is to use markets and trades to put you to the test, but to treat your model and the market as equals. The question is not whether you can directionally spot market inefficiencies. That’s (relatively) easy. I firmly believe that one can spot some amount of inefficiency in any market.
The question is, can you come up with better values than the market? That’s very, very hard if your process doesn’t heavily weigh the existing market prices. If you can pass this test without looking directly at the market prices at all, and you’ve confirmed that the market prices in question were real, your prices really are better than the market’s prices.
The even harder version of the test is to fully reverse the scenario. You take only the role of the market maker, allowing the market to trade at your model’s fair prices. If you can survive without a substantial loss, now you can fully reject the market’s prices, and treat your model’s prices as real.
The advantage of The Green Knight Test is it reminds you exactly how much you do not know, and holds you to a very high standard. Unless you are doing a pure math exercise like pricing derivatives, it’s expected that you will fail this test. It’s perfectly fine. The goal is to fail it less, and to remember that you fail it. Except when you actually pass it, then the sky’s the limit.
And yes, on one occasion that didn’t involve a derivative, I did pass this test convincingly. That’s a story for another day.
Method Four: Log Likelihood
I have no idea why I needed actual Eliezer Yudkowsky to first point out to me I should be using this, but once he did point this out it became obvious. Log likelihood for probabilistic outcomes are the obvious go-to standard thing to try.
If your goal is to reward accuracy and punish inaccuracy, log likelihood will do that in expectation. Your score on any given event is the natural log of your model’s probability of the outcome that happened.
Every time you improve your probability estimates, your expected score improves. Make your model worse, and it gets worse. Be highly overconfident and it will cost you quite a lot.
The best feature of log likelihood is that it provides perfect incentives.
The problem is that when you look at a score, you have no idea what you are looking at. There is no intuitive association between an LL score and a level of accuracy in prediction. Part of that is that we’re not used to them. The bigger issue is that a score doesn’t mean anything outside of the exact context and sample the score is based upon.
LL only scores mean something when you compare model one to model two on the exact same set of predictions.
They are all but useless with even tiny variations in what predictions are being scored. One additional unlikely event happening, or even one event being a foregone conclusion rather than a coin flip, will wipe out massive gains from model improvements, sometimes across thousands of predicted events.
What is meaningful is, we have this set of predictions, and we compare it to the market’s implicit predictions, and/or to another model or version of the model, and see which is better. Now we can get an idea of the magnitude of improvement (although again, what that magnitude means won’t be intuitive, other than to compare different score gaps with each other).
All of that skepticism assumes that everyone’s model is doing something sane. If someone is making huge mistakes, LL scores will pick it up very loudly as long as there is time to get punished for those mistakes enough times. If you’re going around saying 99% on 75% shots, or 20% on 50% shots, that will cut through a lot of noise.
Of course, if you were making errors that severe, there hopefully isn’t much need to use LL in order to realize that.
The principle is that your 60% predictions should happen 60% of the time, your 70% predictions should happen 70%, and so on. If they happen more often than that, you’re under-confident. If they happen less than that, you’re over-confident.
This is certainly a useful thing to check. If you’re consistently coming in with bad calibration, or are reliably badly calibrated at a particular point (e.g. perhaps your 10% chances are really 5%, but your 30%+ chances are roughly fair) then you can correct that particular mistake.
At a minimum, this is a bar that any predictor needs to clear if it wants to keep making probabilistic predictions with a straight face.
If you won’t put probabilities on your predictions, this test won’t work, except that we’ve already shown you aren’t doing very good predicting.
Such people can still be making useful predictions. To choose a very blatant example of someone doing this constantly, when Scott Adams says something is 100% going to happen, he neither believes this nor considers himself to be lying. To him, that’s just ‘good persuasion’ to anchor people high and force them to update. What he means is, ‘I think event X is more likely than you would think, so increase your probability estimate of X.’
There might or might not be a ‘substantially more than 50%’ actual prediction in there. If you read more than that into his statement, he’d say that’s your fault for being bad at persuasion.
Certainly he does not think that the numerous times a 100% to happen thing did not happen should send him to Bayes’ hell or cause people to dismiss his statements as worthless. He also doesn’t think one should ignore such misses, but why would you take someone’s stated numbers seriously?
Thus, asking if someone is well-calibrated is a way of asking if they are for reals attempting to provide accurate information, and if they have developed some of the basic skills required to do so. Learning whether this is so is very good and useful.
The problem with calibration testing is that you can get a perfect score on calibration without providing any useful predictions.
The direct cheat is one option. It’s very easy to pick things in the world that are 90% to happen, or 75%, or 50%, or 1%, if you are making up the statements yourself.
The more subtle cheat is another. You can have your 75% predictions be half things that are definitely true, and half things that are true half the time. Maybe you’re making a real error when you conflate them. Maybe you’re doing it on purpose. Hard to say.
This is typically what happens when people who are ‘well-calibrated’ give 90% (or 95% or 98% or 99.9%) probabilities. They’re mostly building in a chance they are making a stupid mistake or misunderstood the question, or other similar possibilities. Which you have to do.
Calibration is a good sanity check. It’s no substitute for actual evaluation.
This method is where you look for an obviously wrong probability. Obviously wrong can be on the level of ‘a human who understands the space would know this instantly’ or it can be on the level of ‘upon reflection that number can’t possibly be right, or it contradicts your other answers that you’re still sticking with.’ The level required to spot a mistake, and how big a mistake you can spot, are ways of measuring how good the predictions are.
Often when you find an obviously wrong statement, you find something important about whoever made the statement. In many cases, you learn that person is a bullshit artist. In other cases, you learn that there’s something important they don’t or didn’t know or understand, or something they overlooked. Or you find something important about their world view that caused this strange answer.
And of course sometimes they’re right and you’re wrong. Also a great time to learn something.
Same thing for a model. If you find a model saying something clearly wrong, then you can use that to find a flaw in the model. Ideally you can then fix the flaw. Failing that, you hope to know what the flaw is so you can correct for it if it happens again – you can flag the model explicitly as not taking factor X into account.
Other times they made a sign or data entry error. There’s always bugs in the code. It’s not always a revelation.
That leads into the concept of evaluating an individual prediction. Which is what one must do in hard mode.
Hard Mode
In hard mode, our metrics don’t work. We need to use reason to think carefully about particular spots.
Looking back, we ask the question of whether our predictions and probabilities were good, what reasonable predictions and probabilities would have been and why, and what information we should have looked for or would have changed our opinions. There are a few different ways to evaluate.
One question to ask is, suppose we were to rewind time. How often would things again turn out the way they did, versus another way? How close was this event’s outcome? Could random events from there have changed the outcome often? What about initial conditions you had no way of knowing about? What about conditions you didn’t know about but could have checked, or should have checked, and what were those conditions? What would have had to have gone differently?
In some cases, one looks back and the result looks inevitable. In others, it was anything but inevitable, and if it had rained in different cities, or one person makes a different hard decision, or news stories happen to slant a different way or something, on the crucial day the other candidate gets elected. In others, it was inevitable if you were omniscient, but given your information it was anyone’s game.
Sports are a great tool for this question because remarkably few things in sports are truly inevitable. Sports are full of guessing games and physical randomness. Any Given Sunday really does mean something, and one can look back and say a game was 50% versus 65% vs. 80% vs. 95% vs. 99% vs. 99.9% vs. 99.99% for the favorite to win. The question of ‘what was the real probability’ is truly meaningful. Someone who said the wrong number by a sufficient margin can be objectivelywrong, regardless of whether that favorite actually won.
That’s not true for many other things, but it is a useful perspective to treat it as if it was it more true from more perspectives in more ways than people think.
Obviously this is not an exact science.
For sports, I could go into endless examples and the merits of various methods of evaluation. One good standard there is ‘what would be the odds if they played another game next week?’ Which has some weird stuff in it but is mostly a concrete way of thinking about ‘what would happen if we re-ran the event and randomized the details of the initial conditions?’
Another good general approach is ‘what do I now know that I didn’t know before, and how does that change my prediction?’ Where did my model of events go wrong?
A third thing to do is to look at the components of your predictions. In hindsight, do the implied conditional probabilities make sense? When things started to happen, how did you update your model? If they had gone differently, how would you have updated, and would those updates have added up to an expected value close to zero?
A fourth thing to do is look at the hidden assumptions. What are your predictions assuming about the world that you didn’t realize you were assuming, or that turned out not to be true? Often you can learn a lot here.
A key takeaway from doing my analysis below of various predictions is that my opinion of the prediction often depends almost not at all on the outcome. Your prediction’s logic is still its logic. In many cases, the actual outcome is only one additional data point.
One cannot point out too many times how easy it is to fool yourself with such questions, if you are looking to be fooled, or even not looking to not be fooled.
Since most of my audience is not deep into the sportsball, I will illustrate further only with non-sports examples.
It makes sense to start with the two that inspired this post, then go from there.
Note that I’ll be doing political analysis, but keeping this purely to probabilities of events. No judgments here, no judgments in the comments, please.
Scott’s two examples
Scott’s two examples from his recent post were Brexit and the 2016 Presidential Election.
In both cases, predictors that are at least trying to try, such as Nate Silver and Tetlock’s forecasters, put the chances of things going the historical way at roughly 25% right before the elections happened. Also in both cases, mainstream pundits and conventional wisdom mostly claimed at the time that the chance was far lower, in many cases very close to (but not quite) 0%. In both cases, there were people who predicted the other outcome and thought it was likely to happen, but not many. Also in both cases, the result may have partially been caused by the expectation of the other result. If voters had realized the elections were close, voters might have decided differently.
Importantly, in both cases, the polls, which are the best first-level way to predict any election, had the wrong side ahead but by amounts that historically and statistically were insufficient to secure victory.
Both elections were very close. Remain had almost as many votes as leave, to the extent that different weather in different areas of the United Kingdom could have made the difference (London voted heavily remain, other places for leave). Trump lost the popular vote and barely won the electoral college, after many things broke his way in the final week and day.
These are textbook cases, in this system, of results that were very much not inevitable. It is very, very easy to tell stories of slightly different sequences of events in the final week or days that end in the opposite result. If everything visible had been the same but the outcome went the other way, it would not have been more surprising than what happened even in hindsight.
As we were warned would happen, both results were then treated as far more inevitable than they actually were. Media and people in general rushed to form a narrative that these results were always going to happen. The United Kingdom treated a tiny majority as an inviolate will of the people rather than what it was, evidence that the country was about evenly split. Everyone wrote about the United States completely differently than if a hundred thousand votes had been distributed differently, or any number of decisions had been made a different way.
If you bet on Trump or on Leave at the available market prices, you made a great trade.
But, if you claimed that those sides were definitely going to win, that it was inevitable (e.g. the Scott Adams position) then you were more wrong than those who said the same thing about Leave and Clinton. This seems clear to me despite your side actually winning.
The only way to believe that predicting a Trump win as inevitable was a reasonable prediction is to assume facts about the world not in evidence. To me, it is a claim that the election either was stolen, or would have been stolen if Trump had been about to lose it. Same or similar thing with Leave.
The generalized version of that, as opposed to election fraud, is a more common pattern than is commonly appreciated. The way that things that look close are actually inevitable is that the winning side had lots of things up their sleeve, or had effectively blocked the scenarios where they might lose, in ways that are hard to observe. Try to change the outcome and the world pushes back hard. They didn’t pull out their ace in the hole because they didn’t need it, but it was there.
I don’t merely think that Nate Silver’s ~25% chance for Trump (and 10% chance to win despite losing the popular vote!) was merely what Scott Alexander called it, a bad prediction but ‘the best we could do.’ I think it was actually a pretty great prediction, the reasonable hindsight range is something like 20% to 40%. You need to give a decent chunk of the distribution to Trump, and he can’t be the favorite. If your prediction was way off of this in either direction I think you were wrong. I think Remain vs. Leave follows a very similar pattern.
(For the 2020 Election, I similarly think that anyone who thinks either candidate is a huge favorite is wrong, and will almost certainly in hindsight still have been wrong in this way regardless of the eventual outcome, because so many things could happen on multiple fronts. To be confident you’d need to be confident at a minimum of the politics and the economics and the epidemiology. That doesn’t mean it will be close on election day, or in October.)
Scott’s calibration exercise
Scott’s predictions are a clean set of probabilities that are clearly fair game. Sticking there seems reasonable.
Let’s look at Scott’s predictions for the year 2019 next. How do they look?
By his convention, strikethroughs mean it didn’t happen, lack of a strikethrough means it happened.
The house impeached Trump for something that, as of the time of the prediction, hadn’t happened yet. It is clear the actual barrier to convincing Pelosi was high. If things had been enough worse, impeachment might have not have happened because resignation. So you could reasonably say the 40% number looks somewhat high in hindsight. The argument for it not being high is if you think Trump always keeps escalating until impeachment happens, especially if you think Trump actively wanted to be impeached. I’m inclined to say that on its own 40% seems reasonable, as would have 20% or 30%.
The 90% number is all-cause remaining President. Several percent of the time Trump dies of natural causes, as he’s in his 70s. Several percent more has to be various medical conditions that prevent him from serving. Again, he’s in his 70s. World leaders also sometimes get shot, we’ve lost multiple presidents that way. Also, he’s impulsive and weird and looks like he often hates being president so maybe he decides to declare America great again and quit. And if there’s a dramatic change to world conditions and the USA doesn’t have a president anymore, he’s not president. Small probabilities but they add up. The majority of the 10% has to be baked in. We can reduce some of those a little in hindsight but not much.
So saying 90% is actually giving a very small probability of Trump leaving office for other reasons, especially given a 40% chance of impeachment – his probability of surviving politically conditional on the house being willing to impeach has to be at least 90%. Given the ways Trump did react and might have reacted to such conditions, and that some of the time the underlying accusations are much worse than what we got, this looks overconfident at 90% and I’d prefer to see 80%. But a lot of that is the lack of precision available when you only predict by 10% increments; 85% would have been fine.
3. Kamala Harris leads the Democratic field: 20% 4. Bernie Sanders leads the Democratic field: 20%
5. Joe Biden leads the Democratic field: 20% 6. Beto O’Rourke leads the Democratic field: 20%
(Disclosure, at PredictIt I sold at various points all but three candidates, one of those three was Joe Biden, and my mistake in hindsight was not waiting longer to sell a few of them along with not selling one of the other two when I had the chance).
Scott’s nominee predictions, however, seem really sloppy. These four candidates were not equally likely. The prediction markets didn’t think so, their backgrounds and the polls didn’t think so. The dynamics we saw play out don’t think so, either. Things came down to a former vice president to a popular president who led in the polls most of the way versus the previous cycle’s runner up.
Putting them on equal footing with a random congressman from Texas who lost a close race once while looking exciting, or a more traditionally plausible alternative candidate like Kamala Harris, doesn’t age well.
Nor does having these all be 20% and adding to 80%, leaving 20% left for the other 16 or so candidates including Elizabeth Warren, plus any unexpected late entries.
The defense of the 20% on Biden is to say Biden was known to be old and a terrible candidate who predictably ran a terrible primary campaign, so he was overrated even though he ended up winning, while Harris and O’Rourke were plausibly very good candidates given what we knew at the time. I do think there’s broad range for such arguments, but not to this extent.
This is where calibration makes you look good but shouldn’t. Name the four leading candidates (or at least four plausible-to-be-top-four candidates, to be generous) and give them each 20% and your calibration will look mostly fine even if that evaluation doesn’t make sense and the remaining field is really more like 30-40% than 20%.
This is also where the human element can warp your findings. There’s a lot of ‘X has to be higher than Y’, or ‘X ~= Y here looks sloppy’ or ‘X can’t be an underdog given Z’ or what not. We have a lot of rules of thumb, and those who break those rules will look worse than they deserve, while those that follow those rules but otherwise talk nonsense will look better.
As usual, use a variety of evaluation methods and switch them up when it looks like someone might be Goodharting.
7. Trump is still leading in prediction markets to be Republican nominee: 70%
8. Polls show more people support the leading Democrat than the leading Republican: 80%
This 70% number seems like a miss low to me if you accept Scott’s other predictions above. In Scott’s model, Trump is 90% to be President, which means he’s now twice as likely to be President but losing the nomination fight, despite at the time facing zero credible opposition. If you again take out the 5%+ chance that Trump is physically unfit for office and leaves because of it, that makes it many times more likely to Scott that Trump can’t get the nomination but stays President, versus him stepping down. I can’t come up with a good defense of less than 80% or so in this context.
Predicting the Democratic candidate as likely to be ahead seems right, as that had been largely both true and stable for a while for pretty much any plausible Democratic candidate. 80% seems a little overconfident if we’re interpreting this as likely voters, but not crazy. A year is a long time, the baseline scenario was for a pretty good economy, and without anything especially good for Trump happening we saw some close polls.
Of course, if we interpret this as all Americans then 80% seems too low, since non-voters and especially children overwhelmingly support Democrats. And if we literally read this as people anywhere then it should be 95% or more. A reminder of how important it is to word predictions carefully.
90% seems overconfident to me, although 80% would have been too low. It’s saying that the world is definitely in ‘nothing matters’ mode and meaningful things are unlikely to happen. This of course goes along with the 80% chance he’ll be behind in the polls, since if he’s above 50% approval he’s going to be ahead in the polls almost every time.
50% for approval ratings below 40 seems clearly more right than 40% or 60% would have been. This is an example of predictions needing to be evaluated at the appropriate level of precision. It’s easy to say “roughly 50%” here, so the ‘smart money’ is the ones who can say 53% instead of 50% and have it be accurate. So credit here for staying sane, which is something.
11. Current government shutdown ends before Feb 1: 40%
12. Current government shutdown ends before Mar 1: 80%
13. Current government shutdown ends before Apr 1: 95% 14. Trump gets at least half the wall funding he wants from current shutdown: 20%
15. Ginsberg still alive: 50%
I would not have been 95% confident that the shutdown wouldn’t extend past April 1. It doesn’t seem implausible to me at all that the two sides could have deadlocked for much longer, since it’s a zero-sum game with at least one of the players as a pure zero-sum thinker and where the players hate each other. There were very plausible paths where there were no reasonable lines of retreat. Once we get into March, chances of things resolving seem like they do down, not up. I think the 40% and 80% predictions look slightly high, but reasonable.
I am not enough of a medical expert to speak to Ginsberg’s chances of survival, but I’m guessing 50% was too low.
ECON AND TECH
16. Bitcoin above 1000: 90%
17. Bitcoin above 3000: 50%
18. Bitcoin above 5000: 20%
19. Bitcoin above Ethereum: 95%
20. Dow above current value of 25000: 80% 21. SpaceX successfully launches and returns crewed spacecraft: 90% 22. SpaceX Starship reaches orbit: 10%
23. No city where a member of the general public can ride self-driving car without attendant: 90% 24. I can buy an Impossible Burger at a grocery store within a 30 minute walk from my house: 70%
25. Pregabalin successfully goes generic and costs less than $100/month on GoodRx.com: 50%
26. No further CRISPR-edited babies born: 80%
The first question I always wonder when I see predictions about Bitcoin is whether the prediction implies a buy or implies a sale.
At the time of these predictions, Bitcoin was trading at roughly $3,500.
Scott thought Bitcoin was a SCREAMING BUY.
The reason this represents a screaming buy is that Scott has Bitcoin to be almost 50% to be trading higher versus lower. But if Bitcoin is higher, often it is double its current price or higher, which in fact happened. You have a long tail in one direction only. Even in Scott’s numbers, the 20% vs. 10% asymmetry at 1000 and 5000 points towards this.
Was that right, given what he knew? I… think so? Probably? I was already sufficiently synthetically long that I didn’t buy (if you’re founding a company that builds on blockchain, investing more in blockchains is much less necessary), but I did think that the mean value of Bitcoin a year later was probably substantially higher than its $3,500 price.
What is clearly wrong is expecting so little variance in the price of Bitcoin. We have Bitcoin more likely to be in the 3000-5000 range, or the 2000-3000 range, than to be above 5000 or below 1000. That doesn’t seem remotely reasonable to me, and I thought so at the time. That’s the thing about Bitcoin. It’s a wild ride. To think you shouldn’t be on the ride at all, given the upside available, you have to think the ride likely ends in a crash.
Bitcoin above Ethereum at 95% depends on how seriously you treat the correlation. At the time Ethereum was roughly $120 per coin, or about 3% of a Bitcoin. Most of Etherium’s variance for years has been Bitcoin’s variance, and they’ve been highly correlated.
Note that this isn’t ETH market cap above BTC market cap, it’s ETH above BTC, which requires an extra doubling.
If we think about three scenarios – BTC up a lot, BTC down a lot, BTC mostly unchanged – we see that ETH going up 3000% more than BTC seems like a very crazy outcome in at least two of those scenarios. Given how little variance we’ve put into BTC, giving ETH that much variance in the upside or mostly unchanged scenarios doesn’t make sense.
So the 5% probability is mostly coming from a BTC collapse that ETH survives. BTC being below 1000 is only 10% in this model. Of that 10%, most of the time this is a general blockhain collapse, and ETH does as badly or worse. So again, aside from general model uncertainty and ‘5% of the time strange things happen’ 5% seems super high for the full flippening to have happened, and felt so at the time.
And of course, again, if ETH is 5% to be above BTC and costs 3% of BTC, then ETH is super cheap relative to BTC! It’s worth more just based on this scenario sometimes happening! Anyone who holds BTC is a complete fool given this other opportunity, unless they are really into balancing a portfolio.
It’s important to note when predictions are making super bold claims, especially when the claims do not look that bold.
The Dow being 80% to be above its current value, by contrast, is a very safe and reasonable estimate, since crashes down tend to be large and we expect the market on average to have positive returns. Given rounding, can’t argue with that, and wouldn’t regardless of the outcome unless there was a known factor about to crash it (e.g. something analogous to covid-19 that was knowable at the time).
On to SpaceX. Being 90% confident of anything being accomplished in space travel for the first time by a new institution within a given year seems like a mistake given what I know about space travel. But I have not been following developments, so perhaps this was reasonable (e.g. they had multiple opportunities and well-planned-out missions to do this, and it took a lot to make it not happen). Others can fill this in better than I can. I have no idea how to evaluate their chances of reaching orbit, since that depends on the plausibility of the schedule in question, and how much they would care about the milestone for various reasons.
The self-driving car prediction depends on exactly what would have counted. If this would have to have been on the level of ‘hail a driverless cab to and from a large portions of a real city’ than 10% seems very reasonable. If it would have been sufficient to have some (much lesser) way in which a member of public could ride a driverless car, I think that wasn’t that far away from happening and this would have been too low.
I am very surprised that Scott couldn’t at the time buy an Impossible Burger within a 30 minute walk from his house. I know where his house is. I can buy one now, within a 30 minute walk from my house (modulo my complete unwillingness to set food in a grocery store, and also my unwillingness to buy an Impossible Burger), and in fact have even passed “meat” sections that were sold out except for Impossible Burgers. Major fast food chains sell them. Of course, they had a very good year, almost certainly much better than expected. So 70% seems fine here, to me, with the 30% largely being that Impossible Burgers don’t do as well as they did, and only a small portion of it being that Scott’s area mysteriously doesn’t carry them. Seriously, this is weird.
The prediction on Pregabalin I have no way to evaluate.
The question of CRISP-er edited babies should have been worded ‘are known to have been born’ or something similar, to make this something we can evaluate. Beyond that, it’s a hard one to think about.
WORLD 27. Britain out of EU: 60% 28. Britain holds second Brexit referendum: 20%
29. No other EU country announces plan to leave: 80%
30. China does not manage to avert economic crisis (subjective): 50%
31. Xi still in power: 95%
32. MbS still in power: 95% 33. May still in power: 70%
34. Nothing more embarassing than Vigano memo happens to Pope Francis: 80%
Once again I the 95% numbers seem too high even when I can’t think of an exact scenario where they lose power, but again it’s not a major mistake.
The Vigano memo seems unusually embarrassing as a thing that happens to the Pope relative to the average year, thinking historically. Most years nothing terribly embarrassing happens to Popes, the continuing abuse scandal seems like the only plausible source for embarrassing things, and Francis seems if anything less likely than par to generate embarrassing things. So if anything 80% seems low, unless I’m forgetting other events.
The China prediction is subjective, and I don’t think I would have ruled it the same way Scott did, so it’s really tough to judge. But in general 50% chance of economic crisis within one year is a very bold prediction, so I’d want to know what made that year so different and whether it proved important.
Now it’s time to talk about the EU, and what happens after you vote for Brexit. It’s definitely been a chaotic series of events. It definitely could have gone differently at various points. Sometimes I wonder what would have happened if Boris Johnson had liked his Remain speech rather than his Leave speech.
I like 60% as a reasonable number for Britain out of EU in 2019. There were a lot of forces pushing Britain to leave given the vote. There were also practical reasons why it was not going to be easy, and overwhelming support for remaining in the EU in parliament if members got to vote their own opinions. Lots of votes throughout the year seemed in doubt several times over, with May and others making questionable tactical decisions that backfired and missing opportunities all the time. The EU itself could have reacted in several different ways. Even now we can see a lot of ways this could have gone.
How about 20% for a second referendum? We can consider two classes of referendum, related but to me they seem importantly distinct.
There’s the class where Her Majesty’s Government decides to do what the EU often does, which is have the voters keep voting until they get the right result. Given the vote was very close, and that leaving turned out to not look like voters were promised, the only thing preventing this from working was some sort of mystical ‘the tribe has spoken’ vibe that took over the country.
Then there’s the class where the EU won’t play ball, or the UK politicians want to vomit when they see the kind of ball the EU was always prepared to play. They’re looking at a full Hard Brexit, and want to put the decision of whether or not to accept that onto the people.
Thus it’s not obvious in hindsight whether the referendum was more likely in the “Britain leaves” world or the “Britain stays” world, given that was already up in the air. Certainly it feels like something unlikely would have had to happen, so we’re well under 50%, but that it wasn’t that far from happening, so it was probably more than 10%. 20% seems fine.
May being 70% to stay in power, however, feels too high. May was clearly facing an impossible problem, while being committed to a horrible path, in a world where prime ministers are expected to resign if they don’t get their way. How often would Britain still be in the UK at the end of the year while May survived? That seems pretty unlikely to me, especially in hindsight, whereas Britain leaving without May seems at least as likely. So May at 70% and leaving at 60% doesn’t seem right.
SURVEY
35. …finds birth order effect is significantly affected by age gap: 40%
36. …finds fluoxetine has significantly less discontinuation issues than average: 60%
37. …finds STEM jobs do not have significantly more perceived gender bias than non-STEM: 60%
(#38 got thrown out as confusing and I don’t know how to evaluate it anyway)
I would have been more confident on the merits in 35 and 37. Birth order effects have to come from somewhere, and the ‘affected’ side gets both directions. And the STEM prediction lets you have both about as much perceived bias and less bias, and I had no particular reason to believe it would come out bigger or smaller.
What’s more interesting, although obviously from a small sample size, is that all three proved true. So Scott’s hunches worked out. Should we suspect Scott was underconfident here?
This could be a case of Unknown Knowns. Scott has good reason to believe in these results, the survey has enough power to find results if they’re there, but Scott’s brain refuses to be that confident in a scientific hypothesis without seeing the data from a well-run randomized controlled trial.
I kid, but also there’s almost certainly a modesty issue happening here. I would predict that Scott would be reliably under-confident in his hunches that he thought enough of to include in his survey.
I started to go over Scott’s personal predictions, but found it mostly not to be a useful exercise. I don’t have the context.
There is of course one obvious thing to note.
PERSONAL – PROJECTS 63. I finish at least 10% more of [redacted]: 20% 64. I completely finish [redacted]: 10% 65. I finish and post [redacted]: 5% 66. I write at least ten pages of something I intend to turn into a full-length book this year: 20% 67. I practice calligraphy at least seven days in the last quarter of 2019: 40% 68. I finish at least one page of the [redacted] calligraphy project this year: 30% 69. I finish the entire [redacted] calligraphy project this year: 10% 70. I finish some other at-least-one-page calligraphy project this year: 80%
PERSONAL – PROFESSIONAL
71. I attend the APA Meeting: 80% 72. [redacted]: 50% 73. [redacted]: 40%
74. I still work in SF with no plans to leave it: 60%
75. I still only do telepsychiatry one day with no plans to increase it: 60%
76. I still work the current number of hours per week: 60%
77. I have not started (= formally see first patient) my own practice: 80%
78. I lease another version of the same car I have now: 90%
None of the personal projects happened. Almost all the professional predictions happened, most of which predict the continued status quo. That all seems highly linked, more like two big predictions than lots of different predictions. One would want to ask what the actual relevant predictions were.
Overall, clearly this person is trying. And there’s clearly a tension between getting 95% of 95% predictions right, and having most of them actually be 95% likely. Occasionally you screw up big and your 95% is actually 50%, and that can often be the bulk of the times such things fail. Or some of them are 85%, but again that can easily be the bulk of the failures. So it’s not entirely fair to complain about a 95% that should be 99% unless standards are super high.
Mostly, I’d like to encourage looking back more in this type of way when possible, in addition to any use of numeric metrics.
I also should look at my own predictions, but also want to make that a distinct post, because its subject matter will have a different appeal on its own merits.
I hope this was helpful, fun, interesting or some combination of all three. I don’t intend it to be perfectly thought out. Rather, I thought it was a useful thing for those interested, so I’d write it quickly, but not let it take too much time/effort away from other higher priority things.
Evaluating Predictions in Hindsight
Link post
Epistemic Status: Confident I have useful things to say, but apologies for the long post because I don’t think it’s worth my time to make it shorter. Better to get thoughts down for those who want to read them.
Scott Alexander’s latest post points to the question of how best to evaluate predictions. The way he characterized leading predictions on Trump and Brexit, that ‘prediction is hard,’ instinctively bothered me. Characterizing the coronavirus situation the same way bothered me even more.
(Before anyone asks, I was ahead of the public but definitely dropped the ball on the early coronavirus prediction front, in the sense that I failed to make a lot of money and failed to warn the public, and waited far too long to move out of New York City. A large part of that fault was that I predicted things badly by my own standards. A large part was other failures. I take full responsibility. But that’s not what I want to talk about right now.)
How does one evaluate past predictions?
As someone who used to place wagers and/or price prediction markets for a living, and later traded with what I believe are some of the best traders on the planet, I’ve thought about this question a lot.
We can divide situation into easy mode, where there are a large number of independent predictions made and robust markets or probabilities against which one can evaluate those predictions, and hard mode, where this is not true, and you are often evaluating an individual prediction of a singular event.
Easy Mode
Most of my time was spent largely in the ‘easy mode.’ Here, easy mode is when one is predicting lots of things for which there are established market prices, or at a minimum baseline fair values that one can be evaluated against. You have a lot of data points, and are comparing your predictions and decisions to a known baseline.
Easy mode makes it realistic to seek a metric that cannot be easily fooled, where you can use your results as evidence to prove what you are doing ‘works’ in some sense.
There is no one metric that is best even in easy mode. There are a few different ones that have merit. I’ll go through at least some of them.
Method One: Money Talks, Bull*** Walks
Did you make money?
If you did, congratulations. Good predicting.
If you didn’t, sorry. Bad predicting. If you didn’t bet, it doesn’t count.
This method has a lot to recommend it. It’s especially great over long periods of time with lots of distinct opportunities of relatively constant size and odds, where gains or losses from individual trades are well bounded.
There are however some severe problems, and one should try to seek other method.
If your trades have tail risk, and can have huge positive or negative payoffs, then that risk can often dominate your expected value (your Alpha) but not impact your observed results, or impact your observed results out of proportion to (or even in the opposite direction of) the Alpha involved.
If you sometimes trade big and sometimes trade small, that reflects your confidence and should be weighed, but also can lead to your bigger trades being all that matters. Which trades are big is often a matter of opportunity and circumstance, or an attempt to manipulate results, rather than reflecting what we want to measure.
Often trades and predictions are highly correlated even if they don’t look correlated.
Trading results often reflect other trading skills, such as speed and negotiation and ability to spot obvious errors. It’s very possible to have a trading strategy that spends most of its time doing things at random, but occasionally someone else typos or makes a huge mental error or there’s a bug in someone’s code, or you find a really dumb counter-party who likes to play very big, and suddenly you make a ton.
The exact method of trading, and which instruments are used, often has a dramatic effect on results even though it expresses the same underlying predictions.
Adverse selection shows up in real world trades where and how you least expect it to. Also exactly how you most expect it to.
And so on. It gets ugly.
These and other things allow someone trying to demonstrate skill to manipulate their results and usually get away with it, or for someone with good luck to look much better than they are, often for a remarkably long time.
In general, see the book Fooled by Randomness, and assume it’s worse than that.
Still, it’s money, and it’s useful.
Method Two: Trading Simulation, Where Virtual Money Talks
In this method, we start with a set of predictions, or a model that makes those predictions from data. Then we establish rules for how it will trade based on that information, and see whether the system makes money.
The big risk is that we can cheat. Our system would have made lots of money. Uh huh. Lots of ways to cheat.
Thus, the best simulations are where the person making the predictions is distinct from the person running the simulation, and the simulation runs in real time.
My good friend Seth Burn will run these simulations on sports models. He’ll take Nate Silver’s or ESPN’s or someone else’s predictions, translate them into win probabilities if necessary, then see what would happen if they were allowed to wager at market odds using Kelly betting. Sometimes it goes well. Other times it goes poorly. I generally consider making such models go full Kelly, without adjusting beliefs at all for market odds, a bit harsh. You Never Go Full Kelly. But I do get it.
When I run simulations on my own stuff, or often other people’s stuff, I will instead use threshold betting. If something is good enough, one unit will be wagered, and sometimes this will be scaled up to two or three units gradually as perceived edge improves. But we won’t force the system to Go Full Kelly. Because that would lead to heavily distorted results. The way you know if your plan is profitable is if it can make money a little at a time, not whether it would get lucky or blow itself up if you didn’t take reasonable precautions. And again, You Never Go Full Kelly.
Simulated trading is vital if you are planning to do actual trading. If you don’t simulate the things you intend to actually do, you can find yourself effectively testing a dramatically different hypothesis than the hypothesis you expected to test. That can end very badly.
These types of tests are good sanity checks and gut checks, all around. They make it much harder to fool yourself, if implemented reasonably.
Of course, in other ways, they make it easier to fool yourself.
Overfitting on results of any kind is highly dangerous, and this can encourage that and make it much worse. Often simulations are doing much more highly correlated things than one realizes, on any number of levels. Unscrupulous people of course can easily manipulate such results, it can become the worst kind of p-hacking taken up a level.
A big risk that is that you can think that your predictions are good because you have a handful of data errors. If your predictions are remotely sane, then any large error in historical prices will be something your simulation jumps on. You’ll make a ton on those, whereas in real life any attempt to take advantage of those opportunities would not have been allowed, and also not all that impressive an act of prediction. Guarding against this is super important, and usually involves manually looking at any situations where you think your edge is super large to ensure your recorded market prices are real.
Most of all, this method doesn’t actually reward accurate predictions. It rewards predictions that tend to disagree in the correct direction. That’s a very different thing.
Thus, think of this as an indicative and necessary method of evaluation wherever it is available, but in no way as a sufficient method, even when implemented properly in real time. But certainly, if the real time simulated test keeps working, I will consider updating my priors away from the market prices, and putting real money on the line after a while.
Method Three: The Green Knight Test
The Green Knight test gets its name from a character in the Arthurian legend. You get to swing at The Green Knight, then The Green Knight gets to swing at you.
Thus, you get to trade against the market at its fair price. Then the market gets to trade against you, at the model’s fair price, for the same amount. So if it’s a prediction market and the market says 50% and you say 60%, your net price is 55%. Whereas if you say 90%, your average price will be 70%, and you’ll do a lot worse.
How much you are allowing yourself to consider market prices, when deciding on your own beliefs, is your decision. If the answer isn’t ‘quite a lot’ it can get very expensive.
The point of The Green Knight Test is to use markets and trades to put you to the test, but to treat your model and the market as equals. The question is not whether you can directionally spot market inefficiencies. That’s (relatively) easy. I firmly believe that one can spot some amount of inefficiency in any market.
The question is, can you come up with better values than the market? That’s very, very hard if your process doesn’t heavily weigh the existing market prices. If you can pass this test without looking directly at the market prices at all, and you’ve confirmed that the market prices in question were real, your prices really are better than the market’s prices.
The even harder version of the test is to fully reverse the scenario. You take only the role of the market maker, allowing the market to trade at your model’s fair prices. If you can survive without a substantial loss, now you can fully reject the market’s prices, and treat your model’s prices as real.
The advantage of The Green Knight Test is it reminds you exactly how much you do not know, and holds you to a very high standard. Unless you are doing a pure math exercise like pricing derivatives, it’s expected that you will fail this test. It’s perfectly fine. The goal is to fail it less, and to remember that you fail it. Except when you actually pass it, then the sky’s the limit.
And yes, on one occasion that didn’t involve a derivative, I did pass this test convincingly. That’s a story for another day.
Method Four: Log Likelihood
I have no idea why I needed actual Eliezer Yudkowsky to first point out to me I should be using this, but once he did point this out it became obvious. Log likelihood for probabilistic outcomes are the obvious go-to standard thing to try.
If your goal is to reward accuracy and punish inaccuracy, log likelihood will do that in expectation. Your score on any given event is the natural log of your model’s probability of the outcome that happened.
Every time you improve your probability estimates, your expected score improves. Make your model worse, and it gets worse. Be highly overconfident and it will cost you quite a lot.
The best feature of log likelihood is that it provides perfect incentives.
The problem is that when you look at a score, you have no idea what you are looking at. There is no intuitive association between an LL score and a level of accuracy in prediction. Part of that is that we’re not used to them. The bigger issue is that a score doesn’t mean anything outside of the exact context and sample the score is based upon.
LL only scores mean something when you compare model one to model two on the exact same set of predictions.
They are all but useless with even tiny variations in what predictions are being scored. One additional unlikely event happening, or even one event being a foregone conclusion rather than a coin flip, will wipe out massive gains from model improvements, sometimes across thousands of predicted events.
What is meaningful is, we have this set of predictions, and we compare it to the market’s implicit predictions, and/or to another model or version of the model, and see which is better. Now we can get an idea of the magnitude of improvement (although again, what that magnitude means won’t be intuitive, other than to compare different score gaps with each other).
All of that skepticism assumes that everyone’s model is doing something sane. If someone is making huge mistakes, LL scores will pick it up very loudly as long as there is time to get punished for those mistakes enough times. If you’re going around saying 99% on 75% shots, or 20% on 50% shots, that will cut through a lot of noise.
Of course, if you were making errors that severe, there hopefully isn’t much need to use LL in order to realize that.
Method Five: Calibration Testing
This is the way Scott Alexander scores his predictions.
The principle is that your 60% predictions should happen 60% of the time, your 70% predictions should happen 70%, and so on. If they happen more often than that, you’re under-confident. If they happen less than that, you’re over-confident.
This is certainly a useful thing to check. If you’re consistently coming in with bad calibration, or are reliably badly calibrated at a particular point (e.g. perhaps your 10% chances are really 5%, but your 30%+ chances are roughly fair) then you can correct that particular mistake.
At a minimum, this is a bar that any predictor needs to clear if it wants to keep making probabilistic predictions with a straight face.
If you won’t put probabilities on your predictions, this test won’t work, except that we’ve already shown you aren’t doing very good predicting.
In most cases this will quickly reveal that someone isn’t trying to choose realistic probabilities. They’re saying words that they think will have a particular impact.
Such people can still be making useful predictions. To choose a very blatant example of someone doing this constantly, when Scott Adams says something is 100% going to happen, he neither believes this nor considers himself to be lying. To him, that’s just ‘good persuasion’ to anchor people high and force them to update. What he means is, ‘I think event X is more likely than you would think, so increase your probability estimate of X.’
There might or might not be a ‘substantially more than 50%’ actual prediction in there. If you read more than that into his statement, he’d say that’s your fault for being bad at persuasion.
Certainly he does not think that the numerous times a 100% to happen thing did not happen should send him to Bayes’ hell or cause people to dismiss his statements as worthless. He also doesn’t think one should ignore such misses, but why would you take someone’s stated numbers seriously?
Thus, asking if someone is well-calibrated is a way of asking if they are for reals attempting to provide accurate information, and if they have developed some of the basic skills required to do so. Learning whether this is so is very good and useful.
The problem with calibration testing is that you can get a perfect score on calibration without providing any useful predictions.
The direct cheat is one option. It’s very easy to pick things in the world that are 90% to happen, or 75%, or 50%, or 1%, if you are making up the statements yourself.
The more subtle cheat is another. You can have your 75% predictions be half things that are definitely true, and half things that are true half the time. Maybe you’re making a real error when you conflate them. Maybe you’re doing it on purpose. Hard to say.
This is typically what happens when people who are ‘well-calibrated’ give 90% (or 95% or 98% or 99.9%) probabilities. They’re mostly building in a chance they are making a stupid mistake or misunderstood the question, or other similar possibilities. Which you have to do.
Calibration is a good sanity check. It’s no substitute for actual evaluation.
Method Six: The One Mistake Rule
This method is where you look for an obviously wrong probability. Obviously wrong can be on the level of ‘a human who understands the space would know this instantly’ or it can be on the level of ‘upon reflection that number can’t possibly be right, or it contradicts your other answers that you’re still sticking with.’ The level required to spot a mistake, and how big a mistake you can spot, are ways of measuring how good the predictions are.
Often when you find an obviously wrong statement, you find something important about whoever made the statement. In many cases, you learn that person is a bullshit artist. In other cases, you learn that there’s something important they don’t or didn’t know or understand, or something they overlooked. Or you find something important about their world view that caused this strange answer.
And of course sometimes they’re right and you’re wrong. Also a great time to learn something.
Same thing for a model. If you find a model saying something clearly wrong, then you can use that to find a flaw in the model. Ideally you can then fix the flaw. Failing that, you hope to know what the flaw is so you can correct for it if it happens again – you can flag the model explicitly as not taking factor X into account.
Other times they made a sign or data entry error. There’s always bugs in the code. It’s not always a revelation.
That leads into the concept of evaluating an individual prediction. Which is what one must do in hard mode.
Hard Mode
In hard mode, our metrics don’t work. We need to use reason to think carefully about particular spots.
Looking back, we ask the question of whether our predictions and probabilities were good, what reasonable predictions and probabilities would have been and why, and what information we should have looked for or would have changed our opinions. There are a few different ways to evaluate.
One question to ask is, suppose we were to rewind time. How often would things again turn out the way they did, versus another way? How close was this event’s outcome? Could random events from there have changed the outcome often? What about initial conditions you had no way of knowing about? What about conditions you didn’t know about but could have checked, or should have checked, and what were those conditions? What would have had to have gone differently?
In some cases, one looks back and the result looks inevitable. In others, it was anything but inevitable, and if it had rained in different cities, or one person makes a different hard decision, or news stories happen to slant a different way or something, on the crucial day the other candidate gets elected. In others, it was inevitable if you were omniscient, but given your information it was anyone’s game.
Sports are a great tool for this question because remarkably few things in sports are truly inevitable. Sports are full of guessing games and physical randomness. Any Given Sunday really does mean something, and one can look back and say a game was 50% versus 65% vs. 80% vs. 95% vs. 99% vs. 99.9% vs. 99.99% for the favorite to win. The question of ‘what was the real probability’ is truly meaningful. Someone who said the wrong number by a sufficient margin can be objectively wrong, regardless of whether that favorite actually won.
That’s not true for many other things, but it is a useful perspective to treat it as if it was it more true from more perspectives in more ways than people think.
Obviously this is not an exact science.
For sports, I could go into endless examples and the merits of various methods of evaluation. One good standard there is ‘what would be the odds if they played another game next week?’ Which has some weird stuff in it but is mostly a concrete way of thinking about ‘what would happen if we re-ran the event and randomized the details of the initial conditions?’
Another good general approach is ‘what do I now know that I didn’t know before, and how does that change my prediction?’ Where did my model of events go wrong?
A third thing to do is to look at the components of your predictions. In hindsight, do the implied conditional probabilities make sense? When things started to happen, how did you update your model? If they had gone differently, how would you have updated, and would those updates have added up to an expected value close to zero?
A fourth thing to do is look at the hidden assumptions. What are your predictions assuming about the world that you didn’t realize you were assuming, or that turned out not to be true? Often you can learn a lot here.
A key takeaway from doing my analysis below of various predictions is that my opinion of the prediction often depends almost not at all on the outcome. Your prediction’s logic is still its logic. In many cases, the actual outcome is only one additional data point.
One cannot point out too many times how easy it is to fool yourself with such questions, if you are looking to be fooled, or even not looking to not be fooled.
Since most of my audience is not deep into the sportsball, I will illustrate further only with non-sports examples.
It makes sense to start with the two that inspired this post, then go from there.
Note that I’ll be doing political analysis, but keeping this purely to probabilities of events. No judgments here, no judgments in the comments, please.
Scott’s two examples
Scott’s two examples from his recent post were Brexit and the 2016 Presidential Election.
In both cases, predictors that are at least trying to try, such as Nate Silver and Tetlock’s forecasters, put the chances of things going the historical way at roughly 25% right before the elections happened. Also in both cases, mainstream pundits and conventional wisdom mostly claimed at the time that the chance was far lower, in many cases very close to (but not quite) 0%. In both cases, there were people who predicted the other outcome and thought it was likely to happen, but not many. Also in both cases, the result may have partially been caused by the expectation of the other result. If voters had realized the elections were close, voters might have decided differently.
Importantly, in both cases, the polls, which are the best first-level way to predict any election, had the wrong side ahead but by amounts that historically and statistically were insufficient to secure victory.
Both elections were very close. Remain had almost as many votes as leave, to the extent that different weather in different areas of the United Kingdom could have made the difference (London voted heavily remain, other places for leave). Trump lost the popular vote and barely won the electoral college, after many things broke his way in the final week and day.
These are textbook cases, in this system, of results that were very much not inevitable. It is very, very easy to tell stories of slightly different sequences of events in the final week or days that end in the opposite result. If everything visible had been the same but the outcome went the other way, it would not have been more surprising than what happened even in hindsight.
As we were warned would happen, both results were then treated as far more inevitable than they actually were. Media and people in general rushed to form a narrative that these results were always going to happen. The United Kingdom treated a tiny majority as an inviolate will of the people rather than what it was, evidence that the country was about evenly split. Everyone wrote about the United States completely differently than if a hundred thousand votes had been distributed differently, or any number of decisions had been made a different way.
If you bet on Trump or on Leave at the available market prices, you made a great trade.
But, if you claimed that those sides were definitely going to win, that it was inevitable (e.g. the Scott Adams position) then you were more wrong than those who said the same thing about Leave and Clinton. This seems clear to me despite your side actually winning.
The only way to believe that predicting a Trump win as inevitable was a reasonable prediction is to assume facts about the world not in evidence. To me, it is a claim that the election either was stolen, or would have been stolen if Trump had been about to lose it. Same or similar thing with Leave.
The generalized version of that, as opposed to election fraud, is a more common pattern than is commonly appreciated. The way that things that look close are actually inevitable is that the winning side had lots of things up their sleeve, or had effectively blocked the scenarios where they might lose, in ways that are hard to observe. Try to change the outcome and the world pushes back hard. They didn’t pull out their ace in the hole because they didn’t need it, but it was there.
I don’t merely think that Nate Silver’s ~25% chance for Trump (and 10% chance to win despite losing the popular vote!) was merely what Scott Alexander called it, a bad prediction but ‘the best we could do.’ I think it was actually a pretty great prediction, the reasonable hindsight range is something like 20% to 40%. You need to give a decent chunk of the distribution to Trump, and he can’t be the favorite. If your prediction was way off of this in either direction I think you were wrong. I think Remain vs. Leave follows a very similar pattern.
(For the 2020 Election, I similarly think that anyone who thinks either candidate is a huge favorite is wrong, and will almost certainly in hindsight still have been wrong in this way regardless of the eventual outcome, because so many things could happen on multiple fronts. To be confident you’d need to be confident at a minimum of the politics and the economics and the epidemiology. That doesn’t mean it will be close on election day, or in October.)
Scott’s calibration exercise
Scott’s predictions are a clean set of probabilities that are clearly fair game. Sticking there seems reasonable.
Let’s look at Scott’s predictions for the year 2019 next. How do they look?
By his convention, strikethroughs mean it didn’t happen, lack of a strikethrough means it happened.
Politics (Reminder, strategic discussions only, please)
Donald Trump remains president: 90%
Donald Trump is impeached by the House: 40%
The house impeached Trump for something that, as of the time of the prediction, hadn’t happened yet. It is clear the actual barrier to convincing Pelosi was high. If things had been enough worse, impeachment might have not have happened because resignation. So you could reasonably say the 40% number looks somewhat high in hindsight. The argument for it not being high is if you think Trump always keeps escalating until impeachment happens, especially if you think Trump actively wanted to be impeached. I’m inclined to say that on its own 40% seems reasonable, as would have 20% or 30%.
The 90% number is all-cause remaining President. Several percent of the time Trump dies of natural causes, as he’s in his 70s. Several percent more has to be various medical conditions that prevent him from serving. Again, he’s in his 70s. World leaders also sometimes get shot, we’ve lost multiple presidents that way. Also, he’s impulsive and weird and looks like he often hates being president so maybe he decides to declare America great again and quit. And if there’s a dramatic change to world conditions and the USA doesn’t have a president anymore, he’s not president. Small probabilities but they add up. The majority of the 10% has to be baked in. We can reduce some of those a little in hindsight but not much.
So saying 90% is actually giving a very small probability of Trump leaving office for other reasons, especially given a 40% chance of impeachment – his probability of surviving politically conditional on the house being willing to impeach has to be at least 90%. Given the ways Trump did react and might have reacted to such conditions, and that some of the time the underlying accusations are much worse than what we got, this looks overconfident at 90% and I’d prefer to see 80%. But a lot of that is the lack of precision available when you only predict by 10% increments; 85% would have been fine.
3. Kamala Harris leads the Democratic field: 20%4. Bernie Sanders leads the Democratic field: 20%5. Joe Biden leads the Democratic field: 20%
6. Beto O’Rourke leads the Democratic field: 20%(Disclosure, at PredictIt I sold at various points all but three candidates, one of those three was Joe Biden, and my mistake in hindsight was not waiting longer to sell a few of them along with not selling one of the other two when I had the chance).
Scott’s nominee predictions, however, seem really sloppy. These four candidates were not equally likely. The prediction markets didn’t think so, their backgrounds and the polls didn’t think so. The dynamics we saw play out don’t think so, either. Things came down to a former vice president to a popular president who led in the polls most of the way versus the previous cycle’s runner up.
Putting them on equal footing with a random congressman from Texas who lost a close race once while looking exciting, or a more traditionally plausible alternative candidate like Kamala Harris, doesn’t age well.
Nor does having these all be 20% and adding to 80%, leaving 20% left for the other 16 or so candidates including Elizabeth Warren, plus any unexpected late entries.
The defense of the 20% on Biden is to say Biden was known to be old and a terrible candidate who predictably ran a terrible primary campaign, so he was overrated even though he ended up winning, while Harris and O’Rourke were plausibly very good candidates given what we knew at the time. I do think there’s broad range for such arguments, but not to this extent.
This is where calibration makes you look good but shouldn’t. Name the four leading candidates (or at least four plausible-to-be-top-four candidates, to be generous) and give them each 20% and your calibration will look mostly fine even if that evaluation doesn’t make sense and the remaining field is really more like 30-40% than 20%.
This is also where the human element can warp your findings. There’s a lot of ‘X has to be higher than Y’, or ‘X ~= Y here looks sloppy’ or ‘X can’t be an underdog given Z’ or what not. We have a lot of rules of thumb, and those who break those rules will look worse than they deserve, while those that follow those rules but otherwise talk nonsense will look better.
As usual, use a variety of evaluation methods and switch them up when it looks like someone might be Goodharting.
7. Trump is still leading in prediction markets to be Republican nominee: 70%
8. Polls show more people support the leading Democrat than the leading Republican: 80%
This 70% number seems like a miss low to me if you accept Scott’s other predictions above. In Scott’s model, Trump is 90% to be President, which means he’s now twice as likely to be President but losing the nomination fight, despite at the time facing zero credible opposition. If you again take out the 5%+ chance that Trump is physically unfit for office and leaves because of it, that makes it many times more likely to Scott that Trump can’t get the nomination but stays President, versus him stepping down. I can’t come up with a good defense of less than 80% or so in this context.
Predicting the Democratic candidate as likely to be ahead seems right, as that had been largely both true and stable for a while for pretty much any plausible Democratic candidate. 80% seems a little overconfident if we’re interpreting this as likely voters, but not crazy. A year is a long time, the baseline scenario was for a pretty good economy, and without anything especially good for Trump happening we saw some close polls.
Of course, if we interpret this as all Americans then 80% seems too low, since non-voters and especially children overwhelmingly support Democrats. And if we literally read this as people anywhere then it should be 95% or more. A reminder of how important it is to word predictions carefully.
9. Trump’s approval rating below 50: 90%
10. Trump’s approval rating below 40: 50%90% seems overconfident to me, although 80% would have been too low. It’s saying that the world is definitely in ‘nothing matters’ mode and meaningful things are unlikely to happen. This of course goes along with the 80% chance he’ll be behind in the polls, since if he’s above 50% approval he’s going to be ahead in the polls almost every time.
50% for approval ratings below 40 seems clearly more right than 40% or 60% would have been. This is an example of predictions needing to be evaluated at the appropriate level of precision. It’s easy to say “roughly 50%” here, so the ‘smart money’ is the ones who can say 53% instead of 50% and have it be accurate. So credit here for staying sane, which is something.
11. Current government shutdown ends before Feb 1: 40%12. Current government shutdown ends before Mar 1: 80%
13. Current government shutdown ends before Apr 1: 95%
14. Trump gets at least half the wall funding he wants from current shutdown: 20%15. Ginsberg still alive: 50%
I would not have been 95% confident that the shutdown wouldn’t extend past April 1. It doesn’t seem implausible to me at all that the two sides could have deadlocked for much longer, since it’s a zero-sum game with at least one of the players as a pure zero-sum thinker and where the players hate each other. There were very plausible paths where there were no reasonable lines of retreat. Once we get into March, chances of things resolving seem like they do down, not up. I think the 40% and 80% predictions look slightly high, but reasonable.
I am not enough of a medical expert to speak to Ginsberg’s chances of survival, but I’m guessing 50% was too low.
ECON AND TECH
16. Bitcoin above 1000: 90%
17. Bitcoin above 3000: 50%
18. Bitcoin above 5000: 20%
19. Bitcoin above Ethereum: 95%
20. Dow above current value of 25000: 80%
21. SpaceX successfully launches and returns crewed spacecraft: 90%22. SpaceX Starship reaches orbit: 10%23. No city where a member of the general public can ride self-driving car without attendant: 90%
24. I can buy an Impossible Burger at a grocery store within a 30 minute walk from my house: 70%25. Pregabalin successfully goes generic and costs less than $100/month on GoodRx.com: 50%
26. No further CRISPR-edited babies born: 80%
The first question I always wonder when I see predictions about Bitcoin is whether the prediction implies a buy or implies a sale.
At the time of these predictions, Bitcoin was trading at roughly $3,500.
Scott thought Bitcoin was a SCREAMING BUY.
The reason this represents a screaming buy is that Scott has Bitcoin to be almost 50% to be trading higher versus lower. But if Bitcoin is higher, often it is double its current price or higher, which in fact happened. You have a long tail in one direction only. Even in Scott’s numbers, the 20% vs. 10% asymmetry at 1000 and 5000 points towards this.
Was that right, given what he knew? I… think so? Probably? I was already sufficiently synthetically long that I didn’t buy (if you’re founding a company that builds on blockchain, investing more in blockchains is much less necessary), but I did think that the mean value of Bitcoin a year later was probably substantially higher than its $3,500 price.
What is clearly wrong is expecting so little variance in the price of Bitcoin. We have Bitcoin more likely to be in the 3000-5000 range, or the 2000-3000 range, than to be above 5000 or below 1000. That doesn’t seem remotely reasonable to me, and I thought so at the time. That’s the thing about Bitcoin. It’s a wild ride. To think you shouldn’t be on the ride at all, given the upside available, you have to think the ride likely ends in a crash.
Bitcoin above Ethereum at 95% depends on how seriously you treat the correlation. At the time Ethereum was roughly $120 per coin, or about 3% of a Bitcoin. Most of Etherium’s variance for years has been Bitcoin’s variance, and they’ve been highly correlated.
Note that this isn’t ETH market cap above BTC market cap, it’s ETH above BTC, which requires an extra doubling.
If we think about three scenarios – BTC up a lot, BTC down a lot, BTC mostly unchanged – we see that ETH going up 3000% more than BTC seems like a very crazy outcome in at least two of those scenarios. Given how little variance we’ve put into BTC, giving ETH that much variance in the upside or mostly unchanged scenarios doesn’t make sense.
So the 5% probability is mostly coming from a BTC collapse that ETH survives. BTC being below 1000 is only 10% in this model. Of that 10%, most of the time this is a general blockhain collapse, and ETH does as badly or worse. So again, aside from general model uncertainty and ‘5% of the time strange things happen’ 5% seems super high for the full flippening to have happened, and felt so at the time.
And of course, again, if ETH is 5% to be above BTC and costs 3% of BTC, then ETH is super cheap relative to BTC! It’s worth more just based on this scenario sometimes happening! Anyone who holds BTC is a complete fool given this other opportunity, unless they are really into balancing a portfolio.
It’s important to note when predictions are making super bold claims, especially when the claims do not look that bold.
The Dow being 80% to be above its current value, by contrast, is a very safe and reasonable estimate, since crashes down tend to be large and we expect the market on average to have positive returns. Given rounding, can’t argue with that, and wouldn’t regardless of the outcome unless there was a known factor about to crash it (e.g. something analogous to covid-19 that was knowable at the time).
On to SpaceX. Being 90% confident of anything being accomplished in space travel for the first time by a new institution within a given year seems like a mistake given what I know about space travel. But I have not been following developments, so perhaps this was reasonable (e.g. they had multiple opportunities and well-planned-out missions to do this, and it took a lot to make it not happen). Others can fill this in better than I can. I have no idea how to evaluate their chances of reaching orbit, since that depends on the plausibility of the schedule in question, and how much they would care about the milestone for various reasons.
The self-driving car prediction depends on exactly what would have counted. If this would have to have been on the level of ‘hail a driverless cab to and from a large portions of a real city’ than 10% seems very reasonable. If it would have been sufficient to have some (much lesser) way in which a member of public could ride a driverless car, I think that wasn’t that far away from happening and this would have been too low.
I am very surprised that Scott couldn’t at the time buy an Impossible Burger within a 30 minute walk from his house. I know where his house is. I can buy one now, within a 30 minute walk from my house (modulo my complete unwillingness to set food in a grocery store, and also my unwillingness to buy an Impossible Burger), and in fact have even passed “meat” sections that were sold out except for Impossible Burgers. Major fast food chains sell them. Of course, they had a very good year, almost certainly much better than expected. So 70% seems fine here, to me, with the 30% largely being that Impossible Burgers don’t do as well as they did, and only a small portion of it being that Scott’s area mysteriously doesn’t carry them. Seriously, this is weird.
The prediction on Pregabalin I have no way to evaluate.
The question of CRISP-er edited babies should have been worded ‘are known to have been born’ or something similar, to make this something we can evaluate. Beyond that, it’s a hard one to think about.
WORLD
27. Britain out of EU: 60%28. Britain holds second Brexit referendum: 20%29. No other EU country announces plan to leave: 80%
30. China does not manage to avert economic crisis (subjective): 50%
31. Xi still in power: 95%
32. MbS still in power: 95%
33. May still in power: 70%34. Nothing more embarassing than Vigano memo happens to Pope Francis: 80%
Once again I the 95% numbers seem too high even when I can’t think of an exact scenario where they lose power, but again it’s not a major mistake.
The Vigano memo seems unusually embarrassing as a thing that happens to the Pope relative to the average year, thinking historically. Most years nothing terribly embarrassing happens to Popes, the continuing abuse scandal seems like the only plausible source for embarrassing things, and Francis seems if anything less likely than par to generate embarrassing things. So if anything 80% seems low, unless I’m forgetting other events.
The China prediction is subjective, and I don’t think I would have ruled it the same way Scott did, so it’s really tough to judge. But in general 50% chance of economic crisis within one year is a very bold prediction, so I’d want to know what made that year so different and whether it proved important.
Now it’s time to talk about the EU, and what happens after you vote for Brexit. It’s definitely been a chaotic series of events. It definitely could have gone differently at various points. Sometimes I wonder what would have happened if Boris Johnson had liked his Remain speech rather than his Leave speech.
I like 60% as a reasonable number for Britain out of EU in 2019. There were a lot of forces pushing Britain to leave given the vote. There were also practical reasons why it was not going to be easy, and overwhelming support for remaining in the EU in parliament if members got to vote their own opinions. Lots of votes throughout the year seemed in doubt several times over, with May and others making questionable tactical decisions that backfired and missing opportunities all the time. The EU itself could have reacted in several different ways. Even now we can see a lot of ways this could have gone.
How about 20% for a second referendum? We can consider two classes of referendum, related but to me they seem importantly distinct.
There’s the class where Her Majesty’s Government decides to do what the EU often does, which is have the voters keep voting until they get the right result. Given the vote was very close, and that leaving turned out to not look like voters were promised, the only thing preventing this from working was some sort of mystical ‘the tribe has spoken’ vibe that took over the country.
Then there’s the class where the EU won’t play ball, or the UK politicians want to vomit when they see the kind of ball the EU was always prepared to play. They’re looking at a full Hard Brexit, and want to put the decision of whether or not to accept that onto the people.
Thus it’s not obvious in hindsight whether the referendum was more likely in the “Britain leaves” world or the “Britain stays” world, given that was already up in the air. Certainly it feels like something unlikely would have had to happen, so we’re well under 50%, but that it wasn’t that far from happening, so it was probably more than 10%. 20% seems fine.
May being 70% to stay in power, however, feels too high. May was clearly facing an impossible problem, while being committed to a horrible path, in a world where prime ministers are expected to resign if they don’t get their way. How often would Britain still be in the UK at the end of the year while May survived? That seems pretty unlikely to me, especially in hindsight, whereas Britain leaving without May seems at least as likely. So May at 70% and leaving at 60% doesn’t seem right.
SURVEY
35. …finds birth order effect is significantly affected by age gap: 40%
36. …finds fluoxetine has significantly less discontinuation issues than average: 60%
37. …finds STEM jobs do not have significantly more perceived gender bias than non-STEM: 60%
(#38 got thrown out as confusing and I don’t know how to evaluate it anyway)
I would have been more confident on the merits in 35 and 37. Birth order effects have to come from somewhere, and the ‘affected’ side gets both directions. And the STEM prediction lets you have both about as much perceived bias and less bias, and I had no particular reason to believe it would come out bigger or smaller.
What’s more interesting, although obviously from a small sample size, is that all three proved true. So Scott’s hunches worked out. Should we suspect Scott was underconfident here?
This could be a case of Unknown Knowns. Scott has good reason to believe in these results, the survey has enough power to find results if they’re there, but Scott’s brain refuses to be that confident in a scientific hypothesis without seeing the data from a well-run randomized controlled trial.
I kid, but also there’s almost certainly a modesty issue happening here. I would predict that Scott would be reliably under-confident in his hunches that he thought enough of to include in his survey.
I started to go over Scott’s personal predictions, but found it mostly not to be a useful exercise. I don’t have the context.
There is of course one obvious thing to note.
PERSONAL – PROJECTS
63. I finish at least 10% more of [redacted]: 20%64. I completely finish [redacted]: 10%65. I finish and post [redacted]: 5%66. I write at least ten pages of something I intend to turn into a full-length book this year: 20%67. I practice calligraphy at least seven days in the last quarter of 2019: 40%68. I finish at least one page of the [redacted] calligraphy project this year: 30%69. I finish the entire [redacted] calligraphy project this year: 10%70. I finish some other at-least-one-page calligraphy project this year: 80%PERSONAL – PROFESSIONAL
71. I attend the APA Meeting: 80%
72. [redacted]: 50%73. [redacted]: 40%
74. I still work in SF with no plans to leave it: 60%
75. I still only do telepsychiatry one day with no plans to increase it: 60%
76. I still work the current number of hours per week: 60%
77. I have not started (= formally see first patient) my own practice: 80%
78. I lease another version of the same car I have now: 90%
None of the personal projects happened. Almost all the professional predictions happened, most of which predict the continued status quo. That all seems highly linked, more like two big predictions than lots of different predictions. One would want to ask what the actual relevant predictions were.
Overall, clearly this person is trying. And there’s clearly a tension between getting 95% of 95% predictions right, and having most of them actually be 95% likely. Occasionally you screw up big and your 95% is actually 50%, and that can often be the bulk of the times such things fail. Or some of them are 85%, but again that can easily be the bulk of the failures. So it’s not entirely fair to complain about a 95% that should be 99% unless standards are super high.
Mostly, I’d like to encourage looking back more in this type of way when possible, in addition to any use of numeric metrics.
I also should look at my own predictions, but also want to make that a distinct post, because its subject matter will have a different appeal on its own merits.
I hope this was helpful, fun, interesting or some combination of all three. I don’t intend it to be perfectly thought out. Rather, I thought it was a useful thing for those interested, so I’d write it quickly, but not let it take too much time/effort away from other higher priority things.