Raising the forecasting waterline (part 2)
Previously: part 1
The three tactics I described in part 1 are most suited to making an initial forecast. I will now turn to a question that was raised in comments on part 1 - that of updating when new evidence arrives. But first, I’d like to discuss the notion of a “well-specified forecast”.
Well-specified forecasts
It is often surprisingly hard to frame a question in terms that make a forecast reasonably easy to verify and score. Questions can be ambiguous (consider “X will win the U.S. presidential election”—do we mean win the popular vote, or win re-election in the electoral college?). They can fail to cover all possible outcomes (so “which of the candidates will win the election” needs a catch-all “Other”).1
Another way to make questions ambiguous is to leave out their sell-by date. Consider the question, “Google Earth / Street View will become a video gaming platform.” This seems designed to prompt a “by when?” On both PredictionBook and the Good Judgment Project, questions come with a “known on” or a “closing” date respectively. But the question itself generally includes a date: “Super Mario Bros. Z episode 9 will be released on or before November 24, 2012”. The “known on” or “closing” date often leaves a margin of safety, for cases where some time may pass between the event happening and the outcome becoming known. 2
Questions for GJP are provided by IARPA, the tournament sponsor. Both IARPA and the GJP research team go to some lengths to make sure that questions leave little room for interpretation or ambiguity: a deadline is always specified, and a “judge’s statement” clarifies the meaning of terms (even such obvious-seeming ones as “by” a certain date, which is expanded into “by midnight of that date”) and which sources will be taken as authoritative (for instance, “gain control of the Somali town of Kismayo” was to be verified by one of BBC, Reuters or the Economist announcing a control transition and failing to then announce a reversal within 48 hours). Some questions have been voided (not scored) due to ambiguity in the wording. This is one of the things I appreciate about GJP.
Tool 4 - Prepare lines of retreat
Many forecasts are long-range, and many unexpected things might happen between making your initial forecast and reaching the deadline, or the occurrence of one of the specified outcomes. There are two pitfalls to avoid: one is that you will over-react to new information, swinging between “almost certain” to “cannot happen” every time you hear something in the news; the other is that you will find a way to interpret any new information as confirming your initial forecast (confirmation bias).
One of my recent breakthroughs with GJP was when I started laying out my lines of retreat in advance: I now try to ask myself, “What would change my mind about this”, and write that in the comments that you can optionally leave on a forecast, as a non-repudiable reminder. For instance, on the question “Will the Colombian government and FARC commence official talks before 1 January 2013?”, I entered a 90% “Yes” forecast on 9⁄18 when a date for the talks was set, but added: “Will revise significantly towards “No” if the meeting fails to happen on October 8 or is pushed back.” This was to prevent my thinking later “Oh, it’s just a small delay, it will surely happen anyway”. On October 1st, a delay was announced, and I duly scaled my forecast back to 75%.
Advice of this sort was part of the “process feedback” that we received from the GJP team at the start of Season 2, pointing out behaviors that they associated with high-performing forecasters, and in particular the quantity and quality of the comments these participants posted with their forecasts. I only recently started really getting the hang of this, but now, more likely than not, mine are accompanied with a mini-summary (a few paragraphs) where I briefly summarize the status quo, reference classes or base rates if applicable, current evidence pointing to likely change, and what kind of future reports might change my mind. These are generally not based on my background knowledge, which is often embarrassingly scant, but between a few minutes and an hour of Googling and Wikipedia-ing.
Tool 5 - Abandon sunk costs
The GJP scores forecasts using what’s known as the “Brier score”, which consists of taking your probability and squaring it if the event did not happen, or the complement of your probability and squaring it if the event did happen. (That’s for binary outcomes; for multiple outcomes, it’s the sum of these squares over each outcome.)
You’ll notice that the best you can hope for is 0: you assign 100% probability to something that happens, or 0% to something that doesn’t happen. In any other situation your score is positive; so a good score is a low score.
The sunk cost fallacy consists of letting past costs (or losses) affect decisions on future bets. Say you have entered a forecast of 90% on the question “Will 1 Euro buy less than $1.20 US dollars at any point before 1 January 2013?”, as I did in mid-July. If this fails to happen, the penalty is a fairly large .81. A few months later, propped up by governmental decisions in the Euro zone, the continent’s currency is still strong.
You may find yourself thinking “Well if I change my mind now, half-way through, I’m only getting at best the average of .81 and the Brier score of whatever I change it to; .4 in the best case. But really, there’s still a chance it will hit that low… then I’d get a very nice .01 for sticking to my guns.” That’s the sunk cost fallacy. It’s silly because “sticking to my guns” will penalize me even worse if my forecast does fail. Whatever is going to happen to the Euro will happen; the loss from my past forecast is already determined. The Brier score is a “proper” scoring rule, which can only be optimized by accurately stating my degree of uncertainty.
What’s scary is that, while I knew about the sunk cost fallacy in theory, I think it pretty much describes my thoughts on the Euro question. I only retreated to 80% at first, then 70% - before finally biting the bullet and admitting my new belief, predominantly on the “No” side. (That question isn’t scored yet.)
Detach from your sunk costs: treat your past forecasts as if they’d been made by someone else, and if you now have grounds to form a completely different forecast, go with that.
Tool 6 - Consider your loss function
The Brier score is also known as the “squared error loss” and can be seen as a “loss function”: informally, you can think of a “loss function” as “what is at stake if I call this wrong”. In poker, the loss function would be not the probability that your hand loses, or the Brier score associated with your estimate of that probability, but the probability multiplied by the size of the pot—the real-world consequences of your estimate, in other words. This is why you may play the same hand aggressively or conservatively according to the circumstances.
The Brier score is more “forgiving” in one sense than another common loss function, the logarithmic scoring rule—which instead of the square takes the log of your probability (or of its complement). If you use probabilities close to 0 or 1, you can end up with huge negative scores! With the Brier score, on the other hand, the worst you can do is a score of 1, and a forecast of 100% isn’t much worse than 95%, even if the event fails to happen.
The GJP system computes your Brier score for each day that the question is open, according to what your forecast is on that day, and average over all days. Forecasts have a sell-by date, which is somewhat artificially imposed by the contest rather than intrinsic to the situation. This means there is an asymmetry to some situations, such that the best forecast may not reflect your actual beliefs. One example was the question “Will any government force gain control of the Somali town of Kismayo before 1 November 2012?”.
When this question opened, I quickly learned that government forces were preparing an attack on the town. *If* the assault succeeded, then the question would probably resolve in a few days, and the Brier score would be dominated by my initial short-term forecast. If, on the other hand, the assault failed, the status quo would likely continue for quite some time; I could then change my forecast, and the initial would be “diluted” over the following weeks. So I picked a “Yes” value more extreme (90%) than my actual confidence, planning to quickly “retreat” back to the “No” side if the assault failed or gave any sign of turning into a protracted battle.
This can feel a little like cheating—but, if your objective is to do well in the forecasting contest (as opposed to having correct beliefs on the more general question “will Kismayo fall”, which does not have a real deadline), it’s perfectly sensible.
We have reached something that feels a little like a “trick” of forecasting, and thus are probably leaving the domain of “basic skills to raise yourself above a low waterline”. I’ll leave you with these, and hope this summary has encouraged you to try your hand at forecasting, if you hadn’t done so before.
If you do: have fun! And please report back here with whatever new and useful tricks you’ve learned.
1 PB only has binary questions, so that isn’t an issue, but it is one on GJP where multiple-choice questions are common.
2 On PB, there appears to be a tacit convention that the “known on” date also serves as a deadline for the question, if no such deadline was specified.
- Forecasting Newsletter: September 2022. by 12 Oct 2022 16:37 UTC; 23 points) (EA Forum;
- 12 Oct 2012 18:40 UTC; 4 points) 's comment on Raising the forecasting waterline (part 1) by (
- 6 Jun 2015 18:14 UTC; 2 points) 's comment on Summary of my Participation in the Good Judgment Project by (
Thanks for writing this. It’s always great to see an article with specific techniques backed up by examples.
Seconded. In particular, this sort of approach to this kind of subject is very fulfilling, giving the message in clear understandable bits. I feel like I got a lot from reading this, and that always is something I appreciate.
It’s much harder to make well formed predictions than one would initially suspect. The fun part about PB is trying to make them, that you don’t get on GJP.
nitpick: should it be a very nice 0.01 rather than a very nice 0.10?
Yep!
Good article. As a fellow GJPer, my only nitpick is that the Brier rule is a squared rule, so there is a bigger loss between 95% and 100% than just 0.05. It’s not as bad as a logarithm based rule though. Also, the way they do it, the maximum loss is 2 not 1.
Look forward to the next part!
It seems like it would be useful to break predictions into many different parts so that you can tell exactly where you went wrong. So, to take the first example, you would estimate the probability that a certain candidate was nominated, predict one or two of their most probable opponents, estimate the probability that they win against the first one, estimate the probability they win against the second one, estimate how much they win by and where they win states.
This is infinitely regressive, but it would be helpful to divide different types of aspects of your predictions so that you can know what areas you are weak in. So, if you usually do well with predicting whether or not a major news event will occur, but do badly with predicting the effects of that event on your prediction, then you can make your mental model much more responsive to new data relevant to the second aspect of your prediction about the potential major news event.
In other words, macro predictions are sometimes done so that people get a sense of which fields they are strong and weak in. Breaking macro predictions down into smaller predictions would allow them to get a sense of which aspects of predictions they are strong at within those fields. This would result in more accurate models.
The DAGGRE project is based on just that, decomposition of forecasts. This PDF explains how it works. It’s an interesting approach, and the reason I mentioned in an aside, in part 1, that I might have liked to join that team.
The GJP, on the other hand, uses different tools—as I understand it some teams have “survey” type interfaces, where we enter just a probability and our reasoning, other teams have “prediction market” interfaces.
I don’t personally find it very useful (yet?) to explicitly decompose my forecasts.
For instance a recent question was “Will the sentence of any of the three members of the band Pussy Riot who were convicted of hooliganism be reduced, nullified, or suspended before 1 December 2012?” It’s not clear how you’d decompose that:
chance that each individual girl member of PR would have her sentence reduced
chance for each possible grounds for a sentence reduction
chance for each possible political influence on sentencing (public opinion, Putin, Medvedev)
ISTM that making a fine-grained forecast on any of the above is to presume way too much of my detailed knowledge of the situation. Maybe someone close to the case might have predicted that Yekaterina would walk while the other two would serve a full sentence. The reason given was “because she was thrown out of the cathedral by guards before she could remove her guitar from its case and take part in the performance.” I only learned about that just now, looking at news reports on the appeal result; this was never mentioned previously.
So, I don’t know how I feel about decomposition. What I’m reminded of is the distinction between “fox” and “hedgehog” approaches that originated with Tetlock and which Silver discusses in his book: “Hedgehogs know one big thing, while foxes know many little things.”
Silver says that a “fox” usually does better because they approach different predictions in different ways and bring a variety of perspectives to each, whereas the “hedgehog” tends to be more ideological, to insist that there is One True Way to tackle every forecasst. The decomposition approach strikes me as less fox-y and more hedgehog-gy.
The results of the questionnaire I filled when I joined GJP identified me as more of a hedgehog: 4.5 on a 1-7 scale, compared to a mean of 3.81, SD .52. I’m pretty sure that my actual forecasting behavior, at least this year, is foxier.
I also scored slightly on the hedgehog scale. I think people who like to “think about thinking” are already slightly hedgehog. True foxes don’t believe in such grand theories.
The decomposition approach seems more foxy to me because it applies more specific details and approaches different aspects of the problems in different ways. The analogy I see is that the decomposition approach is similar to using a Taylor series to approximate the slope of a function in calculus, whereas the other approach is much less fine tuned. Except it’s more like using multiple Taylor series’ in a piece wise function than just a vanilla Taylor series, so the analogy isn’t really great.
With Pussy Riot, I would just say that you can identify the macro factors which effect the probability of conviction, and then determine the probability that those macro factors exist by estimating the probability that they change. So, for instance, you could learn who their judge would be and what that judge’s rulings have been like in the past. You could try to get a rough understanding of that judge’s thought process. You could do the same with lawyers. You could check what time or what month the trial was scheduled for, and then apply data on how judges tend to be kinder when it’s the beginning of the day or how when the weather is nice (this latter factor is more of a guess but it seems to me like it would be accurate).
Those specific factors I mentioned are actually somewhat weak and wouldn’t have much overall effect. But they’re intended as examples of what can be done in a more general sense. Although I agree that the Pussy Riot conviction specifically would be difficult to predict, and although I’m doing a bad job of explaining this specific case, I think that’s more because I haven’t really been following that news story at all, because it seemed boring to me. If I had done more reading on it, even if I didn’t have many more specific details, then I think I would be able to identify more powerful factors that a decomposition approach could use.
You mention that you often don’t have specific information, but I think that there’s general information which is available that can be applied to specific aspects of the decomposition, and that this would improve the accuracy of the prediction.
To use a different example that I feel more comfortable with, let’s say we had to evaluate the probability that the current government of China collapses by 2035. Economic trends are going to be a huge factor in this. We can break that down further. Here are some major things that will determine the economic health of China: 1. Demographics, what China does with its aging population and what the economic effects of the one-child policy and lack of females are. 2. Environmental health, the environment is the foundation of the economy. 3. Infrastructure is important.
So, looking at 1, we can say that there’s a high probability that China will at least initially attempt to support its aging population. However, the population is so old that I can’t see this working for long. I think China will lose lots of money on its elderly people, but then will just let them die (this would have social consequences, those should be put as inputs into a different section of our overall prediction). I also think that China is going to have much of its male population move overseas, and that it’s going to start encouraging foreign females to immigrate. These policies won’t be very successful because China is awful. Overall, China will still lose lots of money.
Looking at 2, China’s environment is trash, and their priority is on short term growth, so this will stay the same for the foreseeable future. Unless China gets some bold new leader who is so charismatic he can keep the people from revolting while their short term growth slows and who implements major environmental reforms, then China is doomed. I don’t think that any such leaders exist in China right now. However, another solution would arise if some fantastic new technology allowed China to change the basis of its production system. I don’t know what the probability of cheap and widespread nanotech or something similar by 2035 is, but I’m going to assume that it’s low and so this also won’t happen and so China’s environment is in trouble.
Looking at 3, China won’t be able to maintain infrastructure as a consequence of 1 and 2. This will aggravate underlying problems. Also, China’s already running into problems because other countries in the region are trying to take the cheap industrial production section of the market China used to have. China’s been trying to switch towards a more technologically advanced production economy, but their education is subpar. This will get worse during the aging crisis.
China is basically dependent on economic growth to keep its population from revolting, from what I’ve read.
So I conclude that unless something very important happens with China’s leadership or cultural/social practices, the economy of China will collapse by 2035 and China will then disintegrate into a civil war. I’m going to put about a 80% probability estimate on this one. This is what I actually believe, although this is sort of an incomplete description and also it oversimplifies and formalizes a lot of what is just raw intuition in my head.
This example also hopefully makes clear the risks of infinite regression. Stuff is complicated.
The “unless something very important” hedge makes this prediction rather hard to judge: it’s vague enough that anyone might confidently predict “something very important happens with China’s leadership or cultural/social practices between now and 2035″, because something is bound to happen that arguably falls under this specification.
This is kind of a converse rule of thumb to the “prefer status quo”, which—I should maybe have said—is valid at fairly short timeframes. Over long enough timeframes, the reverse is true—pretty much everything we can specify is going to change. (Over long enough timeframes, continents will move, mountains will be eroded to sea level, and eventually the planet itself will no longer exist.)
Over a twenty year timeframe, if you phrase your prediction in terms of (say) “the rate of exchange between USD and CNY”, there is some chance that one of the currencies will cease to exist, or that the notion of “rate of exchange” will stop making sense.
Nitpick: I don’t think this is much beyond epsilon. Even if a currency ‘ceases to exist’, the currency still exists in whatever physical embodiment. There is still an exchange rate from Saddam’s dinars to US dollars: it’s however many pieces of old paper $1 will buy you on eBay. The exchange rates for currencies that cease to exist can even go up over time (how much would a American Revolution-era shinplaster cost ya?).
If something important happens, it is not an excuse. My prediction will still have been wrong.
Ooh. As long as I’m making predictions, I believe Romney will defy the polls and win. 65% chance I’m right.
The economy is doing poorly.
Obama is an incumbent.
Romney is taller.
I think polls (and 538) are overestimating what Democratic participation will be like.
These are the only factors I considered. I intentionally took a different approach to this prediction.
This prediction is very hedgehoggy, but I think that is the right tool for the job here. I feel comfortable doing that here but not in China because I believe macro trends work much better as predictors of things like human psychology and things like democratic votes. In contrast, I don’t think I can form a good reference class to put China (2012) in that will allow me to evaluate the probability it collapses in, so I chose to instead use speciifc data. Also, I’ve examined specific facts about China, but I don’t feel confident in my ability to do that when it comes to the elections, because of media bias and cultural biases.
How confident are you? IEM and InTrade would like to know.
It’s far too late to get in on IEM or InTrade action—with like 24 days to go, it’d take almost all of that time to deposit funds via check or get ID cleared.
That said, I’d be happy to wager with chaosmosis that Romney will lose and Obama will win- say his $30 against my $20?
Nope, but thanks.
I’m confident, but not meta-level confident, because I haven’t been doing this for a while and I’m using new techniques and the experts are against me. I also have low risk tolerance for cash losses.
Additionally, you’re probably putting the terms of the bargain like that because I said 65% chance, but I expect that you’ve got an even higher chance in your mind that Obama would win. Obama winning is conventional wisdom, I would only accept a bet that rewarded me for being an outlier.
Fine—you drive a hard bargain! $20:$20 or $30:$30?
Not gonna do it. Maybe that means I should revise my estimates, or that I have irrational feelings about bets in general. I’ll figure that out later.
I notice that I would be willing to bet on China, that’s because: 1. I’ve looked through the data more and taken a foxy approach to it. 2. I think that China’s fate is less subject to random events 3. I’m not really against conventional wisdom with China. 4. I feel more metalevel confident about my ability to predict world affairs, I’m not sure why though.
I think the most likely reason that I’m not wanting to bet is that I don’t know whether or not I’m safe in doubting conventional wisdom and current polls and most experts in this specific field, and that I actually suspect that I’m not. For now I’ll revise my estimates, I’ll still incorporate my move towards Romney from the mainstream but I’ll start at the mainstream, which would be about 80% Obama, and I’ll move towards Romney, to 60% for Obama.
Also, I might have been in backlash mode or trying to be TOO pessimistic when I made the initial prediction that Obama would lose. Either I’ve gotten deeper into the liberal echo chamber from redditing and whatnot and my perceptions are getting distorted, or else I was being too pessimistic before (because I tried to force factors like height into my model, and I assumed that substance was basically irrelevant to political success, but now I’m not so sure.) I think I wanted to see myself as pessimistic so I wrote off the mainstream opinion and maybe I overdid it.
Additionally, I’m not wanting to go through the hassle of getting a paypal or whatever might happen. I would like the information on this (whatever the standard betting venue or monetary exchange thingy on LessWrong would be, if there is one, even though I’m not going to be betting here). I think I might be scared that I’ll lose the bet and then decide to not pay you, also, and that’s something else that’s making me reluctant to bet. Now that I think about it, I should also get an Intrade account in advance now.
There are many complicating factors. Demands for money helped me identify some.
FWIW, my own Obama probability is more like 65-70%.
Bets are typically paid as Paypal, Bitcoin, or donations to specified charities.
I personally don’t use Intrade because the new fees make it very expensive for low-stakes long-term bets like most bets on LW (assuming the market even exists).
I was tired when I wrote this. I’m going to bed now, and I’m not going to fix the grammar or the colloquialisms or the general disorganization.