Summary of my Participation in the Good Judgment Project
Follow-Up to Good Judgment Project, Season Three.
During the last forecasting season I took part in the Good Judgment Project (GJP; see also the blog) and this is a short summary of my participation (actually triggered by hamnox comment).
The GJP estimates world events like
Ukraine conflict
Arctic ice cap melting
Ebola outbreak duration
Chinese sea conflict
ISIS attacks
Terrorist attacks
Oil price
Certain exchange rates
Election results
and many other political events
To participate in that study one has to register (can’t remember where exactly I stumbled over the link, possibly the one at the top). And one has to do an preparatory online course and one has to pass an online test. At least I had to complete it. Whether the result affected my assignment to any group I can’t say. The course explains the scoring and gives recommendations for making good forecasts (choose forecasts one has an edge in, estimate early, update often, do post-mortems). The test seems to test for calibration and accuracy by asking for known (mostly political) events and whether one is sure about them.
The current forecasting season started in November 2014 and has just ended. I invested significantly less then half an hour a week on 8 questions of about 100 (and thus less than I projected in an early questionaire). I did 2 to 15 updates for these questions and I earned a score in the middle range (mostly due to getting hit by an unexpected terrorist attack). As I just learned I was assigned to the study condition were I could neither see the total group estimate nor the estimates of the other group members—only their comments. I was somewhat disappointed by this as I had hoped to learn something from how the scores developed. Too bad I wasn’t in a prediction marked group. But I hope to get the study results later.
I will not take part in further rounds as I shy the effort for the types of forecasts which are mostly political. They are political because the sponsor (guess who) is interested mostly in political events—less in economical, environmental, scientific or other types. But I enjoyed forecasting artic ice cap melting and ebola—and netted a better than average score on that.
The scoring—at least in this group—is interesting and uses an averaged Brier Score—averaged over a) all forecast questiontion and b) within a question over all the days for which a forecast is provided. I intended to game that by betting on questions that a) I could forecast well and b) that had an expected reliable outcome. Sadly there were few of type a.
From this experience I learned that
such prediction organizations ask mostly for political events,
political events are hard to predict and
predicting political events requires a lot of background information.
I’m below average in predicting political event (at least compared to my group which I’d guess has more interest in politics than I) but
I’m above average on non-political topics.
I was also involved. I did not need to do a training, and my group was in a prediction market, so it was a very different group. The data should be available once the papers are published; that may take a year or two. (If you’re a grad student, at least, you might be able to ask for the data for writing your own papers- when I talked to one of the organizers, he seemed to think that it could be worked out.)
IARPA is the intelligence version of DARPA, so they were in fact interested in the types of questions that intel analysts would face. On the other hand, the places where prediction is useful is not at all limited to these areas, and GJP is interested in extending that in the future.)
Thanks for posting this. The GJP’s sparked only sporadic discussion here, maybe because it focuses so much on world politics as opposed to stereotypically LWesque STEM stuff, and that’s a bit of a shame. I’m a STEM nerd myself, but in a way that made the GJP more enticing because I thought participating in it might nudge me to learn a tiny bit about world politics (it did), and because I wanted to see whether I could beat the averages despite having minimal domain-specific knowledge (I could).
IIRC I filled out a pre-registration form that just asked for bare-bones demographic info like occupation and highest-level education qualification. After the GJP let me into the study, but before they assigned me to a group, I think I filled out a longer background survey about myself, and did the political knowledge/calibration test.
I did the short training session after getting the group assignment. Presumably the (sub)group assignments are randomized so the researchers can make causal inferences about which treatments generate better forecasts.
It’s actually still running for my group. We have 31 questions still open which don’t close until the 8th or 9th.
I wound up putting in more time than I think I anticipated, probably more than half an hour a week most weeks, and so far I’ve made 335 predictions on 36 questions. Since GJP started displaying my rank in my group, my overall Brier score’s consistently been in the lowest 20%.
Maybe we were in the same group. My group also had no prediction markets, but I could read “tips” written by other people who were apparently chatting to each other in a forum to which I didn’t have access. I also couldn’t/can’t see other users’ predictions in real time, although I could see the group’s median Brier score for each question after it was closed.
Ah, but if you didn’t make a prediction on a question, you still got a Brier score for it — GJP gave you the median score of the group members who did make a prediction. (Or that’s how it worked for me, anyway.) So the trick is to choose questions where you expect to do better than the median predictor, even if those questions look difficult. (Perhaps especially questions which look difficult to you, because other people might be overconfident about them.) The sample size is small, but on each of the 4 questions where my Brier score was high (≥ 0.5) I scored 0.07-0.15 fewer points than the group score, which really helped drive down my overall score.
Mostly true, although election-result questions tended to be nice & easy. A few other political events weren’t obvious slam-dunks if I looked at them from a distance, but became very obvious slam-dunks as soon as I investigated them.
Example: “Will a referendum on Quebec’s affiliation with Canada be held before 31 December 2014?”, which I didn’t touch until October. But when I ran Google News searches about it, the lack of positive evidence for expecting a referendum was stark, and I immediately gave it only a 5% probability. During October I monotonically lowered that as tips came in pointing out that the one party pushing for a referendum was unpopular and leader-less, and that a referendum would take time to organize. For all of November & December I had that question at 0%, and my final Brier score for it halved the (already tiny) group score.
I also discovered that the prediction difficulty of the political questions was often time-dependent. IARPA tried to pick relevant & topical questions, which meant that a lot of questions were provoked by news coverage. But because the news prefers dramatic, sudden events, quite a few of the resulting questions were about transient crises or other hot issues that rapidly cooled down and became highly predictable within days or weeks, leaving them easy to predict for most of the (months-long) prediction windows.
A good tactic therefore turned out to be: just wait. It’d be interesting to see how people would do in a GJP re-run where the questions had shorter prediction windows, and that tactic would surely be less successful.
Yes. Actually it is. Somehow I misinterpreted one of the last mails. At least it’s closed on all my forecasts.
Maybe. The best forecaster is grossz18 in my group.
No. 1 in my group is morrell. Our groups are probably different after all...or GJP is feeding us different rankings as part of the experiment!
Thank you for your detailed contribution!
Hm, yes, that makes sense as these scores are listed in grey in my coulmn too. I just didn’t make that connection and can’t seem to remember that it was explicitly explained that way—but maybe I misunderstood which averaging applies to which. Esp. before actually seeing the UI.
Yes. That seems like another sensible strategy to game it,
I recommend that you send this as a reply to one of the last mails. They seem to really read them.
I don’t think “little interest” is a fair description. Searching LW for Good Judgment Project provides 290 search hits.
I just did a search for “(”Good Judgment Project” OR GJP)” and got only 87 hits, so most of your results might merely have been recent comments/posts in LW’s sidebar.
Looking through the first couple of pages of hits I see
a link post for GJP season 3, and the only comments are the ones I linked in the grandparent (I upvoted them anyway because they’re interesting feedback)
a link post about an earlier GJP round, which does actually have a lot of GJP talk among its 55 comments
this Gunnar_Zarncke post
part 1 of Morendil’s 2012 “Raising the forecasting waterline”, about participating in the GJP, which has 108 comments but most aren’t about the GJP
a short follow-up by gwern to post 2, with 2 comments
part 2 of Morendil’s “Raising the forecasting waterline” (and 22 comments)
your user page, which comes up because of the parent comment
a link post to an FT article on forecasting, with comments that don’t talk about the GJP
the list of recent comments for LW’s Discussion section, which comes up because of the parent comment
VipulNaik’s “Some historical evaluations of forecasting”, which discusses the GJP for a paragraph (the only comment doesn’t mention the GJP)
the list of Discussion posts tagged “tetlock”, which matches because post 8 comes up
Morendil starting a short subthread about the GJP under “The Martial Art of Rationality”
post 5 at a different URL
an unrelated post which only comes up because Google indexed it while my GJP-mentioning comment was in the sidebar
another VipulNaik post which again discusses the GJP for a paragraph; all 3 comments talk about something else
Morendil’s “Raising the waterline” mentions the GJP a few times (none of its comments do)
VipulNaik’s “An overview of forecasting for politics, conflict, and political violence” lists various forecasting efforts, and discusses the GJP as one of them across several bullet points (0 comments)
VipulNaik’s list of submitted posts
Morendil mentioning the GJP in a one-sentence comment.
VipulNaik again giving the GJP a paragraph in “Domains of forecasting” (none of the 4 comments mention the GJP)
That is more commentary than I remembered (I’d definitely forgotten about Morendil’s 3 top-level posts), and yeah, “little interest” is too strong. I’ll change that to “sporadic discussion”, which I think is fair. Aside from Morendil’s posts and this G_Z post, most of the mentions of GJP on LW seem to be asides or links to external articles, and they’re spread out over about 4 years.