Conversation on forecasting with Vaniver and Ozzie Gooen

[Cross-posted to the EA Forum]

This is a transcript of a conversation on forecasting between Vaniver and Ozzie Gooen, with an anonymous facilitator (inspired by the double crux technique). The transcript was transcribed using a professional service and edited by Jacob Lagerros.

I (Jacob) decided to record, transcribe, edit and post it as:

  • Despite an increase in interest and funding for forecasting work in recent years, there seems to be a disconnect between the mental models of the people working on it and the people who aren’t. I want to move the community’s frontier of insight closer to that of the forecasting subcommunity

  • I think this is true for many more topics than forecasting. It’s incredibly difficult to be exposed to the frontier of insight unless you happen to be in the right conversations, for no better reason than that people are busy, preparing transcripts takes time and effort, and there are no standards and unclear expected rewards for doing so. This is an inefficiency in the economic sense. So it seems good to experiment with ways of alleviating it

  • This was a high-effort activity where two people dedicated several hours to collaborative, truth-seeking dialogue. Such conversations usually look quite different from comment sections (even good ones!) or most ordinary conversations. Yet there are very few records of actual, mind-changing conversations online, despite their importance in the rationality community.

  • Posting things publicly online increases the surface area of ideas to the people who might use them, and can have very positive, hard-to-predict effects.


Introduction

Facilitator: One way to start would be to get a bit of both of your senses of the importance of forecasting, maybe Ozzie starting first. Why are you excited about it and what caused you to get involved?

Ozzie: Actually, would it be possible that you start first? Because there are just so many …

Vaniver: Yeah. My sense is that predicting the future is great. Forecasting is one way to do this. The question for “will this connect to things being better” is the difficult part. In particular, Ozzie had this picture before of, on the one hand, data science-y repeated things that happen a lot, and then on the other hand judgement-style forecasting, a one off thing where people are relying on whatever models because they can’t do the “predict the weather”-style things.

Vaniver: My sense is that most of the things that we care about are going to be closer to the right hand side and also most of the things that we can do now to try and build out forecasting infrastructures aren’t addressing the core limitations in getting to these places.

Is infrastructure really what forecasting needs? (And clarifying the term “forecasting”)

Vaniver: My main example here is something like prediction markets are pretty easy to run but they aren’t being adopted in many of the places that we’d like to have them for reasons that are not … “we didn’t get a software engineer to build them.” That feels like my core reason to be pessimistic about forecasting as intellectual infrastructure.

Ozzie: Yeah. I wanted to ask you about this. Forecasting is such a big type of thing. One thing we have about maybe five to ten people doing timelines, direct forecasting, at OpenAI, OpenPhil and AI Impacts. My impression is that you’re not talking about that kind of forecasting. You’re talking about infrastructural forecasting where we have a formal platform and people making formalised things.

Vaniver: Yeah. When I think about infrastructure, I’m thinking about building tooling for people to do work in a shared space as opposed to individual people doing individual work. If we think about dentistry or something, like what dentists’ infrastructure would look like is very different from people actually modifying mouths. It feels to me like that OpenAI and similar people are doing more of the direct style work than infrastructure.

Ozzie: Yeah okay. Another question I have is something like a lot of trend extrapolations stuff, e.g. for particular organizations, “how much money do you think they will have in the future?” Or for LessWrong, “how many posts are going to be there in the future?” and things like that. There’s a lot of that happening. Would you call that formal forecasting? Or would you say that’s not really tied to existing infrastructure and they don’t really need infrastructure support?

Vaniver: That’s interesting. I noticed earlier I hadn’t been including Guesstimate or similar things in this category because that felt to me more like model building tools or something. What do I think now …

Vaniver: I’m thinking about two different things. One of them is the “does my view change if I count model building tooling as part of this category, or does that seem like an unnatural categorization?” The other thing that I’m thinking about is if we have stuff like the LessWrong team trying to forecast how many posts there will be… If we built tools to make that more effective, does that make good things happen?

Vaniver: I think on that second question the answer is mostly no because it’s not clear that it gets them better counterfactual analysis or means they work on better projects or something. It feels closer to … The thing that feels like it’s missing there is something like how them being able to forecast how many posts there will be on LessWrong connects to whether LessWrong is any good.

Fragility of value and difficulty of capturing important uncertainties in forecasts

Vaniver: There was this big discussion that happened recently about what metric is the team should be trying to optimize for the quarter. My impression is this operationalization step connected people pretty deeply to the fact that the things that we care about are actually just extremely hard to put numbers on. This difficulty will also be there for any forecasts we might make.

Ozzie: Do you think that there could be value in people in the EA community figuring out how do put numbers on such things? For instance, like groups evaluate these things in the future in formal ways. Maybe not for LessWrong but for other kinds of projects.

Vaniver: Yeah. Here I’m noticing this old LessWrong post…. Actually I don’t know if this was one specific post, but this claim of the “fragility of value” where it’s like “oh yeah, in fact the thing that you care about is this giant mess. If you drill it down to one consideration, you probably screwed it up somehow”. But it feels like even though I don’t expect you to drill it down to one consideration, I do think having 12 is an improvement over having 50. That would be evidence of moral progress.

Ozzie: That’s interesting. Even so, the agenda that I’ve been talking about it quite broad. It’s very much a lot of interesting things. A combination of forecasting and better evaluations. For forecasting itself, there are a lot of different ways to do it. That does probably mean that there is more work for us to do back and forth with specific types and their likelihood, which make this a bit challenging. It’ll give you a wide conversation.

Ben G: Is it worth going over the double cruxing steps, the general format? I’m sorry. I’m not the facilitator.

Vaniver: Yeah. What does our facilitator think?

Facilitator: I think you’re doing pretty good and exploring each other’s stuff. Pretty cool… I’m also sharing a sense that forecasting has been replaced with a vague “technology” or something.

Ozzie: I think in a more ideal world we’d have something like a list of every single application and for each one say what are the likelihoods that I think it’s going to be interesting, what you think is going to be interesting, etc.

Ozzie: We don’t have a super great list like that right now.

Vaniver: I’m tickled because this feels like a very forecasting way to approach the thing where it’s like “we have all these questions, let’s put numbers on all of them”.

Ozzie: Yeah of course. What I’d like to see, what I’m going for, is a way that you could formally ask forecasters these things.

Vaniver: Yeah.

Ozzie: That is a long shot. I’d say that’s more on the experimental side. But if you could get that to work, that’d be amazing. More likely, that is something that is kind of infrequent.

Vaniver’s conceptual model of why forecasting works

Vaniver: When I think about these sorts of things, I try to have some sort of conceptual model of what’s doing the work. It seems to me the story behind forecasting is there’s a lot of, I’m going to say, intelligence for hire out there and that the thing that we need to build is this marketplace that connects the intelligence for hire and the people who need cognitive work done. The easiest sorts of work for us to use for are these predictions about the future because it’s easy to verify later and ….

Vaniver: I mean the credit allocation problem is easy because of everyone who moved the prediction in a good direction gets money and everyone who moved it in the wrong direction loses money. Whereas if we’re trying to develop a cancer drug and we do scientific prizes, it may be very difficult to do the credit allocation for “here’s a billion dollars for this drug”. Now all the scientists who made some sort of progress along the way figure out who gets what of that money.

Vaniver: I’m curious how that connects with your conception of the thing. Does that seem basically right or you’re like there’s this part that you’re missing or you would characterize differently or something?

Ozzie: Different aspects about it. One is I think that’s one of the possible benefits. Hypothetically, it may be one of the main benefits. But even if it’s not an actual benefit, even if it doesn’t come out to be true, I think that there are other ways that this type of stuff would be quite useful.

Background on prediction markets and the Good Judgement Project

Ozzie: Also to stand back a little bit, I’m not that excited about prediction markets in a formal way. My impression is that A) they’re not very legal in the US, and B), it’s very hard to incentivize people to forecast the right questions. Then C), there are issues around a lot of these forecasting systems you have people that want private information and stuff. There’s a lot of nasty things with those kinds of systems. They could be used for some portion of this.

Ozzie: The primary area that I’m more interested in forecasting applications similar to Metaculus and PredictionBook and one that I’m working on right now. More, they’re working differently. Basically, people build up good reputations by having good track records. Then there’s basically a variety of ways to pay people. The Good Judgement Project does it by basically paying people a stipend. There are around 125 super forecasters who work on specific questions for specific companies. I think you pay like $100,000 to get a group of them.

Ozzie: Just a quick question, are you guys familiar with how they do things in specific? Not many people are.

Ozzie: Maybe one of the most interesting examples of paid forecasters which was similar to this. For them, they basically have the GJP Open where they find the really good forecasters. Then those become the super forecasters. There’s about 200 of these, 125, are the ones that they’re charging other companies for.

Vaniver: Can you paint me more of a picture of who is buying the forecasting service and what they’re doing it for?

Ozzie: Yeah. For one thing, I’ll say that this area is pretty new. This is still on the cutting edge and small. OpenPhil bought some of their questions … I think they basically bought one batch. The questions I know about them asking were things like “what are the chances of nuclear between the US and Russia?” “What are the chances of nuclear war between different countries?” where one of the main ones was Pakistan and India. Also specific questions about outcomes of interventions that they were sponsoring. OpenPhil already internally does forecasting on most of its grant applications. When a grant is made internally they would have forecasts about how well it’s going to do and they track that. That is a type of forecasting.

Ozzie: The other groups that use them are often businesses. There are two buckets in how that’s useful. One of them is to drive actual answers. A second one is to get the reasoning behind those answers. A lot of times what happens—although it may be less useful for EAs—is that these are companies maybe do not have optimal epistemologies, but instead have systematic biases. They basically purchase this team of people who do provably well at some of these types of questions. Those people would have discussions about their kinds of reasoning. Then they find their reasoning interesting.

Vaniver: Yeah. Should I be imagining an oil company deciding whether to build a bunch of wells in Ghana and has decided that they just want to outsource the question of what’s the political environment in Ghana going to be for the next 10 years?

Ozzie: That may be a good interpretation. Or there’d be the specific question of what’s the possibility that there’ll be a violent outbreak.

Vaniver: Yeah. This is distinct from Coca Cola trying to figure out which of their new ad campaigns would work best.

Ozzie: This is typically different. They’ve been focused on political outcomes mostly. That comes in assuming that they were working with businesses. A lot of GJP stuff is covered by NDA so we can’t actually talk about it. We don’t have that much information.

Ozzie: My impression is that some groups have found it useful and a lot of businesses don’t know what to do with those numbers. They get a number like 87% and they don’t have ways to directly make that interact with the rest of their system.

Ozzie: That said, there are a lot of nice things about that hypothetically. Of course some of it does come down to the users. A lot of businesses do have pretty large biases. That is a known thing. It’s hard to know if you have a bias or not. Having a team of people who has a track record of accuracy is quite nice if you want to get a third party check. Of course another thing for them is that it is just another way to outsource intellectual effort.

Positive cultural externalities of forecasting AI

Facilitator: Vaniver, is this changing your mind on anything essentially important?

Vaniver: The thing that I’m circling around now is a question closer to “in what contexts does this definitely work?” and then trying to build out from that to “in what ways would I expect it to work in the future?”. For example here, Ozzie didn’t mention this, but a similar thing that you might do is have pundits just track their predictions or somehow encourage them to make predictions that then feed into some reputation score where it may matter in the future. The people who consistently get economic forecasts right actually get more mindshare or whatever. There’s versions of this that rely on the users caring about it and then there are other versions that rely less on this.

Vaniver: The AI related thing that might seem interesting is something like 2.5 years ago Eliezer asked this question at the Asilomar AI conference which was “What’s the least impressive thing that you’re sure won’t happen in two years?” Somebody came back with the response of “We’re not going to hit 90% on the Winograd Schema.” [Editor’s note: the speaker was Oren Etzioni] This is relevant because a month ago somebody hit 90% on the Winograd Schema. This turned out to have been 2.5 years after the thing. This person did successfully predict the thing that would happen right after the deadline.

Vaniver: I think many people in the AI space would like there to be this sort of sense of “people are actually trying to forecast near progress”. Or sorry. Maybe I should say medium term progress. Predicting a few years of progress is actually hard. But it’s categorically different from three months. You can imagine something where people who are building up the infrastructure to be good at this sort of forecasting does actually make the discourse healthier in various ways and gives us better predictions of the future.

Importance of software engineering vs. other kinds of infrastructure

Vaniver: Also I’m having some question of how much of this is infrastructure and how much of this is other things. For example when we look at the Good Judgement Project I feel like the software engineering is a pretty small part of what they did as compared to the selection effects. It may still be the sort of thing where we’re talking about infrastructure, though we’re not talking about software engineering.

Vaniver: The fact that they ran this tournament at all is the infrastructure, not the code underneath the tournaments. Similarly, even if we think about a Good Judgment Project for research forecasting in general, this might be the sort of cool thing that we could do. I’m curious how that landed for you.

Ozzie: There’s a lot of stuff in there. One thing is that on the question of “can we just ask pundits or experts”, I think my prior is that that would be a difficult thing, specifically in that in “Expert Political Judgment” Tetlock tried to get a lot of pundits to make falsifiable predictions and none of them wanted to …

Vaniver: Oh yeah. It’s bad for them.

Facilitator: Sorry. Can you tell me what you thought were the main points of what Vaniver was just saying then?

Ozzie: Totally. Some of them …

Facilitator: Yeah. I had a sense you might go “I have a point about everything he might have said so I’ll say all of them” as opposed the key ones.

Ozzie: I also have to figure out what he said in that last bit as opposed to the previous bit. It’s one of them. There’s a question. Most recent when it comes to the Good Judgment Project, how much of it was technology versus other things that we did?

Ozzie: I have an impression that you’re focused on the AI space. You do talk about the AI space a lot. It’s funny because I think we’re both talking a bit on points that help the other side, which is kind of nice. You mentioned one piece where prediction was useful in the AI space. My impression is that you’re skeptical about whether we could get a lot more wins like that, especially if we tried to do it with a more systematic effort.

Vaniver: I think I actually might be excited about that instead of skeptical. We run into similar problems as we did with getting pundits to predict things. However, the things that’re going on with professors and graduates and research scientists is very different from the thing that’s going on with pundits and newspaper editors and newspaper readers.

Vaniver: Also it ties into the ongoing question of “is science real?” that the psychology replication stuff is connected to. Many people in computer science research in particular are worried about bits of how machine learning research is too close to engineering or too finicky in various ways. So I could a imagine a “Hey, will this paper replicate?”-market catching on in computer science. I imagine getting from that to a “What State-of-the-Arts will fall when?”-thing. That also seems quite plausible that we could make that happen.

Ozzie: I have a few points now that connect to that. On pundits and experts, I think we probably agree that pundits often can be bad. Also experts often are pretty bad at forecasting it seems. That’s something that’s repeatable.

Ozzie: For instance in the AI expert surveys, a lot of the distributions don’t really make sense with each other. But the people who do seem to be pretty good are the specific class of forecasters, specifically ones that we have evidence for, that’s really nice. We only have so many of them right now but it is possible that we can get more of them.

Ozzie: It would be nice for more pundits to be more vocal about this stuff. I think Kelsey at Vox with their Future Perfect group is talking about making predictions. They’ve done some. I don’t know how much we’ll end up doing.

Privacy

Ozzie: When it comes to the AI space, there are questions about “what would interesting projects look like right now?” I’ve actually been dancing around AI in part because I could imagine a bad world or possibly a bad world where we really help make it obvious what research directions are exciting and then we help speed up AI progress by five years and that could be quite bad. Though, managing to do that in an interesting way could be important.

Ozzie: There are other questions about privacy. There’s the question of “is this interesting?”, and the question of “conditional on it being kind of interesting. Should we be private about it?” We’re right now playing for that first question.

Orgs using internal prediction tools, and the action-guidingness of quantitative forecasts

Ozzie: Some other things I’d like to bring into this discussion is that a lot of it right now is already being systemized. They say when you are an entrepreneur or something and try to build a tool it’s nice to find that there are already internal tools. A lot of these groups are making internal systematic predictions at this point. They’re just not doing it using very formal methods.

Ozzie: Some example, OpenPhil formally specifies a few predictions for grants. Open AI also has a setup for internal forecasting. These are people at Open AI who are ML experts basically. That’s a decent sized thing.

Ozzie: There are several other organizations that are using internal forecasting for calibration. It’s just a fun game that forces them to get a sense of what calibration is like. Then for that there are questions of “How useful is calibration?”, “Does it give you better calibration over time?”

Ozzie: Right now none of them seem to be using PredictionBook. We could also talk a bit about … I think that thing is nice and shows a bit of promise. It may be that there are some decent wins to be done by making better tools for those people which right now aren’t using any specific tools because they looked at them and found them to be inadequate. It’s also possible that even if they did use those tools it’d be a small win and not a huge win. That’s one area where there could be some nice value. But it’s not super exciting so I don’t know if you want to push back against that and say “there’ll be no value in that.”

Vaniver: There I’m sort of confused. What are the advantages to making software as a startup where you make companies’ internal prediction tools better? This feels similar to Atlassian of something where it’s like “yeah, we made their internal bug reporting or other things better”. It’s like yeah, sure, I can see how this is valuable. I can see how I’d make them pay for it. But I don’t see how this is …

Vaniver: …a leap towards the utopian goals if we take something like Futarchy or … in your initial talk you painted some pictures of this is how in the future if you had much more intelligence or much more sophisticated systems you could do lots of cool things. [Editor’s note: see Ozzie’s sequence “Prediction-Driven Collaborative Reasoning Systems” for background on this] The software as a service vision doesn’t seem like it gets us all that much closer and also feels like it’s not pushing at the hardest bit which is something like “getting companies to adopt it”-thing. Or maybe what I think there is something like that the organizations themselves have to be structured very differently. It feels like there’s some social tech.

Ozzie: When you say very differently, do you mean very differently? Right now they’re already doing some predictions. Do you mean very differently for like predictions would be a very important aspect of the company? Because right now it is kind of small.

Vaniver: My impression is something like going back to your point earlier about looking back at answers like 87% and they won’t really know what to do with it. Similarly, I was in a conversation with Oli earlier about whether or not organizations had beliefs or world models. There’s some extent to which the organization has a world model that doesn’t live in a person’s head. It’s going to be something like its beliefs are these forecasts on all these different questions and also the actions that the organization takes is just driven by those forecasts without having a human in the loop, where it feels to me right now often the thing that will happen is some executive will be unsure about a decision. Maybe they’ll go out to the forecasters. The forecasters will come back with 87%. Now the executive is still making the decision using their own mind. Whether or not that “87%” lands as “the actual real number 0.87” or something else is unclear, or not sensibly checked, or something. Does that make sense?

Ozzie: Yeah. Everything’s there. Let’s say that … 87% example is something that A) comes up if you’re a bit naïve about what you want and B), comes up depending on how systematic your organization is with using number for things. If you happen to have a model what the 87% is, that could be quite valuable. With see different organizations are on different parts of the spectrum. Probably the one that’s most intense about this is GiveWell. GiveWell has their multiple gigantic sheets of lots of forecasts essentially. It’s possible that it’ll be hard to make tooling that’ll be super useful to them. I’ve been talking with them. There’s experiments to be tried there. They’re definitely in the case that as specific things change they may change decisions and they’ll definitely change recommendations.

Ozzie: Basically they have this huge model where people estimate a bunch of parameters about moral decision making and a lot of other parameters about how well the different interventions are going to do. Out of all of that comes recommendations for what the highest expected values are.

Ozzie: That said, they are also in the domain that’s probably the most certain of all the EA groups in some ways. They’re able to do that more. I think the Open AI is probably a little bit… I haven’t seen their internal models but my guess is that they do care a lot about the specifics of the numbers and also are more reasonable about what to do with them.

Ozzie: I think the 87% example is a case of most CEOs don’t seem to know what a probability distribution is but I think the EA groups are quite a bit better.

Vaniver: When I think about civilization as a whole, there’s a disconnect between groups that think numbers are real and groups that don’t think numbers are real. There’s some amount of “ah, if we want our society to be based on numbers are real, somehow we need the numbers-are-real-orgs to eat everyone else. Or successfully infect everyone else.”

Vaniver’s steelman of Ozzie

Vaniver: What’s up?

Facilitator: Vaniver, given what you can see from all the things you discussed and touched on in the forecasting space, I wonder if you had some sense of the thing Ozzie is working on. If you imagine yourself actually being Ozzie and doing the things that he’s doing, I’m curious what are the main things that feel like you don’t actually buy about what he’s doing.

Vaniver: Yeah. One of the things … maybe this is fair. Maybe this isn’t. I’ve rounded it up to something like personality difference where I’m imagining someone who is excited about thinking about this sort of tool and so ends up with “here’s this wide range of possibilities and it was fun to think about all of them, but of the wide range, here’s the few that I think are actually good”.

Vaniver: When I imagine dropping myself into your shoes, there’s much more of the … for me, the “actually good” is the bit that’s interesting (though I want to consider much of the possibility space for due diligence). I don’t know if that’s actually true. Maybe you’re like, “No. I hated this thing but I came into it because it felt like the value is here.”

Ozzie: I’m not certain. You’re saying I wasn’t focused on … this was a creative … it was enjoyable to do and then I was trying to rationalize it?

Vaniver: Not necessarily rationalize but I think closer to the exploration step was fun and creative. Then the exploitation step of now we’re actually going to build a project for these two things was guided by the question of which of these will be useful or not useful.

How to explore the forecasting space

Vaniver: When I imagine trying to do that thing, my exploration step looks very different. But this seems connected to this because there’s still some amount of everyone having different exploration steps that are driven by their interests. Then also you should expect many people to not have many well-developed possibilities outside of their interests.

Vaniver: This may end up being good to the extent that people do specialize in various ways. If we just randomly reassigned jobs to everyone, productivity would go way down. But this thing where the interests matter. You should actually only explore things that you find interesting makes sense. There’s a different thing where I don’t think I see the details of Ozzie’s strategic map for something in the sense of “Here’s the long term north star type things that are guiding us.” The one bit that I’ve seen that was medium term was the “yep, we could do the AI part testing stuff but it is actually unclear whether this is speeding up capabilities more than it’s useful”. How many years is a “fire alarm for general intelligence” worth? [Editor’s note: Vaniver is referring to this post by Eliezer Yudkowsky] Maybe the answer to that is “0” because we won’t do anything useful with the fire alarm even if we had it.

Facilitator: To make sure I followed, the first step was: you have a sense of Ozzie exploring a lot of the space initially and now it’s exploiting some of the things you think may be more useful. But you wouldn’t have explored it that way yourself potentially because you wouldn’t really have felt differently that there would have been something especially useful to find if you continued exploring?

Facilitator: Secondly, you’re also not yet sufficiently sold on the actual medium term things to think that the exploiting strategies are worth taking?

Vaniver: “Not yet sold” feels too strong. I think it’s more that I don’t see it. Not being sold implies something like … I would normally say I’m not sold on x when I can see it but I don’t see the justification for it yet where here I don’t actually have a crisp picture of what seven year success looks like.

Facilitator: Ozzie which one of those feels more like “Argh, I just want to tell Vaniver what I’m thinking now”?

Ozzie: So on exploration and exploitation. One the one hand, not that much time or resource is going into this yet. Maybe a few full-time months like to think about it and then several for making webapps. Maybe that was too much. I think it wasn’t.

Ozzie: The amount of variety of types of proposals that are on the table right now compared to when I started I’m pretty happy with for like a few months of thinking. Especially since for me to get involved in AI would have taken quite a bit more time of education and stuff. It did seem like a few cheap wins at this point. I still kind of feel like that.

Importance and neglectedness of forecasting work

Ozzie: I also do get the sense that this area is still pretty neglected.

Vaniver: Yeah. I guess in my mind neglecting is both people aren’t working on it and people should be working on it. Is that true for you also?

Ozzie: There are three aspects. Importance, tractable, and neglected. It could be neglected but not important. I’m just saying here that it’s neglected.

Vaniver: Okay. You are just saying that people aren’t working on it.

Ozzie: Yeah. You can talk about then the questions of importance and tractability.

Facilitator: I feel like there are a lot of things that one can do. One can Like try to start a group house in Cambridge, one can try and teach rationality at the FHI. Forecasting … something about “neglected” doesn’t feel like it quite gets at the thing because the space is sufficiently vast.

Ozzie: Yeah. The next part would be importance. I obviously think that it’s higher in importance than a lot of the other things that seem similarly neglected. Let’s say basically the ratio of importance in importance, neglected and tractable was pretty good for forecasting. I’m happy to spend a while getting into that.

Tractability of forecasting work

Vaniver: I guess I actually don’t care all that much about the importance because I buy if we could … in my earlier framing, we move everyone to a “numbers-are-real” organization. That would be excellent. The thing that I feel most doomy about is something like the tractability where it feels like most of the wins that people were trying to get before turned out to be extremely difficult and not really worth it. I’m interested in seeing the avenues that you think are promising in this regard.

Ozzie: Yeah. It’s an interesting question. I think a lot of people have the notion that we’ve had tons and tons of attempts at forecasting systems since Robin Hanson started talking about Prediction markets. All of those have failed therefore Prediction markets have failed and it’s not worth spending another person and it’s like a heap of dead bodies.

Ozzie: The viewpoint that I have where it definitely doesn’t look that way, for one thing, the tooling. If you actually look at a lot of the tooling that’s been done, a lot of it is still pretty basic. One piece of evidence for that is the fact that almost no EA organizations are using it themselves.

Ozzie: That could also be that it’s really hard to make good tooling. If you look at it, basically if you look at non-prediction market systems, in terms of prediction markets there were also a few attempts. But the area is kind of illegal. Like I said, there are issues with prediction markets.

Ozzie: If you look at non prediction market tournament applications. Basically you have a few. The GJP doesn’t make their own. They’ve used Cultivate Labs. Now they’re starting to try and make their own systems as well. But the GJP people are mostly political scientists and stuff, not developers.

Ozzie: A lot of experiments they’ve done are political. It’s not like engineering questions about how there’d be an awesome engineering infrastructure. My take on that is if you put some really smart engineer/​entrepreneur in that type of area, I’d expect them to generally have a very different approach.

Vaniver: There’s a saying from Nintendo: “if your game is not fun with programmer art, it won’t be fun in the final product” or something. Similarly, I can buy that there’s some minimum level of tooling that we need for these sorts of forecasts that would be sensible it all. But it feels to me that if I expected forecasting to be easy in the relevant ways, the shitty early versions would have succeeded without us having to build later good versions.

Ozzie: There’s a question of what “enough” is. They definitely have succeeded to some extent. PredictionBook has been used by Gwern and a lot of other people. Some also use their own setups and Metaculus and stuff… So. you can actually see a decent amount of activity. I don’t see many other areas that have nearly that level of experimentation. There are very few other areas that are being used to the extent that predictions are used that we could imagine as future EA web apps.

Vaniver: The claim that I’m hearing there is something like “I should be comparing PredictionBook and Metaculus and similar things to reciprocity.io or something, as this is just a web app made in their spare time and if it actually sees use that’s relevant”.

Ozzie: I think that there’s a lot of truth to that, though maybe not exactly be the case. Maybe we’re past a bit of reciprocity.

Vaniver: Beeminder also feels like it’s in this camp to me to me although less like EA specific.

Ozzie: Yeah. Or like Anki.

Ozzie: Right.

Technical tooling for Effective Altruism

Ozzie: There’s one question which is A), do we think that there’s room for technical tooling around Effective Altruism? B) if there is, what are the areas that seems exciting? I don’t see many other exciting areas. Of course, that is another question. If you think … that’s not exactly depending forecasting… but more like, if you don’t like forecasting, what do you like? Because there’s a conclusion that we just don’t like EA tools and there’s almost nothing in the space. Because there’s not much more that seems obviously more exciting. But there’s a very different side to the argument.

Vaniver: Yeah. It’s interesting because on the one hand I do buy the frame of it might make sense to just try to make EA tools and then to figure out what the most promising EA tool is. Then also I can see the thing going in the reverse direction which is something like if none of the opportunities for EA tools are good then people shouldn’t try it. Also if we do in fact come up with 12 great opportunities for EA tools this should be a wave of EA grants or whatever.

Vaniver: I would be excited about something double crux-shaped. But I worry this runs into the problem that argument mapping and mind mapping have all run into before. There’s something that’s nice about doing a double crux which makes it grounded out in the trace that one particular conversation takes as opposed to actually trying to represent minds. I feel like most of the other EA tools would be … in my head it starts as silly one-offs. I’m thinking of things like for the 2016 election there was a vote-swapping thing to try to get third party voters in swing states to vote for whatever party in exchange for third party votes in safe states. I think Scott Aaronsson promoted it but I don’t think he made it. But. It feels to me like that sort of thing. We may end up seeing lots of things like that where it’s like “if we had software engineers ready to go, we would make these projects happen”. Currently I expect it’s sufficient that people do that just for the glory of having done it. But the Beeminder style things are more like, “oh yeah, actually this is the sort of thing where if it’s providing value then we should have people working for it and the people will be paid by the value they’re providing”. Though that move is a bit weird because that doesn’t quite capture how LessWrong is being paid for...

Ozzie: Yeah. Multiple questions on that. This could be a long winding conversation. One would be “should things like this be funded by the users or by other groups?”

Ozzie: One thing I’d say that … I joined 80000 Hours about four years ago. I worked with them to help them with their application and decided at that point that it should be much less of an application and more of like a blog. I helped them scale it down.

Ozzie: I was looking for other opportunities to make big EA apps. At that point there was not much money. I kind of took a detour and I’m coming back to it in some ways. In a way I’ve experienced this with Guesstimate, which has been used a bit. Apps from Effective Altruism has advantages and disadvantages. One disadvantage is that writing software is an expensive thing. An advantage is that it’s very tractable. By tractable I mean you could say “if I spent $200,000 and three engineer years I could expect to get this thing out”. Right now we are in a situation where we do have hypothetically a decent amount of money if it could beat a specific bar. The programmers don’t even have to be these intense EAs (although it is definitely helpful).

5-MIN BREAK

Tractability of forecasting within vs outside EA

Ozzie: I feel like we both kind of agree, that, hypothetically, if a forecasting system was used and people decided it was quite useful, and we could get to the point that EA orgs were making decisions in big ways with it, that could be a nice thing to have. But there’s disagreement about whether that’s an existing possibility, and whether existing evidence shows us that won’t happen.

Vaniver: I’m now also more excited about the prospects of this for the EA space. Where I imagine a software engineer coming out of college saying “My startup idea is prediction markets”, and my response is “let’s do some market research!” But in the EA space the market research is quite different, because people are more interested in using the thing, and there’s more money for crazy long-shots… or not crazy long-shots, but rather, “if we can make this handful of people slightly more effective, there are many dollars on the line”.

Ozzie: Yeah.

Vaniver: It’s similar to a case where you have this obscure tool for Wall Street traders, and even if you only sell to one firm you may just pay for yourself.

Ozzie: I’m skeptical whenever I hear an entrepreneur saying “I’m doing a prediction market thing”. It’s usually crypto related. Interestingly most prediction platforms don’t predict their own success, and that kind of tells you something…

(Audience laughter)

Vaniver: Well this is just like the prediction market on “will the universe still exist”. It turns out it’s just asymmetric who gets paid out.

Medium-term goals and lean startup methodology

Facilitator: Vaniver, your earlier impression was you didn’t have a sense what medium term progress would look like?

Vaniver: It’s important to flag that I changed my mind. When I think about forecasting as a service for the EA space, I’m now more optimistic, compared to when I think of it as a service on the general market. It’s not surprising OpenPhil bought a bunch of Good Judgement forecasters. Whereas it would be a surprise if Exxon bought GJP questions.

Vaniver: Ozzie do you have detailed visions of what success looks like in several years?

Ozzie: I have multiple options. The way I see is that… when lots of YC startups come out they have a sense that “this is an area that seems kind of exciting”. We kind of have evidence that it may be interesting, and also that it may not be interesting. We don’t know what success looks like for an organisation in this space, though hopefully we’re competent and we could work quickly to figure it out. And it seems things are exciting enough for it to be worth that effort.

Ozzie: So AirBnB and the vast majority of companies didn’t have a super clear idea of how they were going to be useful when they started. But they do have good inputs, and a vague sense of what kind of cool outputs would be.

Ozzie: There’s evidence that statistically this seems to be what works in startup land.

Ozzie: Some of the evidence against. There was a question of “if you have a few small things that are working but are not super exciting, does that make it pretty unlikely you’ll see something in this space?”

Ozzie: It would be hard to make a strong argument that YC wouldn’t find any companies in such cases. They do fund things without any evidence of success.

Vaniver: But also if you’re looking for moonshots, mild success the first few times is evidence against “the first time it just works and everything goes great”.

Limitations of current forecasting tooling

Ozzie: Of course in that case you’re question is of exactly what is this that’s been tried. I think there are arguments that there are more exciting things on the horizon which haven’t been tried.

Ozzie: Now we have PredictionBook, Metaculus, and hypothetically Cultivate Labs and another similar site. Cultivate Labs does enterprise gigs, and are used by big companies like Exxon for ideation and similar things. They’re a YC company and have around 6 people. But they haven’t done amazingly well. They’re pretty expensive to use. At this point you’d have to spend around $400 for one instance per month. And even then you get a specific enterprise-y app that’s kind of messy.

Ozzie: Then if you actually look at the amount of work done on PredictionBook and Metaculus, it’s not that much. PredictionBook might have had 1-2 years of engineering effort, around 7 years ago. People think it’s cool, but not a serious site really. As for Metaculus, I have a lot of respect for their team. That project was probably around 3-5 engineering years.

Ozzie: They have a specific set of assumptions I kind of disagree with. For example, everyone has to post their questions in one main thread, and separate communities only exist by having subdomains. They’re mostly excited about setting up those subdomains for big projects.

Ozzie: So if a few of us wanted to experiment with “oh, let’s make a small community, have some privacy, and start messing around with questions” it’s hard to do that…

Vaniver: So what would this be for? Who wants their own instances? MMO guilds?

Jacob: Here’s one example of the simplest thing you currently cannot do. (Or could not do around January 1st 2019.) Four guys are hanging out, and they wonder “When will people next climb mount everest?” They then just want to note down their distributions for this and get some feedback, without having to specify everything in a Google doc or a spreadsheet which doesn’t have distributions.

Facilitator: Which bit breaks?

Jacob: You cannot small private channels for multiple people which take 5 minutes to set up where everyone records custom distributions.

Vaniver: So I see what you can’t do. What I want is the group that wants to do it. For example, one of my housemates loves these sites, but also is the sort of nerd that loves these kinds of sites in general. So should I just imagine there’s some MIT fraternity where everyone is really into forecasting so they want a private domain?

Ozzie: I’d say there’s a lot of uncertainty. A bunch of groups may be interested, and if a few are pretty good and happen to be excited, that would be nice. We don’t know who those are yet, but we have ideas. There are EA groups now. A lot of them are kind of already doing this; and we could enable them to do it without having to pay $400-$1000 per month; or in a way that could make stuff public knowledge between groups… For other smaller EA groups that just wanted to experiment the current tooling would create some awkwardness.

Ozzie: If we want to run experiments on interesting things to forecast, e.g “how valuable is this thing?” or stuff around evaluation or LessWrong posts. We’d have to set up a new instance for each. Or maybe we could have one instance and use it for all experiments, but that would force a single privacy setting for all those experiments.

Ozzie: Besides that, at this point, I raised some money and spent like $11,000 to get someone to program. So a lot of this tooling work is already done and these things are starting to be experimented with.

Knowledge graphs and moving beyond questions-as-strings

Ozzie: In the medium-term there’s a lot of other interesting things. With the systems right now, a lot of them assume all questions are strings. So if you’re going to have a 1000 questions, it’s impossible to understand and for other people to get value from. So if you wanted to organise something like, “every EA org, how much money and personnel would they have each year for the coming 10 years” it would be impossible with current methods.

Vaniver: Instead we’d want like a string prefix combined with a list of string postfixes?

Ozzie: There are many ways to do it. I’m experimenting with using a formal knowledge graph where you have formal entities.

Vaniver: So there would be a pointer to the MIRI object instead of a string?

Ozzie: Yeah, and that would include information about how to find information about it from Wikipedia, etc. So if someone wanted to set up an automated system to do some of this they could. Combining this with bot support would enable experiments with data scientists and ML people to basically augment human forecasts with AI bots.

Vaniver: So, bot support here is like participants in the market (I’ll just always call a “forecast-aggregator” a market)? Somehow we have an API where they can just ingest question and respond with distributions?

Ozzie: Even without bots, just organising structured questions in this way makes it easier for both participants and observers to get value.

Summary of cruxes

Facilitator: Yeah, I don’t know… You chatted for a while, I’m curious what feels like some of the things you’ll likely think a bit more about, or things that seem especially surprising?

Ozzie: I got the sense that we agreed on more things than I was kind of expecting to. It seems lots of it now may be fleshing out what the mid-term would be, and seeing if there’s parts of it you agree are surprisingly useful, or if it does seem like all of them are long-shots?

Vaniver: When I try to summarise your cruxes, what would change your mind about forecasting, it feels like 1) if you thought there was a different app/​EA tool to build, you would bet on that instead of this.

Ozzie: I’d agree with that.

Vaniver: And 2) if the track-record of attempts were more like… I don’t know what word to use, but maybe like “sophisticated” or “effortful”? If there were more people who were more competent than you and failed, then you’d decide to give up on it.

Ozzie: I agree.

Vaniver: I didn’t get the sense that there were conceptual things about forecasting that you expected to be surprised by. In my mind, getting data scientists to give useful forecasts, even if the questions are in some complicated knowledge graph or something, seems moderately implausible. Maybe I could transfer that intuition, but maybe the response is “they’ll just attempt to do base-rate forecasting, and it’s just an NLP problem to identify the right baserates”

Vaniver: Does it feel like it’s missing some of your cruxes?

Facilitator: Ozzie, can you repeat the ones he did say?

Audience: Good question.

Ozzie: I’m bad at this part. Now I’m a bit panicked because I feel like I’m getting cornered or something.

Vaniver: My sense was… 1) if there are better EA tools to build, you’d build them instead. 2) if better tries had failed, it would feel less tractable. And 3) Absence of conceptual uncertainties that we could resolve now. It feels it’s not like “Previous systems are bad because they got the questions wrong” or “Question/​answer is not the right format”. It’s closer to “Previous systems are bad because their question data structure doesn’t give us the full flexibility that we want”.

Vaniver: Maybe that’s a bad characterization of the automation and knowledge graph stuff.

Ozzie: I’d definitely agree with the first two, although the first one is a bit more expansive than tools. If there was e.g. a programming tool I’d be better for and had higher EV, I’d do that instead. Number two, on tries, I agree if there were one or two other top programming teams who tried a few of these ideas and were very creative about it, and failed, and especially if they had software we could use now! (I’d feel much better about not having to make software) Then for three, The absence of conceptual uncertainties. I don’t know exactly how to pin this down.

Facilitator: I don’t know if we should follow this track.

Vaniver: I’m excited about hearing what Ozzie’s conceptual uncertainties are.

Facilitator: Yeah, I agree actually.

Ozzie’s conceptual uncertainties

Ozzie: I think the way I’m looking at this problem is one where there are many different types of approaches that could be useful. There are many kinds of people who could be doing the predicting. There are many kinds of privacy. Maybe there would be more EAs using it, or maybe we want non-EAs of specific types. And within EA vs non-EA, there are many different kinds of things we might want to forecast. There are many creative ways of organising question such that forecasting leads to an improved amount of accuracy. And I have a lot of uncertainty about this entire space, and what areas will be useful and what won’t.

Ozzie: I think I find it unlikely that absolutely nothing will be useful. But I do find it very possible that it’ll just be too expensive to find out useful things.

Vaniver: If it turned out nothing was useful, would it be the same reason for different applications, or would it be “we just got tails on every different application?”

Ozzie: If it came out people just hate using the tooling, then no matter what application you use it for it will kind of suck.

Ozzie: For me a lot of this is a question of economics. Basically, it requires some cost to both build the system and then get people to do forecasts; and then to make the question and do the resolution. In some areas the cost will be higher than value, and in some the value will be higher than the cost. It kind of comes down to a question of efficiency. Though, it’s hard to know, because there’s always the question of maybe if I would have implemented this feature things would have been different?

Vaniver: That made me think of something specific. When we look at the success stories, they are things like weather and sports, whereas for sports you had to do some amount of difficult operationalisation, but you sort of only had to do it once. The step I expect to be hard across most application domains is the “I have a question, and now I need to turn it into a thing-that-can-be-quantitatively-forecasted” and then I became kind of curious if we could get relatively simple NLP systems that could figure out the probability that a question is well-operationalised or not. And have some sort of automatic suggestions of like “ah, consider these cases” or whatever, or “write the question this way rather than that way”.

Ozzie: From my angle, you could kind of call those “unique question”, where the marginal cost per question is pretty high. I think that if we were in any ecosystem where things were tremendously useful, the majority of questions would not be like this.

Vaniver: Right so if I ask about the odds I will still be together with my partner a while from now, I’d be cloning the standard “will this relationship last?” question and substituting new pointers?

Ozzie: Yeah. And a lot of questions would be like “GDP for every country for every year” so there could be a large set of question templates in the ecosystem. So you don’t need any fancy NLP; you could get pretty far with trend analysis and stuff.

Ozzie: On the question of whether data scientists would be likely to use it, that comes down to funding and incentive structures.

Ozzie: If you go on upwork and pay $10k to a data scientist they could give you a decent extrapolation system, and you could then just build that into a bot and hypothetically just keep pumping out these forecasts as new data come in. Pipelines like that already exist. What this would be doing is to provide infrastructure to help support them basically.

END OF TRANSCRIPT

At this point the conversation opened up to questions from the audience.

While this conversation was inspired by the double-crux technique, there is a large variation in how such sessions might look. Even when both participants retain the spirit of seeking the truth and changing their minds in that direction, some disagreements dissipate after less than an hour, others take 10+ hours to resolve and some remain unsolved for years. It seems good to have more public examples of genuine truth-seeking dialogue, but at the same time should be noted that such conversations might look very different from this one.