I think you’re being harsh on yourselves by emphasising the cost/benefit ratio. For one, the forecasters were asked to predict Elizabeth van Norstrand’s distributions rather than their mean, right? So this method of scoring would actually reward them for being worse at their jobs, if they happened to put all their mass near the resolution’s mean as opposed to predicting the correct distribution. IMO a more interesting measure is the degree of agreement between the forecasters’ predictions and Elizabeth’s distributions, although I appreciate that that’s hard to condense into an intuitive statistic.
An interesting question this touches on is “Can research be parallelised?”. It would be nice to investigate this more closely. It feels as though different types of research questions might be amenable to different forms of parallelisation involving more or less communication between individual researchers and more or less sophisticated aggregation functions. For example, a strategy where each researcher is explicitly assigned a separate portion of the problem to work on, and at the end the conclusions are synthesised in a discussion among the researchers, might be appropriate for some questions. Do you have any plans to explore directions like these, or do you think that what you did in this experiment (as I understand, ad-hoc cooperation among the forecasters with each submitting a distribution, these then being averaged) is appropriate for most questions? If so, why?
About the anticorrelation between importance and “outsourceablilty”: investigating which types of questions are outsourceable would be super interesting. You’d think there’d be some connection between outsourceable questions and parallelisable problems in computer science. Again, different aggregation functions/incentive structures will lead to different questions being outsourcable.
One potential use case for this kind of thing could be as a way of finding reasonable distributions over answers to questions that require so much information that a single person or small group couldn’t do the research in an acceptable amount of time or correctly synthesise their conclusions by themselves. One could test how plausible this is by looking at how aggregate performance tracks complexity on problems where one person can do the research alone. So an experiment like the one you’ve done, but on questions of varying complexity, starting from trivial up to the limit of what’s feasible.
Great questions! I’ll try to respond to the points in order.
Question 1
The distinction between forecasters/Elizabeth making predictions of her initial distributions or the final mean, was one that was rather confusing. I later wrote some internal notes to think through some implications in more detail. You can see them here.
I have a lot of uncertainty in how to best structure these setups. I think though that for cost effectiveness, Elizabeth’s initial distributions should be seen as estimates given of the correct value, which is what she occasionally later gave. As such, for cost effectiveness we are interested in how well the forecasters did and estimating this correct value, vs. how well she did at estimating this correct value.
Separately, it’s of course apparent that that correct value itself is an estimate, and there’s further theoretical work to be done to best say what it should have been estimating, and empiricle work to be done to get a sense of how well it holds up against even more trustworthy estimates.
I personally don’t regard the cost effectiveness here as that crucial, I’d instead treat much of this experiment as a set of structures that could apply to more important things in other cases. Elizabeth’s time was rather inexpensive compared to other people/procedures we may want to use in the future, and we could also spend fixed costs improving the marginal costs of such a setup.
Question 2
We haven’t talked about this specific thing, but I could definitely imagine it. The general hope is that even without such a split, many splits would happen automatically. One big challenge is to get the splits right. One may initially think that forecaster work should be split by partitions of questions, but this may be pretty suboptimal. It may be that some forecasters have significant comparative advantages to techniques that span across questions; for instance, some people are great at making mathematical models, and others are great at adjusting the tails of distributions to account for common biases. I think of this more as dividing cognitive work based on trading strategies than questions.
There are a whole ton of possible experiments to be done here, because there are many degrees of freedom. Pursuing these in an effective way is one of our main questions. Of course, if we could have forecasters help forecast which experiments would be effective, then that could help bootstrap a process.
Question 3
We’ve come up with a few “rubrics” to evaluate how effective a given question or question set will be. The main factors are things like:
Tractability (How much progress for how many resources can be made? What if all the participants are outside the relevant organizations/work?)
Importance (How likely is this information to be valuable for changing important decisions?)
Risk (How likely is it that this work will really anger someone or lead to significant downsides?)
I think it’s really easy to spend a lot of money predicting ineffective things if you are not careful. Finding opportunities that are EV-positive is a pretty significant challenge here. I think my general intended strategy is a mix of “try a bunch of things” and “try to set up a system so the predictors themselves could predict the rubric elements or similar for a bunch of things they could predict.”
Question 4
Agreed! That said, there are many possible dimensions for “complexity”, so there’s a lot of theoretical and practical work to be done here.
It seems like Ozzie is answering on a more abstract level than the question was asked. There’s a difference between “How valuable will it be to answer question X?” (what Ozzie said) and “How outsourceable is question X?” (what Lawrence’s question was related to).
I think that outsourceability would be a sub-property of Tractability.
In more detail, some properties I imagine to affect outsourceability, are whether the question:
1) Requires in-depth domain knowledge/experience
2) Requires substantial back-and-forth between question asker and question answerer to get the intention right
3) Relies on hard-to-communicate intuitions
4) Cannot easily be converted into a quantitative distribution
5) Has independent subcomponents which can be answered separately and don’t rely on each other to be answered (related to Lawrence point about tractability)
Nice work. A few comments/questions:
I think you’re being harsh on yourselves by emphasising the cost/benefit ratio. For one, the forecasters were asked to predict Elizabeth van Norstrand’s distributions rather than their mean, right? So this method of scoring would actually reward them for being worse at their jobs, if they happened to put all their mass near the resolution’s mean as opposed to predicting the correct distribution. IMO a more interesting measure is the degree of agreement between the forecasters’ predictions and Elizabeth’s distributions, although I appreciate that that’s hard to condense into an intuitive statistic.
An interesting question this touches on is “Can research be parallelised?”. It would be nice to investigate this more closely. It feels as though different types of research questions might be amenable to different forms of parallelisation involving more or less communication between individual researchers and more or less sophisticated aggregation functions. For example, a strategy where each researcher is explicitly assigned a separate portion of the problem to work on, and at the end the conclusions are synthesised in a discussion among the researchers, might be appropriate for some questions. Do you have any plans to explore directions like these, or do you think that what you did in this experiment (as I understand, ad-hoc cooperation among the forecasters with each submitting a distribution, these then being averaged) is appropriate for most questions? If so, why?
About the anticorrelation between importance and “outsourceablilty”: investigating which types of questions are outsourceable would be super interesting. You’d think there’d be some connection between outsourceable questions and parallelisable problems in computer science. Again, different aggregation functions/incentive structures will lead to different questions being outsourcable.
One potential use case for this kind of thing could be as a way of finding reasonable distributions over answers to questions that require so much information that a single person or small group couldn’t do the research in an acceptable amount of time or correctly synthesise their conclusions by themselves. One could test how plausible this is by looking at how aggregate performance tracks complexity on problems where one person can do the research alone. So an experiment like the one you’ve done, but on questions of varying complexity, starting from trivial up to the limit of what’s feasible.
Great questions! I’ll try to respond to the points in order.
Question 1
The distinction between forecasters/Elizabeth making predictions of her initial distributions or the final mean, was one that was rather confusing. I later wrote some internal notes to think through some implications in more detail. You can see them here.
I have a lot of uncertainty in how to best structure these setups. I think though that for cost effectiveness, Elizabeth’s initial distributions should be seen as estimates given of the correct value, which is what she occasionally later gave. As such, for cost effectiveness we are interested in how well the forecasters did and estimating this correct value, vs. how well she did at estimating this correct value.
Separately, it’s of course apparent that that correct value itself is an estimate, and there’s further theoretical work to be done to best say what it should have been estimating, and empiricle work to be done to get a sense of how well it holds up against even more trustworthy estimates.
I personally don’t regard the cost effectiveness here as that crucial, I’d instead treat much of this experiment as a set of structures that could apply to more important things in other cases. Elizabeth’s time was rather inexpensive compared to other people/procedures we may want to use in the future, and we could also spend fixed costs improving the marginal costs of such a setup.
Question 2
We haven’t talked about this specific thing, but I could definitely imagine it. The general hope is that even without such a split, many splits would happen automatically. One big challenge is to get the splits right. One may initially think that forecaster work should be split by partitions of questions, but this may be pretty suboptimal. It may be that some forecasters have significant comparative advantages to techniques that span across questions; for instance, some people are great at making mathematical models, and others are great at adjusting the tails of distributions to account for common biases. I think of this more as dividing cognitive work based on trading strategies than questions.
There are a whole ton of possible experiments to be done here, because there are many degrees of freedom. Pursuing these in an effective way is one of our main questions. Of course, if we could have forecasters help forecast which experiments would be effective, then that could help bootstrap a process.
Question 3
We’ve come up with a few “rubrics” to evaluate how effective a given question or question set will be. The main factors are things like:
Tractability (How much progress for how many resources can be made? What if all the participants are outside the relevant organizations/work?)
Importance (How likely is this information to be valuable for changing important decisions?)
Risk (How likely is it that this work will really anger someone or lead to significant downsides?)
I think it’s really easy to spend a lot of money predicting ineffective things if you are not careful. Finding opportunities that are EV-positive is a pretty significant challenge here. I think my general intended strategy is a mix of “try a bunch of things” and “try to set up a system so the predictors themselves could predict the rubric elements or similar for a bunch of things they could predict.”
Question 4
Agreed! That said, there are many possible dimensions for “complexity”, so there’s a lot of theoretical and practical work to be done here.
Question 3
It seems like Ozzie is answering on a more abstract level than the question was asked. There’s a difference between “How valuable will it be to answer question X?” (what Ozzie said) and “How outsourceable is question X?” (what Lawrence’s question was related to).
I think that outsourceability would be a sub-property of Tractability.
In more detail, some properties I imagine to affect outsourceability, are whether the question:
1) Requires in-depth domain knowledge/experience
2) Requires substantial back-and-forth between question asker and question answerer to get the intention right
3) Relies on hard-to-communicate intuitions
4) Cannot easily be converted into a quantitative distribution
5) Has independent subcomponents which can be answered separately and don’t rely on each other to be answered (related to Lawrence point about tractability)