But many of the questions we care about are much less verifiable:
“How much value has this organization created?”
“What is the relative effectiveness of AI safety research vs. bio risk research?”
One solution attempt would be to have an “expert panel” assess these questions, but this opens up a bunch of issues. How could we know how much we could trust this group to be accurate, precise, and understandable?
The topic of, “How can we trust that a person or group can give reasonable answers to abstract questions” is quite generic and abstract, but it’s a start.
I’ve decided to investigate this as part of my overall project on forecasting infrastructure. I’ve recently been working with Elizabeth on some high-level research.
I believe that this general strand of work could be useful both for forecasting systems and also for the more broad-reaching evaluations that are important in our communities.
Early concrete questions in evaluation quality
One concrete topic that’s easily studiable is evaluation consistency. If the most respected philosopher gives wildly different answers to “Is moral realism true” on different dates, it makes you question the validity of their belief. Or perhaps their belief is fixed, but we can determine that there was significant randomness in the processes that determined it.
Daniel Kahneman apparently thinks a version of this question is important enough to be writing his new book on it.
Another obvious topic is in the misunderstanding of terminology. If an evaluator understands “transformative AI” in a very different way to the people reading their statements about transformative AI, they may make statements that get misinterpreted.
These are two specific examples of questions, but I’m sure there are many more. I’m excited about understanding existing work in this overall space more, and getting a better sense of where things stand and what the next right questions are to be asking.
can insights from prediction markets work for helping us select better proxies and decision criteria or do we expect people to be too poorly entangled with the truth of these matters for that to work? Do orgs always require someone who is managing the ontology and incentives to be super competent at that to do well? De facto improvements here are worth billions (project management tools, slack, email add ons for assisting managing etc.)
I think that prediction markets can help us select better proxies, but the initial set up (at least) will require people pretty clever with ontologies.
For example, say a group comes up with 20 proposals for specific ways of answering the question, “How much value has this organization created?”. A prediction market could predict the outcome of the effectiveness of each proposal.
I’d hope that over time people would put together lists of “best” techniques to formalize questions like this, so doing it for many new situations would be quite straightforward.
Another related idea we played around with, but which didn’t make it into the final whitepaper:
What if we just assumed that Brier score was also predictive of good judgement. Then, people, could create a distribution over several measures of “how good will this organization do” and we could use standard probability theory and aggregation tools to create an aggregated final measure.
The way we handled this with Verity was to pick a series of values, like “good judgement”, “integrity,” “consistency” etc. Then the community would select exemplars who they thought represented those values the best.
As people voted on which proposals they liked best, we would weight their votes by:
1. How much other people (weighted by their own score on that value) thought they had that value.
2. How similarly they voted to the examplars.
This sort of “value judgement” allows for fuzzy representation of high level judgement, and is a great supplement to more objective metrics like Brier score which can only measure well defined questions.
Eigentrust++ is a great algorithm that has the properties needed for this judgement-based reputation. The Verity Whitepaper goes more into depth as to how this would be used in practice.
One way to look at this is, where is the variance coming from? Any particular forecasting question has implied sub-questions, which the predictor needs to divide their attention between. For example, given the question “How much value has this organization created?”, a predictor might spend their time comparing the organization to others in its reference class, or they might spend time modeling the judges and whether they tend to give numbers that are higher or lower.
Evaluation consistency is a way of reducing the amount of resources that you need to spend modeling the judges, by providing a standard that you can calibrate against. But there are other ways of achieving the same effect. For example, if you have people predict the ratio of value produced between two organizations, then if the judges consistently predict high or predict low, this no longer matters since it affects both equally.
“What is the relative effectiveness of AI safety research vs. bio risk research?”
If you had a precise definition of “effectiveness” this shouldn’t be a problem. E.g. if you had predictions for “will humans go extinct in the next 100 years?” and “will we go extinct in the next 100 years, if we invest 1M into AI risk research?” and “will we go extinct, if we invest 1M in bio risk research?”, then you should be able to make decisions with that. And these questions should work fine in existing forecasting platforms. Their long term and conditional nature are problems, of course, but I don’t think that can be helped.
“How much value has this organization created?”
That’s not a forecast. But if you asked “How much value will this organization create next year?” along with a clear measure of “value”, then again, I don’t see much of a problem. And, although clearly defining value can be tedious (and prone to errors), I don’t think that problem can be avoided. Different people value different things, that can’t be helped.
One solution attempt would be to have an “expert panel” assess these questions
Why would you do that? What’s wrong with the usual prediction markets? Of course, they’re expensive (require many participants), but I don’t think a group of experts can be made to work well without a market-like mechanism. Is your project about making such markets more efficient?
If you had a precise definition of “effectiveness” this shouldn’t be a problem.
Coming up with a precise definition is difficult, especially if you want multiple groups to agree. Those specific questions are relatively low-level; I think we should ask a bunch of questions like that, but think we may also want some more vague things as well.
For example, say I wanted to know how good/enjoyable a specific movie would be. Predicting the ratings according to movie reviewers (evaluators) is an approach I’d regard as reasonable. I’m not sure what a precise definition for movie quality would look like (though I would be interested in proposals), but am generally happy enough with movie reviews for what I’m looking for.
“How much value has this organization created?”
Agreed that that itself isn’t a forecast, I meant in the more general case, for questions like, “How much value will this organization create next year” (as you pointed out). I probably should have used that more specific example, apologies.
And, although clearly defining value can be tedious (and prone to errors), I don’t think that problem can be avoided.
Can you be more explicit about your definition of “clearly”? I’d imagine that almost any proposal at a value function would have some vagueness. Certificates of Impact get around this by just leaving that for the review of some eventual judges, kind of similar to what I’m proposing.
Why would you do that? What’s wrong with the usual prediction markets?
The goal for this research isn’t fixing something with prediction markets, but just finding more useful things for them to predict. If we had expert panels that agreed to evaluate things in the future (for instance, they are responsible for deciding on the “value organization X has created” in 2025), then prediction markets and similar could predict what they would say.
For example, say I wanted to know how good/enjoyable a specific movie would be.
My point is that “goodness” is not a thing in the territory. At best it is a label for a set of specific measures (ratings, revenue, awards, etc). In that case, why not just work with those specific measures? Vague questions have the benefit of being short and easy to remember, but beyond that I see only problems. Motivated agents will do their best to interpret the vagueness in a way that suits them.
Is your goal to find a method to generate specific interpretations and procedures of measurement for vague properties like this one? Like a Shelling point for formalizing language? Why do you feel that can be done in a useful way? I’m asking for an intuition pump.
Can you be more explicit about your definition of “clearly”?
Certainly there is some vagueness, but it seems that we manage to live with it. I’m not proposing anything that prediction markets aren’t already doing.
Hm… At this point I don’t feel like I have a good intuition for what you find intuitive. I could give more examples, but don’t expect they would convince you much right now if the others haven’t helped.
I plan to eventually write more about this, and eventually hopefully we should have working examples up (where people are predicting things). Hopefully things should make more sense to you then.
Short comments back<>forth are a pretty messy communication medium for such work.
There’s something of a problem with sensitivity; if the x-risk from AI is ~0.1, and the difference in x-risk from some grant is ~10^-6, then any difference in the forecasts is going to be completely swamped by noise.
(while people in the market could fix any inconsistency between the predictions, they would only be able to look forward to 0.001% returns over the next century)
Making long term predictions is hard. That’s a fundamental problem. Having proxies can be convenient, but it’s not going to tell you anything you don’t already know.
Questions around Making Reliable Evaluations
Most existing forecasting platform questions are for very clearly verifiable questions:
“Who will win the next election”
“How many cars will Tesla sell in 2030?”
But many of the questions we care about are much less verifiable:
“How much value has this organization created?”
“What is the relative effectiveness of AI safety research vs. bio risk research?”
One solution attempt would be to have an “expert panel” assess these questions, but this opens up a bunch of issues. How could we know how much we could trust this group to be accurate, precise, and understandable?
The topic of, “How can we trust that a person or group can give reasonable answers to abstract questions” is quite generic and abstract, but it’s a start.
I’ve decided to investigate this as part of my overall project on forecasting infrastructure. I’ve recently been working with Elizabeth on some high-level research.
I believe that this general strand of work could be useful both for forecasting systems and also for the more broad-reaching evaluations that are important in our communities.
Early concrete questions in evaluation quality
One concrete topic that’s easily studiable is evaluation consistency. If the most respected philosopher gives wildly different answers to “Is moral realism true” on different dates, it makes you question the validity of their belief. Or perhaps their belief is fixed, but we can determine that there was significant randomness in the processes that determined it.
Daniel Kahneman apparently thinks a version of this question is important enough to be writing his new book on it.
Another obvious topic is in the misunderstanding of terminology. If an evaluator understands “transformative AI” in a very different way to the people reading their statements about transformative AI, they may make statements that get misinterpreted.
These are two specific examples of questions, but I’m sure there are many more. I’m excited about understanding existing work in this overall space more, and getting a better sense of where things stand and what the next right questions are to be asking.
> “How much value has this organization created?”
can insights from prediction markets work for helping us select better proxies and decision criteria or do we expect people to be too poorly entangled with the truth of these matters for that to work? Do orgs always require someone who is managing the ontology and incentives to be super competent at that to do well? De facto improvements here are worth billions (project management tools, slack, email add ons for assisting managing etc.)
I think that prediction markets can help us select better proxies, but the initial set up (at least) will require people pretty clever with ontologies.
For example, say a group comes up with 20 proposals for specific ways of answering the question, “How much value has this organization created?”. A prediction market could predict the outcome of the effectiveness of each proposal.
I’d hope that over time people would put together lists of “best” techniques to formalize questions like this, so doing it for many new situations would be quite straightforward.
Another related idea we played around with, but which didn’t make it into the final whitepaper:
What if we just assumed that Brier score was also predictive of good judgement. Then, people, could create a distribution over several measures of “how good will this organization do” and we could use standard probability theory and aggregation tools to create an aggregated final measure.
The way we handled this with Verity was to pick a series of values, like “good judgement”, “integrity,” “consistency” etc. Then the community would select exemplars who they thought represented those values the best.
As people voted on which proposals they liked best, we would weight their votes by:
1. How much other people (weighted by their own score on that value) thought they had that value.
2. How similarly they voted to the examplars.
This sort of “value judgement” allows for fuzzy representation of high level judgement, and is a great supplement to more objective metrics like Brier score which can only measure well defined questions.
Eigentrust++ is a great algorithm that has the properties needed for this judgement-based reputation. The Verity Whitepaper goes more into depth as to how this would be used in practice.
Deference networks seem underrated.
One way to look at this is, where is the variance coming from? Any particular forecasting question has implied sub-questions, which the predictor needs to divide their attention between. For example, given the question “How much value has this organization created?”, a predictor might spend their time comparing the organization to others in its reference class, or they might spend time modeling the judges and whether they tend to give numbers that are higher or lower.
Evaluation consistency is a way of reducing the amount of resources that you need to spend modeling the judges, by providing a standard that you can calibrate against. But there are other ways of achieving the same effect. For example, if you have people predict the ratio of value produced between two organizations, then if the judges consistently predict high or predict low, this no longer matters since it affects both equally.
Yep, good points. Ideally one could do a proper or even estimated error analysis of some kind.
Having good units (like, ratios) seems pretty important.
If you had a precise definition of “effectiveness” this shouldn’t be a problem. E.g. if you had predictions for “will humans go extinct in the next 100 years?” and “will we go extinct in the next 100 years, if we invest 1M into AI risk research?” and “will we go extinct, if we invest 1M in bio risk research?”, then you should be able to make decisions with that. And these questions should work fine in existing forecasting platforms. Their long term and conditional nature are problems, of course, but I don’t think that can be helped.
That’s not a forecast. But if you asked “How much value will this organization create next year?” along with a clear measure of “value”, then again, I don’t see much of a problem. And, although clearly defining value can be tedious (and prone to errors), I don’t think that problem can be avoided. Different people value different things, that can’t be helped.
Why would you do that? What’s wrong with the usual prediction markets? Of course, they’re expensive (require many participants), but I don’t think a group of experts can be made to work well without a market-like mechanism. Is your project about making such markets more efficient?
Coming up with a precise definition is difficult, especially if you want multiple groups to agree. Those specific questions are relatively low-level; I think we should ask a bunch of questions like that, but think we may also want some more vague things as well.
For example, say I wanted to know how good/enjoyable a specific movie would be. Predicting the ratings according to movie reviewers (evaluators) is an approach I’d regard as reasonable. I’m not sure what a precise definition for movie quality would look like (though I would be interested in proposals), but am generally happy enough with movie reviews for what I’m looking for.
Agreed that that itself isn’t a forecast, I meant in the more general case, for questions like, “How much value will this organization create next year” (as you pointed out). I probably should have used that more specific example, apologies.
Can you be more explicit about your definition of “clearly”? I’d imagine that almost any proposal at a value function would have some vagueness. Certificates of Impact get around this by just leaving that for the review of some eventual judges, kind of similar to what I’m proposing.
The goal for this research isn’t fixing something with prediction markets, but just finding more useful things for them to predict. If we had expert panels that agreed to evaluate things in the future (for instance, they are responsible for deciding on the “value organization X has created” in 2025), then prediction markets and similar could predict what they would say.
My point is that “goodness” is not a thing in the territory. At best it is a label for a set of specific measures (ratings, revenue, awards, etc). In that case, why not just work with those specific measures? Vague questions have the benefit of being short and easy to remember, but beyond that I see only problems. Motivated agents will do their best to interpret the vagueness in a way that suits them.
Is your goal to find a method to generate specific interpretations and procedures of measurement for vague properties like this one? Like a Shelling point for formalizing language? Why do you feel that can be done in a useful way? I’m asking for an intuition pump.
Certainly there is some vagueness, but it seems that we manage to live with it. I’m not proposing anything that prediction markets aren’t already doing.
Hm… At this point I don’t feel like I have a good intuition for what you find intuitive. I could give more examples, but don’t expect they would convince you much right now if the others haven’t helped.
I plan to eventually write more about this, and eventually hopefully we should have working examples up (where people are predicting things). Hopefully things should make more sense to you then.
Short comments back<>forth are a pretty messy communication medium for such work.
There’s something of a problem with sensitivity; if the x-risk from AI is ~0.1, and the difference in x-risk from some grant is ~10^-6, then any difference in the forecasts is going to be completely swamped by noise.
(while people in the market could fix any inconsistency between the predictions, they would only be able to look forward to 0.001% returns over the next century)
Making long term predictions is hard. That’s a fundamental problem. Having proxies can be convenient, but it’s not going to tell you anything you don’t already know.
Yea, in cases like these, having intermediate metrics seems pretty essential.