This is why proper scoring rules are important. As long as you are adequately using proper scoring rules, and proper combinations of those scoring rules, then people will be incentivized to predict according to their own beliefs. If we assume that users can’t make account, and are paid in proportion to their performance according to proper scoring rules, then they shouldn’t be able to gain expected earnings by providing overconfident answers.
The log-scoring function we use is a proper scoring rule. The potential winnings if you do a great job are very capped due to this scoring rule.
In this specific experiment we had some trust in the participants and no obviously fake accounts. If we scaled this, fake accounts would be an issue, but there are ways around it. I also would imagine that a more robust system would look something like having users begin with little “trust”; that they would then build up by providing good forecasts. They would only begin being payed as long as they had some threshold of trust; but within that level the proper scoring rules should generally create reasonable incentives.
I have four concerns even given that you’re using a proper scoring rule, which relate to the link between that scoring rule and actually giving people money. I’m not particularly well-informed on this though, so could be totally wrong.
1. To implement some proper scoring rules, you need the ability to confiscate money from people who predict badly. Even when the score always has the same sign, like you have with log-scoring (or when you add a constant to a quadratic scoring system), if you don’t confiscate money for bad predictions, then you’re basically just giving money to people for signing up, which makes having an open platform tricky.
2. Even if you restrict signups, you get an analogous problem within a fixed population who’s already signed up: the incentives will be skewed when it comes to choosing which questions to answer. In particular, if people expect to get positive amounts of money for answering randomly, they’ll do so even when they have no relevant information, adding a lot of noise.
3. If a scoring rule is “very capped”, as the log-scoring function is, then the expected reward from answering randomly may be very close to the expected reward from putting in a lot of effort, and so people would be incentivised to answer randomly and spend their time on other things.
4. Relatedly, people’s utilities aren’t linear in money, so the score function might not remain a proper one taking that into account. But I don’t think this would be a big effect on the scales this is likely to operate on.
The fact that we use a “proper scoring rule” definitely doesn’t mean that the entire system, including the participant’s true utility functions, are really “proper”. There is really a fair bit of impropriety. For instance, people also may care about their online reputation, and that won’t be captured in the proper scoring rule. The proper scoring rule really helps make sure that one specific aspect of the system is “proper” according to a simplified model. This is definitely subideal, but I think it’s still good enough for a lot of things. I’m not sure what type of system would be “perfectly proper”.
Prediction markets have their own disadvantages; as participants don’t behave as perfectly rational agents their either. So I won’t claim that the system is “perfectly aligned”, but I will suggest that it seems “decently aligned” compared to other alternatives, with the ability to improve as we (or others with other systems) add further complexity.
If you don’t confiscate money for bad predictions, then you’re basically just giving money to people for signing up, which makes having an open platform tricky.
What was done in this case was that participants were basically paid a fixed fee for participating, with a second “bonus” that was larger, that was paid in proportion to how they did on said rule. This works in experimental settings where we can filter the participants. It would definitely be more work to make the system totally openly available, especially as the prizes increase in value, much for the reason you describe. We’re working to try to figure out solutions that could hold up (somewhat) in these circumstances, but it is tricky, for reasons you suggest and for others.
I’d also point out that having a nice scoring system is one challenge out of many challenges. Having nice probability distribution viewers and editors is difficult. Writing good questions and organizing them, and having software that does this well, is also difficult. This is something that @jacobjacob has been spending a decent amount of time thinking about after this experiment, but I’ve personally been focusing on other aspects.
At least in this experiment, the scoring system didn’t seem like a big bottleneck. The people who submitted who won the most money were generally those who seemed to have given thoughtful and useful probability distributions. Things are much easier when you have an audience who is generally taking things in good faith and who can be excluded from future rounds if it seems appropriate.
Cool, thanks for those clarifications :) In case it didn’t come through from the previous comments, I wanted to make clear that this seems like exciting work and I’m looking forward to hearing how follow-ups go.
Thanks! I really do appreciate the thoughts & feedback in general, and am quite happy to answer questions. There’s a whole lot we haven’t written up yet, and it’s much easier for me to reply to things than lay everything out.
This is why proper scoring rules are important. As long as you are adequately using proper scoring rules, and proper combinations of those scoring rules, then people will be incentivized to predict according to their own beliefs. If we assume that users can’t make account, and are paid in proportion to their performance according to proper scoring rules, then they shouldn’t be able to gain expected earnings by providing overconfident answers.
The log-scoring function we use is a proper scoring rule. The potential winnings if you do a great job are very capped due to this scoring rule.
In this specific experiment we had some trust in the participants and no obviously fake accounts. If we scaled this, fake accounts would be an issue, but there are ways around it. I also would imagine that a more robust system would look something like having users begin with little “trust”; that they would then build up by providing good forecasts. They would only begin being payed as long as they had some threshold of trust; but within that level the proper scoring rules should generally create reasonable incentives.
I have four concerns even given that you’re using a proper scoring rule, which relate to the link between that scoring rule and actually giving people money. I’m not particularly well-informed on this though, so could be totally wrong.
1. To implement some proper scoring rules, you need the ability to confiscate money from people who predict badly. Even when the score always has the same sign, like you have with log-scoring (or when you add a constant to a quadratic scoring system), if you don’t confiscate money for bad predictions, then you’re basically just giving money to people for signing up, which makes having an open platform tricky.
2. Even if you restrict signups, you get an analogous problem within a fixed population who’s already signed up: the incentives will be skewed when it comes to choosing which questions to answer. In particular, if people expect to get positive amounts of money for answering randomly, they’ll do so even when they have no relevant information, adding a lot of noise.
3. If a scoring rule is “very capped”, as the log-scoring function is, then the expected reward from answering randomly may be very close to the expected reward from putting in a lot of effort, and so people would be incentivised to answer randomly and spend their time on other things.
4. Relatedly, people’s utilities aren’t linear in money, so the score function might not remain a proper one taking that into account. But I don’t think this would be a big effect on the scales this is likely to operate on.
The fact that we use a “proper scoring rule” definitely doesn’t mean that the entire system, including the participant’s true utility functions, are really “proper”. There is really a fair bit of impropriety. For instance, people also may care about their online reputation, and that won’t be captured in the proper scoring rule. The proper scoring rule really helps make sure that one specific aspect of the system is “proper” according to a simplified model. This is definitely subideal, but I think it’s still good enough for a lot of things. I’m not sure what type of system would be “perfectly proper”.
Prediction markets have their own disadvantages; as participants don’t behave as perfectly rational agents their either. So I won’t claim that the system is “perfectly aligned”, but I will suggest that it seems “decently aligned” compared to other alternatives, with the ability to improve as we (or others with other systems) add further complexity.
What was done in this case was that participants were basically paid a fixed fee for participating, with a second “bonus” that was larger, that was paid in proportion to how they did on said rule. This works in experimental settings where we can filter the participants. It would definitely be more work to make the system totally openly available, especially as the prizes increase in value, much for the reason you describe. We’re working to try to figure out solutions that could hold up (somewhat) in these circumstances, but it is tricky, for reasons you suggest and for others.
I’d also point out that having a nice scoring system is one challenge out of many challenges. Having nice probability distribution viewers and editors is difficult. Writing good questions and organizing them, and having software that does this well, is also difficult. This is something that @jacobjacob has been spending a decent amount of time thinking about after this experiment, but I’ve personally been focusing on other aspects.
At least in this experiment, the scoring system didn’t seem like a big bottleneck. The people who submitted who won the most money were generally those who seemed to have given thoughtful and useful probability distributions. Things are much easier when you have an audience who is generally taking things in good faith and who can be excluded from future rounds if it seems appropriate.
Cool, thanks for those clarifications :) In case it didn’t come through from the previous comments, I wanted to make clear that this seems like exciting work and I’m looking forward to hearing how follow-ups go.
Thanks! I really do appreciate the thoughts & feedback in general, and am quite happy to answer questions. There’s a whole lot we haven’t written up yet, and it’s much easier for me to reply to things than lay everything out.