I had a pretty visceral negative response to this, and it took me a bit to figure out why.
What I’m moving towards with ESCs is no gods no proxies. It’s about digging in deeply to get to the truth. Throwing a million variables at a wall to see what sticks seems… dissociated? It’s a search for things you do instead of dig for information you evaluate yourself.
A “spot check” of a few of a book’s claims is supposed to a proxy for the accuracy of the rest of the claims, right?
Of course there are issues to work through. For example, you’d probably want to have a training set and a test set like people always do in machine learning to see if it’s just “what sticks” or whether you’ve actually found a signal that generalizes. And if you published your reasoning then people might game whatever indicators you discovered. (Should still work for older books though.) You might also find that most of the variability in accuracy is per-book rather than per-author or anything like that. (Alternatively, you might find that a book’s accuracy can be predicted better based on external characteristics than doing a few spot checks, if individual spot checks are comparatively noisy.) But the potential upside is much larger because it could help you save time deciding what to read on any subject.
I don’t immediately see how they’re related. Are you thinking people participating in the markets are answering based on proxies rather than truly relevant information?
I’m thinking that if there were liquid prediction markets for amplifying ESCs, people could code bots to do exactly what John suggests and potentially make money. This suggests to me that there’s no principled difference between the two ideas, though I could be missing something (maybe you think the bot is unlikely to beat the market?)
I think I’d feel differently about John’s list if it contained things that weren’t goodhartable, such as… I don’t know, most things are goodhartable. For example, citation density does probably have an impact (not just a correlation) on credence score. But giving truth or credibility points for citations is extremely gameable. A score based on citation density is worthless as soon as it becomes popular because people will do what they would have anyway and throw some citations in on top. Popular authors may not even have to do that themselves. The difference between what John suggested and a prediction market with a citation-count bot is that if that gaming starts to happen, the citation count bot will begin failing (which is an extremely useful signal, so I’d be happy to have citation count bot participating).
Put another way: in a soon-to-air podcast, an author described how reading epistemic spot checks gave them a little shoulder-Elizabeth when writing their own book, pushing them to be more accurate and more justified. That’s a fantastic outcome that I’m really proud of, although I’ll hold the real congratulations for after I read the book. I don’t think a book would be made better by giving the author a shoulder-citation bot, or even a shoulder-complex multivariable function. I suspect some of that is because epistemic spot checks are not a score, they’re a process, and demonstrating a process people can apply themselves, rather than a score they can optimize, leads to better epistemics.
A follow up question is “would shoulder-prediction markets be as useful?”. I think they could be, but that would depend on the prediction market being evaluated by something like the research I do, not a function like John suggests. The prediction markets involve multiple people doing and sometimes sharing research; Ozzie has talked about them as a tool for collaborative learning as opposed to competition (I’ve pinged him and he can say more on that if he likes).
Additionally, John’s suggested metrics are mostly correlated with traditional success in academia, and if I thought traditional academic success was a good predictor of truth I wouldn’t be doing all this work. That’s a testable hypothesis and tests of it might look something like what John suggests, but I would view it as “testing academia”, not “discovering useful metrics”.
This question has spurred some really interesting and useful thoughts for me, thank you for asking it.
On them being for “collaborative learning”; the specific thing I was thinking was how good prediction systems should really encourage introspectability and knowledge externalities in order to be maximally cost-effective. I wrote a bit about this here
I had a pretty visceral negative response to this, and it took me a bit to figure out why.
What I’m moving towards with ESCs is no gods no proxies. It’s about digging in deeply to get to the truth. Throwing a million variables at a wall to see what sticks seems… dissociated? It’s a search for things you do instead of dig for information you evaluate yourself.
“No Gods, No Proxies, Just Digging For Truth” is a good tagline for your blog.
A “spot check” of a few of a book’s claims is supposed to a proxy for the accuracy of the rest of the claims, right?
Of course there are issues to work through. For example, you’d probably want to have a training set and a test set like people always do in machine learning to see if it’s just “what sticks” or whether you’ve actually found a signal that generalizes. And if you published your reasoning then people might game whatever indicators you discovered. (Should still work for older books though.) You might also find that most of the variability in accuracy is per-book rather than per-author or anything like that. (Alternatively, you might find that a book’s accuracy can be predicted better based on external characteristics than doing a few spot checks, if individual spot checks are comparatively noisy.) But the potential upside is much larger because it could help you save time deciding what to read on any subject.
Anyway, just an idea.
What’s the difference between John’s suggestion and amplifying ESCs with prediction markets? (not rhetorical)
I don’t immediately see how they’re related. Are you thinking people participating in the markets are answering based on proxies rather than truly relevant information?
I’m thinking that if there were liquid prediction markets for amplifying ESCs, people could code bots to do exactly what John suggests and potentially make money. This suggests to me that there’s no principled difference between the two ideas, though I could be missing something (maybe you think the bot is unlikely to beat the market?)
I think I’d feel differently about John’s list if it contained things that weren’t goodhartable, such as… I don’t know, most things are goodhartable. For example, citation density does probably have an impact (not just a correlation) on credence score. But giving truth or credibility points for citations is extremely gameable. A score based on citation density is worthless as soon as it becomes popular because people will do what they would have anyway and throw some citations in on top. Popular authors may not even have to do that themselves. The difference between what John suggested and a prediction market with a citation-count bot is that if that gaming starts to happen, the citation count bot will begin failing (which is an extremely useful signal, so I’d be happy to have citation count bot participating).
Put another way: in a soon-to-air podcast, an author described how reading epistemic spot checks gave them a little shoulder-Elizabeth when writing their own book, pushing them to be more accurate and more justified. That’s a fantastic outcome that I’m really proud of, although I’ll hold the real congratulations for after I read the book. I don’t think a book would be made better by giving the author a shoulder-citation bot, or even a shoulder-complex multivariable function. I suspect some of that is because epistemic spot checks are not a score, they’re a process, and demonstrating a process people can apply themselves, rather than a score they can optimize, leads to better epistemics.
A follow up question is “would shoulder-prediction markets be as useful?”. I think they could be, but that would depend on the prediction market being evaluated by something like the research I do, not a function like John suggests. The prediction markets involve multiple people doing and sometimes sharing research; Ozzie has talked about them as a tool for collaborative learning as opposed to competition (I’ve pinged him and he can say more on that if he likes).
Additionally, John’s suggested metrics are mostly correlated with traditional success in academia, and if I thought traditional academic success was a good predictor of truth I wouldn’t be doing all this work. That’s a testable hypothesis and tests of it might look something like what John suggests, but I would view it as “testing academia”, not “discovering useful metrics”.
This question has spurred some really interesting and useful thoughts for me, thank you for asking it.
On them being for “collaborative learning”; the specific thing I was thinking was how good prediction systems should really encourage introspectability and knowledge externalities in order to be maximally cost-effective. I wrote a bit about this here