How are you deciding which books to do spot checks for? My instinct is to suggest finding some overarching question which seems important to investigate, so your project does double duty exploring epistemic spot checks and answering a question which will materially impact the actions of you / people you’re capable of influencing, but you’re a better judge of whether that’s a good idea of course.
Here’s another crazy idea. Instead of trying to measure the reliability of specific books, try to figure out what predicts whether a book is reliable. You could do a single spot check for a lot of different books and then figure out what predicts the output of the spot check: whether the author has a PhD/tenure/what their h-index is, company that published the book, editor, length, citation density, quality of sources cited (e.g. # citations/journal prestige of typical paper citation), publication date, # authors, sales rank, amount of time the author spent on the book/how busy they seemed with other things during that time period, use of a ghostwriter, etc. You could code all those features and feed them into a logistic regression and see which were most predictive.
I had a pretty visceral negative response to this, and it took me a bit to figure out why.
What I’m moving towards with ESCs is no gods no proxies. It’s about digging in deeply to get to the truth. Throwing a million variables at a wall to see what sticks seems… dissociated? It’s a search for things you do instead of dig for information you evaluate yourself.
A “spot check” of a few of a book’s claims is supposed to a proxy for the accuracy of the rest of the claims, right?
Of course there are issues to work through. For example, you’d probably want to have a training set and a test set like people always do in machine learning to see if it’s just “what sticks” or whether you’ve actually found a signal that generalizes. And if you published your reasoning then people might game whatever indicators you discovered. (Should still work for older books though.) You might also find that most of the variability in accuracy is per-book rather than per-author or anything like that. (Alternatively, you might find that a book’s accuracy can be predicted better based on external characteristics than doing a few spot checks, if individual spot checks are comparatively noisy.) But the potential upside is much larger because it could help you save time deciding what to read on any subject.
I don’t immediately see how they’re related. Are you thinking people participating in the markets are answering based on proxies rather than truly relevant information?
I’m thinking that if there were liquid prediction markets for amplifying ESCs, people could code bots to do exactly what John suggests and potentially make money. This suggests to me that there’s no principled difference between the two ideas, though I could be missing something (maybe you think the bot is unlikely to beat the market?)
I think I’d feel differently about John’s list if it contained things that weren’t goodhartable, such as… I don’t know, most things are goodhartable. For example, citation density does probably have an impact (not just a correlation) on credence score. But giving truth or credibility points for citations is extremely gameable. A score based on citation density is worthless as soon as it becomes popular because people will do what they would have anyway and throw some citations in on top. Popular authors may not even have to do that themselves. The difference between what John suggested and a prediction market with a citation-count bot is that if that gaming starts to happen, the citation count bot will begin failing (which is an extremely useful signal, so I’d be happy to have citation count bot participating).
Put another way: in a soon-to-air podcast, an author described how reading epistemic spot checks gave them a little shoulder-Elizabeth when writing their own book, pushing them to be more accurate and more justified. That’s a fantastic outcome that I’m really proud of, although I’ll hold the real congratulations for after I read the book. I don’t think a book would be made better by giving the author a shoulder-citation bot, or even a shoulder-complex multivariable function. I suspect some of that is because epistemic spot checks are not a score, they’re a process, and demonstrating a process people can apply themselves, rather than a score they can optimize, leads to better epistemics.
A follow up question is “would shoulder-prediction markets be as useful?”. I think they could be, but that would depend on the prediction market being evaluated by something like the research I do, not a function like John suggests. The prediction markets involve multiple people doing and sometimes sharing research; Ozzie has talked about them as a tool for collaborative learning as opposed to competition (I’ve pinged him and he can say more on that if he likes).
Additionally, John’s suggested metrics are mostly correlated with traditional success in academia, and if I thought traditional academic success was a good predictor of truth I wouldn’t be doing all this work. That’s a testable hypothesis and tests of it might look something like what John suggests, but I would view it as “testing academia”, not “discovering useful metrics”.
This question has spurred some really interesting and useful thoughts for me, thank you for asking it.
On them being for “collaborative learning”; the specific thing I was thinking was how good prediction systems should really encourage introspectability and knowledge externalities in order to be maximally cost-effective. I wrote a bit about this here
I agree with Liam that amplifying ESCs with prediction markets would be a lot like John’s suggestion. I think an elegant approach would be something like setting up prediction markets, and then allowing users to set up their own data science pipelines as they see fit. My guess is that this would be essential if we wanted to predict a lot of books; say, 200 to 1 Million books.
If predictors did a decent job at this, then I’d be on the whole excited about it being known and for authors to try to perform better on it; because I believe it would reveal more signal than noise (well, as long as this prediction was done decently, for a vague definition of decent.)
My guess is that a strong analysis would only include “number of citations” as being a rather minor feature. If it became evident that authors were trying to actively munchkin[1] things, then predictors should pick up on that, and introduce features for things like “known munchkiner”, which would make this quite difficult. The timescales for authors to update and write books seem much longer than the timescales for predictors to recognize what’s going on.
[1] I realize that “’munchkining” is a pretty uncommon word, but I like it a lot, and it feels more relevant than powergaming. Please let me know if there’s a term you prefer. I think “Goodhart” is too generic, especially if things like “correlational Goodhart” count.
How are you deciding which books to do spot checks for? My instinct is to suggest finding some overarching question which seems important to investigate, so your project does double duty exploring epistemic spot checks and answering a question which will materially impact the actions of you / people you’re capable of influencing, but you’re a better judge of whether that’s a good idea of course.
It depends; that is in fact what I’m doing right now, and I’ve done it before, but sometimes I just follow my interests.
I see, interesting.
Here’s another crazy idea. Instead of trying to measure the reliability of specific books, try to figure out what predicts whether a book is reliable. You could do a single spot check for a lot of different books and then figure out what predicts the output of the spot check: whether the author has a PhD/tenure/what their h-index is, company that published the book, editor, length, citation density, quality of sources cited (e.g. # citations/journal prestige of typical paper citation), publication date, # authors, sales rank, amount of time the author spent on the book/how busy they seemed with other things during that time period, use of a ghostwriter, etc. You could code all those features and feed them into a logistic regression and see which were most predictive.
I had a pretty visceral negative response to this, and it took me a bit to figure out why.
What I’m moving towards with ESCs is no gods no proxies. It’s about digging in deeply to get to the truth. Throwing a million variables at a wall to see what sticks seems… dissociated? It’s a search for things you do instead of dig for information you evaluate yourself.
“No Gods, No Proxies, Just Digging For Truth” is a good tagline for your blog.
A “spot check” of a few of a book’s claims is supposed to a proxy for the accuracy of the rest of the claims, right?
Of course there are issues to work through. For example, you’d probably want to have a training set and a test set like people always do in machine learning to see if it’s just “what sticks” or whether you’ve actually found a signal that generalizes. And if you published your reasoning then people might game whatever indicators you discovered. (Should still work for older books though.) You might also find that most of the variability in accuracy is per-book rather than per-author or anything like that. (Alternatively, you might find that a book’s accuracy can be predicted better based on external characteristics than doing a few spot checks, if individual spot checks are comparatively noisy.) But the potential upside is much larger because it could help you save time deciding what to read on any subject.
Anyway, just an idea.
What’s the difference between John’s suggestion and amplifying ESCs with prediction markets? (not rhetorical)
I don’t immediately see how they’re related. Are you thinking people participating in the markets are answering based on proxies rather than truly relevant information?
I’m thinking that if there were liquid prediction markets for amplifying ESCs, people could code bots to do exactly what John suggests and potentially make money. This suggests to me that there’s no principled difference between the two ideas, though I could be missing something (maybe you think the bot is unlikely to beat the market?)
I think I’d feel differently about John’s list if it contained things that weren’t goodhartable, such as… I don’t know, most things are goodhartable. For example, citation density does probably have an impact (not just a correlation) on credence score. But giving truth or credibility points for citations is extremely gameable. A score based on citation density is worthless as soon as it becomes popular because people will do what they would have anyway and throw some citations in on top. Popular authors may not even have to do that themselves. The difference between what John suggested and a prediction market with a citation-count bot is that if that gaming starts to happen, the citation count bot will begin failing (which is an extremely useful signal, so I’d be happy to have citation count bot participating).
Put another way: in a soon-to-air podcast, an author described how reading epistemic spot checks gave them a little shoulder-Elizabeth when writing their own book, pushing them to be more accurate and more justified. That’s a fantastic outcome that I’m really proud of, although I’ll hold the real congratulations for after I read the book. I don’t think a book would be made better by giving the author a shoulder-citation bot, or even a shoulder-complex multivariable function. I suspect some of that is because epistemic spot checks are not a score, they’re a process, and demonstrating a process people can apply themselves, rather than a score they can optimize, leads to better epistemics.
A follow up question is “would shoulder-prediction markets be as useful?”. I think they could be, but that would depend on the prediction market being evaluated by something like the research I do, not a function like John suggests. The prediction markets involve multiple people doing and sometimes sharing research; Ozzie has talked about them as a tool for collaborative learning as opposed to competition (I’ve pinged him and he can say more on that if he likes).
Additionally, John’s suggested metrics are mostly correlated with traditional success in academia, and if I thought traditional academic success was a good predictor of truth I wouldn’t be doing all this work. That’s a testable hypothesis and tests of it might look something like what John suggests, but I would view it as “testing academia”, not “discovering useful metrics”.
This question has spurred some really interesting and useful thoughts for me, thank you for asking it.
On them being for “collaborative learning”; the specific thing I was thinking was how good prediction systems should really encourage introspectability and knowledge externalities in order to be maximally cost-effective. I wrote a bit about this here
Just chiming in here;
I agree with Liam that amplifying ESCs with prediction markets would be a lot like John’s suggestion. I think an elegant approach would be something like setting up prediction markets, and then allowing users to set up their own data science pipelines as they see fit. My guess is that this would be essential if we wanted to predict a lot of books; say, 200 to 1 Million books.
If predictors did a decent job at this, then I’d be on the whole excited about it being known and for authors to try to perform better on it; because I believe it would reveal more signal than noise (well, as long as this prediction was done decently, for a vague definition of decent.)
My guess is that a strong analysis would only include “number of citations” as being a rather minor feature. If it became evident that authors were trying to actively munchkin[1] things, then predictors should pick up on that, and introduce features for things like “known munchkiner”, which would make this quite difficult. The timescales for authors to update and write books seem much longer than the timescales for predictors to recognize what’s going on.
[1] I realize that “’munchkining” is a pretty uncommon word, but I like it a lot, and it feels more relevant than powergaming. Please let me know if there’s a term you prefer. I think “Goodhart” is too generic, especially if things like “correlational Goodhart” count.