You’re thinking about this in terms of forecasting. This is not forecasting, this is historical studies.
Consider the hard sciences equivalent: you take, say, some geneticists and try to figure out whether their estimates of which genes cause what are any good by asking them questions about quantum physics to “check how they are calibrated”.
You’re thinking about this in terms of forecasting.
No. Bayesian estimate calibration is most often used in forecasting, but it’s effective in any domain which there’s uncertainty, including hard sciences. In fact, calibration training is often done with either numerical trivia, using 90% credible intervals, or with true or false questions using a single percentage estimate. I recommend checking out “How to Measure Anything” for a more indepth treatment.
Consider the hard sciences equivalent: you take, say, some geneticists and try to figure out whether their estimates of which genes cause what are any good by asking them questions about quantum physics to “check how they are calibrated”.
Yes, that’s essentially how it works, except that you then give them feedback to see if they’re over or under confident. They’d have to be relatively easy questions though, otherwise all the estimates would cluster around fifty percent and it wouldn’t be very useful training for high resolution answers.
it’s effective in any domain which there’s uncertainty, including hard sciences
Citation needed.
Not all uncertainty is created equal. If uncertainty comes from e.g. measurement limitations, the Bayesian calibration is useless.
Note that science is mostly about creating results that can be replicated by anyone regardless of how well or badly calibrated they are.
Yes, that’s essentially how it works
That’s how you imagine it to work, since I don’t expect anyone to actually be doing this. But let’s see, assume we have successfully run the calibration exercises with our group of geneticists. What do you expect them to change in their studies of which genes do what? We can get even more specific, let’s say we’re talking about one of the twin studies where the author tracked a set of twins, tested them on some phenotype feature X, and is reporting the results that the twins correlate Y% while otherwise similar general population is correlated Z%. What results would better calibration affect?
That was an overconfident statement, but for more on how Calibration is useful in places other than Forecasting, check out “How to Measure Anything” as mentioned in the last comment.
But let’s see, assume we have successfully run the calibration exercises with our group of geneticists. What do you expect them to change in their studies of which genes do what? We can get even more specific, let’s say we’re talking about one of the twin studies where the author tracked a set of twins, tested them on some phenotype feature X, and is reporting the results that the twins correlate Y% while otherwise similar general population is correlated Z%. What results would better calibration affect?
Once calibrated, they can make estimates on how sure they are of certain hypotheses, and of how likely treatments based on those hypotheses would lead to lives saved. This in turn can allow them to quantify what experiment to run next using value of information calculations.
Furthermore, by taking a survey of many of these calibrated genetic experts then extremizing their results, you can get an idea of how likely certain hypotheses are to turn out being correct.
Once calibrated, they can make estimates on how sure they are of certain hypotheses
I don’t know if you read scientific papers, but they don’t “make estimates on how sure they are of certain hypotheses”. They present the data and talk about the conclusions and implications that follow from the data presented. The potential hypotheses are evaluated on the basis of data, not on the basis of how well-calibrated does a particular researcher feel.
Calibration is good for guesstimates, it’s not particularly valuable for actual research.
how likely treatments based on those hypotheses would lead to lives saved …
That’s forecasting. Remember, we’re not talking about forecasting.
“I don’t know if you read scientific papers, but they don’t “make estimates on how sure they are of certain hypotheses”. They present the data and talk about the conclusions and implications that follow from the data presented. The potential hypotheses are evaluated on the basis of data, not on the basis of how well-calibrated does a particular researcher feel.
I’m not really sure how to answer this because I think you misunderstand calibration.
Science moves forward through something called scientific consensus. How does scientific consensus work right now? Well, we just kind of use guesswork. Expert calibration is a more useful way to understand what the scientific consensus actually is.
That’s forecasting. Remember, we’re not talking about forecasting.
No, it’s a decision model. The decision model uses a forecast “How many lives can be saved”, but it also uses calibration of known data “Based on the data you have, how sure are you that this particular fact is true”.
Science moves forward through something called scientific consensus.
No. This is absolutely false. Science moves forward through being able to figure out better and better how reality works. Consensus is really irrelevant to the process. The ultimate arbiter is reality regardless of what a collection of people with advanced degrees can agree on.
The decision model uses a forecast “How many lives can be saved”, but it also uses calibration of known data “Based on the data you have, how sure are you that this particular fact is true”.
That has nothing to do with calibration. “How many lives can be saved” is properly called a point forecast which provides an estimate of the center of the distribution. These are very popular but also limited because a much more useful forecast would come with an expected error and, ideally, would specify the shape of the distribution as well.
“Based on the data you have, how sure are you that this particular fact is true” is properly a question about the standard error of the estimate and it has nothing to do with subjective beliefs (well-calibrated or not) of the author.
I only care about someone’s calibration if I’m asking him to guess. If the answer is “based on the data”, it is based on the data and calibration is irrelevant.
No. This is absolutely false. Science moves forward through being able to figure out better and better how reality works.
While this completely true, and consensus only plays a minor role in science, it’s not true that consensus is irrelevant. Given no other information about a certain hypothesis other than that the majority of scientists believe it to be true, the rational course of action would be to adjust belief in the hypothesis upward. Of course, evidence contradicting the hypothesis would nullify this consensus effect. Even a small amount of evidence trumps a large consensus.
No. This is absolutely false. Science moves forward through being able to figure out better and better how reality works. Consensus is really irrelevant to the process. The ultimate arbiter is reality regardless of what a collection of people with advanced degrees can agree on.
No, that’s the popular conception of science, but unfortunately it’s not an oracle that proves reality true or false. What observation and experiments give us are varying levels evidence that can falsify some hypotheses and point towards the truth of other hypotheses. We then use human reasoning to put all this evidence together and let humans decide how sure they are of something. If they have lots and lots of evidence that thing can become a “theory” based on the consensus that there’s quite a lot of it and it’s really good, and even more evidence that’s even better makes that thing a “law”. But it’s based on a subjective sense of “how good these data are.”
“Based on the data you have, how sure are you that this particular fact is true” is properly a question about the standard error of the estimate and it has nothing to do with subjective beliefs (well-calibrated or not) of the author.
Not quite. It also has to do with all the other previous experiments done, your certainty in the model itself, your ideas about how reality works, and a lot of other things.
That has nothing to do with calibration. “How many lives can be saved” is properly called a point forecast which provides an estimate of the center of the distribution. These are very popular but also limited because a much more useful forecast would come with an expected error and, ideally, would specify the shape of the distribution as well.
Yes, ideally this would be a credible interval with an estimated distribution, but even a credible interval assuming uniform distirubtion be very useful for this purpose.
In terms of calibration, if someone is well calibrated, and they give a credible interval with 90% confidence, the better calibrated you are, the more sure you can be that if they make 100 of such estimates, around 90% of them will lie within the credible interval you gave.
I only care about someone’s calibration if I’m asking him to guess. If the answer is “based on the data”, it is based on the data and calibration is irrelevant.
Well calibrated people will base their guesses on data, poorly calibrated people will not. Your understanding of calibration isn’t in line with research done by Douglas Hubbard, Phillip Tetlock, and others who research human judgement.
Heh. Do you mean that’s a conception of science held by not-too-smart uneducated people? X-)
an oracle that proves reality true or false
Sense make not. Reality is always true.
Speaking generally, you seem to treat science as people asserting certain things and so, to decide on how much to trust them, you need to know how calibrated those people are. That seems very different from my perception of science which is based on people saying “This is so, you can test it yourself if you want”.
Under your approach, the goal is achieving consensus. Under my system, the goal is to provide replicability and show that it actually works.
Data does not depend on calibration of particular people.
This is so, you can test it yourself if you want
Under your approach, the goal is achieving consensus. Under my system, the goal is to provide replicability and show that it actually works.
I think we have to separate two ideas here.
There’s the data you get from an experiment
There’s the conclusions you can draw from that data.
I would agree that the data does not depend on the calibration of particular people. But the conclusions you get from that data DO need to be calibrated. Furthermore, other scientists may want to do experiments based on those conclusions… their decision to do that will really be based on how likely they think the conclusions are accurate. The process of science is building new conclusions on the basis of those old conclusions—if it’s just about gathering the data, you never gain a deeper understanding of reality.
There’s the conclusions you can draw from that data.
In the word “conclusions” you conflate two different things which I wish to keep separate.
One of them is subjective opinion/guesstimate/evaluation/conclusion of a person. I agree that the calibration of the person whose opinion we care about is relevant.
The other is objective facts/observations/measurements/conclusions that do not depend on anyone in particular. That’s not just “data” from your first point. That’s also conclusions that follow from the data in an explicit, non-subjective way. A study can perfectly well come to some conclusions by showing how the data leads to them without depending on anyone’s calibration.
The answer to doubts about the first kind of conclusions is “trust me because I know what I’m talking about”. The answer to doubts about the second kind of conclusions is “you don’t have to trust me, see for yourself”.
The process of science is building new conclusions on the basis of those old conclusions
I continue to disagree. In your concept of science the idea of testing against reality is somewhere in the back row. What’s important is achieving consensus and being well-calibrated. I don’t think this is what science is about.
In your concept of science the idea of testing against reality is somewhere in the back row. What’s important is achieving consensus and being well-calibrated. I don’t think this is what science is about.
Let’s stop using the word “science” because I don’t really care how we define that specific word.
Let’s change it instead to “the process of learning things about reality” because that’s what I’m talking about. I think it’s what you’re talking about as well, but traditionally science can also mean “the process of running experiments”—and if we defined it that way, then I’d agree that calibration isn’t needed.
The other is objective facts/observations/measurements/conclusions that do not depend on anyone in particular. That’s not just “data” from your first point. That’s also conclusions that follow from the data in an explicit, non-subjective way.
I can’t think of an example where conclusions are proven true from data in a specific, non-subjective way. Science works on falsification—you can prove things false in a specific, non-subjective way (assuming you trust completely in the protocol and the people running it), but you can’t prove things true, because there’s still ANOTHER experiment someone could run in different conditions that could theoretically falsify your current hypothesis. Furthermore, you may get the correlation right, but may misunderstand the causation.
Don’t get too caught up on this example, because it’s just a silly illustration of a general point, but say you made a hypothesis that “An object falling due to gravity accelerates at a rate of 9.8 meters/second squared”. You could run many experiments with data that fit your hypothesis, but it’s always possible that an alternative hypothesis that “Objects accelerate at 9.8 meters/second squared—except on Tuesday’s when it’s a full moon”. Unless you had specifically tested that scenario, that hypothesis has some infinitesimal chance of being right—and the thing is, there’s no way to test ALL of the potential scenarios.
That’s where calibration comes in—you don’t have certainty that objects accelerate at that rate due to gravity in every situation, but as you prove it in more and more situations, you (and the scientific community) become more and more certain that it’s the correct hypothesis. But even then, someone like Einstein can come along, find some random edge case involving the speed light where the hypothesis doesn’t hold, and present a better one.
We just had different goal posts. You learned science as “running an experiment”—I learned science as “Doing background research, determining likely outcomes, running experiments, sharing results back with the community”. That’s why I tabooed the word, to make sure we were on the same page.
Are we in agreements about the basic concept, if we agree that we have two different definitions of science?
Do tell. Where and how did you “learn science” this way?
Throughout elementary and middle school (early education here in the US) through textbooks with diagrams like this
What is the “basic concept”?
That experiments can give you mostly non-subjective data about one experiment, but to draw broader conclusions about how the world works you have to combine the data from many experiments into a subjective estimate about how likely a hypothesis is.
That does not strike me as an adequate basis for deciding what science is or is not.
Words mean different things to different people… as I said, I’m not interested in arguing over the “proper” definition of this word. I’m interested in clarifying the process through which experiments lead to new knowledge about the world. You can call this process “not science” and I won’t argue—it’s not an interesting argument to me.
So, are you saying that the outcome of science is a set of subjective estimates that most people agree with?
I’m not sure… what do you mean by “the outcome of science?”
That seems very different from my perception of science
Aren’t both these views of science oversimplifications? I mean, in practice most of the people making use of the work scientists have done aren’t really testing the scientists’ work for themselves (they’re kinda doing it implicitly by making use of that work, but the whole point is that they are confident it’s not going to fail).
Reality certainly is the ultimate arbiter, but regrettably we don’t get to ask Reality directly whether our theories are correct; all we can do is test them somewhat (in some cases it’s not even clear how to begin doing that; I’m looking at you, string theory) and that testing is done by fallible people using fallible equipment, and in many cases it’s very difficult to do in a way that actually lets you separate the signal from the noise, and most of us aren’t well placed to evaluate how fallibly it’s been done in any given case, and in practice usually we have to fall back on something like “scientific consensus” after all.
I think you and MattG are at cross purposes about the role he sees for calibration in science. The process by which actual primary scientific work becomes useful to people who aren’t specialists in the field goes something like this:
Alice does some work where she exposes laboratory rats to bad journalism and measures the rate at which they get cancer. (So do Alex, Amanda, Aloysius, et Al.)
She forms some opinions about this stuff; we could, in LW style, represent these opinions as some kind of probability distribution over relationships between bad journalism and cancer. Both her point estimates and her estimates of the distribution around them are strongly constrained by the work she’s done, but of course there are probably things she’s failed to think of. If she’s sensible, her opinions will include explicit allowance for having (maybe) made mistakes and missed things. Such considerations will probably not appear explicitly in the articles she publishes.
Bob talks to Alice (and Alex, Amanda, …) or reads the articles they publish.
As a result, Bob too forms opinions about this stuff, which again we can represent in probabilistic terms. Bob’s knowledge of the actual work is less direct than Alice’s, and his opinions are going to depend not only on Alice’s observed risk ratios and samples sizes and p-values and whatnot but also on how much he trusts Alice (having read her papers) to have done good work. And of course he will be trying to integrate what he learns from Alice with what he learns from Alex, Amanda et Al.
Bob may actually also be a primary researcher in the field, but here we’re considering him in his role as someone who has looked at the primary researchers’ work and drawn some conclusions.
Bob and Bill and Beth and Bert and all the other journo-oncologists (some of whom are in fact Alice and Alex etc.) all read more or less the same articles, and talk to one another at conferences, and write articles commenting on other people’s work. Over the next few years, journo-oncological opinion converges to a rough consensus that reading the Daily Mail probably does causes cancer, that further work might pin that down further, but that the field has higher research priorities.
Carol, a non-specialist who wants to know whether reading the Daily Mail causes cancer, talks to some experts in the field or reads a popular book on the subject or even gets into the journals and finds a review article or two.
As a result, Carol also forms opinions about journo-oncology. If she has the necessary skills she may also look cursorily at some of the primary literature and get some idea of how rigorous that work is, how big the sample sizes are, whether the research was funded by Rupert Murdoch, etc., but on the whole she’s dependent on what Bob and the other Bs tell her. So her opinions are going to be mostly shaped by what Bob says and what she thinks of Bob’s accuracy on this point.
Calibration (in the sense we’re talking about here) isn’t of much relevance to Alice when she’s doing the primary research. She will report that the Daily Mail is positively associated with brain cancer in rats (RR=1.3, n=50, CI=[1.1,1.5], p=0.01, etc., etc., etc.) and that’s more or less it. (I take it that’s the point you’ve been making.)
But Bob’s opinion about the carcenogenicity of the Daily Mail (having read Alice’s papers) is an altogether slipperier thing; and the opinion to which he and Beth and the others converge is slipperier still. It’ll depend on their assessment of how likely it is that Alice made a mistake, how likely it is that Aloysius’s results are fraudulent given that he took a large grant from the DMG Media Propaganda Fund, etc.; and on how strongly Bob is influenced when he hears Bill say ”… and of course we all know what a shoddy operation Alex’s lab is.”
It is in these later stages that better calibration could be valuable, and that I think Matt would like to see more explicit reference to it. He would like Bob and Bill and Beth and the rest to be explicit about what they think and why and how confidently, and he would like the consensus-generating process to involve weighing people’s opinions more or less heavily when they are known to be better or worse at the sort of subjective judgement required to decide how completely to mistrust Aloysius because of his funding.
I’m not terribly convinced that that would actually help much, for what it’s worth. But I don’t think what Matt’s saying is invalidated by pointing out that Alice’s publications don’t talk about (this kind of) calibration.
I mean, in practice most of the people making use of the work scientists have done aren’t really testing the scientists’ work for themselves (they’re kinda doing it implicitly by making use of that work, but the whole point is that they are confident it’s not going to fail).
First, I think the “implicitly” part is very important. That glowing gizmo with melted-sand innards in front of me works. By working it verifies, very directly, a whole lot of science.
And “working in practice” is what leads to confidence, not vice versa. When a sailor took the first GPS unit on a cruise, he didn’t say “Oh, science says it’s going to work, so that’s all going to be fine”. He took it as a secondary or, probably, a tertiary navigation device. Now, after years of working in practice sailors take the GPS as a primary device and most often, a second GPS as a secondary.
Note, by the way, that we want useful science and useful science leads to practical technologies that we test and use all the time.
Calibration (in the sense we’re talking about here) isn’t of much relevance to Alice when she’s doing the primary research.
Oh, good, we agree.
But Bob’s opinion … is an altogether slipperier thing; and the opinion to which he and Beth and the others converge is slipperier still.
Sure, that’s fine. Bob and Beth are not scientists and are not doing science. Allow me to quote myself: “Calibration is good for guesstimates, it’s not particularly valuable for actual research.” Bob and Bill and Beth and Bert are not doing actual research. They are trying to use published results to form some opinions, some guesstimates and, as I agree, their calibration matters for the quality of their guesstimates. But, again, that’s not science.
Bob and Beth are not scientists and are not doing science.
Bob and Beth are scientists (didn’t I make it clear enough in my gedankenexperiment that they are intended to be journo-oncologists just as much as Alice et al, it’s just that we’re considering them in a different role here?). And they are forming their opinions in the course of their professional activities. Doing science is not only about doing experiments and working out knotty theoretical problems; when two scientists discuss their work, they are doing science; when a scientist attends a conference presentation given by another, they are doing science; when a scientist sits and thinks about what might be a good problem to attack next, they are doing science.
Doing actual research is a more “central” scientific activity than those other things. But the other things are real, they are things scientists actually do, they are things scientists need to do, and I don’t see any reason to deny that doing them is part of how science (the whole collective enterprise) functions.
when a scientist sits and thinks about what might be a good problem to attack next, they are doing science.
Sure, and you’ve expanded the definition of “doing science” into uselessness. “Doodling on paper napkins is doing science!”—well, yeah, if you want it so, what next?
I’m not talking about what large variety of things scientists do in the course of their professional lives. I’m talking about the core concept of science and whether it, as MattG believes, “moves forward through something called scientific consensus”.
In particular, I would like to distinguish between “doing science” (discovering how the world works) and “applying science” (changing the world based on your beliefs about how it works).
Let’s distinguish two things. (1) The core activities of science are, for sure, things like doing carefully designed experiments and applying mathematics to make quantitative predictions based on precisely formulated theories. These activities, indeed, don’t proceed by consensus, but no one claimed otherwise; even to ask whether they do is a type error. (2) How scientific knowledge actually advances. This is not only a matter of #1; if we had nothing but #1 then science wouldn’t advance at all, because in order for science to advance each scientist’s work needs to be based in, or at least aware of, the work of their predecessors. And #2, as it happens, does involve something like consensus, and it’s reasonable to wonder whether being more explicitly and carefully rational about #2 would help science to advance more effectively. And that is what (AIUI) MattG is proposing.
I do believe MattG claimed otherwise. At least that was the most straightforward reading of what he said.
in order for science to advance each scientist’s work needs to be based in, or at least aware of, the work of their predecessors.
That is true, the scientists do trust what’s considered “solved”, but that trust is conditional. One little ugly fact can blow up a lot of consensus sky-high.
I think one of the core issues here is resistance to cargo cult science. Consensus is dangerous because it is enables cargo cults, but the sceptical “show me” attitude is invaluable here.
more explicitly and carefully rational about #2 would help science to advance more effectively
What do you mean by “carefully rational”? How is that better than the baseline “show me”?
I think you can only reach that conclusion by applying your preferred definition of “science” to MattG’s statement about science. That’s a mistake unless you know he’s not using a substantially different definition.
that trust is conditional
Yes, of course. (Did anyone suggest it’s not?)
For the avoidance of doubt, I am not for a minute suggesting blind or unquestioning trust of scientific consensus; at least, not for scientists. (It is possible that below some threshold of scientific competence blind trust is in fact the best available strategy.)
What do you mean by “carefully rational”?
I mean what happens if the Bobs in my thought experiment, rather than arriving at their opinions informally and qualitatively, think explicitly about what they’ve heard and read and about how much evidence each thing they’ve heard or read provides, and determine their own opinions by deliberate reflection on that (not necessarily by actual calculation, but with that always available in cases of doubt).
This might well not be an improvement (e.g., because System 1 has hardware support that System 2 doesn’t) but it’s not obvious that it isn’t.
How is that better than the baseline “show me”?
“Carefully rational” isn’t a proposed replacement for “show me”, it’s a proposed replacement for things like “I’ve read about this in a few papers so I’ll assume it’s true” (which probably doesn’t get said explicitly very often, of course).
“Show me” is always there (usually in the background) as an option. Most scientists, most of the time, don’t go banging on other scientists’ lab doors demanding further evidence for what’s in their papers. Most scientists, most of the time, don’t attempt to replicate other scientists’ results before (at least provisionally) accepting them.
(One reason is that replication and door-banging take effort. This is also an argument against the more explicit “carefully rational” approach I think MattG is advocating.)
In the absence of any more information than that you “fail to discern [my] point”, I don’t know what I can usefully say to help. In ascending order of cynicism:
If nothing in my previous comment conveyed any meaning to you at all, then it seems like we have a big impedance mismatch and fixing the problem (whatever it is) seems likely to be more trouble than it’s worth.
If you just can’t be bothered to say with any specificity what the problem is, then I suppose that indicates that you think your time is much more valuable than mine, a position I cordially decline to share.
If you’re just being generally dismissive because that’s rhetorically more effective than engagement, I’m not interested in discussion on those terms.
(I’m sorry if you find my style uncongenially cautious. This deep into a tangential discussion like this one, I’d expect much of what’s said to be clarifications and edge-nibbling, and in particular it seems peculiar to (1) ask questions of the form “what do you mean by X and why is it better than Y?” and then (2) complain that you’re getting clarification and edge-nibbling in response.)
I mean it literally. I can’t see a coherent position behind your criticisms, there is no overarching framework which backs them up. I don’t understand what is the core of your disagreement amongst all the clarifications.
I don’t know that my disagreement has a single core. It looks to me as if you are making a number of separate (but related) mistakes.
I think you are defining “science” narrowly, to include only actual experimentation and analysis, then interpreting MattG’s comments as if he is using a similarly narrow definition of “science” (which he has said he isn’t). This is a mistake because of course what someone says is liable to come out wrong when you give its words different meanings from the one they had in mind.
I think you are defining “science” narrowly, to include only actual experimentation and analysis, in a discussion of whether knowledge would advance more effectively if scientists explicitly represented their beliefs about scientific theories in probabilistic terms, did something like Bayes-rule updates on learning new things, and attempted to monitor the reliability of other scientists using notions like “calibration”. This is a mistake because the question at issue is not about actual experimentation and analysis.
I think you are writing as if the only important things scientists do in their capacity as scientists are actual experimentation and analysis. This is a mistake because science is in fact a collective endeavour whose success in advancing knowledge depends on scientists’ communication with other scientists, and evaluation of their work.
Perhaps this is the core: I do not think that, in this discussion, it is helpful for you to insist on a narrow definition of what counts as “science”. I think your suggestion upthread that the only alternative is to say that absolutely anything is “science” is ridiculous. I don’t have any objection to a narrow definition of “science” as such; there are surely contexts in which it’s better than a broad one; but I don’t think this discussion is such a context.
Perhaps this is the core: I do not think that, in this discussion, it is helpful for you to insist on a narrow definition of what counts as “science”
Interesting. I don’t perceive this subthread as mostly about definitions, I think of it as being about the balance between two approaches to claims about reality: the hard one (“show me”, see also this) and the soft one (“let’s construct as subjective probability assessment on the basis of opinions of experts”).
Notably, this subthread started with MattG saying “Science moves forward through something called scientific consensus” and me going “Whaaaa...?”
I also don’t think the discussion is about definitions, but I think it’s being made needlessly more difficult by differences in definitions.
It is (I think) a simple matter of empirical fact that most of the time scientists get information from one another without saying “show me!”. That doesn’t mean that “show me!” isn’t always there in the background—it is—but only that the actual practice of science-broadly-conceived (by which I don’t mean “science-narrowly-conceived plus fake science”, I mean “science-narrowly-conceived plus the other things scientists do without which science as a whole would make much less progress”) does in fact involve subjective probability assessments on the basis of experts’ opinions.
It is (I think) a simple matter of empirical fact that most of the time scientists get information from one another without saying “show me!”.
Actually, I will disagree with that. There is a reason published papers consist mostly of detailed descriptions of what was done and what happened. If what you are saying were true, executive summaries would suffice: We have discovered that frobnicating frotzed blivets leads to emission of magic smoke. The End.
Certainly, large parts of scientific knowledge have passed into the “just accept it’s true” realm, but any new claims are required to be supported by fairly large amounts of “show me”.
If what you are saying were true, executive summaries would suffice
I don’t see why. The details are there for the following reasons, none of which appears to me to be invalidated by anything I’ve said. (1) They are interesting for their own sake (to those immersed in the field, at least). (2) They clarify what useful opportunities there may be for followup work (“Hmm, all their blivets were frotzed with titanium chloride. What happens if we use uranium nitride instead?”). (3) They provide a way to do “show me!”-like checks for those relatively few who want to without needing to interrogate the authors (replicating the analysis is easier than replicating the experiment). (4) They provide, in principle, the information needed for a more thorough “show me!” check (outright replication) for those even fewer who want to do that.
If you’ve got the impression that I don’t agree that independent experimental test is the nearest thing we have to an ultimate arbiter of scientific truth, then I’ve been unclear or you’ve been obtuse or both; I do agree with that. Most of the time, though, scientists don’t go all the way to the ultimate arbiter.
But this thread has drifted far from reality. It began with Lumifer’s comment about estimates of historical poverty:
The charts posted claim to reflect the entire world and they go back to early XIX century. Whole-world data at that point is nothing but a collection of guesstimates.
To which MattG replied:
My understanding is you basically get a bunch of economists in the room to break down the problem into relevant parts, then get a bunch of historians in the room, calibrate them, get them to give credible intervals for the relevant data, and plug it all in to the model.
Lumifer:
Is this how you think it works or is this how you think it should work?
MattG:
It’s how I think it works.
And the conversation drifted into the stratosphere with no further discussion of where those numbers actually came from.
You’re thinking about this in terms of forecasting. This is not forecasting, this is historical studies.
Consider the hard sciences equivalent: you take, say, some geneticists and try to figure out whether their estimates of which genes cause what are any good by asking them questions about quantum physics to “check how they are calibrated”.
No. Bayesian estimate calibration is most often used in forecasting, but it’s effective in any domain which there’s uncertainty, including hard sciences. In fact, calibration training is often done with either numerical trivia, using 90% credible intervals, or with true or false questions using a single percentage estimate. I recommend checking out “How to Measure Anything” for a more indepth treatment.
Yes, that’s essentially how it works, except that you then give them feedback to see if they’re over or under confident. They’d have to be relatively easy questions though, otherwise all the estimates would cluster around fifty percent and it wouldn’t be very useful training for high resolution answers.
Citation needed.
Not all uncertainty is created equal. If uncertainty comes from e.g. measurement limitations, the Bayesian calibration is useless.
Note that science is mostly about creating results that can be replicated by anyone regardless of how well or badly calibrated they are.
That’s how you imagine it to work, since I don’t expect anyone to actually be doing this. But let’s see, assume we have successfully run the calibration exercises with our group of geneticists. What do you expect them to change in their studies of which genes do what? We can get even more specific, let’s say we’re talking about one of the twin studies where the author tracked a set of twins, tested them on some phenotype feature X, and is reporting the results that the twins correlate Y% while otherwise similar general population is correlated Z%. What results would better calibration affect?
That was an overconfident statement, but for more on how Calibration is useful in places other than Forecasting, check out “How to Measure Anything” as mentioned in the last comment.
Once calibrated, they can make estimates on how sure they are of certain hypotheses, and of how likely treatments based on those hypotheses would lead to lives saved. This in turn can allow them to quantify what experiment to run next using value of information calculations.
Furthermore, by taking a survey of many of these calibrated genetic experts then extremizing their results, you can get an idea of how likely certain hypotheses are to turn out being correct.
I don’t know if you read scientific papers, but they don’t “make estimates on how sure they are of certain hypotheses”. They present the data and talk about the conclusions and implications that follow from the data presented. The potential hypotheses are evaluated on the basis of data, not on the basis of how well-calibrated does a particular researcher feel.
Calibration is good for guesstimates, it’s not particularly valuable for actual research.
That’s forecasting. Remember, we’re not talking about forecasting.
I’m not really sure how to answer this because I think you misunderstand calibration.
Science moves forward through something called scientific consensus. How does scientific consensus work right now? Well, we just kind of use guesswork. Expert calibration is a more useful way to understand what the scientific consensus actually is.
No, it’s a decision model. The decision model uses a forecast “How many lives can be saved”, but it also uses calibration of known data “Based on the data you have, how sure are you that this particular fact is true”.
No. This is absolutely false. Science moves forward through being able to figure out better and better how reality works. Consensus is really irrelevant to the process. The ultimate arbiter is reality regardless of what a collection of people with advanced degrees can agree on.
That has nothing to do with calibration. “How many lives can be saved” is properly called a point forecast which provides an estimate of the center of the distribution. These are very popular but also limited because a much more useful forecast would come with an expected error and, ideally, would specify the shape of the distribution as well.
“Based on the data you have, how sure are you that this particular fact is true” is properly a question about the standard error of the estimate and it has nothing to do with subjective beliefs (well-calibrated or not) of the author.
I only care about someone’s calibration if I’m asking him to guess. If the answer is “based on the data”, it is based on the data and calibration is irrelevant.
While this completely true, and consensus only plays a minor role in science, it’s not true that consensus is irrelevant. Given no other information about a certain hypothesis other than that the majority of scientists believe it to be true, the rational course of action would be to adjust belief in the hypothesis upward. Of course, evidence contradicting the hypothesis would nullify this consensus effect. Even a small amount of evidence trumps a large consensus.
No, that’s the popular conception of science, but unfortunately it’s not an oracle that proves reality true or false. What observation and experiments give us are varying levels evidence that can falsify some hypotheses and point towards the truth of other hypotheses. We then use human reasoning to put all this evidence together and let humans decide how sure they are of something. If they have lots and lots of evidence that thing can become a “theory” based on the consensus that there’s quite a lot of it and it’s really good, and even more evidence that’s even better makes that thing a “law”. But it’s based on a subjective sense of “how good these data are.”
Not quite. It also has to do with all the other previous experiments done, your certainty in the model itself, your ideas about how reality works, and a lot of other things.
Yes, ideally this would be a credible interval with an estimated distribution, but even a credible interval assuming uniform distirubtion be very useful for this purpose.
In terms of calibration, if someone is well calibrated, and they give a credible interval with 90% confidence, the better calibrated you are, the more sure you can be that if they make 100 of such estimates, around 90% of them will lie within the credible interval you gave.
Well calibrated people will base their guesses on data, poorly calibrated people will not. Your understanding of calibration isn’t in line with research done by Douglas Hubbard, Phillip Tetlock, and others who research human judgement.
Heh. Do you mean that’s a conception of science held by not-too-smart uneducated people? X-)
Sense make not. Reality is always true.
Speaking generally, you seem to treat science as people asserting certain things and so, to decide on how much to trust them, you need to know how calibrated those people are. That seems very different from my perception of science which is based on people saying “This is so, you can test it yourself if you want”.
Under your approach, the goal is achieving consensus. Under my system, the goal is to provide replicability and show that it actually works.
Data does not depend on calibration of particular people.
I think we have to separate two ideas here.
There’s the data you get from an experiment
There’s the conclusions you can draw from that data.
I would agree that the data does not depend on the calibration of particular people. But the conclusions you get from that data DO need to be calibrated. Furthermore, other scientists may want to do experiments based on those conclusions… their decision to do that will really be based on how likely they think the conclusions are accurate. The process of science is building new conclusions on the basis of those old conclusions—if it’s just about gathering the data, you never gain a deeper understanding of reality.
In the word “conclusions” you conflate two different things which I wish to keep separate.
One of them is subjective opinion/guesstimate/evaluation/conclusion of a person. I agree that the calibration of the person whose opinion we care about is relevant.
The other is objective facts/observations/measurements/conclusions that do not depend on anyone in particular. That’s not just “data” from your first point. That’s also conclusions that follow from the data in an explicit, non-subjective way. A study can perfectly well come to some conclusions by showing how the data leads to them without depending on anyone’s calibration.
The answer to doubts about the first kind of conclusions is “trust me because I know what I’m talking about”. The answer to doubts about the second kind of conclusions is “you don’t have to trust me, see for yourself”.
I continue to disagree. In your concept of science the idea of testing against reality is somewhere in the back row. What’s important is achieving consensus and being well-calibrated. I don’t think this is what science is about.
Let’s stop using the word “science” because I don’t really care how we define that specific word.
Let’s change it instead to “the process of learning things about reality” because that’s what I’m talking about. I think it’s what you’re talking about as well, but traditionally science can also mean “the process of running experiments”—and if we defined it that way, then I’d agree that calibration isn’t needed.
I can’t think of an example where conclusions are proven true from data in a specific, non-subjective way. Science works on falsification—you can prove things false in a specific, non-subjective way (assuming you trust completely in the protocol and the people running it), but you can’t prove things true, because there’s still ANOTHER experiment someone could run in different conditions that could theoretically falsify your current hypothesis. Furthermore, you may get the correlation right, but may misunderstand the causation.
Don’t get too caught up on this example, because it’s just a silly illustration of a general point, but say you made a hypothesis that “An object falling due to gravity accelerates at a rate of 9.8 meters/second squared”. You could run many experiments with data that fit your hypothesis, but it’s always possible that an alternative hypothesis that “Objects accelerate at 9.8 meters/second squared—except on Tuesday’s when it’s a full moon”. Unless you had specifically tested that scenario, that hypothesis has some infinitesimal chance of being right—and the thing is, there’s no way to test ALL of the potential scenarios.
That’s where calibration comes in—you don’t have certainty that objects accelerate at that rate due to gravity in every situation, but as you prove it in more and more situations, you (and the scientific community) become more and more certain that it’s the correct hypothesis. But even then, someone like Einstein can come along, find some random edge case involving the speed light where the hypothesis doesn’t hold, and present a better one.
“The process of learning things about reality” is much MUCH larger and more varied than science.
That ain’t where goalposts used to be :-/
We just had different goal posts. You learned science as “running an experiment”—I learned science as “Doing background research, determining likely outcomes, running experiments, sharing results back with the community”. That’s why I tabooed the word, to make sure we were on the same page.
Are we in agreements about the basic concept, if we agree that we have two different definitions of science?
Do tell. Where and how did you “learn science” this way?
What is the “basic concept”?
Throughout elementary and middle school (early education here in the US) through textbooks with diagrams like this
That experiments can give you mostly non-subjective data about one experiment, but to draw broader conclusions about how the world works you have to combine the data from many experiments into a subjective estimate about how likely a hypothesis is.
That does not strike me as an adequate basis for deciding what science is or is not.
So, are you saying that the outcome of science is a set of subjective estimates that most people agree with?
Words mean different things to different people… as I said, I’m not interested in arguing over the “proper” definition of this word. I’m interested in clarifying the process through which experiments lead to new knowledge about the world. You can call this process “not science” and I won’t argue—it’s not an interesting argument to me.
I’m not sure… what do you mean by “the outcome of science?”
Aren’t both these views of science oversimplifications? I mean, in practice most of the people making use of the work scientists have done aren’t really testing the scientists’ work for themselves (they’re kinda doing it implicitly by making use of that work, but the whole point is that they are confident it’s not going to fail).
Reality certainly is the ultimate arbiter, but regrettably we don’t get to ask Reality directly whether our theories are correct; all we can do is test them somewhat (in some cases it’s not even clear how to begin doing that; I’m looking at you, string theory) and that testing is done by fallible people using fallible equipment, and in many cases it’s very difficult to do in a way that actually lets you separate the signal from the noise, and most of us aren’t well placed to evaluate how fallibly it’s been done in any given case, and in practice usually we have to fall back on something like “scientific consensus” after all.
I think you and MattG are at cross purposes about the role he sees for calibration in science. The process by which actual primary scientific work becomes useful to people who aren’t specialists in the field goes something like this:
Alice does some work where she exposes laboratory rats to bad journalism and measures the rate at which they get cancer. (So do Alex, Amanda, Aloysius, et Al.)
She forms some opinions about this stuff; we could, in LW style, represent these opinions as some kind of probability distribution over relationships between bad journalism and cancer. Both her point estimates and her estimates of the distribution around them are strongly constrained by the work she’s done, but of course there are probably things she’s failed to think of. If she’s sensible, her opinions will include explicit allowance for having (maybe) made mistakes and missed things. Such considerations will probably not appear explicitly in the articles she publishes.
Bob talks to Alice (and Alex, Amanda, …) or reads the articles they publish.
As a result, Bob too forms opinions about this stuff, which again we can represent in probabilistic terms. Bob’s knowledge of the actual work is less direct than Alice’s, and his opinions are going to depend not only on Alice’s observed risk ratios and samples sizes and p-values and whatnot but also on how much he trusts Alice (having read her papers) to have done good work. And of course he will be trying to integrate what he learns from Alice with what he learns from Alex, Amanda et Al.
Bob may actually also be a primary researcher in the field, but here we’re considering him in his role as someone who has looked at the primary researchers’ work and drawn some conclusions.
Bob and Bill and Beth and Bert and all the other journo-oncologists (some of whom are in fact Alice and Alex etc.) all read more or less the same articles, and talk to one another at conferences, and write articles commenting on other people’s work. Over the next few years, journo-oncological opinion converges to a rough consensus that reading the Daily Mail probably does causes cancer, that further work might pin that down further, but that the field has higher research priorities.
Carol, a non-specialist who wants to know whether reading the Daily Mail causes cancer, talks to some experts in the field or reads a popular book on the subject or even gets into the journals and finds a review article or two.
As a result, Carol also forms opinions about journo-oncology. If she has the necessary skills she may also look cursorily at some of the primary literature and get some idea of how rigorous that work is, how big the sample sizes are, whether the research was funded by Rupert Murdoch, etc., but on the whole she’s dependent on what Bob and the other Bs tell her. So her opinions are going to be mostly shaped by what Bob says and what she thinks of Bob’s accuracy on this point.
Calibration (in the sense we’re talking about here) isn’t of much relevance to Alice when she’s doing the primary research. She will report that the Daily Mail is positively associated with brain cancer in rats (RR=1.3, n=50, CI=[1.1,1.5], p=0.01, etc., etc., etc.) and that’s more or less it. (I take it that’s the point you’ve been making.)
But Bob’s opinion about the carcenogenicity of the Daily Mail (having read Alice’s papers) is an altogether slipperier thing; and the opinion to which he and Beth and the others converge is slipperier still. It’ll depend on their assessment of how likely it is that Alice made a mistake, how likely it is that Aloysius’s results are fraudulent given that he took a large grant from the DMG Media Propaganda Fund, etc.; and on how strongly Bob is influenced when he hears Bill say ”… and of course we all know what a shoddy operation Alex’s lab is.”
It is in these later stages that better calibration could be valuable, and that I think Matt would like to see more explicit reference to it. He would like Bob and Bill and Beth and the rest to be explicit about what they think and why and how confidently, and he would like the consensus-generating process to involve weighing people’s opinions more or less heavily when they are known to be better or worse at the sort of subjective judgement required to decide how completely to mistrust Aloysius because of his funding.
I’m not terribly convinced that that would actually help much, for what it’s worth. But I don’t think what Matt’s saying is invalidated by pointing out that Alice’s publications don’t talk about (this kind of) calibration.
First, I think the “implicitly” part is very important. That glowing gizmo with melted-sand innards in front of me works. By working it verifies, very directly, a whole lot of science.
And “working in practice” is what leads to confidence, not vice versa. When a sailor took the first GPS unit on a cruise, he didn’t say “Oh, science says it’s going to work, so that’s all going to be fine”. He took it as a secondary or, probably, a tertiary navigation device. Now, after years of working in practice sailors take the GPS as a primary device and most often, a second GPS as a secondary.
Note, by the way, that we want useful science and useful science leads to practical technologies that we test and use all the time.
Oh, good, we agree.
Sure, that’s fine. Bob and Beth are not scientists and are not doing science. Allow me to quote myself: “Calibration is good for guesstimates, it’s not particularly valuable for actual research.” Bob and Bill and Beth and Bert are not doing actual research. They are trying to use published results to form some opinions, some guesstimates and, as I agree, their calibration matters for the quality of their guesstimates. But, again, that’s not science.
Bob and Beth are scientists (didn’t I make it clear enough in my gedankenexperiment that they are intended to be journo-oncologists just as much as Alice et al, it’s just that we’re considering them in a different role here?). And they are forming their opinions in the course of their professional activities. Doing science is not only about doing experiments and working out knotty theoretical problems; when two scientists discuss their work, they are doing science; when a scientist attends a conference presentation given by another, they are doing science; when a scientist sits and thinks about what might be a good problem to attack next, they are doing science.
Doing actual research is a more “central” scientific activity than those other things. But the other things are real, they are things scientists actually do, they are things scientists need to do, and I don’t see any reason to deny that doing them is part of how science (the whole collective enterprise) functions.
Sure, and you’ve expanded the definition of “doing science” into uselessness. “Doodling on paper napkins is doing science!”—well, yeah, if you want it so, what next?
I’m not talking about what large variety of things scientists do in the course of their professional lives. I’m talking about the core concept of science and whether it, as MattG believes, “moves forward through something called scientific consensus”.
In particular, I would like to distinguish between “doing science” (discovering how the world works) and “applying science” (changing the world based on your beliefs about how it works).
Let’s distinguish two things. (1) The core activities of science are, for sure, things like doing carefully designed experiments and applying mathematics to make quantitative predictions based on precisely formulated theories. These activities, indeed, don’t proceed by consensus, but no one claimed otherwise; even to ask whether they do is a type error. (2) How scientific knowledge actually advances. This is not only a matter of #1; if we had nothing but #1 then science wouldn’t advance at all, because in order for science to advance each scientist’s work needs to be based in, or at least aware of, the work of their predecessors. And #2, as it happens, does involve something like consensus, and it’s reasonable to wonder whether being more explicitly and carefully rational about #2 would help science to advance more effectively. And that is what (AIUI) MattG is proposing.
I do believe MattG claimed otherwise. At least that was the most straightforward reading of what he said.
That is true, the scientists do trust what’s considered “solved”, but that trust is conditional. One little ugly fact can blow up a lot of consensus sky-high.
I think one of the core issues here is resistance to cargo cult science. Consensus is dangerous because it is enables cargo cults, but the sceptical “show me” attitude is invaluable here.
What do you mean by “carefully rational”? How is that better than the baseline “show me”?
I think you can only reach that conclusion by applying your preferred definition of “science” to MattG’s statement about science. That’s a mistake unless you know he’s not using a substantially different definition.
Yes, of course. (Did anyone suggest it’s not?)
For the avoidance of doubt, I am not for a minute suggesting blind or unquestioning trust of scientific consensus; at least, not for scientists. (It is possible that below some threshold of scientific competence blind trust is in fact the best available strategy.)
I mean what happens if the Bobs in my thought experiment, rather than arriving at their opinions informally and qualitatively, think explicitly about what they’ve heard and read and about how much evidence each thing they’ve heard or read provides, and determine their own opinions by deliberate reflection on that (not necessarily by actual calculation, but with that always available in cases of doubt).
This might well not be an improvement (e.g., because System 1 has hardware support that System 2 doesn’t) but it’s not obvious that it isn’t.
“Carefully rational” isn’t a proposed replacement for “show me”, it’s a proposed replacement for things like “I’ve read about this in a few papers so I’ll assume it’s true” (which probably doesn’t get said explicitly very often, of course).
“Show me” is always there (usually in the background) as an option. Most scientists, most of the time, don’t go banging on other scientists’ lab doors demanding further evidence for what’s in their papers. Most scientists, most of the time, don’t attempt to replicate other scientists’ results before (at least provisionally) accepting them.
(One reason is that replication and door-banging take effort. This is also an argument against the more explicit “carefully rational” approach I think MattG is advocating.)
I fail to discern your point. There is a lot of clarifications, adjustments, and edge-nibbling, but what is it that you want to say?
In the absence of any more information than that you “fail to discern [my] point”, I don’t know what I can usefully say to help. In ascending order of cynicism:
If nothing in my previous comment conveyed any meaning to you at all, then it seems like we have a big impedance mismatch and fixing the problem (whatever it is) seems likely to be more trouble than it’s worth.
If you just can’t be bothered to say with any specificity what the problem is, then I suppose that indicates that you think your time is much more valuable than mine, a position I cordially decline to share.
If you’re just being generally dismissive because that’s rhetorically more effective than engagement, I’m not interested in discussion on those terms.
(I’m sorry if you find my style uncongenially cautious. This deep into a tangential discussion like this one, I’d expect much of what’s said to be clarifications and edge-nibbling, and in particular it seems peculiar to (1) ask questions of the form “what do you mean by X and why is it better than Y?” and then (2) complain that you’re getting clarification and edge-nibbling in response.)
I mean it literally. I can’t see a coherent position behind your criticisms, there is no overarching framework which backs them up. I don’t understand what is the core of your disagreement amongst all the clarifications.
I don’t know that my disagreement has a single core. It looks to me as if you are making a number of separate (but related) mistakes.
I think you are defining “science” narrowly, to include only actual experimentation and analysis, then interpreting MattG’s comments as if he is using a similarly narrow definition of “science” (which he has said he isn’t). This is a mistake because of course what someone says is liable to come out wrong when you give its words different meanings from the one they had in mind.
I think you are defining “science” narrowly, to include only actual experimentation and analysis, in a discussion of whether knowledge would advance more effectively if scientists explicitly represented their beliefs about scientific theories in probabilistic terms, did something like Bayes-rule updates on learning new things, and attempted to monitor the reliability of other scientists using notions like “calibration”. This is a mistake because the question at issue is not about actual experimentation and analysis.
I think you are writing as if the only important things scientists do in their capacity as scientists are actual experimentation and analysis. This is a mistake because science is in fact a collective endeavour whose success in advancing knowledge depends on scientists’ communication with other scientists, and evaluation of their work.
Perhaps this is the core: I do not think that, in this discussion, it is helpful for you to insist on a narrow definition of what counts as “science”. I think your suggestion upthread that the only alternative is to say that absolutely anything is “science” is ridiculous. I don’t have any objection to a narrow definition of “science” as such; there are surely contexts in which it’s better than a broad one; but I don’t think this discussion is such a context.
Interesting. I don’t perceive this subthread as mostly about definitions, I think of it as being about the balance between two approaches to claims about reality: the hard one (“show me”, see also this) and the soft one (“let’s construct as subjective probability assessment on the basis of opinions of experts”).
Notably, this subthread started with MattG saying “Science moves forward through something called scientific consensus” and me going “Whaaaa...?”
I also don’t think the discussion is about definitions, but I think it’s being made needlessly more difficult by differences in definitions.
It is (I think) a simple matter of empirical fact that most of the time scientists get information from one another without saying “show me!”. That doesn’t mean that “show me!” isn’t always there in the background—it is—but only that the actual practice of science-broadly-conceived (by which I don’t mean “science-narrowly-conceived plus fake science”, I mean “science-narrowly-conceived plus the other things scientists do without which science as a whole would make much less progress”) does in fact involve subjective probability assessments on the basis of experts’ opinions.
Actually, I will disagree with that. There is a reason published papers consist mostly of detailed descriptions of what was done and what happened. If what you are saying were true, executive summaries would suffice: We have discovered that frobnicating frotzed blivets leads to emission of magic smoke. The End.
Certainly, large parts of scientific knowledge have passed into the “just accept it’s true” realm, but any new claims are required to be supported by fairly large amounts of “show me”.
I don’t see why. The details are there for the following reasons, none of which appears to me to be invalidated by anything I’ve said. (1) They are interesting for their own sake (to those immersed in the field, at least). (2) They clarify what useful opportunities there may be for followup work (“Hmm, all their blivets were frotzed with titanium chloride. What happens if we use uranium nitride instead?”). (3) They provide a way to do “show me!”-like checks for those relatively few who want to without needing to interrogate the authors (replicating the analysis is easier than replicating the experiment). (4) They provide, in principle, the information needed for a more thorough “show me!” check (outright replication) for those even fewer who want to do that.
If you’ve got the impression that I don’t agree that independent experimental test is the nearest thing we have to an ultimate arbiter of scientific truth, then I’ve been unclear or you’ve been obtuse or both; I do agree with that. Most of the time, though, scientists don’t go all the way to the ultimate arbiter.
Consensus is the result, not the means.
But this thread has drifted far from reality. It began with Lumifer’s comment about estimates of historical poverty:
To which MattG replied:
Lumifer:
MattG:
And the conversation drifted into the stratosphere with no further discussion of where those numbers actually came from.
Consensus is the result, not the means.