The fact that Big Pharma has to lay of a lot of scientists is a real world indication that the output of model of finding a drug target, screening thousands of components against it, runs those components through clinical trials to find whether they are any good and then coming out with drugs that cure important illnesses at the other end stops producing results.
This seems like extremely weak evidence. Diminishing marginal returns is a common thing in many areas. For example, engineering better trains happened a lot in the second half 19th century and the early 20th century. That slowed down, not because of some lack of theoretical background, but because the technology reached maturity. Now, improvements in train technology do occur, but slowly.
Saying that there’s a file drawer problem is quite easy. That however not a solution. I think your problem is that you can’t imaging a theory that would solve the problem. That’s typical. If it would be easy to imagine a theoretical breakthrough beforehand it wouldn’t be much of a breakthrough.
On the contrary. We have ways of handling the file drawer problem, and they aren’t theory based issues. Pre-registration of studies works. It isn’t even clear to me what it would mean to have a theoretical solution of the file drawer problem given that it is a problem about how culture, and a type of problem exists in any field. It makes about as much sense to talk about how having better theory could somehow solve type I errors.
Look at a theoretical breakthrough of moving from the model of numbers as IV+II=VI to 4+2=6. If you would have talked with a Pythagoras he probably couldn’t imaging a theoretical breakthrough like that.
The ancient Greeks used the Babylonian number system and the Greek system. They did not use Roman numerals.
It isn’t even clear to me what it would mean to have a theoretical solution of the file drawer problem given that it is a problem about how culture, and a type of problem exists in any field.
The file drawer problem is about an effect. If you can estimate exactly how large the effect is when you look at the question of whether to take a certain drug you solve the problem because you can just run the numbers.
On the contrary. We have ways of handling the file drawer problem, and they aren’t theory based issues. Pre-registration of studies works.
The concept of the file drawer problem first appeared in 1976 if I can trust google ngrams.
How much money do you think it cost to run the experiments to come up with the concept of the file drawer problem and the concept pre-registration of studies?
I don’t think that’s knowledge that got created by running expensive experiments. It came from people engaging in theoretical thinking.
It makes about as much sense to talk about how having better theory could somehow solve type I errors.
Type I errors are a feature of frequentist statistics. If you don’t use null hypotheses you don’t make type I errors. Bayesians don’t make type I errors because they don’t have null hypotheses.
How much money do you think it cost to run the experiments to come up with the concept of the file drawer problem and the concept pre-registration of studies? I don’t think that’s knowledge that got created by running expensive experiments. It came from people engaging in theoretical thinking.
After some background about NHST on page 1, Sterling immediately begins tallying tests of significance in a years’ worth of 4 psychology journals, on page 2, and discovers that eg of 106 tests, 105 rejected the null hypothesis. On page 3, he discusses how this bias could come about.
So at least in this very early discussion of publication bias, it was driven by people engaged in empirical thinking.
After some background about NHST on page 1, Sterling immediately begins tallying tests of significance in a years’ worth of 4 psychology journals, on page 2, and discovers that eg of 106 tests, 105 rejected the null hypothesis. On page 3, he discusses how this bias could come about.
I think doing a literature review is engaging in using other people data. For the sake of this discussion JoshuaZ claimed that Einstein was doing theoretical work when he worked with other people’s data.
If I want to draw information from a literature review to gather insights I don’t need expensive equipment. JoshuaZ claimed that you need expensive equipement to gather new insights in biology. I claim that’s not true.
I claim that there enough published information that’s not well organised into theories that you can make major advances in biology without needing to buy any equipment.
As far as I understand you don’t run experiments on participants to see whether Dual ‘n’ back works. You simply gather Dual ‘n’ back data from other people and tried doing it yourself to know how it feel like.
That’s not expensive. You don’t need to write large grants to get a lot of money to do that kind of work.
You do need some money to pay your bills. Einstein made that money through being a patent clerk. I don’t know how you make your money to live. Of course you don’t have to tell and I respect if that’s private information.
For all I know you could be making money by being a patent clerk like Einstein.
A scientists who can’t work on his grant projects because he of the government shutdown could use his free time to do the kind of work that you are doing.
If you don’t like the label “theoretic” that’s fine. If you want to propose a different label that distinguish your approach from the making fancy expensive experiments approach I’m open to use another label.
I think in the last decades we had an explosion in the amount of data in biology. I think that organising that data into theories lags behind. I think it takes less effort to advance biology by organising into theories and to do a bit of phenomenology than to push for further for expensive equipment produced knowledge.
I claim that there enough published information that’s not well organised into theories that you can make major advances in biology without needing to buy any equipment.
This can be true but also suboptimal. I’m sure that given enough cleverness and effort, we could extract a lot of genetic causes out of existing SNP databases—but why bother when we can wait a decade and sequence everyone for $100 a head? People aren’t free, and equipment both complements and substitutes for them.
As far as I understand you don’t run experiments on participants to see whether Dual ‘n’ back works. You simply gather Dual ‘n’ back data from other people and tried doing it yourself to know how it feel like. That’s not expensive. You don’t need to write large grants to get a lot of money to do that kind of work.
I assume you’re referring to my DNB meta-analysis? Yes, it’s not gathering primary data—I did think about doing that early on, which is why I carefully compiled all anecdotes mentioning IQ tests in my FAQ, but I realized that between the sheer heterogeneity, lack of a control group, massive selection effects, etc, the data was completely worthless.
But I can only gather the studies into a meta-analysis because people are running these studies. And I need a lot of data to draw any kind of conclusion. If n-back studies had stopped in 2010, I’d be out of luck, because with the studies up to 2010, I can exclude zero as the net effect, but I can’t make a rigorous statement about the effect of passive vs active control groups. (In fact, it’s only with the last 3 or 4 studies that the confidence intervals for the two groups stopped overlapping.) And these studies are expensive. I’m corresponding with one study author to correct the payment covariate, and it seems that on average participants were paid $600 - and there were 40, so they blew $24,000 just on paying the subjects, never mind paying for the MRI machine, the grad students, the professor time, publication, etc. At this point, the total cost of the research must be well into the millions of dollars.
It’s true that it’s a little irritating that no one has published a meta-analysis on DNB and that it’s not that difficult for a random person like myself to do it, it requires little in the way of resources—but that doesn’t change the fact that I still needed these dozens of professionals to run all these very expensive experiments to provide grist for the mill.
To go way up to Einstein, he was drawing on a lot of expensive data like that which showed the Mercury anomaly, and then was verified by very expensive data (I shudder to think how much those expeditions must have cost in constant dollars). Without that data, he would just be another… string theorist. Not Einstein.
You do need some money to pay your bills. Einstein made that money through being a patent clerk. I don’t know how you make your money to live. Of course you don’t have to tell and I respect if that’s private information. For all I know you could be making money by being a patent clerk like Einstein.
Not by being a patent clerk, no. :)
A scientists who can’t work on his grant projects because he of the government shutdown could use his free time to do the kind of work that you are doing.
To a very limited extent. There has to be enough studies to productively review, and there has to be no existing reviews you’re duplicating. To give another example: suppose I had been furloughed and wanted to work on a creatine meta-analysis. I get as far as I got now—not that hard, maybe 10 hours of work—and I realize there’s only 3 studies. Now what? Well, what I am doing is waiting a few months for 2 scientists to reply, and then I’ll wait another 5 or 10 years for governments to fund more psychology studies which happen to use creatine. But in no way can I possibly “finish” this even given months of government-shutdown-time.
I think in the last decades we had an explosion in the amount of data in biology. I think that organising that data into theories lags behind. I think it takes less effort to advance biology by organising into theories and to do a bit of phenomenology than to push for further for expensive equipment produced knowledge.
I don’t think that’s a stupid or obviously incorrect claim, but I don’t think it’s right. Equipment is advancing fast (if not always as fast as my first example of genotyping/sequencing), so it’d be surprising to me if you could do more work by ignoring potential new data and reprocessing old work, and more generally, even though stuff like meta-analysis is accessible to anyone for free (case in point: myself), we don’t see anyone producing any impressive discoveries. Case in point: more than a few researchers already believed n-back might be an artifact of the control groups before I started my meta-analysis—my results are a welcome confirmation, not a novel discovery; or to use your vitamin D example, yes, it’s cool that we found an effect of vitamin D on sleep (I certainly believe it), but the counterfactual of “QS does not exist” is not “vitamin D’s effect on sleep goes unknown” but “Gominak discovers the effect on her patients and publishes a review paper in 2012 arguing that vitamin D affects sleep”.
Type I errors are a feature of frequentist statistics. If you don’t use null hypotheses you don’t make type I errors. Bayesians don’t make type I errors because they don’t have null hypotheses.
LOL. That’s, um, not exactly true.
Let’s take a new drug trial. You want to find out whether the drug has certain (specific, detectable) effects. Could you please explain how a Bayesian approach to the results of the trial would make it impossible to make a Type I error, that is, a false positive: decide that the drug does have effects while in fact it does not?
The output of a bayesian analysis isn’t a truth value but a probability.
So is the output of a frequentist analysis.
However real life is full of step functions which translate probabilities into binary decisions. The FDA needs to either approve the drug or not approve the drug.
Saying “I will never make a Type I error because I will never make a hard decision” doesn’t look good as evidence for the superiority of Bayes...
However real life is full of step functions which translate probabilities into binary decisions.
Decisions are not the result of statistical test but of utility functions.
A bayesian takes the probability that he gets from his statistics and puts that into his utility function.
Type I errors are a feature of statistical tests and not one of decision functions.
There a difference between asking yourself: “Does this drug work better than other drugs?” and then deciding based on the answer to that question whether or not to approve the drug and asking “What’s the probability that the drug works?” and making a decision based on it.
In practice the FDA does ask their statistical tools “Does this drug work better than other drugs?” and then decides on that basis whether to approve the drug.
Why is that a problem? Take an issue like developing new antibiotica. Antibiotica are an area where there a consensus that not enough money goes into developing new ones. The special needs comes out of the fact that bacteria can develop resistance to drugs.
A bayesian FDA could just change the utility factor that goes to calculate the value of approving a new antibiotica medicament.
Skipping the whole “Does this drug work?”- question and instead of focusing on the question “What’s the expected utility from approving the drug?”
The bayesian FDA could get a probability value that the drug works from the trial and another number to quantify the seriousness of sideeffects. Those numbers can go together into a utility function for making a decision.
Developing a good framework which the FDA could use to make such decisions would be theoretical work.
The kind of work in which not enough intellectual effort goes because scientists rather want to play with fancy equipment.
If the FDA would publish utility values for the drugs that it approves that would also help insurance companies.
A insurance company could sell you an insurance that pays for drugs that exceed a certain utility value for a cerain price.
You could simply factor the file drawer effect into such a model. If a company preregisters a trial and doesn’t publish it the utility score of the drug goes down.
Preregistered trials count more towards the utility of the drug than trials with aren’t preregistered so you create an incentive for registration.
You can do all sorts of thinks when you think about designing an utility function that goes beyond (“Does this drug work better than existing ones”(Yes/No”) and “Is it safe?”(Yes/No)).
You can even ask whether the FDA should do approval at all. You can just allow all drugs but say that insurance only pays for drugs with a certain demonstrated utility score. Just pay the Big Pharma more for drugs that have high demonstrated utility.
There you have a model of an FDA that wouldn’t do any Type I errors.
I solved the basis of a theoretical problem that JoshuaZ considered insolveable in an afternoon.
*I would add that if you want to end the war on drugs, this propsal matters a lot. (Details left as exercise for the reader)
Consider Alice and Bob. Alice is a mainstream statistician, aka a frequentist. Bob is a Bayesian.
We take our clinical trial results and give them to both Alice and Bob.
Alice says: the p-value for the drug effectiveness is X. This means that there is X% probability that the results we see arose entirely by chance while the drug has no effect at all.
Bob says: my posterior probability for drug being useless is Y. This means Bob believes that there is (1-Y)% probability that drug is effective and Y% probability that is has no effect.
Given that both are competent and Bob doesn’t have strong priors X should be about the same as Y.
Do note that both Alice and Bob provided a probability as the outcome.
Now after that statistical analysis someone, let’s call him Trent, needs to make a binary decision. Trent says “I have a threshold of certainty/confidence Z. If the probability of the drug working is greater than Z, I will make a positive decision. If it’s lower, I will make a negative decision”.
Alice comes forward and says: here is my probability of the drug working, it is (1-X).
Bob comes forward and says: here is my probability of the drug working, it is (1-Y).
So, you’re saying that if Trent relies on Alice’s number (which was produced in the frequentist way) he is in danger of committing a Type I error. But if Trent relies on Bob’s number (which was produced in the Bayesian way) he cannot possibly commit a Type I error. Yes?
And then you start to fight the hypothetical and say that Trent really should not make a binary decision. He should just publish the probability and let everyone make their own decisions. Maybe—that works in some cases and doesn’t work in others. But Trent can publish Alice’s number, and he can publish Bob’s number—they are pretty much the same and both can be adequate inputs into some utility function. So where exactly is the Bayesian advantage?
Given that both are competent and Bob doesn’t have strong priors X should be about the same as Y.
Why? X is P(results >= what we saw | effect = 0), whereas Y is P(effect < costs | results = what we saw). I can see no obvious reason those would be similar, not even if we assume costs = 0; p(results = what we saw | effect = 0) = p(effect = 0 | results = what we saw) iff p_{prior}(result = what we saw) = p_{prior}(effect = 0) (where the small p’s are probability densities, not probability masses), but that’s another story.
You have two samples: one was given the drug, the other was given the placebo. You have some metric for the effect you’re looking for, a value of interest.
The given-drug sample has a certain distribution of the values of your metric which you model as a random variable. The given-placebo sample also has a distribution of these values (different, of course) which you also model as a random variable.
The statistical questions are whether these two random variables are different, in which way, and how confident you are of the answers.
For simple questions like that (and absent strong priors) the frequentists and the Bayesians will come to very similar conclusions and very similar probabilities.
For simple questions like that (and absent strong priors) the frequentists and the Bayesians will come to very similar conclusions and very similar probabilities.
Yes, but the p-value and the posterior probability aren’t even the same question, are they?
Alice says: the p-value for the drug effectiveness is X. This means that there is X% probability that the results we see arose entirely by chance while the drug has no effect at all.
No. You don’t understand null hypotheis testing. It doesn’t measure whether the results arose entirely by chance. It measures whether a specifc null hypothsis can be rejected.
I hate to disappoint you, but I do understand null hypothesis testing. In this particular example the specific null hypothesis is that the drug has no effect and therefore all observable results arose entirely by chance.
You are really determined to fight they hypothetical, aren’t you? :-) Let me quote myself with the relevant part emphasized: “You want to find out whether the drug has certain (specific, detectable) effects.”
I could simply run n=1 experiments
And how would they help you? There is the little issue of noise. You cannot detect any effects below the noise floor and for n=1 that floor is going to be pretty high.
“You want to find out whether the drug has certain (specific, detectable) effects.”
A p-value isn’t the probability that a drug has certain (specific, detectable) effects. 1-p isn’t either.
You are really determined to fight they hypothetical, aren’t you?
No, I’m accepting it. The probability of a drug having zero effects is 0. If your statistics give you an answer that a drug has a probability other than 0 for a drug having zero effects your statistics are wrong.
I think your answer suggests the idea that an experiment might provide actionable information.
And how would they help you? There is the little issue of noise. You cannot detect any effects below the noise floor and for n=1 that floor is going to be pretty high.
But you still claim that every experiment provides an actionable probability when interpreted by a frequentist.
If you give a bayesian your priors and then get a posterior probability from the bayesian that probability is in every case actionable.
Again: the probability that a drug has no specific, detectable effects is NOT zero.
I don’t care about detectability when I take a drug. I care about whether it helps me.
I want a number that tell me the probability of the drug helping me. I don’t want the statisician to answer a different question.
Detectability depends on the power of a trial.
If a frequentist gives you some number after he analysed an experiment you can’t just fit that number in a decision function.
You have to think about issues such as whether the experiment had enough power to pick up an effect.
If a bayesian gives you a probability you don’t have to think about such issues because the bayesian already integrates your prior knowledge. The probability that the bayesian gives you can be directly used.
Drug trials are neither designed to, nor capable of answering questions like this.
Whether a drug will help you is a different probability that comes out of a complicated evaluation for which the drug trial results serve as just one of the inputs.
If a bayesian gives you a probability you don’t have to think about such issues
Whether a drug will help you is a different probability that comes out of a complicated evaluation for which the drug trial results serve as just one of the inputs.
That evaluation is in it’s nature bayesian. Bayes rule is about adding together different probabilities.
At the moment there no systematic way of going about it. That’s where theory development is needed. I would that someone like the FDA writes down all their priors and then provides some computer analysis tool that actually calculates that probability.
I am sorry, you’re speaking nonsense.
If the priors are correct then a correct bayesian analysis provides me exactly the probability in which I should believe after I read the study.
This seems like extremely weak evidence. Diminishing marginal returns is a common thing in many areas. For example, engineering better trains happened a lot in the second half 19th century and the early 20th century. That slowed down, not because of some lack of theoretical background, but because the technology reached maturity. Now, improvements in train technology do occur, but slowly.
On the contrary. We have ways of handling the file drawer problem, and they aren’t theory based issues. Pre-registration of studies works. It isn’t even clear to me what it would mean to have a theoretical solution of the file drawer problem given that it is a problem about how culture, and a type of problem exists in any field. It makes about as much sense to talk about how having better theory could somehow solve type I errors.
The ancient Greeks used the Babylonian number system and the Greek system. They did not use Roman numerals.
The file drawer problem is about an effect. If you can estimate exactly how large the effect is when you look at the question of whether to take a certain drug you solve the problem because you can just run the numbers.
The concept of the file drawer problem first appeared in 1976 if I can trust google ngrams.
How much money do you think it cost to run the experiments to come up with the concept of the file drawer problem and the concept pre-registration of studies? I don’t think that’s knowledge that got created by running expensive experiments. It came from people engaging in theoretical thinking.
Type I errors are a feature of frequentist statistics. If you don’t use null hypotheses you don’t make type I errors. Bayesians don’t make type I errors because they don’t have null hypotheses.
The earliest citation in the Rosenthal paper that coined the term ‘file drawer’ is to a 1959 paper by one Theodore Sterling; I jailbroke this to “Publication Decisions and Their Possible Effects on Inferences Drawn from tests of Significance—or Vice Versa”.
After some background about NHST on page 1, Sterling immediately begins tallying tests of significance in a years’ worth of 4 psychology journals, on page 2, and discovers that eg of 106 tests, 105 rejected the null hypothesis. On page 3, he discusses how this bias could come about.
So at least in this very early discussion of publication bias, it was driven by people engaged in empirical thinking.
I think doing a literature review is engaging in using other people data. For the sake of this discussion JoshuaZ claimed that Einstein was doing theoretical work when he worked with other people’s data.
If I want to draw information from a literature review to gather insights I don’t need expensive equipment. JoshuaZ claimed that you need expensive equipement to gather new insights in biology. I claim that’s not true. I claim that there enough published information that’s not well organised into theories that you can make major advances in biology without needing to buy any equipment.
As far as I understand you don’t run experiments on participants to see whether Dual ‘n’ back works. You simply gather Dual ‘n’ back data from other people and tried doing it yourself to know how it feel like. That’s not expensive. You don’t need to write large grants to get a lot of money to do that kind of work.
You do need some money to pay your bills. Einstein made that money through being a patent clerk. I don’t know how you make your money to live. Of course you don’t have to tell and I respect if that’s private information.
For all I know you could be making money by being a patent clerk like Einstein.
A scientists who can’t work on his grant projects because he of the government shutdown could use his free time to do the kind of work that you are doing.
If you don’t like the label “theoretic” that’s fine. If you want to propose a different label that distinguish your approach from the making fancy expensive experiments approach I’m open to use another label.
I think in the last decades we had an explosion in the amount of data in biology. I think that organising that data into theories lags behind. I think it takes less effort to advance biology by organising into theories and to do a bit of phenomenology than to push for further for expensive equipment produced knowledge.
If I phrase it that way, would you agree?
This can be true but also suboptimal. I’m sure that given enough cleverness and effort, we could extract a lot of genetic causes out of existing SNP databases—but why bother when we can wait a decade and sequence everyone for $100 a head? People aren’t free, and equipment both complements and substitutes for them.
I assume you’re referring to my DNB meta-analysis? Yes, it’s not gathering primary data—I did think about doing that early on, which is why I carefully compiled all anecdotes mentioning IQ tests in my FAQ, but I realized that between the sheer heterogeneity, lack of a control group, massive selection effects, etc, the data was completely worthless.
But I can only gather the studies into a meta-analysis because people are running these studies. And I need a lot of data to draw any kind of conclusion. If n-back studies had stopped in 2010, I’d be out of luck, because with the studies up to 2010, I can exclude zero as the net effect, but I can’t make a rigorous statement about the effect of passive vs active control groups. (In fact, it’s only with the last 3 or 4 studies that the confidence intervals for the two groups stopped overlapping.) And these studies are expensive. I’m corresponding with one study author to correct the payment covariate, and it seems that on average participants were paid $600 - and there were 40, so they blew $24,000 just on paying the subjects, never mind paying for the MRI machine, the grad students, the professor time, publication, etc. At this point, the total cost of the research must be well into the millions of dollars.
It’s true that it’s a little irritating that no one has published a meta-analysis on DNB and that it’s not that difficult for a random person like myself to do it, it requires little in the way of resources—but that doesn’t change the fact that I still needed these dozens of professionals to run all these very expensive experiments to provide grist for the mill.
To go way up to Einstein, he was drawing on a lot of expensive data like that which showed the Mercury anomaly, and then was verified by very expensive data (I shudder to think how much those expeditions must have cost in constant dollars). Without that data, he would just be another… string theorist. Not Einstein.
Not by being a patent clerk, no. :)
To a very limited extent. There has to be enough studies to productively review, and there has to be no existing reviews you’re duplicating. To give another example: suppose I had been furloughed and wanted to work on a creatine meta-analysis. I get as far as I got now—not that hard, maybe 10 hours of work—and I realize there’s only 3 studies. Now what? Well, what I am doing is waiting a few months for 2 scientists to reply, and then I’ll wait another 5 or 10 years for governments to fund more psychology studies which happen to use creatine. But in no way can I possibly “finish” this even given months of government-shutdown-time.
I don’t think that’s a stupid or obviously incorrect claim, but I don’t think it’s right. Equipment is advancing fast (if not always as fast as my first example of genotyping/sequencing), so it’d be surprising to me if you could do more work by ignoring potential new data and reprocessing old work, and more generally, even though stuff like meta-analysis is accessible to anyone for free (case in point: myself), we don’t see anyone producing any impressive discoveries. Case in point: more than a few researchers already believed n-back might be an artifact of the control groups before I started my meta-analysis—my results are a welcome confirmation, not a novel discovery; or to use your vitamin D example, yes, it’s cool that we found an effect of vitamin D on sleep (I certainly believe it), but the counterfactual of “QS does not exist” is not “vitamin D’s effect on sleep goes unknown” but “Gominak discovers the effect on her patients and publishes a review paper in 2012 arguing that vitamin D affects sleep”.
LOL. That’s, um, not exactly true.
Let’s take a new drug trial. You want to find out whether the drug has certain (specific, detectable) effects. Could you please explain how a Bayesian approach to the results of the trial would make it impossible to make a Type I error, that is, a false positive: decide that the drug does have effects while in fact it does not?
I don’t. A real bayesian doesn’t. The bayesian wants to know the probability which with the drug will improve the well being of a patient.
The output of a bayesian analysis isn’t a truth value but a probability.
So is the output of a frequentist analysis.
However real life is full of step functions which translate probabilities into binary decisions. The FDA needs to either approve the drug or not approve the drug.
Saying “I will never make a Type I error because I will never make a hard decision” doesn’t look good as evidence for the superiority of Bayes...
Decisions are not the result of statistical test but of utility functions. A bayesian takes the probability that he gets from his statistics and puts that into his utility function.
Type I errors are a feature of statistical tests and not one of decision functions.
It’s a huge theoretical advance to move from aristotelism to baysianism. Maybe reading http://slatestarcodex.com/2013/08/06/on-first-looking-into-chapmans-pop-bayesianism/ might help you.
I doubt it. I already did and clearly it didn’t help :-P
There a difference between asking yourself: “Does this drug work better than other drugs?” and then deciding based on the answer to that question whether or not to approve the drug and asking “What’s the probability that the drug works?” and making a decision based on it.
In practice the FDA does ask their statistical tools “Does this drug work better than other drugs?” and then decides on that basis whether to approve the drug.
Why is that a problem? Take an issue like developing new antibiotica. Antibiotica are an area where there a consensus that not enough money goes into developing new ones. The special needs comes out of the fact that bacteria can develop resistance to drugs.
A bayesian FDA could just change the utility factor that goes to calculate the value of approving a new antibiotica medicament. Skipping the whole “Does this drug work?”- question and instead of focusing on the question “What’s the expected utility from approving the drug?”
The bayesian FDA could get a probability value that the drug works from the trial and another number to quantify the seriousness of sideeffects. Those numbers can go together into a utility function for making a decision.
Developing a good framework which the FDA could use to make such decisions would be theoretical work. The kind of work in which not enough intellectual effort goes because scientists rather want to play with fancy equipment.
If the FDA would publish utility values for the drugs that it approves that would also help insurance companies. A insurance company could sell you an insurance that pays for drugs that exceed a certain utility value for a cerain price.
You could simply factor the file drawer effect into such a model. If a company preregisters a trial and doesn’t publish it the utility score of the drug goes down. Preregistered trials count more towards the utility of the drug than trials with aren’t preregistered so you create an incentive for registration. You can do all sorts of thinks when you think about designing an utility function that goes beyond (“Does this drug work better than existing ones”(Yes/No”) and “Is it safe?”(Yes/No)).
You can even ask whether the FDA should do approval at all. You can just allow all drugs but say that insurance only pays for drugs with a certain demonstrated utility score. Just pay the Big Pharma more for drugs that have high demonstrated utility.
There you have a model of an FDA that wouldn’t do any Type I errors. I solved the basis of a theoretical problem that JoshuaZ considered insolveable in an afternoon.
*I would add that if you want to end the war on drugs, this propsal matters a lot. (Details left as exercise for the reader)
Consider Alice and Bob. Alice is a mainstream statistician, aka a frequentist. Bob is a Bayesian.
We take our clinical trial results and give them to both Alice and Bob.
Alice says: the p-value for the drug effectiveness is X. This means that there is X% probability that the results we see arose entirely by chance while the drug has no effect at all.
Bob says: my posterior probability for drug being useless is Y. This means Bob believes that there is (1-Y)% probability that drug is effective and Y% probability that is has no effect.
Given that both are competent and Bob doesn’t have strong priors X should be about the same as Y.
Do note that both Alice and Bob provided a probability as the outcome.
Now after that statistical analysis someone, let’s call him Trent, needs to make a binary decision. Trent says “I have a threshold of certainty/confidence Z. If the probability of the drug working is greater than Z, I will make a positive decision. If it’s lower, I will make a negative decision”.
Alice comes forward and says: here is my probability of the drug working, it is (1-X).
Bob comes forward and says: here is my probability of the drug working, it is (1-Y).
So, you’re saying that if Trent relies on Alice’s number (which was produced in the frequentist way) he is in danger of committing a Type I error. But if Trent relies on Bob’s number (which was produced in the Bayesian way) he cannot possibly commit a Type I error. Yes?
And then you start to fight the hypothetical and say that Trent really should not make a binary decision. He should just publish the probability and let everyone make their own decisions. Maybe—that works in some cases and doesn’t work in others. But Trent can publish Alice’s number, and he can publish Bob’s number—they are pretty much the same and both can be adequate inputs into some utility function. So where exactly is the Bayesian advantage?
Why? X is P(results >= what we saw | effect = 0), whereas Y is P(effect < costs | results = what we saw). I can see no obvious reason those would be similar, not even if we assume costs = 0; p(results = what we saw | effect = 0) = p(effect = 0 | results = what we saw) iff p_{prior}(result = what we saw) = p_{prior}(effect = 0) (where the small p’s are probability densities, not probability masses), but that’s another story.
You have two samples: one was given the drug, the other was given the placebo. You have some metric for the effect you’re looking for, a value of interest.
The given-drug sample has a certain distribution of the values of your metric which you model as a random variable. The given-placebo sample also has a distribution of these values (different, of course) which you also model as a random variable.
The statistical questions are whether these two random variables are different, in which way, and how confident you are of the answers.
For simple questions like that (and absent strong priors) the frequentists and the Bayesians will come to very similar conclusions and very similar probabilities.
Yes, but the p-value and the posterior probability aren’t even the same question, are they?
No, they are not.
However for many simple cases—e.g. where we are considering only two possible hypotheses—they are sufficiently similar.
No. You don’t understand null hypotheis testing. It doesn’t measure whether the results arose entirely by chance. It measures whether a specifc null hypothsis can be rejected.
I hate to disappoint you, but I do understand null hypothesis testing. In this particular example the specific null hypothesis is that the drug has no effect and therefore all observable results arose entirely by chance.
Almost no drug has no effect. Most drug changes the patient and produces either a slight advantage or disadvantage.
If what you saying is correct I could simply run n=1 experiments.
You are really determined to fight they hypothetical, aren’t you? :-) Let me quote myself with the relevant part emphasized: “You want to find out whether the drug has certain (specific, detectable) effects.”
And how would they help you? There is the little issue of noise. You cannot detect any effects below the noise floor and for n=1 that floor is going to be pretty high.
A p-value isn’t the probability that a drug has certain (specific, detectable) effects. 1-p isn’t either.
No, I’m accepting it. The probability of a drug having zero effects is 0. If your statistics give you an answer that a drug has a probability other than 0 for a drug having zero effects your statistics are wrong.
I think your answer suggests the idea that an experiment might provide actionable information.
But you still claim that every experiment provides an actionable probability when interpreted by a frequentist.
If you give a bayesian your priors and then get a posterior probability from the bayesian that probability is in every case actionable.
Again: the probability that a drug has no specific, detectable effects is NOT zero.
Huh? What? I don’t even… Please quote me.
What do you call an “actionable” probability? What would be an example of a “non-actionable” probability?
I don’t care about detectability when I take a drug. I care about whether it helps me. I want a number that tell me the probability of the drug helping me. I don’t want the statisician to answer a different question.
Detectability depends on the power of a trial.
If a frequentist gives you some number after he analysed an experiment you can’t just fit that number in a decision function. You have to think about issues such as whether the experiment had enough power to pick up an effect.
If a bayesian gives you a probability you don’t have to think about such issues because the bayesian already integrates your prior knowledge. The probability that the bayesian gives you can be directly used.
Drug trials are neither designed to, nor capable of answering questions like this.
Whether a drug will help you is a different probability that comes out of a complicated evaluation for which the drug trial results serve as just one of the inputs.
I am sorry, you’re speaking nonsense.
That evaluation is in it’s nature bayesian. Bayes rule is about adding together different probabilities.
At the moment there no systematic way of going about it. That’s where theory development is needed. I would that someone like the FDA writes down all their priors and then provides some computer analysis tool that actually calculates that probability.
If the priors are correct then a correct bayesian analysis provides me exactly the probability in which I should believe after I read the study.