I’m not sure about this observation vs experiment design dichotomy you’re thinking of. I think of power analysis as something which can be done both before an experiment to design it and understand what the data could tell one, and post hoc, to understand why you did or did not get a result and to estimate things for designing the next experiment.
Well, I think of statistical power as the ability to distinguish signal from noise. If you expect signal of a particular strength you need to find ways to reduce the noise floor to below that strength (typically through increasing sample size).
However my standard way of thinking about this is: we have data, we build a model, we evaluate how good the model output is. Bulding a model, say, via some sort of maximum likelihood, gives you “the” fitted model with specific chances to commit a Type I or a Type II error. But can you trade off chances of Type I errors against chances of Type II errors other than through crudely adding bias to the model output?
But can you trade off chances of Type I errors against chances of Type II errors other than through crudely adding bias to the model output?
Model-building seems like a separate topic. Power analysis is for particular approaches, where I certainly can trade off Type I against Type II. Here’s a simple example for a two-group t-test, where I accept a higher Type I error rate and immediately see my Type II go down (power go up):
R> power.t.test(n=40, delta=0.5, sig.level=0.05)
Two-sample t test power calculation
n = 40
delta = 0.5
sd = 1
sig.level = 0.05
power = 0.5981
alternative = two.sided
NOTE: n is number in *each* group
R> power.t.test(n=40, delta=0.5, sig.level=0.10)
Two-sample t test power calculation
n = 40
delta = 0.5
sd = 1
sig.level = 0.1
power = 0.7163
alternative = two.sided
NOTE: n is number in *each* group
In exchange for accepting 10% Type I rather than 5%, I see my Type II fall from 1-0.60=40% to 1-0.72=28%. Tada, I have traded off errors and as far as I know, the t-test remains exactly as unbiased as it ever was.
I am not explaining myself well. Let me try again.
To even talk about Type I / II errors you need two things—a hypothesis or a prediction (generally, output of a model, possibly implicit) and reality (unobserved at prediction time). Let’s keep things very simple and deal with binary variables, let’s say we have an object foo and we want to know whether it belongs to class bar (or does not belong to it). We have a model, maybe simple and even trivial, which, when fed the object foo outputs the probability of it belonging to class bar. Let’s say this probability is 92%.
Now, at this point we are still in the probability land. Saying that “foo belongs to class bar with a probability of 92%” does not subject us to Type I / II errors. It’s only when we commit to the binary outcome and say “foo belongs to class bar, full stop” that they appear.
The point is that in probability land you can’t trade off Type I error against Type II—you just have the probability (or a full distribution in the more general case). It’s the commitment to to a certain outcome on the basis of an arbitrarily picked threshold that gives rise to them. And if so it is that threshold (e.g. traditionally 5%) that determines the trade-off between errors. Changing the threshold changes the trade-off, but this doesn’t affect the model and its output, it’s all post-prediction interpretation.
So you’re trying to talk about overall probability distributions in a Bayesian framework? I haven’t ever done power analysis with that approach, so I don’t know what would be analogous to Type I and II errors and whether one can trade them off; in fact, the only paper I can recall discussing how one does it is Kruschke’s paper (starting on pg11) - maybe he will be helpful?
Not necessarily in the Bayesian framework, though it’s kinda natural there. You can think in terms of complete distributions within the frequentist framework perfectly well, too.
The issue that we started with was of statistical power, right? While it’s technically defined in terms of the usual significance (=rejecting the null hypothesis), you can think about it in broader terms. Essentially it’s the capability to detect a signal (of certain effect size) in the presence of noise (in certain amounts) with a given level of confidence.
Thank for the paper, I’ve seen it before but didn’t have a handy link to it.
You can think in terms of complete distributions within the frequentist framework perfectly well, too.
Does anyone do that, though?
Essentially it’s the capability to detect a signal (of certain effect size) in the presence of noise (in certain amounts) with a given level of confidence.
Well, if you want to think of it like that, you could probably formulate all of this in information-theoretic terms and speak of needing a certain number of bits; then the sample size & effect size interact to say how many bits each n contains. So a binary variable contains a lot less than a continuous variable, a shift in a rare observation like 90⁄10 is going to be harder to detect than a shift in a 50⁄50 split, etc. That’s not stuff I know a lot about.
Well, sure. The frequentist approach, aka mainstream statistics, deals with distributions all the time and the arguments about particular tests or predictions being optimal, or unbiased, or asymptotically true, etc. are all explicitly conditional on characteristics of underlying distributions.
Well, if you want to think of it like that, you could probably formulate all of this in information-theoretic terms and speak of needing a certain number of bits;
Yes, something like that. Take a look at Fisher information, e.g. “The Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ upon which the probability of X depends.”
I’m not sure about this observation vs experiment design dichotomy you’re thinking of. I think of power analysis as something which can be done both before an experiment to design it and understand what the data could tell one, and post hoc, to understand why you did or did not get a result and to estimate things for designing the next experiment.
Well, I think of statistical power as the ability to distinguish signal from noise. If you expect signal of a particular strength you need to find ways to reduce the noise floor to below that strength (typically through increasing sample size).
However my standard way of thinking about this is: we have data, we build a model, we evaluate how good the model output is. Bulding a model, say, via some sort of maximum likelihood, gives you “the” fitted model with specific chances to commit a Type I or a Type II error. But can you trade off chances of Type I errors against chances of Type II errors other than through crudely adding bias to the model output?
Model-building seems like a separate topic. Power analysis is for particular approaches, where I certainly can trade off Type I against Type II. Here’s a simple example for a two-group t-test, where I accept a higher Type I error rate and immediately see my Type II go down (power go up):
In exchange for accepting 10% Type I rather than 5%, I see my Type II fall from 1-0.60=40% to 1-0.72=28%. Tada, I have traded off errors and as far as I know, the t-test remains exactly as unbiased as it ever was.
I am not explaining myself well. Let me try again.
To even talk about Type I / II errors you need two things—a hypothesis or a prediction (generally, output of a model, possibly implicit) and reality (unobserved at prediction time). Let’s keep things very simple and deal with binary variables, let’s say we have an object foo and we want to know whether it belongs to class bar (or does not belong to it). We have a model, maybe simple and even trivial, which, when fed the object foo outputs the probability of it belonging to class bar. Let’s say this probability is 92%.
Now, at this point we are still in the probability land. Saying that “foo belongs to class bar with a probability of 92%” does not subject us to Type I / II errors. It’s only when we commit to the binary outcome and say “foo belongs to class bar, full stop” that they appear.
The point is that in probability land you can’t trade off Type I error against Type II—you just have the probability (or a full distribution in the more general case). It’s the commitment to to a certain outcome on the basis of an arbitrarily picked threshold that gives rise to them. And if so it is that threshold (e.g. traditionally 5%) that determines the trade-off between errors. Changing the threshold changes the trade-off, but this doesn’t affect the model and its output, it’s all post-prediction interpretation.
So you’re trying to talk about overall probability distributions in a Bayesian framework? I haven’t ever done power analysis with that approach, so I don’t know what would be analogous to Type I and II errors and whether one can trade them off; in fact, the only paper I can recall discussing how one does it is Kruschke’s paper (starting on pg11) - maybe he will be helpful?
Not necessarily in the Bayesian framework, though it’s kinda natural there. You can think in terms of complete distributions within the frequentist framework perfectly well, too.
The issue that we started with was of statistical power, right? While it’s technically defined in terms of the usual significance (=rejecting the null hypothesis), you can think about it in broader terms. Essentially it’s the capability to detect a signal (of certain effect size) in the presence of noise (in certain amounts) with a given level of confidence.
Thank for the paper, I’ve seen it before but didn’t have a handy link to it.
Does anyone do that, though?
Well, if you want to think of it like that, you could probably formulate all of this in information-theoretic terms and speak of needing a certain number of bits; then the sample size & effect size interact to say how many bits each n contains. So a binary variable contains a lot less than a continuous variable, a shift in a rare observation like 90⁄10 is going to be harder to detect than a shift in a 50⁄50 split, etc. That’s not stuff I know a lot about.
Well, sure. The frequentist approach, aka mainstream statistics, deals with distributions all the time and the arguments about particular tests or predictions being optimal, or unbiased, or asymptotically true, etc. are all explicitly conditional on characteristics of underlying distributions.
Yes, something like that. Take a look at Fisher information, e.g. “The Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ upon which the probability of X depends.”