Giving that we are on Lesswrong you missed a core one: Academic social science uses a bunch of frequentist statistics that perform well if your goal is to prove that your thesis has “statistical significance” but that aren’t useful for learning what’s true. Machine learning algorithms don’t give you p values.
Do they even give you Bayesian forms of summary values like Bayes factors?
(This is actually a relevant concern for me now: my magnesium self-experiment has finished, and the results are really surprising. To check my linear model, I tried looking at what a random forests might say; it mostly agrees with the analysis… except it also places a lot of importance on a covariate which with the linear model, the fit is better with that coavariate discarded entirely. What does this mean? I dunno. There’s no statistic like a p-value I can use to interpret this.)
You can turn any kind of analysis (which returns a scalar) into a p-value by generating a zillion fake data sets assuming the null hypothesis, analysing them all, and checking for what fraction of the fake data sets your statistic exceeds that for the real data set.
This doesn’t sound true to me. How do you know the underlying distribution of the null when it’s just something like “these variables are independent”?
What I was saying was sort of vague, so I’m going to formalize here.
Data is coming from some random process X(θ,ω), where θ parameterizes the process and ω captures all the randomness. Let’s suppose that for any particular θ, living in the set Θ of parameters where the model is well-defined, it’s easy to sample from X(θ,ω). We don’t put any particular structure (in particular, cardinality assumptions) on Θ. Since we’re being frequentists here, nature’s parameter θ′ is fixed and unknown. We only get to work with the realization of the random process that actually happens, X’ = X(θ′,ω′).
We have some sort of analysis t(⋅) that returns a scalar; applying it to the random data gives us the random variables t(X(θ,ω)), which is still parameterized by θ and still easy to sample from. We pick some null hypothesis Θ0 ⊂ Θ, usually for scientific or convenience reasons.
We want some measure of how weird/surprising the value t(X’) is if θ′ were actually in Θ0. One way to do this, if we have a simple null hypothesis Θ0 = { θ0 }, is to calculate the p-value p(X’) = P(t(X(θ0,ω)) ≥ t(X’)). This can clearly be approximated using samples from t(X(θ0,ω)).
For composite null hypotheses, I guessed that using p(X’) = sup{θ0 ∈ Θ0} P(t(X(θ0,ω)) ≥ t(X’)) would work. Paraphrasing jsteinhardt, if Θ0 = { θ01, …, θ0n }, you could approximate p(X’) using samples from t(X(θ01,ω)), … t(X(θ01,ω)), but it’s not clear what to do when Θ0 has infinite cardinality. I see two ways forward. One is approximating p(X’) by doing the above computation over a finite subset of points in Θ0, chosen by gridding or at random. This should give an approximate lower bound on the p-value, since it might miss θ where the observed data look unexceptional. If the approximate p-value leads you to fail to reject the null, you can believe it; if it leads you to reject the null, you might be less sure and might want to continue trying more points in Θ0. Maybe this is what jsteinhardt means by saying it “doesn’t terminate”? The other way forward might be to use features of t and Θ0, which we do have some control over, to simplify the expression sup{θ0 ∈ Θ0} P(t(X(θ0,ω)) ≥ c). Say, if t(X(θ,ω)) is convex in θ for any ω and Θ0 is a convex bounded polytope living in some Euclidean space, then the supremum only depends on how P(t(X(θ0,ω)) ≥ c) behaves at a finite number of points.
So yeah, things are far more complicated than I claimed and realize now working through it. But you can do sensible things even with a composite null.
I don’t have knowledge on random forests in particular but I did learn a little bit about machine learning in bioinformatics classes.
As far as I understand you can train your machine learning algorithm on one set of data and then see how it predicts values of a different set of data.
That means you have values for sensitivity and specificity of your model. You can build a receiver operating characteristic (ROC) plot with it. You can also do things like seeing whether you get a different model if you build the model on a different set of your data. That can tell you whether your model is robust.
The idea of p values is to decide whether or not your model is true. In general that’s not what machine learning folks are concerned with. The know that their model is a model and not reality and they care about the receiver operating characteristic.
Giving that we are on Lesswrong you missed a core one: Academic social science uses a bunch of frequentist statistics that perform well if your goal is to prove that your thesis has “statistical significance” but that aren’t useful for learning what’s true. Machine learning algorithms don’t give you p values.
Do they even give you Bayesian forms of summary values like Bayes factors?
(This is actually a relevant concern for me now: my magnesium self-experiment has finished, and the results are really surprising. To check my linear model, I tried looking at what a random forests might say; it mostly agrees with the analysis… except it also places a lot of importance on a covariate which with the linear model, the fit is better with that coavariate discarded entirely. What does this mean? I dunno. There’s no statistic like a p-value I can use to interpret this.)
You can turn any kind of analysis (which returns a scalar) into a p-value by generating a zillion fake data sets assuming the null hypothesis, analysing them all, and checking for what fraction of the fake data sets your statistic exceeds that for the real data set.
This doesn’t sound true to me. How do you know the underlying distribution of the null when it’s just something like “these variables are independent”?
If you’re working with composite hypotheses, replace “your statistic” with “the supremum of your statistic over the relevant set of hypotheses”.
If there are infinitely many hypotheses in the set then the algorithm in the grandparent doesn’t terminate :).
What I was saying was sort of vague, so I’m going to formalize here.
Data is coming from some random process X(θ,ω), where θ parameterizes the process and ω captures all the randomness. Let’s suppose that for any particular θ, living in the set Θ of parameters where the model is well-defined, it’s easy to sample from X(θ,ω). We don’t put any particular structure (in particular, cardinality assumptions) on Θ. Since we’re being frequentists here, nature’s parameter θ′ is fixed and unknown. We only get to work with the realization of the random process that actually happens, X’ = X(θ′,ω′).
We have some sort of analysis t(⋅) that returns a scalar; applying it to the random data gives us the random variables t(X(θ,ω)), which is still parameterized by θ and still easy to sample from. We pick some null hypothesis Θ0 ⊂ Θ, usually for scientific or convenience reasons.
We want some measure of how weird/surprising the value t(X’) is if θ′ were actually in Θ0. One way to do this, if we have a simple null hypothesis Θ0 = { θ0 }, is to calculate the p-value p(X’) = P(t(X(θ0,ω)) ≥ t(X’)). This can clearly be approximated using samples from t(X(θ0,ω)).
For composite null hypotheses, I guessed that using p(X’) = sup{θ0 ∈ Θ0} P(t(X(θ0,ω)) ≥ t(X’)) would work. Paraphrasing jsteinhardt, if Θ0 = { θ01, …, θ0n }, you could approximate p(X’) using samples from t(X(θ01,ω)), … t(X(θ01,ω)), but it’s not clear what to do when Θ0 has infinite cardinality. I see two ways forward. One is approximating p(X’) by doing the above computation over a finite subset of points in Θ0, chosen by gridding or at random. This should give an approximate lower bound on the p-value, since it might miss θ where the observed data look unexceptional. If the approximate p-value leads you to fail to reject the null, you can believe it; if it leads you to reject the null, you might be less sure and might want to continue trying more points in Θ0. Maybe this is what jsteinhardt means by saying it “doesn’t terminate”? The other way forward might be to use features of t and Θ0, which we do have some control over, to simplify the expression sup{θ0 ∈ Θ0} P(t(X(θ0,ω)) ≥ c). Say, if t(X(θ,ω)) is convex in θ for any ω and Θ0 is a convex bounded polytope living in some Euclidean space, then the supremum only depends on how P(t(X(θ0,ω)) ≥ c) behaves at a finite number of points.
So yeah, things are far more complicated than I claimed and realize now working through it. But you can do sensible things even with a composite null.
Yup I agree with all of that. Nice explanation!
I don’t have knowledge on random forests in particular but I did learn a little bit about machine learning in bioinformatics classes.
As far as I understand you can train your machine learning algorithm on one set of data and then see how it predicts values of a different set of data. That means you have values for sensitivity and specificity of your model. You can build a receiver operating characteristic (ROC) plot with it. You can also do things like seeing whether you get a different model if you build the model on a different set of your data. That can tell you whether your model is robust.
The idea of p values is to decide whether or not your model is true. In general that’s not what machine learning folks are concerned with. The know that their model is a model and not reality and they care about the receiver operating characteristic.
You don’t know what you are talking about.
The grandchild comment suggests that he does, at least to the the level of a typical user (though not a researcher or developer) of these methods.