Are you just stating that that the upper limit is not 4? Interesting—what is it and please estimate how rare the ability is. Is there anything else unusual about these people?
They seem to self-select into prominent academic positions (of course I deal with academic folks, not folks with “straight jobs”—likely there are such people everywhere). I am not sure how rare the ability is, because I think most people do not work up to their genetic limits—usually acrasia gets them first. Academics often have an easier acrasia problem because (a) their work is interesting/rewarding and (b) collaborators/deadlines help get across motivation lulls.
People in startups are another example. Motivated startup people, especially single digit employees, often work long hours and intensely and productively.
Self-experimentation with within-subject design can be internally valid (I make sure mine are well-powered, even, which is more than some psychologists can say), but this does nothing about external validity or selection bias.
Which ironically makes self-experimentation somewhat analogous to quantum suicide experiments: because of selection bias, observers of my self-experiments will rationally learn little even as I learn much more. Someone watching quantum suicides will expect to see lots of survivals, and someone watching self-experiments will expect to see lots of positive results, even if quantum suicide doesn’t work and self-experiments measure nothing but null effects.
(Except maybe if the observer had some reason to believe they would have learned about my experiments regardless of the experiment results… possibly because they became interested in my writings for non-experiment reasons, maybe? I wonder.)
When you say “can be internally valid” what do you mean? What about interactions from repeated treatments? I mean, correlation can equal causation, too. But that’s a pretty weak standard to meet.
Also, how do you know the selection bias does not create non-causal explanations for observed dependence? For example, in case control studies you select based on the child of the outcome:
T → Y → S, with unobserved U1 being a parent of T, and U2 being a parent of Y (U1, U2 possibly dependent creating unobserved confounding).
If we select on S FIRST, and THEN try to randomize T (conditioning and do(.) do not commute), then we create a dependence between T and U2 due to “explaining away.” Randomizing on T cuts the arc from U1 to T (good—we get rid of some unobserved confounding), but does nothing about this new dependence between T and U2 introduced by the selection procedure.
20 pages of theorems on do-calculus is not helpful, and the examples they use like retrospective pregnancies or case-control studies do not seem to apply to self-experiments.
Maybe I should be clearer: can you give an example of a real self-experiment, preferably one which was done blind & randomized, which is plausibly affected by your selection bias? Because I still don’t understand what you are getting at.
With repeated measures design the problem is whether the “washout period” is sufficient.
For simple two-level experiments, reasonable block lengths plus counterbalancing from randomization deals with that.
I guess the fundamental question is, say you do your self-experiment, and find an effect. Say it’s nootropic pills or something. The question is, what can we conclude from this. I can conclude either nothing or very little indeed about nootropics and myself (since as you correctly point out, all data is on you, not on me).
What can you conclude about nootropics and yourself from this experiment? The worry is washout length, and lack of exchangeability between different “copies” of you spread across time (you mention this when talking about good/bad days when you were doing some sort of nicotine test).
What is the strongest effect you ever found in this way?
If you read about statistical power and experimental design for fun, reading about do-calculus will probably not be an awful idea or a waste of time (not that paper though..)
The worry is washout length, and lack of exchangeability between different “copies” of you spread across time
As I said, I don’t think washout length is a concern for blocked 2-level experiments. Suppose I block as single days, and unbeknownst to me, I screwed up the literature search and the substance actually lasts 2 days; some of the possible sequences will be messed up and show no difference while other possible pairs will be fine. (A sequence like 10 will be screwed up and look like 11, but a sequence like 01 will show the true difference.) Since the blocks are randomized, there will be a mixed of confounded and accurate blocks: the apparent effect will be weaker than the true effect. I have lost power, but not introduced bias.
(you mention this when talking about good/bad days when you were doing some sort of nicotine test).
I suppose that’s one way to think about u-curve responses. Lack of exchangeability sounds pessimistic to me, though: if there really is a u-curve effect, then you can improve the model and make the datapoints comparable by measuring the person to learn where they are on the u-curve before administering the intervention. (At least, I think that’s how it went. I thought I read some psychology papers using some sort of methodology like that at some point...)
If you read about statistical power and experimental design for fun, reading about do-calculus will probably not be an awful idea or a waste of time
Certainly, but I have a hard time with math and so learning do-calculus would be a big time investment, one I am loath to make right this month just to understand someone’s opaque objection to self-experiments which they probably could easily make clear.
Since the blocks are randomized, there will be a mixed of confounded and accurate blocks: the apparent
effect will be weaker than the true effect. I have lost power, but not introduced bias.
I don’t think this is a good way to think about confounding. For one thing, you are implicitly assuming the effect is monotonic. Perhaps this is true with nootropics (how do you know though?) Monotonicity is not true in general, though. Maybe treatments and unwashed out partial treatments interact in weird/random ways. In general, if you are adding up unconfounded and confounded days, your sum is garbage, not a weaker version of the true sum.
I suppose that’s one way to think about u-curve responses.
A u-curve response is just one type of non-monotonic response. There could be others. I don’t think it’s entirely scientific to assume either the function is monotonic or it has a monotonic first derivative.. What if there is no simple way to describe the response?
Actually I am not even talking about the response to the treatment. Suppose you were a werewolf, and the outcome you were measuring was a physical test. Now, every few days out of 28 you would measure off the charts completely independently of whatever physical enhancement treatment you were taking, just because you were half-wolf during those days. So you might conclude there is an effect under the null. Now werewolves do not exist, but are you sure this sort of thing doesn’t happen with you? How do you know?
one I am loath to make right this month just to understand someone’s opaque objection to self-experiments
which they probably could easily make clear.
I think that’s a curious attitude for someone who is into self-experimentation (independently of whether the opaque objection can be made clear or not). In some sense, do-calculus is the math behind identifying causal effects from data. I am not sure how you can talk about these things with any confidence without reading up on the math. It’s like being a practicing consequentialist without knowing some decision theory. You can’t just rely on intuition.
I think at the very least you should write down all the assumptions you are making in order to have your conclusions be internally valid.
What is the strongest effect you ever found in this way?
I haven’t compiled my results into a table or anything but IIRC, I think the largest effect size so far was taking vitamin D at bedtime with d~=-0.7. (Roughly inline with psychology meta-analyses: effect sizes drop off sharply past |0.6|.)
I don’t think this is a good way to think about confounding. For one thing, you are implicitly assuming the effect is monotonic. Perhaps this is true with nootropics (how do you know though?)
The background research and published experiments don’t seem to include unusual adjustments for non-monotonicity (not really sure what that means in this context).
Monotonicity is not true in general, though.
In general? Do you have a meta-analysis over hundreds of different kinds of experiments showing this?
Actually I am not even talking about the response to the treatment. Suppose you were a werewolf, and the outcome you were measuring was a physical test. Now, every few days out of 28 you would measure off the charts completely independently of whatever physical enhancement treatment you were taking, just because you were half-wolf during those days. So you might conclude there is an effect under the null. Now werewolves do not exist, but are you sure this sort of thing doesn’t happen with you? How do you know?
Wouldn’t this be covered by randomization? If I randomize each day to this treatment, half of the wolf-days will be under treatment days and half under control days. They’ll inflate the standard deviation and I’ll be much less likely to reject the null.
I think that’s a curious attitude for someone who is into self-experimentation (independently of whether the opaque objection can be made clear or not).
From the sound of it, you’re largely making the theoretician’s objection: “but there are a billion ways your simple design could go wrong! How can you do any experiments if you don’t understand in detail every underlying tool or theorem?” Well, yes, it’s true that I nor other experimenters can’t rule out becoming a werewolf on every 5th Tuesday or in setting up an experiment with completely wrong blocks or washouts, nor can we be sure that induction will continue to work tomorrow and we will not be eaten by grues or bleens, but nevertheless...
(not really sure what that means in this context).
I am just saying that confounding could make your effect weaker (if there is cancellation of paths), or stronger (if there is some sort of interaction with the treatment), or weaker sometimes and stronger other times. You just don’t know. Confounding doesn’t just increase the variance of your effect estimate, it creates bias in the estimate. That is, if you add up some confounded bits to your estimate, you are adding up garbage.
Wouldn’t this be covered by randomization?
No. The werewolf example is a clear case of the copies not being exchangeable. Different versions of you could react to (randomized!) treatment differently, and you won’t know how without more assumptions. For instance, if you were a woman, you would have a different hormonal composition due to the monthly cycle, etc. etc. etc.
From the sound of it, you’re largely making the theoretician’s objection: “but there are a billion ways your
simple design could go wrong!”
Look, what I am saying is not very complicated. I am not asking you to become a mathematician. You are looking for causal effects. That’s great! It is not my goal to discourage you! Just report your assumptions. All of them. Say you assume monotonicity, exchangeability of copies, etc. If you don’t know what assumptions you need to make, maybe read up on them. Reporting assumptions is good science, right? It’s standard practice in the stats literature.
No, see. The burden of proof is not on me. If you make an assumption, the burden of proof that it holds (or at the very least the burden of reporting) is on you. Causal mechanisms in general are not monotonic...Just report your assumptions. All of them. Say you assume monotonicity, exchangeability of copies, etc. If you don’t know what assumptions you need to make, maybe read up on them.
This is an example of what I mean by you are taking a wildly impractical theoretical approach. Have you ever seen an experiment in which every assumption is reported with a proof? No, because such a paper would not be an experiment but an exercise in pure mathematics or statistics and no one would ever get anything done if they tried to actually apply your suggestions since they would spend all their time reading up on various statistical frameworks and going ‘well, I guess I should specify this and that assumption but wait don’t I also assume independence of who’s the current Justice of the Supreme Court?’ etc
But don’t just assume some random thing you came up with after reading some slice of the literature that happened to catch your fancy will give you the effect you want.
I hate to break it to you, but that’s pretty much how it works. People read a slice of the literature, apply simple common models, which yield reasonable answers, and only start delving into the foundations and examining closely the methods if someone makes a good case that a hidden assumption or a method’s limitation is important. This should not dismay you any more than a philosopher of science should be dismayed that scientists spend their days in the lab and he is only consulted to deal with borderline cases like Intelligent Design.
Reporting assumptions is standard practice. For example in causal inference literature the mantra is often “we assume SUTVA (stable unit treatment value assumption), and conditional ignorability.” You can’t prove them all (in fact many are untestable). Reporting is still a good idea (for sensitivity analysis, replication, arguing about their reasonableness, etc.)
Exchangeability of copies and monotonicity are pretty important. People always report monotonicity (because you get identification when you could not before). But anyways, I shouldn’t be the one to have to tell you this.
Also, it’s not some, it’s all assumptions needed to get your answer from the data. Even if exchangeability holds for you, it might not hold for someone else who might want to try your design. If you don’t write down what you assume, how should they know if your design will carry over?
Anyways, this is just the Scruffy AI mistake all over again. Actually it’s worse than that. The scientific attitude is to try to falsify, e.g. look for reasons your model might fail. You are assuming as a default that your model is reasonable, and not even leaving a paper trail.
Dozens of fields are concerned with “identifying causal effects from data”, pretty much all the natural sciences and all their myriad subspecializations can be viewed through such a lense. That’s the crux, can be viewed as such. Yet, I doubt you’ll find all that many medical studies, physical experiments, etc. invoking, understanding or even being aware of do-calculus. That does not void their results, there are ways of interpreting the results that do not rely on grasping—or even be aware of—the math behind the curtain.
A biologist can make valid observations about a meadow without being concerned about wave functions; gwern can do internally valid studies without being concerned about the math of do-calculus. Thankfully, or else nothing would get done. Like, ever.
It’s nice to be enthusiastic about what you do, but be careful of an apotheosis of your specific field of study.
Dozens of fields are concerned with “identifying causal effects from data”, pretty much all the natural
sciences and all their myriad subspecializations can be viewed through such a lense.
Indeed.
That’s the crux, can be viewed as such. Yet, I doubt you’ll find all that many medical studies, physical
experiments, etc. invoking do-calculus. That does not void their results, there are ways of interpreting the
results that do not rely on grasping—or even be aware of—the math behind the curtain.
“That’s just like, your opinion, man.”
See, you don’t get to say that. When people talk about causal effects from randomization (a la what Fisher talked about), effects of interventions is what they mean. That is the math behind what they want, just like complex valued matrices is the math behind quantum mechanics, or Peano axioms the math behind doing arithmetic. Not everyone uses the language of do(.) (some use potential outcome language, which is equivalent). But either their language is equivalent to do(.), or they are essentially doing garbage (and I assure you, there is a lot of garbage out there). In fields like epidemiology, what they often have is the data people (who know about HIV, say, or cancer), and methods people (who know how not to get garbage from the data).
The fact of the matter is, there are all sorts of gotchas about doing causal inference that being careless and relying on intuitions makes you vulnerable to. I can give endless examples:
(a) People doing longitudinal causal inference basically failed at time-varying confounders until 1986, when the right method was developed. So they would report garbage causal effects from longitudinal studies, because they thought they just need to adjust for these confounders. No. Wrong. Have to use the equivalent of g-computation.
(b) People try to use coefficients of regressions as mediated causal effects, even when this is not warranted (that is, the coefficient doesn’t correspond to anything causal). No. Wrong. This fails if you have discrete mediators. This fails with interaction terms. This fails under certain natural modeling choices. This fails if you have unobserved confounding. In general a mediated effect is a complicated function of the observed data, not a regression coefficient.
(c) People try to test for causal null, even when their model does not permit the null to happen. (null paradox)
(d) Don Rubin (famous Harvard statistician, one of the people who wrote down the EM algorithm, and one of the people behind potential outcomes) once said that you should adjust for all covariates. He was just trying to be a good Bayesian (have to use all the data, right?) No. Wrong. You only adjust for what you need to block all non-causal paths, while not opening any non-causal paths.
(e) An example from something written at lesswrong: a Bayesian network is a causal model. No. Wrong. A Bayesian network is a statistical model (a set of densities) defined by conditional independence. In order to have a causal model you need to talk about how interventions relate to observations (essentially you need to say parents are direct causes formally).
Actually the list is so long, I am trying to put it in a paper format.
This stuff is not simple, and even very smart people can be confused! So if you want to do causal inference, you know, read up on it.. I am surprised this is a controversial point. To quote Miguel Hernan, the g-formula (expressing do(.) in terms of observed data) is not a causal method, it is the causal method.
If you don’t want to read Pearl, you can read Robins, or Dawid, or the potential outcomes people who learned from Rubin. The formalism is the same.
This is not true.
Guys, the reasonable default is not “yeah this sounds similar to what EY said once,” but “I don’t believe you.”
Are you just stating that that the upper limit is not 4? Interesting—what is it and please estimate how rare the ability is. Is there anything else unusual about these people?
They seem to self-select into prominent academic positions (of course I deal with academic folks, not folks with “straight jobs”—likely there are such people everywhere). I am not sure how rare the ability is, because I think most people do not work up to their genetic limits—usually acrasia gets them first. Academics often have an easier acrasia problem because (a) their work is interesting/rewarding and (b) collaborators/deadlines help get across motivation lulls.
People in startups are another example. Motivated startup people, especially single digit employees, often work long hours and intensely and productively.
Good for you! :)
I actually was not talking about myself (self-evaluation is noisy), but people I know.
Don’t tell gwern …
Although something like “hours worked in a productive manner” should be well quantifiable, for most professions.
Self-experimentation is causal inference with a sample size of 1 and selection bias, e.g. silly and doomed.
Self-experimentation with within-subject design can be internally valid (I make sure mine are well-powered, even, which is more than some psychologists can say), but this does nothing about external validity or selection bias.
Which ironically makes self-experimentation somewhat analogous to quantum suicide experiments: because of selection bias, observers of my self-experiments will rationally learn little even as I learn much more. Someone watching quantum suicides will expect to see lots of survivals, and someone watching self-experiments will expect to see lots of positive results, even if quantum suicide doesn’t work and self-experiments measure nothing but null effects.
(Except maybe if the observer had some reason to believe they would have learned about my experiments regardless of the experiment results… possibly because they became interested in my writings for non-experiment reasons, maybe? I wonder.)
When you say “can be internally valid” what do you mean? What about interactions from repeated treatments? I mean, correlation can equal causation, too. But that’s a pretty weak standard to meet.
Also, how do you know the selection bias does not create non-causal explanations for observed dependence? For example, in case control studies you select based on the child of the outcome:
T → Y → S, with unobserved U1 being a parent of T, and U2 being a parent of Y (U1, U2 possibly dependent creating unobserved confounding).
If we select on S FIRST, and THEN try to randomize T (conditioning and do(.) do not commute), then we create a dependence between T and U2 due to “explaining away.” Randomizing on T cuts the arc from U1 to T (good—we get rid of some unobserved confounding), but does nothing about this new dependence between T and U2 introduced by the selection procedure.
http://en.wikipedia.org/wiki/Internal_validity
?
I don’t understand your hypothetical. Could you give a concrete example?
http://www.maths.bris.ac.uk/~maxvd/didelez_etal_StatSci_final.pdf
With repeated measures design the problem is whether the “washout period” is sufficient.
20 pages of theorems on do-calculus is not helpful, and the examples they use like retrospective pregnancies or case-control studies do not seem to apply to self-experiments.
Maybe I should be clearer: can you give an example of a real self-experiment, preferably one which was done blind & randomized, which is plausibly affected by your selection bias? Because I still don’t understand what you are getting at.
For simple two-level experiments, reasonable block lengths plus counterbalancing from randomization deals with that.
Actually gwern, I think you gave me an idea for a paper :).
[shortened a long reply]:
I guess the fundamental question is, say you do your self-experiment, and find an effect. Say it’s nootropic pills or something. The question is, what can we conclude from this. I can conclude either nothing or very little indeed about nootropics and myself (since as you correctly point out, all data is on you, not on me).
What can you conclude about nootropics and yourself from this experiment? The worry is washout length, and lack of exchangeability between different “copies” of you spread across time (you mention this when talking about good/bad days when you were doing some sort of nicotine test).
What is the strongest effect you ever found in this way?
If you read about statistical power and experimental design for fun, reading about do-calculus will probably not be an awful idea or a waste of time (not that paper though..)
As I said, I don’t think washout length is a concern for blocked 2-level experiments. Suppose I block as single days, and unbeknownst to me, I screwed up the literature search and the substance actually lasts 2 days; some of the possible sequences will be messed up and show no difference while other possible pairs will be fine. (A sequence like 10 will be screwed up and look like 11, but a sequence like 01 will show the true difference.) Since the blocks are randomized, there will be a mixed of confounded and accurate blocks: the apparent effect will be weaker than the true effect. I have lost power, but not introduced bias.
I suppose that’s one way to think about u-curve responses. Lack of exchangeability sounds pessimistic to me, though: if there really is a u-curve effect, then you can improve the model and make the datapoints comparable by measuring the person to learn where they are on the u-curve before administering the intervention. (At least, I think that’s how it went. I thought I read some psychology papers using some sort of methodology like that at some point...)
Certainly, but I have a hard time with math and so learning do-calculus would be a big time investment, one I am loath to make right this month just to understand someone’s opaque objection to self-experiments which they probably could easily make clear.
I don’t think this is a good way to think about confounding. For one thing, you are implicitly assuming the effect is monotonic. Perhaps this is true with nootropics (how do you know though?) Monotonicity is not true in general, though. Maybe treatments and unwashed out partial treatments interact in weird/random ways. In general, if you are adding up unconfounded and confounded days, your sum is garbage, not a weaker version of the true sum.
A u-curve response is just one type of non-monotonic response. There could be others. I don’t think it’s entirely scientific to assume either the function is monotonic or it has a monotonic first derivative.. What if there is no simple way to describe the response?
Actually I am not even talking about the response to the treatment. Suppose you were a werewolf, and the outcome you were measuring was a physical test. Now, every few days out of 28 you would measure off the charts completely independently of whatever physical enhancement treatment you were taking, just because you were half-wolf during those days. So you might conclude there is an effect under the null. Now werewolves do not exist, but are you sure this sort of thing doesn’t happen with you? How do you know?
I think that’s a curious attitude for someone who is into self-experimentation (independently of whether the opaque objection can be made clear or not). In some sense, do-calculus is the math behind identifying causal effects from data. I am not sure how you can talk about these things with any confidence without reading up on the math. It’s like being a practicing consequentialist without knowing some decision theory. You can’t just rely on intuition.
I think at the very least you should write down all the assumptions you are making in order to have your conclusions be internally valid.
I haven’t compiled my results into a table or anything but IIRC, I think the largest effect size so far was taking vitamin D at bedtime with d~=-0.7. (Roughly inline with psychology meta-analyses: effect sizes drop off sharply past |0.6|.)
The background research and published experiments don’t seem to include unusual adjustments for non-monotonicity (not really sure what that means in this context).
In general? Do you have a meta-analysis over hundreds of different kinds of experiments showing this?
Wouldn’t this be covered by randomization? If I randomize each day to this treatment, half of the wolf-days will be under treatment days and half under control days. They’ll inflate the standard deviation and I’ll be much less likely to reject the null.
From the sound of it, you’re largely making the theoretician’s objection: “but there are a billion ways your simple design could go wrong! How can you do any experiments if you don’t understand in detail every underlying tool or theorem?” Well, yes, it’s true that I nor other experimenters can’t rule out becoming a werewolf on every 5th Tuesday or in setting up an experiment with completely wrong blocks or washouts, nor can we be sure that induction will continue to work tomorrow and we will not be eaten by grues or bleens, but nevertheless...
I am just saying that confounding could make your effect weaker (if there is cancellation of paths), or stronger (if there is some sort of interaction with the treatment), or weaker sometimes and stronger other times. You just don’t know. Confounding doesn’t just increase the variance of your effect estimate, it creates bias in the estimate. That is, if you add up some confounded bits to your estimate, you are adding up garbage.
No. The werewolf example is a clear case of the copies not being exchangeable. Different versions of you could react to (randomized!) treatment differently, and you won’t know how without more assumptions. For instance, if you were a woman, you would have a different hormonal composition due to the monthly cycle, etc. etc. etc.
Look, what I am saying is not very complicated. I am not asking you to become a mathematician. You are looking for causal effects. That’s great! It is not my goal to discourage you! Just report your assumptions. All of them. Say you assume monotonicity, exchangeability of copies, etc. If you don’t know what assumptions you need to make, maybe read up on them. Reporting assumptions is good science, right? It’s standard practice in the stats literature.
This is an example of what I mean by you are taking a wildly impractical theoretical approach. Have you ever seen an experiment in which every assumption is reported with a proof? No, because such a paper would not be an experiment but an exercise in pure mathematics or statistics and no one would ever get anything done if they tried to actually apply your suggestions since they would spend all their time reading up on various statistical frameworks and going ‘well, I guess I should specify this and that assumption but wait don’t I also assume independence of who’s the current Justice of the Supreme Court?’ etc
I hate to break it to you, but that’s pretty much how it works. People read a slice of the literature, apply simple common models, which yield reasonable answers, and only start delving into the foundations and examining closely the methods if someone makes a good case that a hidden assumption or a method’s limitation is important. This should not dismay you any more than a philosopher of science should be dismayed that scientists spend their days in the lab and he is only consulted to deal with borderline cases like Intelligent Design.
Reporting assumptions is standard practice. For example in causal inference literature the mantra is often “we assume SUTVA (stable unit treatment value assumption), and conditional ignorability.” You can’t prove them all (in fact many are untestable). Reporting is still a good idea (for sensitivity analysis, replication, arguing about their reasonableness, etc.)
That’s reporting some assumptions, and presumably ones who have earned their being specifically singled out.
Exchangeability of copies and monotonicity are pretty important. People always report monotonicity (because you get identification when you could not before). But anyways, I shouldn’t be the one to have to tell you this.
Also, it’s not some, it’s all assumptions needed to get your answer from the data. Even if exchangeability holds for you, it might not hold for someone else who might want to try your design. If you don’t write down what you assume, how should they know if your design will carry over?
Anyways, this is just the Scruffy AI mistake all over again. Actually it’s worse than that. The scientific attitude is to try to falsify, e.g. look for reasons your model might fail. You are assuming as a default that your model is reasonable, and not even leaving a paper trail.
Dozens of fields are concerned with “identifying causal effects from data”, pretty much all the natural sciences and all their myriad subspecializations can be viewed through such a lense. That’s the crux, can be viewed as such. Yet, I doubt you’ll find all that many medical studies, physical experiments, etc. invoking, understanding or even being aware of do-calculus. That does not void their results, there are ways of interpreting the results that do not rely on grasping—or even be aware of—the math behind the curtain.
A biologist can make valid observations about a meadow without being concerned about wave functions; gwern can do internally valid studies without being concerned about the math of do-calculus. Thankfully, or else nothing would get done. Like, ever.
It’s nice to be enthusiastic about what you do, but be careful of an apotheosis of your specific field of study.
Indeed.
“That’s just like, your opinion, man.”
See, you don’t get to say that. When people talk about causal effects from randomization (a la what Fisher talked about), effects of interventions is what they mean. That is the math behind what they want, just like complex valued matrices is the math behind quantum mechanics, or Peano axioms the math behind doing arithmetic. Not everyone uses the language of do(.) (some use potential outcome language, which is equivalent). But either their language is equivalent to do(.), or they are essentially doing garbage (and I assure you, there is a lot of garbage out there). In fields like epidemiology, what they often have is the data people (who know about HIV, say, or cancer), and methods people (who know how not to get garbage from the data).
The fact of the matter is, there are all sorts of gotchas about doing causal inference that being careless and relying on intuitions makes you vulnerable to. I can give endless examples:
(a) People doing longitudinal causal inference basically failed at time-varying confounders until 1986, when the right method was developed. So they would report garbage causal effects from longitudinal studies, because they thought they just need to adjust for these confounders. No. Wrong. Have to use the equivalent of g-computation.
(b) People try to use coefficients of regressions as mediated causal effects, even when this is not warranted (that is, the coefficient doesn’t correspond to anything causal). No. Wrong. This fails if you have discrete mediators. This fails with interaction terms. This fails under certain natural modeling choices. This fails if you have unobserved confounding. In general a mediated effect is a complicated function of the observed data, not a regression coefficient.
(c) People try to test for causal null, even when their model does not permit the null to happen. (null paradox)
(d) Don Rubin (famous Harvard statistician, one of the people who wrote down the EM algorithm, and one of the people behind potential outcomes) once said that you should adjust for all covariates. He was just trying to be a good Bayesian (have to use all the data, right?) No. Wrong. You only adjust for what you need to block all non-causal paths, while not opening any non-causal paths.
(e) An example from something written at lesswrong: a Bayesian network is a causal model. No. Wrong. A Bayesian network is a statistical model (a set of densities) defined by conditional independence. In order to have a causal model you need to talk about how interventions relate to observations (essentially you need to say parents are direct causes formally).
Actually the list is so long, I am trying to put it in a paper format.
This stuff is not simple, and even very smart people can be confused! So if you want to do causal inference, you know, read up on it.. I am surprised this is a controversial point. To quote Miguel Hernan, the g-formula (expressing do(.) in terms of observed data) is not a causal method, it is the causal method.
If you don’t want to read Pearl, you can read Robins, or Dawid, or the potential outcomes people who learned from Rubin. The formalism is the same.