Thank you, this is important feedback. I’ll think about whether it may be possible to incorporate it.
I know there are problems with the article, which is the reason I did not post to main. Part of the reason for publishing at all was to figure out how extensive the problems are, where readers get stuck etc. I tried to focus as much as possible on the issues that I got hung up on when I learned the material, but obviously this may differ from reader to reader. I understand that there may be parts of the theory that my mind glosses over, leading to a false sense of understanding. It is very helpful for me when people point it out.
I certainly invoke Crocker’s Rule in order to get more accurate feedback.
I feel that I owe you a longer explanation of what I mean, especially in the light of a longish comment you wrote and then retracted (I thought it was fine, but nevermind). I invoke Crocker’s Rule for myself, too.
I took another look the other day at your introduction paragraph, to better understand what was bugging me so much about it. Meanwhile you edited it down to this:
Whenever someone tells you about a new framework for describing empirical research, the first question you should ask yourself is whether the new framework allows you to correctly represent the important aspects of phenomena you are studying.
So here’s the thing. The vagueness of this “empirical research” kept bugging me, and suddenly it dawned on me to try and see if this is a thing in epidemiology, which you said the article was originally rooted in. And turns out it is. Epidemiology journals, books etc. talk all the time about “empirical research” and “empirical research methods” etc. etc. As far as I could understand it—and I trust you’ll correct me if I’m wrong—these refer to studies that collect relevant data about large numbers of people “in the wild”—perhaps cohorts, perhaps simply populations or subpopulations—and try to infer all kinds of stuff from the data. And this is contrasted with, for example, studying one particular person in a clinic, or studying how diseases spread on a molecular level, or other things one could do.
So, suddenly it’s all very much clearer! I understand now what you were saying about complicated models etc. - you have a large dataset collected from a large population, you can do all kinds of complicated statistical/machine learning’y stuff with it, try to fit it to any number of increasingly complicated models and such, and you feel that while doing so, people often get lost in the complicated models, and you think causal inference is nice because it never loses track of things people on the ground actually want to know about the data, which tends to be stuff like “what causes what” or “what happens when I do this”. Did I get this right?
OK, now consider this—I’m a computer programmer with a strong math background and strong interest in other natural sciences. I seem to be in the intended audience for your article—I deal with large datasets all the time and I’m very keen to understand causal inference better. When I see the phrase “empirical research”, it doesn’t tell me any of that stuff I just wrote. The closest phrase to this that I have in my normal vocabulary is “empirical sciences” which is really all natural sciences besides math. The only reasonable guess I have for “empirical research” is “things one finds out by actually studying something in the real world, and not just looking at the ceiling and thinking hard”. So for example all of experimental physics comes under this notion of “empirical research”, all chemistry done in a lab is “empirical research”. All the OTHER kinds of epidemiological research that are NOT “empirical research” according to the way the phrase is used in epidemiology, I would still consider “empirical research” under this wide notion. Since the notion is so vague, I have no idea what kinds of “models” or “frameworks” to handle it you might possibly be thinking about. And it puzzles me that you even speak about something so wide and vague, and also it isn’t clear what kinds of relevance it might have for causal inference anyway.
By now I spent probably x30 more time talking about that sentence than you spent writing it, but it’s a symptom of a larger problem. You have a crisp and clear picture in your head: empirical research, that thing where you take N humans and record answers to K
questions of interest for every one of them, and then try to see what this data can reveal. But when you put it into words, you fail to read those words with the simulated mind of your intended audience. Your intended audience in this case are not
epidemiologists. They will not see the crisp and clear picture you get when you invoke the phrase “empirical research”. It’s on you to clearly understand that—that may be the most important thing you have to work on as an explainer of things to a non-specialist audience. But in your text this happens again and again. AND, to add to the problem, you DON’T use examples to make clear what you’re talking about.
As a result, I spent upwards of 2 hours trying to understand just what two words in one sentence in your article mean. This is clearly suboptimal!
Now, I’ll try to finish this already overlong comment quickly and point out several more examples of the same sort of problem in the article.
[in the next comment since this one is getting rejected as too long]
1: when you first introduce “God’s Table”, it’s really hard to understand what’s going on, and the chief reason may be that you don’t explicitly explain to the reader that the rows in the table are individuals for which you record the data. Again, this is something that’s probably crystal clear to you, but I, the reader, at this point haven’t seen any EXAMPLE of what you mean by “dataset”, anything beyond vague talk of “observed variables”. The rows in your table are only identified by “Id” of 1,2,3,4, which doesn’t tell me anything useful, it can be just automatic numbering of rows. And since you have two variables of interest, A and Y, and four rows in the table, it’s REALLY easy to just mistake the table for one which lists all possible combinations of values of A and Y and then adds more data dependent on those in the hypothetical columns you add. This interpretation of the table doesn’t really make sense (the set of values is wrong for one thing) and it probably never occurred to you, but it took me a while to dig into your article and try to make sense of it to see that it doesn’t make sense. It should be crystal clear to the reader on the first reading just what this table represents, and it just isn’t. Again, if only you had a simple example running through the article! If only you wrote, I dunno, “John”, “Mary”, “Kim” and “Steven” as the row IDs to make it clear they were actual people, and “treatment” and “effect” (or better yet, “got better”) instead of A and Y. It would have been so much easier to understand your meaning.
(then there’re little things: why oh why do you write a=0 and a=1 in the conditioned variables instead of A=0, A=1? It just makes the reader wonder unhappily of there’s some small-case ‘a’ they missed before in the exposition. “A” and “a” are normally two different variable names in math, even if they’re not usually used together! Again, I need to spend a bunch of time to puzzle out that you really mean A=0 and A=1. And why A and Y anyway, if you call them treatment and effect, why not T and E? Why not make it easier on us? The little things)
2: You start a section with “What do we mean by Causality?” Then you NEVER ANSWER THIS. I reread this section three times just to make sure. Nope. You introduce counterfactual variables, God’s table and then say “The goal of causal inference is to learn about God’s Table using information from the observed table (in combination with a priori causal knowledge)”. I thought the goal of causal inference was to answer questions such as “what caused what”. Sure, the link between the two may be very simple but it’s a crucial piece of the puzzle and you explain more simple things elsewhere. You just explained what a “counterfactual” is to a person who possibly never heard the term before, so you can probably spare another two sentences (better yet, two sentences and an example!) on explaining just how we define causation via counterfactuals. If you title a section with the question “What do we mean by causality?” don’t leave me guessing, ANSWER IT.
3: First you give a reasonably generic example of confounding and finish with “This is called confounding”. Then later you call sex a confounder and helpfully add “We will give a definition of confounding in Part 2 of this sequence.” OK, but what was the thing before then?
4: The A->L->Y thing. God, that’s so confusing. At this point, again because there’ve been no examples, I have no clue what A,L,Y could conceivably stand for. Then you keep talking about “random functions” which to me are functions with random values; a “random function with a deterministic component dependent only on A” sounds like nonsense. “Probabilistic function” would be fine. “Random variable” would be fine if you assume basic knowledge in probability. “Random function” is just confusing but I took it to mean “probabilistic function”.
Then you say “No matter how many people you sample, you cannot tell the graphs apart, because any joint distribution of L, A and Y that is consistent with graph 1, could also have been generated by graph 2”.
Why?
It’s obvious to you? It isn’t obvious to me. I understand sort of vaguely why it might be true, and it certainly looks like a good demonstration of why joint distribution isn’t enough, if it’s true. Why is it true? The way you write seems like it should be the most obvious thing in the world to any reader. Well, it’s not. Maybe if you had a running EXAMPLE (that probably sounds like a broken record by now)...
So I’m trying to puzzle this out. What if the values for A,L,Y are binary, and in the diagram on the left, L->A and L->Y always just copy values from L to those other two deterministically; while in the diagram on the left, let’s say that A is always 1, L is randomly chosen to be 0 or 1 (so that its dependence on A is vacuous), while Y is a copy of L. Then the joint distribution generated by graph 1 will be, in order ALY, 100 or 111 with equal probability, and it cannot be generated by graph 2, because in any distribution generated by graph 2, A=Y in all samples.
Does that make sense, or did I miss something obvious? If I’m right, the example is wrong, and if I’m wrong, perhaps the example is not as crystal clear as it could have been, since it let me argue myself into such a mistake?
OK, how should I finish this. You may have gotten the impression from my previous comment that I was looking for a more rounded-off philosophical discussion of the relevant issues, but that’s really not the case. My problem was not that you didn’t spend 10 paragraphs on summarizing what we might mean by causality and what other approaches there are. It’s fine to have a practical approach that goes straight to discussing data and what we do with it. The problem is that your article isn’t readable by someone who isn’t already familiar with the field. I feel that most of the problems could be solved by a combination of: a) a VERY careful rereading of the article from the p.o.v. of someone who’s a subject-matter expert, but is completely ignorant of epidemiology, causal inference or any but the most basic notions in probability, and merciless rewriting/expanding of the text to make everything lucid for that hypothetical reader; b) adding a simple example that would run through the text and have every new definition or claim tested on it.
Thank you, this is very high-quality feedback, with a lot of clear advice on how I can improve the post. I will do my best to make improvements over the next few days. I greatly appreciate that you took the time to draw my attention to these things. Among many other things, you have convinced me that a lot of things about the set-up are not obvious to non-epidemiologists. The article may need extensive restructuring to fix this.
You are obviously completely right about my abuse of the term “empirical research”. I will fix it to something like “observational correlation studies” tomorrow.
So I’m trying to puzzle this out. What if the values for A,L,Y are binary, and in the diagram on the left, L->A and L->Y always just copy values from L to those other two deterministically; while in the diagram on the left, let’s say that A is always 1, L is randomly chosen to be 0 or 1 (so that its dependence on A is vacuous), while Y is a copy of L. Then the joint distribution generated by graph 1 will be, in order ALY, 100 or 111 with equal probability, and it cannot be generated by graph 2, because in any distribution generated by graph 2, A=Y in all samples.
I agree that this part of the post needs more work. I think what is happening, is that you have data on a probability distribution that was generated by graph 1, and are then asking if it could have been generated by a particular mechanism that can be described by graph 2. However, the point I wanted to make is that you would have been able to come up with some mechanism described by graph 2 that could account for the data.. I realize this is not clear, and I will work on it over the next few days.
why oh why do you write a=0 and a=1 in the conditioned variables instead of A=0, A=1?
When I use lower case a, I am referring to a specific value that the random variable A can take. Obviously, I agree that I should have spelled this out. For example , the counterfactual Y(a) describes would have happened we intervened to set A to a, where a can be either 0 or 1. The distinction between upper case and lower case is necessary..
However, the point I wanted to make is that you would have been able to come up with some mechanism described by graph 2 that could account for the data..
Thanks. I should have realized that, and I think I did at some point but later lost track of this. With this understood properly I can’t think of any counterexample, and I feel more confident now that this is true, but I’m still not sure whether it ought to be obvious.
Thank you, this is important feedback. I’ll think about whether it may be possible to incorporate it.
I know there are problems with the article, which is the reason I did not post to main. Part of the reason for publishing at all was to figure out how extensive the problems are, where readers get stuck etc. I tried to focus as much as possible on the issues that I got hung up on when I learned the material, but obviously this may differ from reader to reader. I understand that there may be parts of the theory that my mind glosses over, leading to a false sense of understanding. It is very helpful for me when people point it out.
I certainly invoke Crocker’s Rule in order to get more accurate feedback.
I feel that I owe you a longer explanation of what I mean, especially in the light of a longish comment you wrote and then retracted (I thought it was fine, but nevermind). I invoke Crocker’s Rule for myself, too.
I took another look the other day at your introduction paragraph, to better understand what was bugging me so much about it. Meanwhile you edited it down to this:
So here’s the thing. The vagueness of this “empirical research” kept bugging me, and suddenly it dawned on me to try and see if this is a thing in epidemiology, which you said the article was originally rooted in. And turns out it is. Epidemiology journals, books etc. talk all the time about “empirical research” and “empirical research methods” etc. etc. As far as I could understand it—and I trust you’ll correct me if I’m wrong—these refer to studies that collect relevant data about large numbers of people “in the wild”—perhaps cohorts, perhaps simply populations or subpopulations—and try to infer all kinds of stuff from the data. And this is contrasted with, for example, studying one particular person in a clinic, or studying how diseases spread on a molecular level, or other things one could do.
So, suddenly it’s all very much clearer! I understand now what you were saying about complicated models etc. - you have a large dataset collected from a large population, you can do all kinds of complicated statistical/machine learning’y stuff with it, try to fit it to any number of increasingly complicated models and such, and you feel that while doing so, people often get lost in the complicated models, and you think causal inference is nice because it never loses track of things people on the ground actually want to know about the data, which tends to be stuff like “what causes what” or “what happens when I do this”. Did I get this right?
OK, now consider this—I’m a computer programmer with a strong math background and strong interest in other natural sciences. I seem to be in the intended audience for your article—I deal with large datasets all the time and I’m very keen to understand causal inference better. When I see the phrase “empirical research”, it doesn’t tell me any of that stuff I just wrote. The closest phrase to this that I have in my normal vocabulary is “empirical sciences” which is really all natural sciences besides math. The only reasonable guess I have for “empirical research” is “things one finds out by actually studying something in the real world, and not just looking at the ceiling and thinking hard”. So for example all of experimental physics comes under this notion of “empirical research”, all chemistry done in a lab is “empirical research”. All the OTHER kinds of epidemiological research that are NOT “empirical research” according to the way the phrase is used in epidemiology, I would still consider “empirical research” under this wide notion. Since the notion is so vague, I have no idea what kinds of “models” or “frameworks” to handle it you might possibly be thinking about. And it puzzles me that you even speak about something so wide and vague, and also it isn’t clear what kinds of relevance it might have for causal inference anyway.
By now I spent probably x30 more time talking about that sentence than you spent writing it, but it’s a symptom of a larger problem. You have a crisp and clear picture in your head: empirical research, that thing where you take N humans and record answers to K questions of interest for every one of them, and then try to see what this data can reveal. But when you put it into words, you fail to read those words with the simulated mind of your intended audience. Your intended audience in this case are not epidemiologists. They will not see the crisp and clear picture you get when you invoke the phrase “empirical research”. It’s on you to clearly understand that—that may be the most important thing you have to work on as an explainer of things to a non-specialist audience. But in your text this happens again and again. AND, to add to the problem, you DON’T use examples to make clear what you’re talking about.
As a result, I spent upwards of 2 hours trying to understand just what two words in one sentence in your article mean. This is clearly suboptimal!
Now, I’ll try to finish this already overlong comment quickly and point out several more examples of the same sort of problem in the article.
[in the next comment since this one is getting rejected as too long]
[cont’d from the parent comment]
1: when you first introduce “God’s Table”, it’s really hard to understand what’s going on, and the chief reason may be that you don’t explicitly explain to the reader that the rows in the table are individuals for which you record the data. Again, this is something that’s probably crystal clear to you, but I, the reader, at this point haven’t seen any EXAMPLE of what you mean by “dataset”, anything beyond vague talk of “observed variables”. The rows in your table are only identified by “Id” of 1,2,3,4, which doesn’t tell me anything useful, it can be just automatic numbering of rows. And since you have two variables of interest, A and Y, and four rows in the table, it’s REALLY easy to just mistake the table for one which lists all possible combinations of values of A and Y and then adds more data dependent on those in the hypothetical columns you add. This interpretation of the table doesn’t really make sense (the set of values is wrong for one thing) and it probably never occurred to you, but it took me a while to dig into your article and try to make sense of it to see that it doesn’t make sense. It should be crystal clear to the reader on the first reading just what this table represents, and it just isn’t. Again, if only you had a simple example running through the article! If only you wrote, I dunno, “John”, “Mary”, “Kim” and “Steven” as the row IDs to make it clear they were actual people, and “treatment” and “effect” (or better yet, “got better”) instead of A and Y. It would have been so much easier to understand your meaning.
(then there’re little things: why oh why do you write a=0 and a=1 in the conditioned variables instead of A=0, A=1? It just makes the reader wonder unhappily of there’s some small-case ‘a’ they missed before in the exposition. “A” and “a” are normally two different variable names in math, even if they’re not usually used together! Again, I need to spend a bunch of time to puzzle out that you really mean A=0 and A=1. And why A and Y anyway, if you call them treatment and effect, why not T and E? Why not make it easier on us? The little things)
2: You start a section with “What do we mean by Causality?” Then you NEVER ANSWER THIS. I reread this section three times just to make sure. Nope. You introduce counterfactual variables, God’s table and then say “The goal of causal inference is to learn about God’s Table using information from the observed table (in combination with a priori causal knowledge)”. I thought the goal of causal inference was to answer questions such as “what caused what”. Sure, the link between the two may be very simple but it’s a crucial piece of the puzzle and you explain more simple things elsewhere. You just explained what a “counterfactual” is to a person who possibly never heard the term before, so you can probably spare another two sentences (better yet, two sentences and an example!) on explaining just how we define causation via counterfactuals. If you title a section with the question “What do we mean by causality?” don’t leave me guessing, ANSWER IT.
3: First you give a reasonably generic example of confounding and finish with “This is called confounding”. Then later you call sex a confounder and helpfully add “We will give a definition of confounding in Part 2 of this sequence.” OK, but what was the thing before then?
4: The A->L->Y thing. God, that’s so confusing. At this point, again because there’ve been no examples, I have no clue what A,L,Y could conceivably stand for. Then you keep talking about “random functions” which to me are functions with random values; a “random function with a deterministic component dependent only on A” sounds like nonsense. “Probabilistic function” would be fine. “Random variable” would be fine if you assume basic knowledge in probability. “Random function” is just confusing but I took it to mean “probabilistic function”.
Then you say “No matter how many people you sample, you cannot tell the graphs apart, because any joint distribution of L, A and Y that is consistent with graph 1, could also have been generated by graph 2”.
Why?
It’s obvious to you? It isn’t obvious to me. I understand sort of vaguely why it might be true, and it certainly looks like a good demonstration of why joint distribution isn’t enough, if it’s true. Why is it true? The way you write seems like it should be the most obvious thing in the world to any reader. Well, it’s not. Maybe if you had a running EXAMPLE (that probably sounds like a broken record by now)...
So I’m trying to puzzle this out. What if the values for A,L,Y are binary, and in the diagram on the left, L->A and L->Y always just copy values from L to those other two deterministically; while in the diagram on the left, let’s say that A is always 1, L is randomly chosen to be 0 or 1 (so that its dependence on A is vacuous), while Y is a copy of L. Then the joint distribution generated by graph 1 will be, in order ALY, 100 or 111 with equal probability, and it cannot be generated by graph 2, because in any distribution generated by graph 2, A=Y in all samples.
Does that make sense, or did I miss something obvious? If I’m right, the example is wrong, and if I’m wrong, perhaps the example is not as crystal clear as it could have been, since it let me argue myself into such a mistake?
OK, how should I finish this. You may have gotten the impression from my previous comment that I was looking for a more rounded-off philosophical discussion of the relevant issues, but that’s really not the case. My problem was not that you didn’t spend 10 paragraphs on summarizing what we might mean by causality and what other approaches there are. It’s fine to have a practical approach that goes straight to discussing data and what we do with it. The problem is that your article isn’t readable by someone who isn’t already familiar with the field. I feel that most of the problems could be solved by a combination of: a) a VERY careful rereading of the article from the p.o.v. of someone who’s a subject-matter expert, but is completely ignorant of epidemiology, causal inference or any but the most basic notions in probability, and merciless rewriting/expanding of the text to make everything lucid for that hypothetical reader; b) adding a simple example that would run through the text and have every new definition or claim tested on it.
Thank you, this is very high-quality feedback, with a lot of clear advice on how I can improve the post. I will do my best to make improvements over the next few days. I greatly appreciate that you took the time to draw my attention to these things. Among many other things, you have convinced me that a lot of things about the set-up are not obvious to non-epidemiologists. The article may need extensive restructuring to fix this.
You are obviously completely right about my abuse of the term “empirical research”. I will fix it to something like “observational correlation studies” tomorrow.
I agree that this part of the post needs more work. I think what is happening, is that you have data on a probability distribution that was generated by graph 1, and are then asking if it could have been generated by a particular mechanism that can be described by graph 2. However, the point I wanted to make is that you would have been able to come up with some mechanism described by graph 2 that could account for the data.. I realize this is not clear, and I will work on it over the next few days.
When I use lower case a, I am referring to a specific value that the random variable A can take. Obviously, I agree that I should have spelled this out. For example , the counterfactual Y(a) describes would have happened we intervened to set A to a, where a can be either 0 or 1. The distinction between upper case and lower case is necessary..
Thanks. I should have realized that, and I think I did at some point but later lost track of this. With this understood properly I can’t think of any counterexample, and I feel more confident now that this is true, but I’m still not sure whether it ought to be obvious.