The primary weakness of longitudinal studies, compared with studies that include a control group
Longitudinal studies can and should include control groups. The difference with RCTs is that the control group is not randomized. Instead, you select from a population which is as similar as possible to the treatment group, so an example is a group of people who were interested but couldn’t attend because of scheduling conflicts. There is also the option of a placebo substitute like sending them generic self-help tips.
ETA: “Longitudinal” is also ambiguous here. It means that data were collected over time, and could mean one of several study types (RCTs are also longitudinal, by some definitions). I think you want to call this a cohort study, except without controls this is more like two different cross-sectional studies from the same population.
Instead, you select from a population which is as similar as possible to the treatment group
They did this with an earlier batch (I was part of that control group) and they haven’t reported that data. I found this disappointing, and it makes me trust this round of data less.
On Sunday, Sep 8, 2013 Dan at CFAR wrote:
Last year, you took part in the first round of the Center for Applied Rationality’s study on the benefits of learning rationality skills. As we explained then, there are two stages to the survey process: first an initial set of surveys in summer/fall 2012 (an online Rationality Survey for you to fill out about yourself, and a Friend Survey for your friends to fill out about you), and then a followup set of surveys one year later in 2013 when you (and your friends) would complete the surveys again so that we could see what has changed.
You’re right, we should’ve posted the results on our previous study. I’ll put those numbers together in a comprehensible format and then I’ll have them posted soon.
The brief explanation of why we didn’t take the time to write them up earlier is that the study was underpowered and we thought that the results weren’t that informative. In retrospect, that decision was a mistake.
I’ve put a list of the workshop surveys that we’ve done in a separate comment.
We looked into the possibility of including a nonrandomized comparison group. In order to get a large enough sample size, we’d have to be much less selective than your example (people who were accepted to a workshop but weren’t able to attend for several months). One option that we considered was surveying Less Wrongers. Another option was to ask for volunteers from the people who had shown an interest in CFAR (e.g., people who have subscribed to the CFAR newsletter, people who have applied to workshops and been turned down). We decided not to use either of those comparison groups in this study, but we might use them in future research.
Would you have much more confidence in these results if we had included one of those groups as a comparison, and found that they showed little or no change on these variables?
(RE terminology: studies with this design are often just called “longitudinal.” Hopefully the methodology section clears up any ambiguity, and the opening of the post also points readers’ thoughts in the right direction.)
People with an interest in CFAR would probably work. It would account for possibilities like the population being drawn from people interested in self-improvement, since they could get that in other places.
I can’t say how much confidence I’d have without seeing the data. The evidence for whether it’s a good control mainly comes from checking the differences between groups at baseline. This isn’t the same as whether the controls changed, which is a common pitfall. Even if the treatment group changes significantly and the control doesn’t, it doesn’t mean the difference between treatment and control is significant.
Also, to clarify, the comparison at baseline isn’t limited to the outcome variables. It should include all the data on potential confounders, including things like age and gender. This is all presented in Table 1 in most studies of cause and effect in populations. A few differences don’t invalidate the study, but they should be accounted for in the analysis.
RE terminology: Agreed it works as a shorthand and the methodology has enough detail to tell us what was done. It just seems unusual to use it as a complete formal description.
Another question: could you explain more of what you did about potential confounders? Using age as an example, you only wrote about testing for significant correlations. This doesn’t rule out age as a confounder, so did you do anything else that you didn’t include?
Could you give an example of an additional analysis that you think should be run?
If the study included a comparison group which differed on some demographic variables (like gender), then I understand the value of running analyses that control for those variables (e.g., did the treatment group have a larger increase in conscientiousness than the comparison group while controlling for gender?). But that wasn’t the study design, so we can’t just run a regression with demographic controls.
You want adjusted effect sizes to check confounding. It’s not because variables are different for the controls, but because you don’t know if they affected your treatment group. You could stratify by group and take a weighted average of the effect sizes (“effect size” defined as change from baseline, as in the writeup). However, you might not have a large enough sample size for all strata, you can’t adjust for many variables at once, and it’s inferior to regression.
If correlation was your primary method to check confounding, there are two problems: a) confounding depends on the correlations with both the independent and dependent variables, but you only have data for the latter. b) the concept of significance can’t be applied to confounding in a straightforward way. It’s affected by sample size and variance, but confounding isn’t.
The main complication is the missing control group. I’m undecided on how to interpret this study, because I can’t think of any reason to avoid controls and I’m still trying to figure out the implications. If the RCT was done well, this makes the evidence a little bit stronger because it’s a replication. But by itself, I still haven’t thought of any way to draw useful conclusions from these data. There’s some good information, but it’s like two cross-sections, which are usually used only to find hypotheses for new research.
You’ll have to clarify those points. For the first part, M-bias is not confounding. It’s a kind of selection bias, and it happens when there is no causal relation with the independent or dependent variables (not no correlation), specifically when you try to adjust for confounding that doesn’t exist. The collider can be a confounder, but it doesn’t have to be. From the second link, “some authors refer to this type of (M-bias) as confounding...but this extension has no practical consequences”
I don’t think you can get a good control group after the fact, because you need their outcomes at both timepoints, with a year in between. None of the options that come to mind are very good: you could ask them what they would have answered a year ago, you could start collecting data now and ask them in a year’s time, or you could throw out the temporal data and use only a single cross-section.
Yes, M-bias is an example of a situation where a variable depends on treatment and outcome, but is not a confounder. Hence I was confused by your statement:
confounding depends on the correlations with both the independent and dependent variables
I used “depends” informally, so I didn’t mean to say that variables that depend on treatment and outcome are always confounders. I was answering the implication that a variable with no detectable correlation with the outcome is not likely to be a source of confounding. I assumed they were using a correlational definition of confounding, so I answered in that context.
A variable with no detectable correlation with the outcome might still be a confounder, of course, you might have unfaithful things going on, or dependence might be non-linear. “Unlikely” usually implies “with respect to some model” you have in mind. How do you know that model is right? What if the true model is highly unfaithful for some reason? etc. etc.
edit: I don’t mean to jump on you specifically, but it sort of is unfortunate that it somehow is a social norm to say wrong things in statistics “informally.” To me, that’s sort of like saying “don’t worry, when I said 2+2=5, I was being informal.”
We talked about this before. I disagree with wikipedia’s philosophy, and don’t have time to police edits there. Wikipedia doesn’t have a process in place to recognize that the opinion of someone like me on a subject like confounding is worth considerably more than the opinion of a randomly sampled internet person. I like *overflow a lot better.
One somewhat subtle point in that article is that it is titled “confounding” (which is easy to define), but then tries to define “a confounder” which is much harder, and might not be a well-defined concept according to some people.
Longitudinal studies can and should include control groups. The difference with RCTs is that the control group is not randomized. Instead, you select from a population which is as similar as possible to the treatment group, so an example is a group of people who were interested but couldn’t attend because of scheduling conflicts. There is also the option of a placebo substitute like sending them generic self-help tips.
ETA: “Longitudinal” is also ambiguous here. It means that data were collected over time, and could mean one of several study types (RCTs are also longitudinal, by some definitions). I think you want to call this a cohort study, except without controls this is more like two different cross-sectional studies from the same population.
They did this with an earlier batch (I was part of that control group) and they haven’t reported that data. I found this disappointing, and it makes me trust this round of data less.
You’re right, we should’ve posted the results on our previous study. I’ll put those numbers together in a comprehensible format and then I’ll have them posted soon.
The brief explanation of why we didn’t take the time to write them up earlier is that the study was underpowered and we thought that the results weren’t that informative. In retrospect, that decision was a mistake.
I’ve put a list of the workshop surveys that we’ve done in a separate comment.
We looked into the possibility of including a nonrandomized comparison group. In order to get a large enough sample size, we’d have to be much less selective than your example (people who were accepted to a workshop but weren’t able to attend for several months). One option that we considered was surveying Less Wrongers. Another option was to ask for volunteers from the people who had shown an interest in CFAR (e.g., people who have subscribed to the CFAR newsletter, people who have applied to workshops and been turned down). We decided not to use either of those comparison groups in this study, but we might use them in future research.
Would you have much more confidence in these results if we had included one of those groups as a comparison, and found that they showed little or no change on these variables?
(RE terminology: studies with this design are often just called “longitudinal.” Hopefully the methodology section clears up any ambiguity, and the opening of the post also points readers’ thoughts in the right direction.)
People with an interest in CFAR would probably work. It would account for possibilities like the population being drawn from people interested in self-improvement, since they could get that in other places.
I can’t say how much confidence I’d have without seeing the data. The evidence for whether it’s a good control mainly comes from checking the differences between groups at baseline. This isn’t the same as whether the controls changed, which is a common pitfall. Even if the treatment group changes significantly and the control doesn’t, it doesn’t mean the difference between treatment and control is significant.
Also, to clarify, the comparison at baseline isn’t limited to the outcome variables. It should include all the data on potential confounders, including things like age and gender. This is all presented in Table 1 in most studies of cause and effect in populations. A few differences don’t invalidate the study, but they should be accounted for in the analysis.
RE terminology: Agreed it works as a shorthand and the methodology has enough detail to tell us what was done. It just seems unusual to use it as a complete formal description.
Another question: could you explain more of what you did about potential confounders? Using age as an example, you only wrote about testing for significant correlations. This doesn’t rule out age as a confounder, so did you do anything else that you didn’t include?
Could you give an example of an additional analysis that you think should be run?
If the study included a comparison group which differed on some demographic variables (like gender), then I understand the value of running analyses that control for those variables (e.g., did the treatment group have a larger increase in conscientiousness than the comparison group while controlling for gender?). But that wasn’t the study design, so we can’t just run a regression with demographic controls.
You want adjusted effect sizes to check confounding. It’s not because variables are different for the controls, but because you don’t know if they affected your treatment group. You could stratify by group and take a weighted average of the effect sizes (“effect size” defined as change from baseline, as in the writeup). However, you might not have a large enough sample size for all strata, you can’t adjust for many variables at once, and it’s inferior to regression.
If correlation was your primary method to check confounding, there are two problems: a) confounding depends on the correlations with both the independent and dependent variables, but you only have data for the latter. b) the concept of significance can’t be applied to confounding in a straightforward way. It’s affected by sample size and variance, but confounding isn’t.
The main complication is the missing control group. I’m undecided on how to interpret this study, because I can’t think of any reason to avoid controls and I’m still trying to figure out the implications. If the RCT was done well, this makes the evidence a little bit stronger because it’s a replication. But by itself, I still haven’t thought of any way to draw useful conclusions from these data. There’s some good information, but it’s like two cross-sections, which are usually used only to find hypotheses for new research.
That’s not the correct definition of confounding (standard counterexample: M-bias).
Re: missing controls, can try to find similar people who didn’t take the course, and match on something sensible.
Not sure what this means, people have been using bootstrap CIs for the ACE for ages.
You’ll have to clarify those points. For the first part, M-bias is not confounding. It’s a kind of selection bias, and it happens when there is no causal relation with the independent or dependent variables (not no correlation), specifically when you try to adjust for confounding that doesn’t exist. The collider can be a confounder, but it doesn’t have to be. From the second link, “some authors refer to this type of (M-bias) as confounding...but this extension has no practical consequences”
I don’t think you can get a good control group after the fact, because you need their outcomes at both timepoints, with a year in between. None of the options that come to mind are very good: you could ask them what they would have answered a year ago, you could start collecting data now and ask them in a year’s time, or you could throw out the temporal data and use only a single cross-section.
Yes, M-bias is an example of a situation where a variable depends on treatment and outcome, but is not a confounder. Hence I was confused by your statement:
Confounding is not about that at all.
I used “depends” informally, so I didn’t mean to say that variables that depend on treatment and outcome are always confounders. I was answering the implication that a variable with no detectable correlation with the outcome is not likely to be a source of confounding. I assumed they were using a correlational definition of confounding, so I answered in that context.
Should be careful with that, might confuse people, see also:
https://en.wikipedia.org/wiki/Confounding
which gets it wrong.
A variable with no detectable correlation with the outcome might still be a confounder, of course, you might have unfaithful things going on, or dependence might be non-linear. “Unlikely” usually implies “with respect to some model” you have in mind. How do you know that model is right? What if the true model is highly unfaithful for some reason? etc. etc.
edit: I don’t mean to jump on you specifically, but it sort of is unfortunate that it somehow is a social norm to say wrong things in statistics “informally.” To me, that’s sort of like saying “don’t worry, when I said 2+2=5, I was being informal.”
Very true. This is something I’ll try to change.
Cheers! If you know what M-bias is, we must have hung out in similar circles. Where did you learn “the causal view of epi”?
If Wikipedia get’s it wrong it might be high leverage to correct it.
We talked about this before. I disagree with wikipedia’s philosophy, and don’t have time to police edits there. Wikipedia doesn’t have a process in place to recognize that the opinion of someone like me on a subject like confounding is worth considerably more than the opinion of a randomly sampled internet person. I like *overflow a lot better.
One somewhat subtle point in that article is that it is titled “confounding” (which is easy to define), but then tries to define “a confounder” which is much harder, and might not be a well-defined concept according to some people.