The sample size issue etc is why I talk about Bayes. You get important info from single data points all the time in life. There’s just a fetish against doing so in science due to bad epistemology trying and failing to counter other bad epistemology.
You certainly derived your belief that your procedure would work from a theory. You hadn’t actually even seed it work, so nothing but a theoretical basis could explain your attempt.
I don’t think it’s second-order good epistemology trying and succeeding to counter bad epistemology.
Let’s say we run a study with 30 people, and we conclude ZM’s method is the best, with p = .55 (sorry, I don’t think in Bayesian when I have my psychology experimentation cap on), which is realistic for that kind of sample and the variability we can expect. Now what?
We could come up with some kind of hokey prior, like that there’s a 33% chance each of our techniques is best, then apply that and end up with maybe a 38% chance ZM’s is best and a 31% chance mine and Pjeby’s are best (no, I didn’t actually do the math there). But first of all, that prior is hokey. Pjeby’s a professional anti-procrastination expert, and we’re giving him the same prior as me and Z.M. Davis? Second of all, we still don’t really know what “best” means, and it’s entirely possible different methods are best for different people in complex ways. Third, I don’t trust anyone including myself to know what to do with a 7% chance. I like my method better; should I give that up just because a very small study ended up shifting the probabilities 7% toward ZM? Fourth of all, we still wouldn’t know how to apply this to picoeconomics as a theory: using any technique will increase success by placebo effect alone, we have several techniques that all use picoeconomics to different degrees, and we would have to handwave new numbers into existence to calculate things and probably end up with something like a .1% or .2% shift in probabilities.
And this is all if we have perfect study design, there’s no confounders, so on and so forth. It would take a lot of work. The best case scenario is that all that work would be for a single digit probability shift, and the realistic case is that there’s flaw somewhere in the process, or we simply misinterpret the result (my guess is that people can’t deal with a 2% shift correctly and just think “now there’s evidence” and count the theory as a little more confirmed) then we’ll actually be giving ourselves negative knowledge.
I’m not saying Bayes isn’t useful, but it’s useful when we have a lot of numbers, when we’re willing to put in a very large amount of work, and where there’s something clear and mathematical we can do with the output.
I recently read The Cult of Statistical Significance. I realize that it’s de rigeur to quote significance, but Ziliak and McCloskey insist that I ask what’s the hypothesized size of the effect?
If we run three conditions, and end up with 4, 5, and 6 people getting some improvement, and calculate statistical significance, we obfuscate the fact that the difference is in the noise. If the same tests end up with 2, 4 and 8 people improving according to some metric, then we have stronger reason to suspect something is going on. Size matters. It’s usually more interesting than statistical significance.
Second of all, we still don’t really know what “best” means, and it’s entirely possible different methods are best for different people in complex ways.
And there’s a worse confounding factor, which is that people tend to interpret instructions in terms of whatever prior model they have. (That’s actually why I object so strenuously to a couple of aspects of your “oath” model—they’re not so much intrinsically harmful, as harmful to people with certain prior models.)
Testing the distinction between your method and mine would require pretty stringent behavioral control of subjects in a large experiment, because you’d need to validate that the subject actually considered each situation and consequence. (Writing those things out is a good way to verify it, which is why I think your success was actually a side-effect of the thinking you had to do in order to design and write your oaths.)
However, if you just grab a bunch of volunteers and tell them to do either your version or mine of that process, I predict that a substantial number will not actually follow the directions, and will simply tell themselves they’ve already thought it through enough after considering maybe 1 or 2 situations, and then proceed to do whatever it is they already do to initiate change effects, sprinkled with a bit of flavor from whatever method they’re supposed to be testing.
This is a major confounding factor in testing any cognitive behavior model, be it a self-help technique, time management system, or anything else. People tend to process virtually all new inputs through whatever mental strategies they already have, and lop off the parts that don’t fit.
Estimating actual treatment effects is possible but not practical. Is that a fair summary of the parent?
I’m just saying it’s hard, and that informal means won’t work very well. Well-designed experiments in psychology tend to be designed to trick people into doing or thinking the thing that’s being tested, in order to avoid some of these effects.
The sample size issue etc is why I talk about Bayes. You get important info from single data points all the time in life. There’s just a fetish against doing so in science due to bad epistemology trying and failing to counter other bad epistemology.
You certainly derived your belief that your procedure would work from a theory. You hadn’t actually even seed it work, so nothing but a theoretical basis could explain your attempt.
I don’t think it’s second-order good epistemology trying and succeeding to counter bad epistemology.
Let’s say we run a study with 30 people, and we conclude ZM’s method is the best, with p = .55 (sorry, I don’t think in Bayesian when I have my psychology experimentation cap on), which is realistic for that kind of sample and the variability we can expect. Now what?
We could come up with some kind of hokey prior, like that there’s a 33% chance each of our techniques is best, then apply that and end up with maybe a 38% chance ZM’s is best and a 31% chance mine and Pjeby’s are best (no, I didn’t actually do the math there). But first of all, that prior is hokey. Pjeby’s a professional anti-procrastination expert, and we’re giving him the same prior as me and Z.M. Davis? Second of all, we still don’t really know what “best” means, and it’s entirely possible different methods are best for different people in complex ways. Third, I don’t trust anyone including myself to know what to do with a 7% chance. I like my method better; should I give that up just because a very small study ended up shifting the probabilities 7% toward ZM? Fourth of all, we still wouldn’t know how to apply this to picoeconomics as a theory: using any technique will increase success by placebo effect alone, we have several techniques that all use picoeconomics to different degrees, and we would have to handwave new numbers into existence to calculate things and probably end up with something like a .1% or .2% shift in probabilities.
And this is all if we have perfect study design, there’s no confounders, so on and so forth. It would take a lot of work. The best case scenario is that all that work would be for a single digit probability shift, and the realistic case is that there’s flaw somewhere in the process, or we simply misinterpret the result (my guess is that people can’t deal with a 2% shift correctly and just think “now there’s evidence” and count the theory as a little more confirmed) then we’ll actually be giving ourselves negative knowledge.
I’m not saying Bayes isn’t useful, but it’s useful when we have a lot of numbers, when we’re willing to put in a very large amount of work, and where there’s something clear and mathematical we can do with the output.
I recently read The Cult of Statistical Significance. I realize that it’s de rigeur to quote significance, but Ziliak and McCloskey insist that I ask what’s the hypothesized size of the effect?
If we run three conditions, and end up with 4, 5, and 6 people getting some improvement, and calculate statistical significance, we obfuscate the fact that the difference is in the noise. If the same tests end up with 2, 4 and 8 people improving according to some metric, then we have stronger reason to suspect something is going on. Size matters. It’s usually more interesting than statistical significance.
And there’s a worse confounding factor, which is that people tend to interpret instructions in terms of whatever prior model they have. (That’s actually why I object so strenuously to a couple of aspects of your “oath” model—they’re not so much intrinsically harmful, as harmful to people with certain prior models.)
Testing the distinction between your method and mine would require pretty stringent behavioral control of subjects in a large experiment, because you’d need to validate that the subject actually considered each situation and consequence. (Writing those things out is a good way to verify it, which is why I think your success was actually a side-effect of the thinking you had to do in order to design and write your oaths.)
However, if you just grab a bunch of volunteers and tell them to do either your version or mine of that process, I predict that a substantial number will not actually follow the directions, and will simply tell themselves they’ve already thought it through enough after considering maybe 1 or 2 situations, and then proceed to do whatever it is they already do to initiate change effects, sprinkled with a bit of flavor from whatever method they’re supposed to be testing.
This is a major confounding factor in testing any cognitive behavior model, be it a self-help technique, time management system, or anything else. People tend to process virtually all new inputs through whatever mental strategies they already have, and lop off the parts that don’t fit.
All we can feasibly get is the intent-to-treat effect. Estimating actual treatment effects is possible but not practical.
Is that a fair summary of the parent?
I’m just saying it’s hard, and that informal means won’t work very well. Well-designed experiments in psychology tend to be designed to trick people into doing or thinking the thing that’s being tested, in order to avoid some of these effects.
I mean not practical for the LW community in the way MichaelVassar would like to see happen.