I actually would have said the opposite: medicine is one of the areas most likely to conduct well-powered studies designed for aggregation into meta-analyses (having accomplished the shift away from p-value fetishism into focusing on effect sizes and confidence intervals), and so have the least problem with random error and hence the most with systematic error.
Certainly you can’t compare medicine with fields like psychology; in the former, you’re going to get plenty of results from studies with hundreds or thousands of participants, and in the latter you’re lucky if you ever get one with n>100.
Interesting. I did not know that medicine currently mostly does a good job of having well powered studies.
It would actually be pretty interesting, to go through many fields and determine how important systematic vs random error is.
Here’s some guesses based on nothing more than my intuition:
Physics − 50:1 systematic
Engineering − 10:1
Psychology − 1:2
What would you need to do this better? A sample of studies from prestigious journals for each field with N, size of random error, lower bound on effect size considered interesting.
Does engineering use these sorts of statistics? I read so few engineering papers I’m not really sure what statistics I would expect them to look like or how systematic vs random error would play out there.
A sample of studies from prestigious journals for each field with N, size of random error, lower bound on effect size considered interesting.
That would be an interesting approach; typically power studies just look at estimating beta, not beta and bounds on effect size.
If you read any narrow/weak/specific/whatever AI papers, then I’d say you do read engineering papers—that’s how I mostly think of my field, computational linguistics, anyway.
The “experiments” I’m doing at the moment are attempts to engineer a better statistical parser of English. We have some human annotated data, and we divide it up into a training section, a development section, and an evaluation section. I write my system and use the training portion for learning, and evaluate my ideas on the development section. When I’m ready to publish, I produce a final score on the evaluation section.
In this case, my experimental error is the extent to which the accuracy figures I produce do not correlate with the accuracy that someone really using my system will see.
Both systematic and random error abounds in these “experiments”. I’d say a really common source of systematic error comes from the linguistic annotation we’re trying to replicate. We evaluate on data annotated by the same people according to the same standards as we trained on, and the scientific standards of the linguistics behind that are poor. If some aspects of the annotation are suboptimal for applications of the system, that won’t be reflected in my results.
I actually would have said the opposite: medicine is one of the areas most likely to conduct well-powered studies designed for aggregation into meta-analyses (having accomplished the shift away from p-value fetishism into focusing on effect sizes and confidence intervals), and so have the least problem with random error and hence the most with systematic error.
Certainly you can’t compare medicine with fields like psychology; in the former, you’re going to get plenty of results from studies with hundreds or thousands of participants, and in the latter you’re lucky if you ever get one with n>100.
Interesting. I did not know that medicine currently mostly does a good job of having well powered studies.
It would actually be pretty interesting, to go through many fields and determine how important systematic vs random error is.
Here’s some guesses based on nothing more than my intuition:
Physics − 50:1 systematic
Engineering − 10:1
Psychology − 1:2
What would you need to do this better? A sample of studies from prestigious journals for each field with N, size of random error, lower bound on effect size considered interesting.
Does engineering use these sorts of statistics? I read so few engineering papers I’m not really sure what statistics I would expect them to look like or how systematic vs random error would play out there.
That would be an interesting approach; typically power studies just look at estimating beta, not beta and bounds on effect size.
If you read any narrow/weak/specific/whatever AI papers, then I’d say you do read engineering papers—that’s how I mostly think of my field, computational linguistics, anyway.
The “experiments” I’m doing at the moment are attempts to engineer a better statistical parser of English. We have some human annotated data, and we divide it up into a training section, a development section, and an evaluation section. I write my system and use the training portion for learning, and evaluate my ideas on the development section. When I’m ready to publish, I produce a final score on the evaluation section.
In this case, my experimental error is the extent to which the accuracy figures I produce do not correlate with the accuracy that someone really using my system will see.
Both systematic and random error abounds in these “experiments”. I’d say a really common source of systematic error comes from the linguistic annotation we’re trying to replicate. We evaluate on data annotated by the same people according to the same standards as we trained on, and the scientific standards of the linguistics behind that are poor. If some aspects of the annotation are suboptimal for applications of the system, that won’t be reflected in my results.