First Law of Experiment Design: you are not measuring what you think you are measuring.
In a way, this is just a corollary to the good, old-fashioned “correlation does not imply causation” principle. I guess the important difference is that this is a warning directed at experimenters who tend to assume that they know better.
Second Law of Experiment Design: if you measure enough different stuff, you might figure out what you’re actually measuring.
Before COVID, my manager’s manager would semi-regularly put on a multi-day training course where he would teach everyone in his organization about the process of “Design of Experiments” (DOE) (https://en.wikipedia.org/wiki/Design_of_experiments). I work with other data scientists, statisticians, biochemist, and engineers, often performing experiments with medical devices, so this is relevabt to all of us.
The idea behind a DOE is to create a mutifactorial experimental design that tests the effects of many factors at once. The first step is to brainstorm all possible factors that may influence the experimental results (often using a fishbone diagram [https://en.m.wikipedia.org/wiki/Ishikawa_diagram] to focus on factors arising from Materials, Method, Measurement, Machine, Man, or “Mother Nature”). Each factor is then labelled as either control (C, factors fixed throughout all parts of the experiment), noise (N, random factors that cannot/won’t be controlled for but may cause some variation in the results), or experimental (X, factors to be caried systematically to test for their effects).
Next, high and low settings are chosen for each X factor, and all possible combinations of settings are arranged in a hypercube. Instead of experimenting on one factor at a time with enough repetitions to build up statistical significance, you can perform just a few repetitions at each corner of the hypercube. You can still use the results to look at single-factor contributions to the experimental effect by aggregating data from all corners of the low side and high side of the hypercube along one dimension. But you can also look at the impact of factor interactions and the nonlinearities that result, which would have acted as hidden confounders in more traditional single-factor experiments.
I liked how this method gave a systematic way of thinking about multifactorial experimental effects. Such experiments tend to uncover a lot more information about a system than you would see otherwise. To actually tease out the underlying causal mechanisms at work, though, would require deeper statistics and modeling than we ever got into in that course.
Next, high and low settings are chosen for each X factor, and all possible combinations of settings are arranged in a hypercube. Instead of experimenting on one factor at a time with enough repetitions to build up statistical significance, you can perform just a few repetitions at each corner of the hypercube.
This concept reminds me of the problem of planning software tests: I want to exercise all behaviors of the code under test, but actually testing the cartesian product of input conditions often means writing a test that is so generic it duplicates the code under test (unless there is a more naïve algorithm that the test can use), and is hard to evaluate for its own correctness. Instead, I end up writing a selected set of cases intended to cover interesting combinations of inputs — but then the problem is thinking of which inputs are worth testing. When bugs are discovered, they may be combinations of inputs that were not thought of (or they may be parameters we didn’t think of testing, i.e. implicitly put in the “control” category, or specific edge-case values of parameters we did test).
An alternative to hand-written testing of specific cases is to write a property test, like “is input A + input B always ≤ output C, under a wide-ranging selection of inputs”. This feels analogous to measuring correlations in that hypercube — and the part of the actual output that you’re not checking precisely (in my example, the value A + B − C) is the part of the test that is “noise” rather than “control” because we’ve decided it is more practical to ignore that information than to control it (write a test that contains or computes the exact answer to expect).
In a way, this is just a corollary to the good, old-fashioned “correlation does not imply causation” principle. I guess the important difference is that this is a warning directed at experimenters who tend to assume that they know better.
Before COVID, my manager’s manager would semi-regularly put on a multi-day training course where he would teach everyone in his organization about the process of “Design of Experiments” (DOE) (https://en.wikipedia.org/wiki/Design_of_experiments). I work with other data scientists, statisticians, biochemist, and engineers, often performing experiments with medical devices, so this is relevabt to all of us.
The idea behind a DOE is to create a mutifactorial experimental design that tests the effects of many factors at once. The first step is to brainstorm all possible factors that may influence the experimental results (often using a fishbone diagram [https://en.m.wikipedia.org/wiki/Ishikawa_diagram] to focus on factors arising from Materials, Method, Measurement, Machine, Man, or “Mother Nature”). Each factor is then labelled as either control (C, factors fixed throughout all parts of the experiment), noise (N, random factors that cannot/won’t be controlled for but may cause some variation in the results), or experimental (X, factors to be caried systematically to test for their effects).
Next, high and low settings are chosen for each X factor, and all possible combinations of settings are arranged in a hypercube. Instead of experimenting on one factor at a time with enough repetitions to build up statistical significance, you can perform just a few repetitions at each corner of the hypercube. You can still use the results to look at single-factor contributions to the experimental effect by aggregating data from all corners of the low side and high side of the hypercube along one dimension. But you can also look at the impact of factor interactions and the nonlinearities that result, which would have acted as hidden confounders in more traditional single-factor experiments.
I liked how this method gave a systematic way of thinking about multifactorial experimental effects. Such experiments tend to uncover a lot more information about a system than you would see otherwise. To actually tease out the underlying causal mechanisms at work, though, would require deeper statistics and modeling than we ever got into in that course.
This concept reminds me of the problem of planning software tests: I want to exercise all behaviors of the code under test, but actually testing the cartesian product of input conditions often means writing a test that is so generic it duplicates the code under test (unless there is a more naïve algorithm that the test can use), and is hard to evaluate for its own correctness. Instead, I end up writing a selected set of cases intended to cover interesting combinations of inputs — but then the problem is thinking of which inputs are worth testing. When bugs are discovered, they may be combinations of inputs that were not thought of (or they may be parameters we didn’t think of testing, i.e. implicitly put in the “control” category, or specific edge-case values of parameters we did test).
An alternative to hand-written testing of specific cases is to write a property test, like “is input A + input B always ≤ output C, under a wide-ranging selection of inputs”. This feels analogous to measuring correlations in that hypercube — and the part of the actual output that you’re not checking precisely (in my example, the value A + B − C) is the part of the test that is “noise” rather than “control” because we’ve decided it is more practical to ignore that information than to control it (write a test that contains or computes the exact answer to expect).