You Are Not Measuring What You Think You Are Measuring
Eight years ago, I worked as a data scientist at a startup, and we wanted to optimize our sign-up flow. We A/B tested lots of different changes, and occasionally found something which would boost (or reduce) click-through rates by 10% or so.
Then one week I was puzzling over a discrepancy in the variance of our daily signups. Eventually I scraped some data from the log files, and found that during traffic spikes, our server latency shot up to multiple seconds. The effect on signups during these spikes was massive: even just 300 ms was enough that click-through dropped by 30%, and when latency went up to seconds the click-through rates dropped by over 80%. And this happened multiple times per day. Latency was far and away the most important factor which determined our click-through rates. [1]
Going back through some of our earlier experiments, it was clear in hindsight that some of our biggest effect-sizes actually came from changing latency—for instance, if we changed the order of two screens, then there’d be an extra screen before the user hit the one with high latency, so the latency would be better hidden. Our original interpretations of those experiments—e.g. that the user cared more about the content of one screen than another—were totally wrong. It was also clear in hindsight that our statistics on all the earlier experiments were bunk—we’d assumed that every user’s click-through was statistically independent, when in fact they were highly correlated, so many of the results which we thought were significant were in fact basically noise.
Main point of this example: we were not measuring what we thought we were measuring. We thought we were testing hypotheses about what information the user cared about, or what order things needed to be presented in, or whether users would be more likely to click on a bigger and shinier button. But in fact, we were mostly measuring latency.
When I look back on experiments I’ve run over the years, in hindsight the very large majority of cases are like the server latency example. The large majority of the time, experiments did not measure what I thought they were measuring. I’ll call this the First Law of Experiment Design: you are not measuring what you think you are measuring.
Against One-Bit Experiments
A one-bit experiment is an experiment designed to answer a yes/no question. It’s the prototypical case from high school statistics: which of two mouse diets results in lower bodyweight? Which of two button designs on a website results in higher click-through rates? Does a new vaccine design protect against COVID better than an old design (or better than no vaccine at all)? Can Muriel Bristol tell whether milk or tea was added first to her teacup? Will a neural net trained to navigate to a coin at the end of a level still go to the coin if it’s no longer at the end of a level? Can a rat navigate a maze just by smell?
There’s an obvious criticism of such experiments: at best, they yield one bit of information. (Of course the experimenter probably observes a lot more than one bit of information over the course of the experiment, but usually people are trained to ignore most of that useful information and just report a p-value on the original yes/no question.) The First Law of Experiment Design implies that the situation is much worse: in the large majority of cases, a one-bit experiment yields approximately zero information about the thing the experimenter intended to measure. It inevitably turns out that mouse bodyweight, or Muriel Bristol’s tea-tasting, or a neural net’s coinrun performance, in fact routes through something entirely different from what we expected.
Corollary To The First Law: If You Are Definitely Not Measuring Anything Besides What You Think You Are Measuring, You Are Probably Not Measuring Anything
Ok, but aren’t there experiments where we in fact understand what’s going on well enough that we can actually measure what we thought we were measuring? Like the vaccine test, or maybe those experiments from physics lab back in college?
Yes. And in those cases, we usually have a pretty damn good idea of what the experiment’s outcome will be. When we understand what’s going on well enough to actually measure the thing we intended to measure, we usually also understand what’s going on well enough to predict the result. And if we already know the result, then we gain zero information—in a Bayesian sense, we measure nothing.
Take the physics lab example: in physics lab classes, we know what the result “should” be, and if we get some other result then we messed up the experiment. In other words: either we know what the result is (and therefore gain zero information), or we accidentally measure something other than what we intended. (Well… I say “accidentally”, but my college did have a physics professor who would loosen the screws on the pendulum in the freshman physics lab.) Either way, we’re definitely not measuring the thing we intended to measure—either we measure something else, or we measure nothing at all.
… though I suppose one could argue that the physics lab experiment result tells us whether or not we’ve messed up the experiment. In other words, we can test whether we’re measuring the thing we thought we were measuring. So if we know the First Law of Experiment Design, then at least we can measure whether or not the corollary applies to the case at hand.
Anyway, for the rest of this post I’ll assume we’re in a domain where we don’t already know what the answer is (or “should” be).
Solution: Measure Lots of Things
In statistics jargon, the problem is confounders. We never measure what we think we are measuring because there are always confounders, all the time. We can’t control for the confounders because in practice we never know what they are, or which potential confounders actually matter, or which confounders are upstream vs downstream. Classical statistics has lots to say about significance and experiment size and so forth, but when we don’t even know what the confounders are there’s not much to be done.
… or at least that used to be the case. Modern work on causality (e.g. Pearl) largely solves that problem—if we measure enough stuff. One of the key insights of causality is that, while we can’t determine causation from correlation of two variables, we can sometimes determine causation from correlation of three or more variables—and the more variables, the better we can nail down causality. Similarly, if we measure enough stuff, we can often back out any latent variables and figure out how they causally link up to everything else. In other words, we can often deal with confounders if we measure enough stuff.
That’s the theoretical basis for what I’ll call The Second Law of Experiment Design: if you measure enough different stuff, you might figure out what you’re actually measuring.
Feynman’s story about rat-mazes is a good example here:
He had a long corridor with doors all along one side where the rats came in, and doors along the other side where the food was. He wanted to see if he could train the rats to go in at the third door down from wherever he started them off. No. The rats went immediately to the door where the food had been the time before.
The question was, how did the rats know, because the corridor was so beautifully built and so uniform, that this was the same door as before? Obviously there was something about the door that was different from the other doors. So he painted the doors very carefully, arranging the textures on the faces of the doors exactly the same. Still the rats could tell. Then he thought maybe the rats were smelling the food, so he used chemicals to change the smell after each run. Still the rats could tell. Then he realized the rats might be able to tell by seeing the lights and the arrangement in the laboratory like any commonsense person. So he covered the corridor, and, still the rats could tell.
He finally found that they could tell by the way the floor sounded when they ran over it. And he could only fix that by putting his corridor in sand.
Measure enough different stuff, and sometimes we can figure out what’s actually going on.
The biggest problem with one-bit experiments (or low-bit experiments more generally) is that we’re not measuring what we think we’re measuring, and we’re not measuring enough stuff to figure out what’s actually going on. When designing experiments, we want a firehose of bits, not just yes/no. Watching something through a microscope yields an enormous number of bits. Looking through server logs yields an enormous number of bits. That’s the sort of thing we want—a firehose of information.
Measurement Devices
What predictions might we make, from the two Laws of Experiment Design?
Here’s one: new measurement devices or applications of measurement devices, especially high-bit measurement devices, are much more likely than individual experiments to be bottlenecks to the progress of science. For instance, the microscope is more of a bottleneck than Jenner’s controlled trial of the first vaccine. Jenner’s experiment enabled only that one vaccine, and it was almost a century before anybody developed another. When the next vaccine came along, it came from Pasteur’s work watching bacteria under a microscope—and that method resulted in multiple vaccines in rapid succession, as well as “Pasteurization” as a method of disinfection.
We could make similar predictions for particle accelerators, high-throughput sequencing, electron microscopes, mass spectrometers, etc. In the context of AI/ML, we might predict that interpretability tools are a major bottleneck.
Betting Markets
For the same reasons that an experiment is usually not measuring what we think it’s measuring, a fully operationalized prediction is usually not predicting the thing we think it is predicting.
For instance, maybe what I really want to predict is something about qualitative shifts in political influence in Russia. I can operationalize that into a bunch of questions about Putin, the war in Ukraine, specific laws/policies, etc. Probably it will turn out that none of those questions actually measure the qualitative shift in political influence which I’m trying to get at. On the other hand, with a whole bunch of questions, I could maybe do some kind of principal component analysis and back out whatever main factors the questions do measure. For the same reasons that we can sometimes figure out what an experiment actually measures if we measure enough stuff, we can sometimes figure out what questions on a prediction market are actually asking about if we set up markets on enough different questions.
Reading Papers
Of course the Laws of Experiment Design also apply when reading the experiment designs and results of others.
As an example, here’s a recent abstract off biorxiv:
In this study, we examined whether there is repeatability in the activity levels of juvenile dyeing poison frogs (Dendrobates tinctorius). [...] We did not find individual behaviour to be repeatable, however, we detected repeatability in activity at the family level, suggesting that behavioural variation may be explained, at least partially, by genetic factors in addition to a common environment.
Just based on the abstract, I’m going to go out on a limb here and guess that this study did not, in fact, measure “genetic factors”. Probably they measured some other confounder, like e.g. family members grew up nearby each other. (Or maybe the whole result was noise + p-hacking, there’s always that possibility.)
Ok, time to look at the paper… well, the experiment size sure is suspiciously small, they used a grand total of 17 frogs and tested 4 separate behaviors. That sure does sound like a statistical nothingburger! On the other hand, the effect size was huge and their best p-value was p < 0.001, so maaaaaybe there’s something here? I’m skeptical, but let’s give the paper the benefit of the doubt on the statistics for now.
Did they actually measure genetic effects? Well, they sure didn’t rule out non-genetic effects. The “husbandry” section of the Methods actually has a whole spiel about how the father-frogs “exhibit an elaborate parental care behaviour” toward their tadpoles: “Recently-hatched tadpoles are transported on their father’s back from terrestrial clutches to water-filled plant structures at variable heights”. Boy, that sure does sound like a family of tadpoles growing up in a single environment which is potentially different from the environment of another family of tadpoles. The experimenters do talk about their efforts to control the exact environment in which they ran the tests themselves… but they don’t seem to have made much effort to control for variables impacting the young frogs before the test began. So, yeah, there’s ample room for non-genetic correlations between each family of tadpoles.
This is a pretty typical paper: the authors didn’t systematically control for confounders, and the experiment is sufficiently low-bit that we can’t tell what factors actually mediated the correlations between sibling frogs (assuming those correlations weren’t just noise in the first place). Probably the authors weren’t measuring what they thought they were measuring; certainly they didn’t rule out other things they might have been measuring.
Takeaways
Let’s recap the two laws of experiment design:
First Law of Experiment Design: you are not measuring what you think you are measuring.
Second Law of Experiment Design: if you measure enough different stuff, you might figure out what you’re actually measuring.
The two laws have a lot of consequences for designing and interpreting experiments. When designing experiments, assume that the experiment will not measure the thing you intend. Include lots of other measurements, to check as many other things as you can. If possible, use instruments which give a massive firehose of information, instruments which would let you notice a huge variety of things you might not have considered, like e.g. a microscope.
Similarly, when interpreting others’ experiments, assume that they were not measuring what they thought they were measuring. Ignore the claims and p-values in the abstract, go look at the graphs and images and data, cross-reference with other papers measuring other things, and try to put together enough different things to figure out what the experimenters actually measured.
- ^
The numbers in the latency story are pulled out my ass, I don’t remember what they actually were other than that the latency effects were far larger than anything else we’d seen. Consider the story qualitatively true, but fictional in the quantitative details.
- Inside Views, Impostor Syndrome, and the Great LARP by 25 Sep 2023 16:08 UTC; 331 points) (
- Lessons On How To Get Things Right On The First Try by 19 Jun 2023 23:58 UTC; 243 points) (
- The Plan − 2023 Version by 29 Dec 2023 23:34 UTC; 146 points) (
- The case for more ambitious language model evals by 30 Jan 2024 0:01 UTC; 110 points) (
- How To Make Prediction Markets Useful For Alignment Work by 18 Oct 2022 19:01 UTC; 97 points) (
- Searching for Search by 28 Nov 2022 15:31 UTC; 94 points) (
- Symbol/Referent Confusions in Language Model Alignment Experiments by 26 Oct 2023 19:49 UTC; 94 points) (
- Input Swap Graphs: Discovering the role of neural network components at scale by 12 May 2023 9:41 UTC; 92 points) (
- Studying The Alien Mind by 5 Dec 2023 17:27 UTC; 80 points) (
- The “you-can-just” alarm by 8 Oct 2022 10:43 UTC; 76 points) (
- MetaAI: less is less for alignment. by 13 Jun 2023 14:08 UTC; 68 points) (
- The LessWrong 2022 Review: Review Phase by 22 Dec 2023 3:23 UTC; 58 points) (
- Voting Results for the 2022 Review by 2 Feb 2024 20:34 UTC; 57 points) (
- A model of research skill by 8 Jan 2024 0:13 UTC; 55 points) (
- 2022 (and All Time) Posts by Pingback Count by 16 Dec 2023 21:17 UTC; 53 points) (
- Searching for a model’s concepts by their shape – a theoretical framework by 23 Feb 2023 20:14 UTC; 51 points) (
- [ASoT] Natural abstractions and AlphaZero by 10 Dec 2022 17:53 UTC; 33 points) (
- Experiment Idea: RL Agents Evading Learned Shutdownability by 16 Jan 2023 22:46 UTC; 31 points) (
- Unit Test Everything by 29 Sep 2022 18:12 UTC; 30 points) (
- Diagnosing EA Research- Are stakeholder-engaged methods the solution? by 27 Jan 2023 14:38 UTC; 28 points) (EA Forum;
- Content and Takeaways from SERI MATS Training Program with John Wentworth by 24 Dec 2022 4:17 UTC; 28 points) (
- EA & LW Forums Weekly Summary (19 − 25 Sep 22′) by 28 Sep 2022 20:13 UTC; 25 points) (EA Forum;
- EA & LW Forums Weekly Summary (19 − 25 Sep 22′) by 28 Sep 2022 20:18 UTC; 16 points) (
- A model of research skill by 8 Jan 2024 0:13 UTC; 14 points) (EA Forum;
- Rationality, Pedagogy, and “Vibes”: Quick Thoughts by 15 Jul 2023 2:09 UTC; 14 points) (
- An introduction to language model interpretability by 20 Apr 2023 22:22 UTC; 14 points) (
- How should I judge the impact of giving $5k to a family of three kids and two mentally ill parents? by 5 Dec 2022 13:42 UTC; 10 points) (
- 4 Nov 2022 4:09 UTC; 8 points) 's comment on A Mystery About High Dimensional Concept Encoding by (
- The case for more ambitious language model evals by 30 Jan 2024 9:24 UTC; 7 points) (EA Forum;
- 19 Feb 2023 19:22 UTC; 5 points) 's comment on Somewhat against “just update all the way” by (
- What are some ideas that LessWrong has reinvented? by 14 Mar 2023 22:27 UTC; 4 points) (
- Trying to measure AI deception capabilities using temporary simulation fine-tuning by 4 May 2023 17:59 UTC; 4 points) (
- 19 Oct 2022 19:57 UTC; 4 points) 's comment on How To Make Prediction Markets Useful For Alignment Work by (
- 19 Oct 2022 19:16 UTC; 3 points) 's comment on How To Make Prediction Markets Useful For Alignment Work by (
- 19 May 2023 20:03 UTC; 3 points) 's comment on How might we align transformative AI if it’s developed very soon? by (
- 16 Jul 2023 23:34 UTC; 2 points) 's comment on Even briefer summary of ai-plans.com by (
- 13 Sep 2023 1:11 UTC; 2 points) 's comment on Fake Explanations by (
- OC ACXLW Meetup in the park. You’re not measuring what you think/Can people be wrong about subjective experience by 18 Nov 2022 1:17 UTC; 1 point) (
This post didn’t feel particularly important when I first read it.
Yet I notice that I’ve been acting on the post’s advice since reading it. E.g. being more optimistic about drug companies that measure a wide variety of biomarkers.
I wasn’t consciously doing that because I updated due to the post. I’m unsure to what extent the post changed me via subconscious influence, versus deriving the ideas independently.
Hmmmm.
So when I read this post I initially thought it was good. But on second thought I don’t think I actually get that much from it. If I had to summarise it, I’d say
a few interesting anecdotes about experiments where measurement was misleading or difficult
some general talk about “low bit experiments” and how hard it is to control for cofounders
The most interesting claim I found was the second law of experiment design. To quote: “The Second Law of Experiment Design: if you measure enough different stuff, you might figure out what you’re actually measuring.”. But even here I didn’t get much clarity or new info. The argument seemed to boil down to “If you measure more things, you may find the actual underlying important variable”, which is true I guess but doesn’t seem particularly novel and also introduces other risks. e.g: That the more variables you measure the higher the chance that at least some of them will correlate just due to chance. There’s a pointer to a book which the author claims sheds more light on the topic and on modern statistical methods around experiment design more generally, but that’s it.
I think I also have a broader problem here, namely that the article feels a bit fuzzy in a way that makes it hard to pin down what the central claims are.
So yeah, I enjoyed it but on reflection I’m a bit less of a fan than I thought.