Claim: the usual explanation of the Scientific Method is missing some key pieces about how to make science work well in a high-dimensional world (e.g. our world). Updating our picture of science to account for the challenges of dimensionality gives a different model for how to do science and how to recognize high-value research. This post will sketch out that model, and explain what problems it solves.
The Dimensionality Problem
Imagine that we are early scientists, investigating the mechanics of a sled sliding down a slope. What determines how fast the sled goes? Any number of factors could conceivably matter: angle of the hill, weight and shape and material of the sled, blessings or curses laid upon the sled or the hill, the weather, wetness, phase of the moon, latitude and/or longitude and/or altitude, etc. For all the early scientists know, there may be some deep mathematical structure to the world which links the sled’s speed to the astrological motions of stars and planets, or the flaps of the wings of butterflies across the ocean, or vibrations from the feet of foxes running through the woods.
Takeaway: there are literally billions of variables which could influence the speed of a sled on a hill, as far as an early scientist knows.
So, the early scientists try to control as much as they can. They use a standardized sled, with standardized weights, on a flat smooth piece of wood treated in a standardized manner, at a standardized angle. Playing around, they find that they need to carefully control a dozen different variables to get reproducible results. With those dozen pieces carefully kept the same every time… the sled consistently reaches the same speed (within reasonable precision).
At first glance, this does not sound very useful. They had to exercise unrealistic levels of standardization and control over a dozen different variables. Presumably their results will not generalize to real sleds on real hills in the wild.
But stop for a moment to consider the implications of the result. A consistent sled-speed can be achieved while controlling only a dozen variables. Out of literally billions. Planetary motions? Irrelevant, after controlling for those dozen variables. Flaps of butterfly wings on the other side of the ocean? Irrelevant, after controlling for those dozen variables. Vibrations from foxes’ feet? Irrelevant, after controlling for those dozen variables.
The amazing power of achieving a consistent sled-speed is not that other sleds on other hills will reach the same predictable speed. Rather, it’s knowing which variables are needed to predict the sled’s speed. Hopefully, those same variables will be sufficient to determine the speeds of other sleds on other hills—even if some experimentation is required to find the speed for any particular variable-combination.
Determinism
How can we know that all other variables in the universe are irrelevant after controlling for a handful? Couldn’t there always be some other variable which is relevant, no matter what empirical results we see?
The key to answering that question is determinism. If the system’s behavior can be predicted perfectly, then there is no mystery left to explain, no information left which some unknown variable could provide. Mathematically, information theorists use the mutual information I(X,Y) to measure the information which X contains about Y. If Y is deterministic—i.e. we can predict Y perfectly—then I(X,Y) is zero no matter what variable X we look at. Or, in terms of correlations: a deterministic variable always has zero correlation with everything else. If we can perfectly predict Y, then there is no further information to gain about it.
In this case, we’re saying that sled speed is deterministic given some set of variables (sled, weight, surface, angle, etc). So, given those variables, everything else in the universe is irrelevant.
Of course, we can’t always perfectly predict things in the real world. There’s always some noise—certainly at the quantum scale, and usually at larger scales too. So how do we science?
The first thing to note is that “perfect predictability implies zero mutual information” plays well with approximation: approximately perfect predictability implies approximately zero mutual information. If we can predict the sled’s speed to within 1% error, then any other variables in the universe can only influence that remaining 1% error. Similarly, if we can predict the sled’s speed 99% of the time, then any other variables can only matter 1% of the time. And we can combine those: if 99% of the time we can predict the sled’s speed to within 1% error, then any other variables can only influence the 1% error except for the 1% of sled-runs when they might have a larger effect.
More generally, if we can perfectly predict any specific variable, then everything else in the universe is irrelevant to that variable—even if we can’t perfectly predict all aspects of the system’s trajectory. For instance, if we can perfectly predict the first two digits of the sled’s speed (but not the less-significant digits), then we know that nothing else in the universe is relevant to those first two digits (although all sorts of things could influence the less-significant digits).
As a special case of this, we can also handle noise using repeated experiments. If I roll a die, I can’t predict the outcome perfectly, so I can’t rule out influences from all the billions of variables in the universe. But if I roll a die a few thousand times, then I can approximately-perfectly predict the distribution of die-rolls (including the mean, variance, etc). So, even though I don’t know what influences any one particular die roll, I do know that nothing else in the universe is relevant to the overall distribution of repeated rolls (at least to within some small error margin).
Replication
This does still leave one tricky problem: what if we accidentally control some variable? Maybe air pressure influences sled speed, but it never occurred to us to test the sled in a vacuum or high-pressure chamber, so the air pressure was roughly the same for all of our experiments. We are able to deterministically predict sled speed, but only because we accidentally keep air pressure the same every time.
This is a thing which actually does happen! Sometimes we test something in conditions never before tested, and find that the usual rules no longer apply.
Ideally, replication attempts catch this sort of thing. Someone runs the same experiment in a different place and time, a different environment, and hopefully whatever things were accidentally kept constant will vary. (You’d be amazed what varies by location—I once had quite a surprise double-checking the pH of deionized water in Los Angeles.)
Of course, like air pressure, some things may happen to be the same even across replication attempts.
On the other hand, if a variable is accidentally controlled across multiple replication attempts, then it will likely be accidentally controlled outside the lab too. If every lab tests sled-speed at atmospheric pressure, and nobody ever accidentally tries a different air pressure, then that’s probably because sleds are almost always used at atmospheric pressure. When somebody goes to predict a sled’s speed in space, some useful new scientific knowledge will be gained, but until then the results will generally work in practice.
The Scientific Method In A High-Dimensional World
Scenario 1: a biologist hypothesizes that adding hydroxyhypotheticol to their yeast culture will make the cells live longer, and the cell population will grow faster as a result. To test this hypothesis, they prepare one batch of cultures with the compound and one without, then measure the increase in cell density after 24 hours. They statistically compare the final cell density in the two batches to see whether the compound had a significant effect.
This is the prototypical Scientific Method: formulate a hypothesis, test it experimentally. Control group, p-values, all that jazz.
Scenario 2: a biologist observes that some of their clonal yeast cultures flourish, while others grow slowly or die out altogether, despite seemingly-identical preparation. What causes this different behavior? They search for differences, measuring and controlling for everything they can think of: position of the dishes in the incubator, order in which samples were prepared, mutations, phages, age of the initial cell, signalling chemicals in the cultures, combinations of all those… Eventually, they find that using initial cells of the same replicative age eliminates most of the randomness.
This looks less like the prototypical Scientific Method. There’s probably some hypothesis formation and testing steps in the middle, but it’s less about hypothesize-test-iterate, and more about figuring out which variables are relevant.
In a high-dimensional world, effective science looks like scenario 2. This isn’t mutually exclusive with the Scientific-Method-as-taught-in-high-school, there’s still some hypothesizing and testing, but there’s a new piece and a different focus. The main goal is to hunt down sources of randomness, figure out exactly what needs to be controlled in order to get predictable results, and thereby establish which of the billions of variables in the universe are actually relevant.
Based on personal experience and reading lots of papers, this matches my impression of which scientific research offers lots of long-term value in practice. The one-shot black-box hypothesis tests usually aren’t that valuable in the long run, compared to research which hunts down the variables relevant to some previously confusing (a.k.a. unpredictable) phenomenon.
Everything Is Connected To Everything Else (But Not Directly)
What if there is no small set of variables which determines the outcome of our experiment? What if there really are billions of variables, all of which matter?
We sometimes see a claim like this made about biological systems. As the story goes, you can perform all sorts of interventions on a biological system—knock out a gene, add a drug, adjust diet or stimulus, etc—and any such intervention will change the level of most of the tens-of-thousands of proteins or metabolites or signalling molecules in the organism. It won’t necessarily be a large change, but it will be measurable. Everything is connected to everything else; any change impacts everything.
Note that this is not at all incompatible with a small set of variables determining the outcome! The problem of science-in-a-high-dimensional-world is not to enumerate all variables which have any influence. The problem is to find a set of variables which determine the outcome, so that no other variables have any influence after controlling for those.
Suppose sled speed is determined by the sled, slope material, and angle. There may still be billions of other variables in the world which impact the sled, the slope material, and the angle! But none of those billions of variables are relevant after controlling for the sled, slope material, and angle; other variables influence the speed only through those three. Those three variables mediate the influence of all the billions of other variables.
In general, the goal of science in a high dimensional world is to find sets of variables which mediate the influence of all other variables on some outcome.
In some sense, the central empirical finding of All Of Science is that, in practice, we can generally find small sets of variables which mediate the influence of all other variables. Our universe is “local”—things only interact directly with nearby things, and only so many things can be nearby at once. Furthermore, our universe abstracts well: even indirect interactions over long distances can usually be summarized by a small set of variables. Interactions between stars across galactic distances mostly just depend on the total mass of each star, not on all the details of the plasma roiling inside.
Even in biology, every protein interacts with every other protein in the network, but the vast majority of proteins do not interact directly—the graph of biochemical interactions is connected, but extremely sparse. The interesting problem is to figure out the structure of that graph—i.e. which variables interact directly with which other variables. If we pick one particular “outcome” variable, then the question is which variables are its neighbors in the graph—i.e. which variables mediate the influence of all the other variables.
Summary
Let’s put it all together.
In a high-dimensional world like ours, there are billions of variables which could influence an outcome. The great challenge is to figure out which variables are directly relevant—i.e. which variables mediate the influence of everything else. In practice, this looks like finding mediators and hunting down sources of randomness. Once we have a set of control variables which is sufficient to (approximately) determine the outcome, we can (approximately) rule out the relevance of any other variables in the rest of the universe, given the control variables.
A remarkable empirical finding across many scientific fields, at many different scales and levels of abstraction, is that a small set of control variables usually suffices. Most of the universe is not directly relevant to most outcomes most of the time.
Ultimately, this is a picture of “gears-level science”: look for mediation, hunt down sources of randomness, rule out the influence of all the other variables in the universe. This sort of research requires a lot of work compared to one-shot hypothesis tests, but it provides a lot more long-run value: because all the other variables in the universe are irrelevant, we only need to measure/control the control variables each time we want to reuse the model.
Science in a High-Dimensional World
Claim: the usual explanation of the Scientific Method is missing some key pieces about how to make science work well in a high-dimensional world (e.g. our world). Updating our picture of science to account for the challenges of dimensionality gives a different model for how to do science and how to recognize high-value research. This post will sketch out that model, and explain what problems it solves.
The Dimensionality Problem
Imagine that we are early scientists, investigating the mechanics of a sled sliding down a slope. What determines how fast the sled goes? Any number of factors could conceivably matter: angle of the hill, weight and shape and material of the sled, blessings or curses laid upon the sled or the hill, the weather, wetness, phase of the moon, latitude and/or longitude and/or altitude, etc. For all the early scientists know, there may be some deep mathematical structure to the world which links the sled’s speed to the astrological motions of stars and planets, or the flaps of the wings of butterflies across the ocean, or vibrations from the feet of foxes running through the woods.
Takeaway: there are literally billions of variables which could influence the speed of a sled on a hill, as far as an early scientist knows.
So, the early scientists try to control as much as they can. They use a standardized sled, with standardized weights, on a flat smooth piece of wood treated in a standardized manner, at a standardized angle. Playing around, they find that they need to carefully control a dozen different variables to get reproducible results. With those dozen pieces carefully kept the same every time… the sled consistently reaches the same speed (within reasonable precision).
At first glance, this does not sound very useful. They had to exercise unrealistic levels of standardization and control over a dozen different variables. Presumably their results will not generalize to real sleds on real hills in the wild.
But stop for a moment to consider the implications of the result. A consistent sled-speed can be achieved while controlling only a dozen variables. Out of literally billions. Planetary motions? Irrelevant, after controlling for those dozen variables. Flaps of butterfly wings on the other side of the ocean? Irrelevant, after controlling for those dozen variables. Vibrations from foxes’ feet? Irrelevant, after controlling for those dozen variables.
The amazing power of achieving a consistent sled-speed is not that other sleds on other hills will reach the same predictable speed. Rather, it’s knowing which variables are needed to predict the sled’s speed. Hopefully, those same variables will be sufficient to determine the speeds of other sleds on other hills—even if some experimentation is required to find the speed for any particular variable-combination.
Determinism
How can we know that all other variables in the universe are irrelevant after controlling for a handful? Couldn’t there always be some other variable which is relevant, no matter what empirical results we see?
The key to answering that question is determinism. If the system’s behavior can be predicted perfectly, then there is no mystery left to explain, no information left which some unknown variable could provide. Mathematically, information theorists use the mutual information I(X,Y) to measure the information which X contains about Y. If Y is deterministic—i.e. we can predict Y perfectly—then I(X,Y) is zero no matter what variable X we look at. Or, in terms of correlations: a deterministic variable always has zero correlation with everything else. If we can perfectly predict Y, then there is no further information to gain about it.
In this case, we’re saying that sled speed is deterministic given some set of variables (sled, weight, surface, angle, etc). So, given those variables, everything else in the universe is irrelevant.
Of course, we can’t always perfectly predict things in the real world. There’s always some noise—certainly at the quantum scale, and usually at larger scales too. So how do we science?
The first thing to note is that “perfect predictability implies zero mutual information” plays well with approximation: approximately perfect predictability implies approximately zero mutual information. If we can predict the sled’s speed to within 1% error, then any other variables in the universe can only influence that remaining 1% error. Similarly, if we can predict the sled’s speed 99% of the time, then any other variables can only matter 1% of the time. And we can combine those: if 99% of the time we can predict the sled’s speed to within 1% error, then any other variables can only influence the 1% error except for the 1% of sled-runs when they might have a larger effect.
More generally, if we can perfectly predict any specific variable, then everything else in the universe is irrelevant to that variable—even if we can’t perfectly predict all aspects of the system’s trajectory. For instance, if we can perfectly predict the first two digits of the sled’s speed (but not the less-significant digits), then we know that nothing else in the universe is relevant to those first two digits (although all sorts of things could influence the less-significant digits).
As a special case of this, we can also handle noise using repeated experiments. If I roll a die, I can’t predict the outcome perfectly, so I can’t rule out influences from all the billions of variables in the universe. But if I roll a die a few thousand times, then I can approximately-perfectly predict the distribution of die-rolls (including the mean, variance, etc). So, even though I don’t know what influences any one particular die roll, I do know that nothing else in the universe is relevant to the overall distribution of repeated rolls (at least to within some small error margin).
Replication
This does still leave one tricky problem: what if we accidentally control some variable? Maybe air pressure influences sled speed, but it never occurred to us to test the sled in a vacuum or high-pressure chamber, so the air pressure was roughly the same for all of our experiments. We are able to deterministically predict sled speed, but only because we accidentally keep air pressure the same every time.
This is a thing which actually does happen! Sometimes we test something in conditions never before tested, and find that the usual rules no longer apply.
Ideally, replication attempts catch this sort of thing. Someone runs the same experiment in a different place and time, a different environment, and hopefully whatever things were accidentally kept constant will vary. (You’d be amazed what varies by location—I once had quite a surprise double-checking the pH of deionized water in Los Angeles.)
Of course, like air pressure, some things may happen to be the same even across replication attempts.
On the other hand, if a variable is accidentally controlled across multiple replication attempts, then it will likely be accidentally controlled outside the lab too. If every lab tests sled-speed at atmospheric pressure, and nobody ever accidentally tries a different air pressure, then that’s probably because sleds are almost always used at atmospheric pressure. When somebody goes to predict a sled’s speed in space, some useful new scientific knowledge will be gained, but until then the results will generally work in practice.
The Scientific Method In A High-Dimensional World
Scenario 1: a biologist hypothesizes that adding hydroxyhypotheticol to their yeast culture will make the cells live longer, and the cell population will grow faster as a result. To test this hypothesis, they prepare one batch of cultures with the compound and one without, then measure the increase in cell density after 24 hours. They statistically compare the final cell density in the two batches to see whether the compound had a significant effect.
This is the prototypical Scientific Method: formulate a hypothesis, test it experimentally. Control group, p-values, all that jazz.
Scenario 2: a biologist observes that some of their clonal yeast cultures flourish, while others grow slowly or die out altogether, despite seemingly-identical preparation. What causes this different behavior? They search for differences, measuring and controlling for everything they can think of: position of the dishes in the incubator, order in which samples were prepared, mutations, phages, age of the initial cell, signalling chemicals in the cultures, combinations of all those… Eventually, they find that using initial cells of the same replicative age eliminates most of the randomness.
This looks less like the prototypical Scientific Method. There’s probably some hypothesis formation and testing steps in the middle, but it’s less about hypothesize-test-iterate, and more about figuring out which variables are relevant.
In a high-dimensional world, effective science looks like scenario 2. This isn’t mutually exclusive with the Scientific-Method-as-taught-in-high-school, there’s still some hypothesizing and testing, but there’s a new piece and a different focus. The main goal is to hunt down sources of randomness, figure out exactly what needs to be controlled in order to get predictable results, and thereby establish which of the billions of variables in the universe are actually relevant.
Based on personal experience and reading lots of papers, this matches my impression of which scientific research offers lots of long-term value in practice. The one-shot black-box hypothesis tests usually aren’t that valuable in the long run, compared to research which hunts down the variables relevant to some previously confusing (a.k.a. unpredictable) phenomenon.
Everything Is Connected To Everything Else (But Not Directly)
What if there is no small set of variables which determines the outcome of our experiment? What if there really are billions of variables, all of which matter?
We sometimes see a claim like this made about biological systems. As the story goes, you can perform all sorts of interventions on a biological system—knock out a gene, add a drug, adjust diet or stimulus, etc—and any such intervention will change the level of most of the tens-of-thousands of proteins or metabolites or signalling molecules in the organism. It won’t necessarily be a large change, but it will be measurable. Everything is connected to everything else; any change impacts everything.
Note that this is not at all incompatible with a small set of variables determining the outcome! The problem of science-in-a-high-dimensional-world is not to enumerate all variables which have any influence. The problem is to find a set of variables which determine the outcome, so that no other variables have any influence after controlling for those.
Suppose sled speed is determined by the sled, slope material, and angle. There may still be billions of other variables in the world which impact the sled, the slope material, and the angle! But none of those billions of variables are relevant after controlling for the sled, slope material, and angle; other variables influence the speed only through those three. Those three variables mediate the influence of all the billions of other variables.
In general, the goal of science in a high dimensional world is to find sets of variables which mediate the influence of all other variables on some outcome.
In some sense, the central empirical finding of All Of Science is that, in practice, we can generally find small sets of variables which mediate the influence of all other variables. Our universe is “local”—things only interact directly with nearby things, and only so many things can be nearby at once. Furthermore, our universe abstracts well: even indirect interactions over long distances can usually be summarized by a small set of variables. Interactions between stars across galactic distances mostly just depend on the total mass of each star, not on all the details of the plasma roiling inside.
Even in biology, every protein interacts with every other protein in the network, but the vast majority of proteins do not interact directly—the graph of biochemical interactions is connected, but extremely sparse. The interesting problem is to figure out the structure of that graph—i.e. which variables interact directly with which other variables. If we pick one particular “outcome” variable, then the question is which variables are its neighbors in the graph—i.e. which variables mediate the influence of all the other variables.
Summary
Let’s put it all together.
In a high-dimensional world like ours, there are billions of variables which could influence an outcome. The great challenge is to figure out which variables are directly relevant—i.e. which variables mediate the influence of everything else. In practice, this looks like finding mediators and hunting down sources of randomness. Once we have a set of control variables which is sufficient to (approximately) determine the outcome, we can (approximately) rule out the relevance of any other variables in the rest of the universe, given the control variables.
A remarkable empirical finding across many scientific fields, at many different scales and levels of abstraction, is that a small set of control variables usually suffices. Most of the universe is not directly relevant to most outcomes most of the time.
Ultimately, this is a picture of “gears-level science”: look for mediation, hunt down sources of randomness, rule out the influence of all the other variables in the universe. This sort of research requires a lot of work compared to one-shot hypothesis tests, but it provides a lot more long-run value: because all the other variables in the universe are irrelevant, we only need to measure/control the control variables each time we want to reuse the model.