Thomas Kuhn’s 1962 book “The Structure of Scientific Revolutions” (SSR) is a foundational work in the sociology and history of science.
SSR makes several important claims:
Scientific disciplines (e.g. astrophysics) undergo a cycle of paradigmatic ‘normal science’ and non-paradigmatic ‘revolutionary science’.
During ‘normal’ periods, scientists in that field operate from a common scientific paradigm (e.g. Ptolemaic geocentric universe) - a common body of models, analysis techniques, and evidence-gathering tools.
As anomalies—discrepancies between observations and the paradigm—accrue, the paradigm enters a crisis. Visionary scientists start developing new paradigms (e.g. Copernican heliocentric universe), initiating the revolutionary period. Eventually one of these wins out, ushering in a new normal period.
(most controversially) These paradigms are incommensurable; there is no single metric by which the two paradigms can be judged as to which is a better map of reality. (He actually denotes multiple other types of incommensurability as well, but I find these far less convincing to anyone who has read the sequences).
In a later work Kuhn provided a set of five criteria that should be considered in judging the quality of a paradigm.
Like all models, Kuhn’s is wrong. But is it useful? Many social scientists seem to think so: according to google scholar, SSR has been cited over 136 thousand times.
In this article/sequence, I will argue that despite issues with all these claims, paradigms are a useful construct. I will attempt to formalize the concept of a paradigm, within the context of Solomonoff induction, and demonstrate how these formalized paradigms can be used to aid in induction.
I will then demonstrate that this new formalism provides us with a metric that makes all paradigms commensurable, and that this metric (mostly) aligns with Kuhn’s five criteria.
Stephen Toulmin had an eminently reasonable critique of SSR, namely that ‘normal’ science has plenty of smaller paradigm shifts that differ more in degree than in kind from Kuhn’s ‘revolutionary’ shifts. The formalized paradigms agree with Toulmin here, as they are flexible enough to allow and infinitude of sub- and super-paradigms that allow for greater and lesser changes in scope.
Finally, I’ll attempt to explain why this is important, both for understanding the course of science, and for the development of inductive AI.
Solomonoff Induction
(For a more complete review of this subject, see here.)
Induction is the problem of generalizing a hypothesis, model, or theorem from a set of specific data points. While it’s a critically important part of the scientific method, philosophers and scientists have long struggled to understand how it works, and under what conditions it is valid, the so called “Problem of Induction.”
Around 1960, Ray Solomonoff developed a generalized method for induction, although its unclear how well publicized it was at the time. Basically, employing Solomonoff induction means to try all the potential hypotheses. Not “all the plausible hypotheses”. Not “all the hypotheses you can think of”. Not even “all the valid hypotheses”. Literally all of them.
More formally, consider a sensor that has observed bit stream Z∈S, where S is the set consisting of all bitstreams of any length. For any hypothesis H∈S, Solomonoff induction assigns a prior probability P(H)∝2−|H| . The likelihood is then computed via a Universal Turing Machine U:S→S. If U(H) is at least |Z| bits long, and the first |Z| bits of U(H)=Z, then the likelihoodP(Z|H) equals 1, otherwise it equals 0. Via Bayes Law, the posterior probability of H is therefore:
P(H|Z)=P(Z|H)P(H)∑h∈SP(Z|h)P(h)
While this form seems to only allow for completely deterministic, all-knowing hypotheses at first glance, it is easy to demonstrate how a hypothesis can accommodate uncertainty. Consider a stochastic hypothesis H that wishes to remain agnostic about two of the bits in Z. We can create four deterministic hypotheses H00,H01,H10,H11, which consist of H and the appropriate two bit code. When these deterministic hypotheses are run, H places the two bit code in the desired location. Of course, only one of the four will be correct, and they will have two extra bits relative to H. Setting
P(Z|H)P(H)=P(Z|Hgood)P(Hgood)=1×2−2P(H)
yields P(Z|H)=0.25, exactly as expected.
There are several objections to Solomonoff Induction:
Because U is Turing complete, Solomonoff Induction is actually uncomputable! Solomonoff himself proved this.
Although it is consistent, the prior probability assignment is somewhat arbitrary. Why not use another one?
Which universal turing machine do you use?
How does Z correspond to actual sensory inputs?
Paradigmatic induction, introduced below, will partially address these objections.
Paradigmatic Induction
Paradigmatic induction is nearly identical to Solomonoff induction, except that both the prior probability distribution AND the requirement for a Universal Turing Machine are relaxed. For notational convenience, we express the posterior probability of a hypothesis within paradigm Π as:
P(H|Z,Π)=P(Z|H,Π)P(H|Π)∑h∈SP(Z|h,Π)P(h|Π)
where the prior P(⋅|Π) is set according to the paradigm, and the likelihood P(Z|H,Π) is calculated using a (not necessarily complete) computer π:S→S , likewise set by the paradigm.
Note how much flexibility this gives us in what constitutes a paradigm. A paradigm can be equivalent to Solomonoff induction, when π=U, and P(H|Π)=P(H).
A paradigm can also be equivalent to a single hypothesis H′ : π=U(H′), P(∅|Π)=1.
A paradigm in the manner of SSR will naturally lie somewhere between these two extremes. A paradigm of the appropriate scale will include some of the following qualities:
Scope—The paradigm machine π might decide that no hypothesis has any information about certain bits within Z. Essentially, those observations are outside of the paradigm’s scope. It has nothing to say about them. It may also be able to subdivide its scope such that multiple independent hypotheses/sub-paradigms can account for different bits within Z.
Precision - π can define arbitrary likelihood distributions, corresponding to more or less precision in its constituent hypotheses.
Potential for anomalies—Certain potential observations Z’ should have zero (or extremely low) aggregate likelihood within the paradigm P(Z′|Π). If said Z’ is actually observed, then the paradigm is in crisis.
What sorts of things qualify as a paradigm? Our formalization requires only two things: a computation π and the paradigmatic prior P(⋅|Π). This can include things like a simulation engine, a class of models, a specific model within that class, or a specific instance of a class, as long they produce an output (perhaps by including a sensor in the model), and have a defined prior probability distribution over the parameter space.
Coin flip paradigm
As a very simple example, consider Πcoin, the paradigm of flipping a coin with an unknown bias, resulting in the observed flips Z. Each hypothesis Hi consists of the odds ratio oi (where oi=1 corresponds to a fair coin) followed by a random bitstring si. We define:
πcoin(Hi)=si
and
P(Hi|Πcoin)=P(oi)P(si|oi).
For any hypothesis where si=Z (the observed results of flipping the coin), we have a posterior probability:
P(Hi|Z,Πcoin)=P(oi)P(Z|oi)∑jP(oj)P(Z|oj).
One of the primary benefits of a modestly scoped paradigm is that it can dramatically reduce the computational costs of induction.
Comparing Paradigms
Paradigms can be straightforwardly be compared using their joint probability with the observation:
P(Π∧Z)=P(Z|Π)P(Π)=P(Π)∑H∈SP(Z|H,Π)P(H|Π)
For the sake of convenience, we will divide Z into Z1 (in scope for the paradigm), and Z0 (out of scope for the paradigm), which gives the metric the form:
P(Π∧Z)=P(Π)2−|Z0|∑H∈SP(Z1|H,Π)P(H|Π)
I believe this metric represents most of Kuhn’s five criteria for a good paradigm.
Accuracy—represented by the term P(Z1|H), the paradigms ability to predict the observed data within its scope.
Consistency—I’m less sure about this one, but I believe in this formalization any ‘inconsistent’ hypotheses would just fail to execute, and thus provide zero probability mass.
Scope—a paradigm with a larger scope will have a larger 2−|Z0|
Simplicity—both the simplicity of the paradigm’s computation and of constructing hypotheses within it are represented by P(Π) and P(H|Π) respectively.
The fifth and final criterion is ‘fruitfulness’, which seems to roughly correspond to elements of the paradigm being reusable in other paradigms with different scopes. This might correspond to things like the ‘modularity’ of π, or it might just be a function of how interpretable π is. The concept is sufficiently nebulous that I’ll just leave it unaddressed for now.
Super- and Sub-paradigms
In the previous section, the paradigm’s metrics used the prior P(Π). Now, this could be the Solomonoff prior of the paradigm, i.e. 2−|Π|. But it doesn’t have to be! You could instead posit a super-paradigm Π′, with its own prior distribution and computer π′.
The computer splits its input hypothesis H′ into a sub-paradigm string cΠ and a sub-hypothesis H. It computes π from cΠ, then outputs π(H). Furthermore, to qualify as a super-paradigm, the super-paradigm’s hypothesis priors must be decomposable:
P(H′|Π)=P(cΠ∧H)=P(H|cΠ,Π′)P(cΠ|Π′)=P(H|Π)P(cΠ|Π′)
We can use these terms to rewrite the metric for paradigm Π (encoded by cΠ) within super-paradigm Π′:
P(cΠ∧Z|Π′)=P(cΠ|Π′)2−|Z0|∑H∈SP(Z1|H,Π)P(H|Π)
This relationship between sub- and super- paradigms provides a natural way of relating more general and more specific models. It also suggests a natural way of constructing paradigms, starting from a general paradigm, and adding successive specifications until the desired scope is attained.
If I ever get past the basic theory, I’ll be sure to include it here!
Why should we care?
Induction is useful. Unfortunately, the most widely known general induction method is uncomputable, and may even be malign. Paradigmatic induction is a generalization of Solomoff induction that relaxes both the prior and the need for a Turing complete computer. By restricting their hypothesis space, many paradigms allow for computable deduction, and can completely eliminate the existence of malign priors. They are also sufficiently well formed that we compare multiple paradigms to one another.
Commensurable Scientific Paradigms; or, computable induction
Epistemic status—could use more rigor.
Thomas Kuhn’s 1962 book “The Structure of Scientific Revolutions” (SSR) is a foundational work in the sociology and history of science.
SSR makes several important claims:
Scientific disciplines (e.g. astrophysics) undergo a cycle of paradigmatic ‘normal science’ and non-paradigmatic ‘revolutionary science’.
During ‘normal’ periods, scientists in that field operate from a common scientific paradigm (e.g. Ptolemaic geocentric universe) - a common body of models, analysis techniques, and evidence-gathering tools.
As anomalies—discrepancies between observations and the paradigm—accrue, the paradigm enters a crisis. Visionary scientists start developing new paradigms (e.g. Copernican heliocentric universe), initiating the revolutionary period. Eventually one of these wins out, ushering in a new normal period.
(most controversially) These paradigms are incommensurable; there is no single metric by which the two paradigms can be judged as to which is a better map of reality. (He actually denotes multiple other types of incommensurability as well, but I find these far less convincing to anyone who has read the sequences).
In a later work Kuhn provided a set of five criteria that should be considered in judging the quality of a paradigm.
Like all models, Kuhn’s is wrong. But is it useful? Many social scientists seem to think so: according to google scholar, SSR has been cited over 136 thousand times.
In this article/sequence, I will argue that despite issues with all these claims, paradigms are a useful construct. I will attempt to formalize the concept of a paradigm, within the context of Solomonoff induction, and demonstrate how these formalized paradigms can be used to aid in induction.
I will then demonstrate that this new formalism provides us with a metric that makes all paradigms commensurable, and that this metric (mostly) aligns with Kuhn’s five criteria.
Stephen Toulmin had an eminently reasonable critique of SSR, namely that ‘normal’ science has plenty of smaller paradigm shifts that differ more in degree than in kind from Kuhn’s ‘revolutionary’ shifts. The formalized paradigms agree with Toulmin here, as they are flexible enough to allow and infinitude of sub- and super-paradigms that allow for greater and lesser changes in scope.
Finally, I’ll attempt to explain why this is important, both for understanding the course of science, and for the development of inductive AI.
Solomonoff Induction
(For a more complete review of this subject, see here.)
Induction is the problem of generalizing a hypothesis, model, or theorem from a set of specific data points. While it’s a critically important part of the scientific method, philosophers and scientists have long struggled to understand how it works, and under what conditions it is valid, the so called “Problem of Induction.”
Around 1960, Ray Solomonoff developed a generalized method for induction, although its unclear how well publicized it was at the time. Basically, employing Solomonoff induction means to try all the potential hypotheses. Not “all the plausible hypotheses”. Not “all the hypotheses you can think of”. Not even “all the valid hypotheses”. Literally all of them.
More formally, consider a sensor that has observed bit stream Z∈S, where S is the set consisting of all bitstreams of any length. For any hypothesis H∈S, Solomonoff induction assigns a prior probability P(H)∝2−|H| . The likelihood is then computed via a Universal Turing Machine U:S→S. If U(H) is at least |Z| bits long, and the first |Z| bits of U(H)=Z, then the likelihoodP(Z|H) equals 1, otherwise it equals 0. Via Bayes Law, the posterior probability of H is therefore:
P(H|Z)=P(Z|H)P(H)∑h∈SP(Z|h)P(h)While this form seems to only allow for completely deterministic, all-knowing hypotheses at first glance, it is easy to demonstrate how a hypothesis can accommodate uncertainty. Consider a stochastic hypothesis H that wishes to remain agnostic about two of the bits in Z. We can create four deterministic hypotheses H00,H01,H10,H11, which consist of H and the appropriate two bit code. When these deterministic hypotheses are run, H places the two bit code in the desired location. Of course, only one of the four will be correct, and they will have two extra bits relative to H. Setting
P(Z|H)P(H)=P(Z|Hgood)P(Hgood)=1×2−2P(H)yields P(Z|H)=0.25, exactly as expected.
There are several objections to Solomonoff Induction:
Because U is Turing complete, Solomonoff Induction is actually uncomputable! Solomonoff himself proved this.
Although it is consistent, the prior probability assignment is somewhat arbitrary. Why not use another one?
Which universal turing machine do you use?
How does Z correspond to actual sensory inputs?
Paradigmatic induction, introduced below, will partially address these objections.
Paradigmatic Induction
Paradigmatic induction is nearly identical to Solomonoff induction, except that both the prior probability distribution AND the requirement for a Universal Turing Machine are relaxed. For notational convenience, we express the posterior probability of a hypothesis within paradigm Π as:
P(H|Z,Π)=P(Z|H,Π)P(H|Π)∑h∈SP(Z|h,Π)P(h|Π)where the prior P(⋅|Π) is set according to the paradigm, and the likelihood P(Z|H,Π) is calculated using a (not necessarily complete) computer π:S→S , likewise set by the paradigm.
Note how much flexibility this gives us in what constitutes a paradigm. A paradigm can be equivalent to Solomonoff induction, when π=U, and P(H|Π)=P(H).
A paradigm can also be equivalent to a single hypothesis H′ : π=U(H′), P(∅|Π)=1.
A paradigm in the manner of SSR will naturally lie somewhere between these two extremes. A paradigm of the appropriate scale will include some of the following qualities:
Scope—The paradigm machine π might decide that no hypothesis has any information about certain bits within Z. Essentially, those observations are outside of the paradigm’s scope. It has nothing to say about them. It may also be able to subdivide its scope such that multiple independent hypotheses/sub-paradigms can account for different bits within Z.
Precision - π can define arbitrary likelihood distributions, corresponding to more or less precision in its constituent hypotheses.
Potential for anomalies—Certain potential observations Z’ should have zero (or extremely low) aggregate likelihood within the paradigm P(Z′|Π). If said Z’ is actually observed, then the paradigm is in crisis.
What sorts of things qualify as a paradigm? Our formalization requires only two things: a computation π and the paradigmatic prior P(⋅|Π). This can include things like a simulation engine, a class of models, a specific model within that class, or a specific instance of a class, as long they produce an output (perhaps by including a sensor in the model), and have a defined prior probability distribution over the parameter space.
Coin flip paradigm
As a very simple example, consider Πcoin, the paradigm of flipping a coin with an unknown bias, resulting in the observed flips Z. Each hypothesis Hi consists of the odds ratio oi (where oi=1 corresponds to a fair coin) followed by a random bitstring si. We define:
πcoin(Hi)=siand
P(Hi|Πcoin)=P(oi)P(si|oi).For any hypothesis where si=Z (the observed results of flipping the coin), we have a posterior probability:
P(Hi|Z,Πcoin)=P(oi)P(Z|oi)∑jP(oj)P(Z|oj).One of the primary benefits of a modestly scoped paradigm is that it can dramatically reduce the computational costs of induction.
Comparing Paradigms
Paradigms can be straightforwardly be compared using their joint probability with the observation:
P(Π∧Z)=P(Z|Π)P(Π)=P(Π)∑H∈SP(Z|H,Π)P(H|Π)For the sake of convenience, we will divide Z into Z1 (in scope for the paradigm), and Z0 (out of scope for the paradigm), which gives the metric the form:
P(Π∧Z)=P(Π)2−|Z0|∑H∈SP(Z1|H,Π)P(H|Π)I believe this metric represents most of Kuhn’s five criteria for a good paradigm.
Accuracy—represented by the term P(Z1|H), the paradigms ability to predict the observed data within its scope.
Consistency—I’m less sure about this one, but I believe in this formalization any ‘inconsistent’ hypotheses would just fail to execute, and thus provide zero probability mass.
Scope—a paradigm with a larger scope will have a larger 2−|Z0|
Simplicity—both the simplicity of the paradigm’s computation and of constructing hypotheses within it are represented by P(Π) and P(H|Π) respectively.
The fifth and final criterion is ‘fruitfulness’, which seems to roughly correspond to elements of the paradigm being reusable in other paradigms with different scopes. This might correspond to things like the ‘modularity’ of π, or it might just be a function of how interpretable π is. The concept is sufficiently nebulous that I’ll just leave it unaddressed for now.
Super- and Sub-paradigms
In the previous section, the paradigm’s metrics used the prior P(Π). Now, this could be the Solomonoff prior of the paradigm, i.e. 2−|Π|. But it doesn’t have to be! You could instead posit a super-paradigm Π′, with its own prior distribution and computer π′.
The computer splits its input hypothesis H′ into a sub-paradigm string cΠ and a sub-hypothesis H. It computes π from cΠ, then outputs π(H). Furthermore, to qualify as a super-paradigm, the super-paradigm’s hypothesis priors must be decomposable:
P(H′|Π)=P(cΠ∧H)=P(H|cΠ,Π′)P(cΠ|Π′)=P(H|Π)P(cΠ|Π′)We can use these terms to rewrite the metric for paradigm Π (encoded by cΠ) within super-paradigm Π′:
P(cΠ∧Z|Π′)=P(cΠ|Π′)2−|Z0|∑H∈SP(Z1|H,Π)P(H|Π)This relationship between sub- and super- paradigms provides a natural way of relating more general and more specific models. It also suggests a natural way of constructing paradigms, starting from a general paradigm, and adding successive specifications until the desired scope is attained.
Relationship to InfraBayesian Induction
If I ever get past the basic theory, I’ll be sure to include it here!
Why should we care?
Induction is useful. Unfortunately, the most widely known general induction method is uncomputable, and may even be malign. Paradigmatic induction is a generalization of Solomoff induction that relaxes both the prior and the need for a Turing complete computer. By restricting their hypothesis space, many paradigms allow for computable deduction, and can completely eliminate the existence of malign priors. They are also sufficiently well formed that we compare multiple paradigms to one another.