When I was trying to make sense of Peter Watts’ Echopraxia it has occurred to me that there may be two vastly different but both viable kinds of epistemology.
First is the classical hypothesis-driven epistemology, promoted by positivists and Popper, and generalized by Bayesian epistemology and Solomonoff induction. In the most general version, you have to come up with a set of hypotheses with assigned probabilities, and look for information that would change the entropy of this set the most. It’s a good idea. It formalizes what is science, and what is not; it provides the framework for research, and, given the infinite amount of computing power on a hypercomputer, extract the theoretical maximum of utility from sensory information. The main problem is that it doesn’t an algorithmic way to come up with hypotheses, and the suggestion to test infinitely many of them (aleph-1, as far as I can tell) isn’t very helpful either.
On th other hand, you can imagine data-driven epistemology, where you don’t really formulate any hypotheses. You just have a lot of pattern-matching power, completely agnostic of the knowledge domain, and you use it to try to find any regularities, predictability, clustering, etc. in the sensory data. Then you just check if any of the discovered knowledge is useful. That barely (if at all) can distinguish correlation and causation, that does not really distinguish scientific and non-scientific beliefs, and it doesn’t even guarantee that the findings will be meaningful. However, it does work algorithmically, even with finite resources.
They actually go together rather nice, with data-driven epistemology being the source of hypotheses for the hypothesis-driven epistemology. However, Watts seems to be arguing that given enough computing power, you’d be better off spending it on data-driven pattern matching than on generating and testing hypotheses. And since brains are generally good at pattern matching, System 1, slightly tweaked with yet-to-be-invented technologies, can potentially vastly outperform System 2 running hypothesis-driven epistemology. I wonder to which extent it may actually be true.
The philosopher Isaiah Berlin originally proposed a (tongue-in-cheek) classification of people into “hedgehogs”, who have a single big theory that explains everything and view the world in that light, and “foxes”, who have a large number of smaller theories that they use to explain parts of the world. Later on, the psychologist Philip Tetlock found that people who were closer to the “fox” end of the spectrum tended to be better at predicting future events than the “hedgehogs”.
In “The Cactus and the Weasel”, Venkat constructs an elaborate hypothesis of the kinds of belief structures that “foxes” and “hedgehogs” have and how they work, talking about how a belief can be grounded in a small number of fundamental elements (typical for hedgehogs) or in an intricate web of other beliefs (typical for foxes). The whole essay is worth reading, but a few excerpts that are related to what you just wrote:
Where does [the fox prediction advantage], let’s call it the Tetlock edge, come from? I have a speculative answer.
It comes from eschewing abstraction and preferring the unreliable world of System 1 tools: metaphor, analogy and narrative; tools that all depend on pattern recognition of one sort or the other, rather than classification into clean schema. Fox brains are in effect constantly doing meta-analyses with unstructured ensembles, rather than projecting from abstract models.
That’s where the advantage comes from: eschewing abstraction.
Abstraction creates meta-knowledge via inductive generalization, and can grow into doctrinaire world views. The way this happens is that you try to formalize the interdependencies among all your generalized beliefs. Your one big idea as a hedgehog is an idea that covers everything, the whole T-box, so to speak. Abstraction provides you with ways to compute beliefs and actions in domains you haven’t even encountered yet, thereby coloring your judgment of the novel before the fact.
Pattern recognition creates meta-knowledge through linkages among weak views in multiple domains. The many things you know start getting densely connected in a messy web of ad hoc associations. Your collection of little ideas, densely connected, does not cover everything, since there are fewer abstractions. So you can only form beliefs about new domains once you encounter some data about them (which means you have an inclusion bias). And you cannot act decisively in those domains, since you lack strong metanorms. This means pattern recognition leaves you with a fundamentally more open mind (or less strongly colored preconceptions about what you do not yet know).
The way you slowly gain a Tetlock advantage, if you live long enough to collect a lot of examples and a very densely connected mind full of little ideas, is as follows: The more you see instances of a belief in various guises, the better you get at recognizing new instances. This is because the chances that a new instance will be recognizable close to an existing instance in your collection increases, and also because patterns color the unknown less strongly than abstractions.
As you age, your mind becomes a vessel for accumulating a growing global context to aid in the appreciation of novelty.
Abstraction offers you a satisfyingly consistent and clean world view, but since you generally stop collecting new instances (and might even discard ones you have) once you have enough to form an abstract belief through inductive generalization, it is harder to make any real use of new information as it comes in. There is already a strongly colored opinion in place and guides to action that don’t rely on knowing things. Your abstractions also accumulate metanorms, and give you an increasing array of reasons to not include new information in your world view. [...]
Foxes are fundamentally Big Data native people. They operate on the assumption that it is cheaper to store new information than to decide what to do with it. Hedgehogs are fundamentally not Big Data native. If they can’t structure it, they can’t store it, and have to throw it away. If they can structure it with an abstraction, they don’t need to store most of it. Only a few critical details to fit the Procrustean bed of their abstraction.
Because foxes resist the temptation of abstraction (and therefore the temptation to throw away examples of patterns once an inductive generalization and/or metanorm has been arrived at, or stop collecting), they slowly gains an advantage over time, as the data accumulates: the Tetlock edge.
We can restate the Archilocus definition in a geeky way: The fox has one big, unstructured dataset, the hedgehog has many small structured datasets.
But this takes a long time and a lot of stamp collecting, and foxes have to learn to survive in the meantime. Young foxes can be particularly intimidated by old hedgehogs, since the latter are likely to have accumulated more data in absolute terms.
That is very interesting and definitely worth reading. One thing though, it seems to me that a rationalist hedgehog should be capable of discarding their beliefs if the incoming information seems to contradict them.
On th other hand, you can imagine data-driven epistemology, where you don’t really formulate any hypotheses. You just have a lot of pattern-matching power, completely agnostic of the knowledge domain
When you say “pattern-matching,” what do you mean? Because when I imagine pattern-matching, I imagine that one has a library of patterns, which are matched against sensory data- and those library of patterns are the ‘hypotheses.’
But where does this library come from? It seems to be something along the lines of “if you see it once, store it as a pattern, and increase the relevance as you see it more times / decrease or delete if you don’t see it enough” which looks like an approximation to “consider all hypotheses, updating their probability upward when you see them and try to keep total probability roughly balanced.”
That is, I think we agree; but I think when we use phrases like “pattern-matching” it helps to be explicit about what we’re talking about. Distinguishing between patterns and hypotheses is dangerous!
Probably a better term would be “unsupervised learning”. For example, deep learning and various clustering algorithms allow us to figure out whether the data had any sorts of non-temporal regularities. Or we may try to see if the data predicts itself—if we see X, in Y seconds we’ll see Z. That doesn’t seem to be equivalent to considering infinitely many hypotheses. In Solomonoff induction, hypothesis is the algorithm capable of generating data, and based on the new incoming information, we can decide whether the algorithm fits the data or not. In unsupervised learning, on the other hand, we don’t necessarily have an underlying model, or the model may not be generative.
For example, deep learning and various clustering algorithms allow us to figure out whether the data had any sorts of non-temporal regularities. … That doesn’t seem to be equivalent to considering infinitely many hypotheses.
I think it’s useful to think of the parameter-space for your model as the hypothesis-space. Saying “our parameter-space is R^600” instead of “our parameter-space is all possible algorithms” is way more reasonable and computable, but what it would mean for an unsupervised learning algorithm to have no hypotheses would be that it has no parameters (which would be worthless!). Remember that we need to seed our neural nets with random parameters so that different parts develop differently, and our clustering algorithms need to be seeded with different cluster centers.
Does it mean then that neural networks start with a completely crazy model of the real world, and slowly modify this model to better fit the data, as opposed to jumping between model sets that fit the data perfectly, as Solomonoff induction does?
Does it mean then that neural networks start with a completely crazy model of the real world, and slowly modify this model to better fit the data
This seems like a good description to me.
as opposed to jumping between model sets that fit the data perfectly, as Solomonoff induction does?
I’m not an expert in Solomonoff induction, but my impression is that each model set is a subset of the model set from the last step. That is, you consider every possible output string (implicitly) by considering every possible program that could generate those strings, and I assume stochastic programs (like ‘flip a coin n times and output 1 for heads and 0 for tails’) are expressed by some algorithmic description followed by the random seed (so that the algorithm itself is deterministic, but the set of algorithms for all possible seeds meets the stochastic properties of the definition).
As we get a new piece of the output string—perhaps we see it move from “1100” to “11001″--we rule out any program that would not have output “11001,” which includes about half of our surviving coin-flip programs and about 90% of our remaining 10-sided die programs. So the class of models that “fit the data perfectly” is a very broad class of models, and you could imagine neural networks as estimating the mean of that class of models instead of every instance of the class and then taking the mean of them.
When I was trying to make sense of Peter Watts’ Echopraxia it has occurred to me that there may be two vastly different but both viable kinds of epistemology.
First is the classical hypothesis-driven epistemology, promoted by positivists and Popper, and generalized by Bayesian epistemology and Solomonoff induction. In the most general version, you have to come up with a set of hypotheses with assigned probabilities, and look for information that would change the entropy of this set the most. It’s a good idea. It formalizes what is science, and what is not; it provides the framework for research, and, given the infinite amount of computing power on a hypercomputer, extract the theoretical maximum of utility from sensory information. The main problem is that it doesn’t an algorithmic way to come up with hypotheses, and the suggestion to test infinitely many of them (aleph-1, as far as I can tell) isn’t very helpful either.
On th other hand, you can imagine data-driven epistemology, where you don’t really formulate any hypotheses. You just have a lot of pattern-matching power, completely agnostic of the knowledge domain, and you use it to try to find any regularities, predictability, clustering, etc. in the sensory data. Then you just check if any of the discovered knowledge is useful. That barely (if at all) can distinguish correlation and causation, that does not really distinguish scientific and non-scientific beliefs, and it doesn’t even guarantee that the findings will be meaningful. However, it does work algorithmically, even with finite resources.
They actually go together rather nice, with data-driven epistemology being the source of hypotheses for the hypothesis-driven epistemology. However, Watts seems to be arguing that given enough computing power, you’d be better off spending it on data-driven pattern matching than on generating and testing hypotheses. And since brains are generally good at pattern matching, System 1, slightly tweaked with yet-to-be-invented technologies, can potentially vastly outperform System 2 running hypothesis-driven epistemology. I wonder to which extent it may actually be true.
Reminds me of “The Cactus and the Weasel”.
The philosopher Isaiah Berlin originally proposed a (tongue-in-cheek) classification of people into “hedgehogs”, who have a single big theory that explains everything and view the world in that light, and “foxes”, who have a large number of smaller theories that they use to explain parts of the world. Later on, the psychologist Philip Tetlock found that people who were closer to the “fox” end of the spectrum tended to be better at predicting future events than the “hedgehogs”.
In “The Cactus and the Weasel”, Venkat constructs an elaborate hypothesis of the kinds of belief structures that “foxes” and “hedgehogs” have and how they work, talking about how a belief can be grounded in a small number of fundamental elements (typical for hedgehogs) or in an intricate web of other beliefs (typical for foxes). The whole essay is worth reading, but a few excerpts that are related to what you just wrote:
That is very interesting and definitely worth reading. One thing though, it seems to me that a rationalist hedgehog should be capable of discarding their beliefs if the incoming information seems to contradict them.
When you say “pattern-matching,” what do you mean? Because when I imagine pattern-matching, I imagine that one has a library of patterns, which are matched against sensory data- and those library of patterns are the ‘hypotheses.’
But where does this library come from? It seems to be something along the lines of “if you see it once, store it as a pattern, and increase the relevance as you see it more times / decrease or delete if you don’t see it enough” which looks like an approximation to “consider all hypotheses, updating their probability upward when you see them and try to keep total probability roughly balanced.”
That is, I think we agree; but I think when we use phrases like “pattern-matching” it helps to be explicit about what we’re talking about. Distinguishing between patterns and hypotheses is dangerous!
Probably a better term would be “unsupervised learning”. For example, deep learning and various clustering algorithms allow us to figure out whether the data had any sorts of non-temporal regularities. Or we may try to see if the data predicts itself—if we see X, in Y seconds we’ll see Z. That doesn’t seem to be equivalent to considering infinitely many hypotheses. In Solomonoff induction, hypothesis is the algorithm capable of generating data, and based on the new incoming information, we can decide whether the algorithm fits the data or not. In unsupervised learning, on the other hand, we don’t necessarily have an underlying model, or the model may not be generative.
I think it’s useful to think of the parameter-space for your model as the hypothesis-space. Saying “our parameter-space is R^600” instead of “our parameter-space is all possible algorithms” is way more reasonable and computable, but what it would mean for an unsupervised learning algorithm to have no hypotheses would be that it has no parameters (which would be worthless!). Remember that we need to seed our neural nets with random parameters so that different parts develop differently, and our clustering algorithms need to be seeded with different cluster centers.
Does it mean then that neural networks start with a completely crazy model of the real world, and slowly modify this model to better fit the data, as opposed to jumping between model sets that fit the data perfectly, as Solomonoff induction does?
This seems like a good description to me.
I’m not an expert in Solomonoff induction, but my impression is that each model set is a subset of the model set from the last step. That is, you consider every possible output string (implicitly) by considering every possible program that could generate those strings, and I assume stochastic programs (like ‘flip a coin n times and output 1 for heads and 0 for tails’) are expressed by some algorithmic description followed by the random seed (so that the algorithm itself is deterministic, but the set of algorithms for all possible seeds meets the stochastic properties of the definition).
As we get a new piece of the output string—perhaps we see it move from “1100” to “11001″--we rule out any program that would not have output “11001,” which includes about half of our surviving coin-flip programs and about 90% of our remaining 10-sided die programs. So the class of models that “fit the data perfectly” is a very broad class of models, and you could imagine neural networks as estimating the mean of that class of models instead of every instance of the class and then taking the mean of them.