On th other hand, you can imagine data-driven epistemology, where you don’t really formulate any hypotheses. You just have a lot of pattern-matching power, completely agnostic of the knowledge domain
When you say “pattern-matching,” what do you mean? Because when I imagine pattern-matching, I imagine that one has a library of patterns, which are matched against sensory data- and those library of patterns are the ‘hypotheses.’
But where does this library come from? It seems to be something along the lines of “if you see it once, store it as a pattern, and increase the relevance as you see it more times / decrease or delete if you don’t see it enough” which looks like an approximation to “consider all hypotheses, updating their probability upward when you see them and try to keep total probability roughly balanced.”
That is, I think we agree; but I think when we use phrases like “pattern-matching” it helps to be explicit about what we’re talking about. Distinguishing between patterns and hypotheses is dangerous!
Probably a better term would be “unsupervised learning”. For example, deep learning and various clustering algorithms allow us to figure out whether the data had any sorts of non-temporal regularities. Or we may try to see if the data predicts itself—if we see X, in Y seconds we’ll see Z. That doesn’t seem to be equivalent to considering infinitely many hypotheses. In Solomonoff induction, hypothesis is the algorithm capable of generating data, and based on the new incoming information, we can decide whether the algorithm fits the data or not. In unsupervised learning, on the other hand, we don’t necessarily have an underlying model, or the model may not be generative.
For example, deep learning and various clustering algorithms allow us to figure out whether the data had any sorts of non-temporal regularities. … That doesn’t seem to be equivalent to considering infinitely many hypotheses.
I think it’s useful to think of the parameter-space for your model as the hypothesis-space. Saying “our parameter-space is R^600” instead of “our parameter-space is all possible algorithms” is way more reasonable and computable, but what it would mean for an unsupervised learning algorithm to have no hypotheses would be that it has no parameters (which would be worthless!). Remember that we need to seed our neural nets with random parameters so that different parts develop differently, and our clustering algorithms need to be seeded with different cluster centers.
Does it mean then that neural networks start with a completely crazy model of the real world, and slowly modify this model to better fit the data, as opposed to jumping between model sets that fit the data perfectly, as Solomonoff induction does?
Does it mean then that neural networks start with a completely crazy model of the real world, and slowly modify this model to better fit the data
This seems like a good description to me.
as opposed to jumping between model sets that fit the data perfectly, as Solomonoff induction does?
I’m not an expert in Solomonoff induction, but my impression is that each model set is a subset of the model set from the last step. That is, you consider every possible output string (implicitly) by considering every possible program that could generate those strings, and I assume stochastic programs (like ‘flip a coin n times and output 1 for heads and 0 for tails’) are expressed by some algorithmic description followed by the random seed (so that the algorithm itself is deterministic, but the set of algorithms for all possible seeds meets the stochastic properties of the definition).
As we get a new piece of the output string—perhaps we see it move from “1100” to “11001″--we rule out any program that would not have output “11001,” which includes about half of our surviving coin-flip programs and about 90% of our remaining 10-sided die programs. So the class of models that “fit the data perfectly” is a very broad class of models, and you could imagine neural networks as estimating the mean of that class of models instead of every instance of the class and then taking the mean of them.
When you say “pattern-matching,” what do you mean? Because when I imagine pattern-matching, I imagine that one has a library of patterns, which are matched against sensory data- and those library of patterns are the ‘hypotheses.’
But where does this library come from? It seems to be something along the lines of “if you see it once, store it as a pattern, and increase the relevance as you see it more times / decrease or delete if you don’t see it enough” which looks like an approximation to “consider all hypotheses, updating their probability upward when you see them and try to keep total probability roughly balanced.”
That is, I think we agree; but I think when we use phrases like “pattern-matching” it helps to be explicit about what we’re talking about. Distinguishing between patterns and hypotheses is dangerous!
Probably a better term would be “unsupervised learning”. For example, deep learning and various clustering algorithms allow us to figure out whether the data had any sorts of non-temporal regularities. Or we may try to see if the data predicts itself—if we see X, in Y seconds we’ll see Z. That doesn’t seem to be equivalent to considering infinitely many hypotheses. In Solomonoff induction, hypothesis is the algorithm capable of generating data, and based on the new incoming information, we can decide whether the algorithm fits the data or not. In unsupervised learning, on the other hand, we don’t necessarily have an underlying model, or the model may not be generative.
I think it’s useful to think of the parameter-space for your model as the hypothesis-space. Saying “our parameter-space is R^600” instead of “our parameter-space is all possible algorithms” is way more reasonable and computable, but what it would mean for an unsupervised learning algorithm to have no hypotheses would be that it has no parameters (which would be worthless!). Remember that we need to seed our neural nets with random parameters so that different parts develop differently, and our clustering algorithms need to be seeded with different cluster centers.
Does it mean then that neural networks start with a completely crazy model of the real world, and slowly modify this model to better fit the data, as opposed to jumping between model sets that fit the data perfectly, as Solomonoff induction does?
This seems like a good description to me.
I’m not an expert in Solomonoff induction, but my impression is that each model set is a subset of the model set from the last step. That is, you consider every possible output string (implicitly) by considering every possible program that could generate those strings, and I assume stochastic programs (like ‘flip a coin n times and output 1 for heads and 0 for tails’) are expressed by some algorithmic description followed by the random seed (so that the algorithm itself is deterministic, but the set of algorithms for all possible seeds meets the stochastic properties of the definition).
As we get a new piece of the output string—perhaps we see it move from “1100” to “11001″--we rule out any program that would not have output “11001,” which includes about half of our surviving coin-flip programs and about 90% of our remaining 10-sided die programs. So the class of models that “fit the data perfectly” is a very broad class of models, and you could imagine neural networks as estimating the mean of that class of models instead of every instance of the class and then taking the mean of them.