Does the fact that naive neural nets almost always fail when applied to out of sample data constitute a strong general argument against the anti-universalizing approach?
I think this demonstrates the problem rather well. In the end, the phenomenon you are trying to model has a level of complexity N. You want your model (neural network or theory or whatever) to have the same level of complexity—no more, no less. So the fact that naive neural nets fail on out of sample data for a given problem shows that the neural network did not reach sufficient complexity. That most naive neural networks fail shows that most problems have at least a bit more complexity than that embodied in the simplest neural networks.
As for how to approach the problem in view of all this… Consider this: for any particular problem of complexity N, there are N − 1 levels of complexity below it, which may fail to make accurate predictions due to oversimplification. And then there’s an infinity of complexity levels above N, which may fail to make accurate predictions due to overfitting. So it makes sense to start with simple theories, and keep adding complexity as new observations arrive, and gradually improve the predictions we make, until we have the simplest theory we can which still produces low errors when predicting new observations.
I say low errors because to truly match all observations would certainly be overfitting! So there at the end we have the same problem again, where we trade off accuracy on current data against overfitting errors on future data… Simple (higher errors) versus complex (higher overfitting)… At the end of the process, only empiricism can help us find the theory that produces the lowest error on future data!
So the fact that naive neural nets fail on out of sample data for a given problem shows that the neural network did not reach sufficient complexity.
This is one possibility. Another, MUCH more common in practice, is that your NN overfitted the in-sample data and so trivially failed at out-of-sample forecasting.
To figure out the complexity of the process you’re trying to model, you first need to be able to separate features of that process from noise and this is far from a trivial exercise.
This is more along the lines of what I was thinking. Most instances of complexity that seem like they’re good are in practice going to be versions of overfitting to noise. Or, perhaps stated more concisely and powerfully, noise and simplicity are opposites (information entropy), thus if we dislike noise we should like simplicity. Does this seem like a reasonable perspective?
noise and simplicity are opposites (information entropy), thus if we dislike noise we should like simplicity. Does this seem like a reasonable perspective?
Not quite. Noise and simplicity are not opposites. I would say that the amount of noise in the data (along with the amount of data) imposes a limit, an upper bound, on the complexity that you can credibly detect.
Basically, if your data is noisy you are forced to consider only low-complexity models.
Can you elaborate on why you think it’s a boundary, not an opposite? I still feel like it’s an opposite. My impression, from self-study, is that randomness in information means the best way to describe eg a sequence of coin flips is to copy the sequence exactly, there is no algorithm or heuristic that allows you to describe the random information more efficiently, like “all heads” or “heads, tails, heads, tails, etc.” That sort of efficient description of information seems identical to simplicity to me. If randomness is defined as the absence of simplicity...
I guess maybe all of this is compatible with an upper bound understanding, though. What is there that distinguishes the upper bound understanding from my “opposites” understanding, that goes your way?
Noise is not randomness. What is “noise” depends on the context, but generally it means the part of the signal that we are not interested in and do not care about other than that we’d like to get rid of it.
But we may be talking in different frameworks. If you define simplicity as the opposite (or inverse) of Kolmogorov complexity and if you define noise as something that increases the Kolmogorov complexity then yes, they are kinda opposite by definition.
I don’t think we’re talking in different frameworks really, I think my choice of words was just dumb/misinformed/sloppy/incorrect. If I had originally stated “randomness and simplicity are opposites” and then pointed out that randomness is a type of noise, (I think it is perhaps even the average of all possible noisy biases, because all biases should cancel?) would that have been a reasonable argument, judged in your paradigm?
In a modeling framework (and we started in the context of neural nets which are models) “noise” is generally interpreted as model residuals—the part of data that you are unwilling or unable to model. In the same context “simplicity” usually means that the model has few parameters and an uncomplicated structure. As you can see, they are not opposites at all.
In the information/entropy framework simplicity usually means low Kolmogorov complexity and I am not sure what would “noise” mean.
When you say “randomness is a type of noise”, can you define the terms you are using?
Randomness is maximally complex, in the sense that a true random output cannot easily be predicted or efficiently be described. Simplicity is minimally complex, in that a simple process is easy to describe and its output easy to predict. Sometimes, part of the complexity of a complex explanation will be the result of “exploited” randomness. Randomness cannot be exploited for long, however. After all, it’s not randomness if it is predictable. Thus a neural net might overfit its data only to fail at out of sample predictions, or a human brain might see faces in the clouds. If we want to avoid this, we should favor simple explanations over complex explanations, all else being equal. Simplicity’s advantage is that it minimizes our vulnerability to random noise.
The reason that complexity is more vulnerable to random noise is that complexity involves more pieces of explanation and consequently is more flexible and sensitive to random changes in input, while simplicity uses large important concepts. In this, we can see that the fact complex explanations are easier to use than simple explanations when rationalizing failed theories is not a mere accident of human psychology, it emerges naturally from the general superiority of simple explanations.
I am not sure this is a useful way to look at things. Randomness can be very different. All random variables are random in some way, but calling all of them “maximally complex” isn’t going to get you anywhere.
Outside of quantum physics, I don’t know what is “a true random output”. Let’s take a common example: stock prices. Are they truly random? According to which definition of true randomness? Are they random to a superhuman AI?
it’s not randomness if it is predictable
Let’s take a random variable ~N(0,1), that, normally distributed with the mean of zero and the standard deviation of 1. Is it predictable? Sure. Predictability is not binary, anyway.
we should favor simple explanations over complex explanations
That’s just Occam’s Razor, isn’t it?
Simplicity’s advantage is that it minimizes our vulnerability to random noise.
How do you know what is random before trying to model it? Usually simplicity doesn’t minimize your vulnerability, it just accepts it. It is quite possible for the explanation to be too simple in which case you treat as noise (as so are vulnerable to it) things which you could have modeled by adding some complexity.
complexity … is more flexible and sensitive to random changes in input
I don’t know about that. This is more a function of your modeling structure and the whole modeling process. To give a trivial example, specifying additional limits and boundary conditions adds complexity to a model, but reduces its flexibility and sensitivity to noise.
general superiority of simple explanations
That’s a meaningless expression until you specify how simple. As I mentioned, it’s clearly possible for explanations and models to be too simple.
After giving myself some time to think about this, I think you are right and my argument was flawed. On the other hand, I still think there’s a sense in which simplicity in explanations is superior to complexity, even though I can’t produce any good arguments for that idea.
After a couple months more thought, I still feel as though there should be some more general sense in which simplicity is better. Maybe because it’s easier to find simple explanations that approximately match complex truths than to find complex explanations that approximately match simple truths, so even when you’re dealing with a domain filled with complex phenomena it’s better to use simplicity. On the other hand, perhaps the notion that approximations matter or can be meaningfully compared across domains of different complexity is begging the question somehow.
I think this demonstrates the problem rather well. In the end, the phenomenon you are trying to model has a level of complexity N. You want your model (neural network or theory or whatever) to have the same level of complexity—no more, no less. So the fact that naive neural nets fail on out of sample data for a given problem shows that the neural network did not reach sufficient complexity. That most naive neural networks fail shows that most problems have at least a bit more complexity than that embodied in the simplest neural networks.
As for how to approach the problem in view of all this… Consider this: for any particular problem of complexity N, there are N − 1 levels of complexity below it, which may fail to make accurate predictions due to oversimplification. And then there’s an infinity of complexity levels above N, which may fail to make accurate predictions due to overfitting. So it makes sense to start with simple theories, and keep adding complexity as new observations arrive, and gradually improve the predictions we make, until we have the simplest theory we can which still produces low errors when predicting new observations.
I say low errors because to truly match all observations would certainly be overfitting! So there at the end we have the same problem again, where we trade off accuracy on current data against overfitting errors on future data… Simple (higher errors) versus complex (higher overfitting)… At the end of the process, only empiricism can help us find the theory that produces the lowest error on future data!
This is one possibility. Another, MUCH more common in practice, is that your NN overfitted the in-sample data and so trivially failed at out-of-sample forecasting.
To figure out the complexity of the process you’re trying to model, you first need to be able to separate features of that process from noise and this is far from a trivial exercise.
This is more along the lines of what I was thinking. Most instances of complexity that seem like they’re good are in practice going to be versions of overfitting to noise. Or, perhaps stated more concisely and powerfully, noise and simplicity are opposites (information entropy), thus if we dislike noise we should like simplicity. Does this seem like a reasonable perspective?
Not quite. Noise and simplicity are not opposites. I would say that the amount of noise in the data (along with the amount of data) imposes a limit, an upper bound, on the complexity that you can credibly detect.
Basically, if your data is noisy you are forced to consider only low-complexity models.
Can you elaborate on why you think it’s a boundary, not an opposite? I still feel like it’s an opposite. My impression, from self-study, is that randomness in information means the best way to describe eg a sequence of coin flips is to copy the sequence exactly, there is no algorithm or heuristic that allows you to describe the random information more efficiently, like “all heads” or “heads, tails, heads, tails, etc.” That sort of efficient description of information seems identical to simplicity to me. If randomness is defined as the absence of simplicity...
I guess maybe all of this is compatible with an upper bound understanding, though. What is there that distinguishes the upper bound understanding from my “opposites” understanding, that goes your way?
Noise is not randomness. What is “noise” depends on the context, but generally it means the part of the signal that we are not interested in and do not care about other than that we’d like to get rid of it.
But we may be talking in different frameworks. If you define simplicity as the opposite (or inverse) of Kolmogorov complexity and if you define noise as something that increases the Kolmogorov complexity then yes, they are kinda opposite by definition.
I don’t think we’re talking in different frameworks really, I think my choice of words was just dumb/misinformed/sloppy/incorrect. If I had originally stated “randomness and simplicity are opposites” and then pointed out that randomness is a type of noise, (I think it is perhaps even the average of all possible noisy biases, because all biases should cancel?) would that have been a reasonable argument, judged in your paradigm?
We still need to figure out the framework.
In a modeling framework (and we started in the context of neural nets which are models) “noise” is generally interpreted as model residuals—the part of data that you are unwilling or unable to model. In the same context “simplicity” usually means that the model has few parameters and an uncomplicated structure. As you can see, they are not opposites at all.
In the information/entropy framework simplicity usually means low Kolmogorov complexity and I am not sure what would “noise” mean.
When you say “randomness is a type of noise”, can you define the terms you are using?
Let me start over.
Randomness is maximally complex, in the sense that a true random output cannot easily be predicted or efficiently be described. Simplicity is minimally complex, in that a simple process is easy to describe and its output easy to predict. Sometimes, part of the complexity of a complex explanation will be the result of “exploited” randomness. Randomness cannot be exploited for long, however. After all, it’s not randomness if it is predictable. Thus a neural net might overfit its data only to fail at out of sample predictions, or a human brain might see faces in the clouds. If we want to avoid this, we should favor simple explanations over complex explanations, all else being equal. Simplicity’s advantage is that it minimizes our vulnerability to random noise.
The reason that complexity is more vulnerable to random noise is that complexity involves more pieces of explanation and consequently is more flexible and sensitive to random changes in input, while simplicity uses large important concepts. In this, we can see that the fact complex explanations are easier to use than simple explanations when rationalizing failed theories is not a mere accident of human psychology, it emerges naturally from the general superiority of simple explanations.
I am not sure this is a useful way to look at things. Randomness can be very different. All random variables are random in some way, but calling all of them “maximally complex” isn’t going to get you anywhere.
Outside of quantum physics, I don’t know what is “a true random output”. Let’s take a common example: stock prices. Are they truly random? According to which definition of true randomness? Are they random to a superhuman AI?
Let’s take a random variable ~N(0,1), that, normally distributed with the mean of zero and the standard deviation of 1. Is it predictable? Sure. Predictability is not binary, anyway.
That’s just Occam’s Razor, isn’t it?
How do you know what is random before trying to model it? Usually simplicity doesn’t minimize your vulnerability, it just accepts it. It is quite possible for the explanation to be too simple in which case you treat as noise (as so are vulnerable to it) things which you could have modeled by adding some complexity.
I don’t know about that. This is more a function of your modeling structure and the whole modeling process. To give a trivial example, specifying additional limits and boundary conditions adds complexity to a model, but reduces its flexibility and sensitivity to noise.
That’s a meaningless expression until you specify how simple. As I mentioned, it’s clearly possible for explanations and models to be too simple.
After giving myself some time to think about this, I think you are right and my argument was flawed. On the other hand, I still think there’s a sense in which simplicity in explanations is superior to complexity, even though I can’t produce any good arguments for that idea.
I would probably argue that the complexity of explanations should match the complexity of the phenomenon you’re trying to describe.
After a couple months more thought, I still feel as though there should be some more general sense in which simplicity is better. Maybe because it’s easier to find simple explanations that approximately match complex truths than to find complex explanations that approximately match simple truths, so even when you’re dealing with a domain filled with complex phenomena it’s better to use simplicity. On the other hand, perhaps the notion that approximations matter or can be meaningfully compared across domains of different complexity is begging the question somehow.
I was editing my comment at the time you replied, you presumably will want to replace this comment with a different one.