The idea of PAC learning always rubbed me the wrong way, since it contains the silent assumption that there are no spurious correlations in the dataset (or to word it in a non-causal way, that if no latent variables are required to fit a model from the training data, no latent variables will be required to predict test data with the same accuracy) and if people run with the idea of make claims like:
With unbounded computing power, this approach would work wonderfully. In principle, everything (well, at least every binary classification task) is learnable from a relatively small set of examples, certainly fewer examples than what is often available in practice. It doesn’t seem completely unreasonable to consider this some (small) amount of evidence on the importance of talent vs data, and perhaps as decent evidence on the inference power of a really smart AI?
Which is inherently untrue in any field other than maybe particle physics if experiments were ran with god-like instruments that had no errors.
the silent assumption that there are no spurious correlations in the dataset
Isn’t that the i.d.d assumption?
We model the environment, which creates our labeled domain points (x,y) by a probability distribution D over X×Y as well as the i.d.d. assumption (not bolded since it’s not ML-specific), which states that all elements are independently and identically distributed according to D.
If so it’s not silent – it’s a formal part of the model. The statements about PAC learnability are mathematical proofs, so there’s no room to argue with those, there’s only room to argue that the model is not realistic.
Although I admit I didn’t mention the i.d.d assumption in the paragraph that you quoted.
Well, the i.d.d assumption or the CLT or whatever variation you want to go to is, in my opinion, rather pointless. It aims at modeling exactly the kind of simplistic idealized system that don’t exist in the real world.
If you look at most important real world systems, from biological organisms, to stock markets, to traffic. The variables are basically the opposite of i.d.d., they are all correlated, even worse, the correlations aren’t linear.
You can model traffic all you want and try to drive in an “optimal” way but then you will reach the “driving like an utter asshole” edge case which has the unexpected result of “tailgating”.
You can model the immune system based on over 80 years of observations in an organism and try to tweak it just a tiny it to help fight an infection and the infinitesimal tweak will cause the “cytokine storm” edge case (which will never have been observed before, since it’s usually fatal).
Further more the criticisms above don’t even mention the idea that, again, you could have a process where all the variables are i.d.d (or can be modeled as such), but you just happen to have missed some variables which are critical for some states, but not for other states. So you get a model that’s good for a large amount of state and then over-fits on the states where you are missing the critical variables (e.g. this is the problem with things like predicting earthquakes or the movement of the Earth’s magnetic pole).
All problems where i.d.d reasonably fits the issue are:
a) Solved (e.g. predicting tides reasonably accurately)
b) Solvable from a statistical perspective but suffer from data gathering issues (e.g. a lot of problem in physics, where running the experiments is the hard part)
Right. I just took issue with the “unsaid” part because it makes it sound like the book makes statements that are untrue, when in fact it can at worst make statements that aren’t meaningful (“if this unrealistic assumption holds, then stuff follows”). You can call it pointless, but not silent, because well it’s not.
I’m of course completely unqualified to judge how realistic the i.d.d. assumption is, having never used ML in practice. I edited the paragraph you quoted to add a disclaimer that it is only true if the i.d.d assumption holds.
But I’d point out that this is a text book, so even if correlations are as problematic as you say, it is still a reasonable choice to present the idealized model first and then later discuss ways to model correlations in the data. No idea if this actually happens at some point.
This seems much too strong, lots of interesting unsolved problems can be cast as i.i.d. Video classification, for example, can be cast as i.i.d., where the distribution is over different videos, not individual frames.
The idea of PAC learning always rubbed me the wrong way, since it contains the silent assumption that there are no spurious correlations in the dataset (or to word it in a non-causal way, that if no latent variables are required to fit a model from the training data, no latent variables will be required to predict test data with the same accuracy) and if people run with the idea of make claims like:
Which is inherently untrue in any field other than maybe particle physics if experiments were ran with god-like instruments that had no errors.
Isn’t that the i.d.d assumption?
If so it’s not silent – it’s a formal part of the model. The statements about PAC learnability are mathematical proofs, so there’s no room to argue with those, there’s only room to argue that the model is not realistic.
Although I admit I didn’t mention the i.d.d assumption in the paragraph that you quoted.
Well, the i.d.d assumption or the CLT or whatever variation you want to go to is, in my opinion, rather pointless. It aims at modeling exactly the kind of simplistic idealized system that don’t exist in the real world.
If you look at most important real world systems, from biological organisms, to stock markets, to traffic. The variables are basically the opposite of i.d.d., they are all correlated, even worse, the correlations aren’t linear.
You can model traffic all you want and try to drive in an “optimal” way but then you will reach the “driving like an utter asshole” edge case which has the unexpected result of “tailgating”.
You can model the immune system based on over 80 years of observations in an organism and try to tweak it just a tiny it to help fight an infection and the infinitesimal tweak will cause the “cytokine storm” edge case (which will never have been observed before, since it’s usually fatal).
Further more the criticisms above don’t even mention the idea that, again, you could have a process where all the variables are i.d.d (or can be modeled as such), but you just happen to have missed some variables which are critical for some states, but not for other states. So you get a model that’s good for a large amount of state and then over-fits on the states where you are missing the critical variables (e.g. this is the problem with things like predicting earthquakes or the movement of the Earth’s magnetic pole).
All problems where i.d.d reasonably fits the issue are:
a) Solved (e.g. predicting tides reasonably accurately)
b) Solvable from a statistical perspective but suffer from data gathering issues (e.g. a lot of problem in physics, where running the experiments is the hard part)
c) Boring and/or imaginary
Right. I just took issue with the “unsaid” part because it makes it sound like the book makes statements that are untrue, when in fact it can at worst make statements that aren’t meaningful (“if this unrealistic assumption holds, then stuff follows”). You can call it pointless, but not silent, because well it’s not.
I’m of course completely unqualified to judge how realistic the i.d.d. assumption is, having never used ML in practice. I edited the paragraph you quoted to add a disclaimer that it is only true if the i.d.d assumption holds.
But I’d point out that this is a text book, so even if correlations are as problematic as you say, it is still a reasonable choice to present the idealized model first and then later discuss ways to model correlations in the data. No idea if this actually happens at some point.
This seems much too strong, lots of interesting unsolved problems can be cast as i.i.d. Video classification, for example, can be cast as i.i.d., where the distribution is over different videos, not individual frames.