“big data” refers to situations with so much training data you can get away with weak priors
The most powerful recent advances in machine learning, such as neural networks, all use big data.
This is only partially true. Consider some image classification dataset, say MNIST or CIFAR10 or ImageNet. Consider some convolutional relu network architecture, say, conv2d → relu → conv2d → relu → conv2d → relu → conv2d → relu → fullyconnected with some chosen kernel sizes and numbers of channels. Consider some configuration of its weights WCNN. Now consider the multilayer perceptron architecture fullyconnected → relu → fullyconnected → relu → fullyconnected → relu → fullyconnected → relu → fullyconnected. Clearly, there exist hyperparameters of the multilayer perceptron (numbers of neurons in hidden layers) such that there exists a configuration WMLP of weights of the multilayer perceptron, such that the function implemented by the multilayer perceptron with WMLP is the same function as the function implemented by the convolutional architecture with WCNN. Therefore, the space of functions which can be implemented by the convolutional neural network (with fixed kernel sizes and channel counts) is a subset of the space of functions which can be implemented by the multilayer perceptron (with correctly chosen numbers of neurons). Therefore, training the convolutional relu network is updating on evidence and having a relatively strong prior, while training the multilayer perceptron is updating on evidence and having a relatively weak prior.
Experimentally, if you train the networks described above, the convolutional relu network will learn to classify images well or at least okay-ish. The multilayer perceptron will not learn to classify images well, its accuracy will be much worse. Therefore, the data is not enough to wash away the multilayer perceptron’s prior, hence by your definition it can’t be called big data. Here I must note that ImageNet is the biggest publically available data for training image classification, so if anything is big data, it should be.
--
Big data uses weak priors. Correcting for bias is a prior. Big data approaches to machine learning therefore have no built-in method of correcting for bias.
This looks like a formal argument, a demonstration or dialectics as Bacon would call it, which uses shabby definitions. I disagree with the conclusion, i.e. with the statement “modern machine learning approaches have no built-in method of correcting for bias”. I think in modern machine learning people are experimenting with various inductive biases and various ad-hoc fixes or techniques with help correcting for all kinds of biases.
--
In your example with a non-converging sequence, I think you have a typo—there should be MSD(t) rather than MSD(xt).
I think in modern machine learning people are experimenting with various inductive biases and various ad-hoc fixes or techniques with help correcting for all kinds of biases.
The conclusion of my post is that these fixes and techniques are ad-hoc because they are written by the programmer, not by the ML system itself. In other words, the creation of ad-hoc fixes and techniques is not automated.
For the longest time, I would have used the convolutional architecture as an example of one of the few human-engineered priors that was still necessary in large scale machine learning tasks.
But in 2021, the Vision Transformer paper included the following excerpt:
When trained on mid-sized datasets such as ImageNet without strong regularization, these models yield modest accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data. However, the picture changes if the models are trained on larger datasets (14M-300M images). We find that large scale training trumps inductive bias.
Taking the above as a given is to say, maybe ImageNet really just wasn’t big enough, despite it being the biggest publicly available dataset around at the time.
This is only partially true. Consider some image classification dataset, say MNIST or CIFAR10 or ImageNet. Consider some convolutional relu network architecture, say, conv2d → relu → conv2d → relu → conv2d → relu → conv2d → relu → fullyconnected with some chosen kernel sizes and numbers of channels. Consider some configuration of its weights WCNN. Now consider the multilayer perceptron architecture fullyconnected → relu → fullyconnected → relu → fullyconnected → relu → fullyconnected → relu → fullyconnected. Clearly, there exist hyperparameters of the multilayer perceptron (numbers of neurons in hidden layers) such that there exists a configuration WMLP of weights of the multilayer perceptron, such that the function implemented by the multilayer perceptron with WMLP is the same function as the function implemented by the convolutional architecture with WCNN. Therefore, the space of functions which can be implemented by the convolutional neural network (with fixed kernel sizes and channel counts) is a subset of the space of functions which can be implemented by the multilayer perceptron (with correctly chosen numbers of neurons). Therefore, training the convolutional relu network is updating on evidence and having a relatively strong prior, while training the multilayer perceptron is updating on evidence and having a relatively weak prior.
Experimentally, if you train the networks described above, the convolutional relu network will learn to classify images well or at least okay-ish. The multilayer perceptron will not learn to classify images well, its accuracy will be much worse. Therefore, the data is not enough to wash away the multilayer perceptron’s prior, hence by your definition it can’t be called big data. Here I must note that ImageNet is the biggest publically available data for training image classification, so if anything is big data, it should be.
--
This looks like a formal argument, a demonstration or dialectics as Bacon would call it, which uses shabby definitions. I disagree with the conclusion, i.e. with the statement “modern machine learning approaches have no built-in method of correcting for bias”. I think in modern machine learning people are experimenting with various inductive biases and various ad-hoc fixes or techniques with help correcting for all kinds of biases.
--
In your example with a non-converging sequence, I think you have a typo—there should be MSD(t) rather than MSD(xt).
The conclusion of my post is that these fixes and techniques are ad-hoc because they are written by the programmer, not by the ML system itself. In other words, the creation of ad-hoc fixes and techniques is not automated.
For the longest time, I would have used the convolutional architecture as an example of one of the few human-engineered priors that was still necessary in large scale machine learning tasks.
But in 2021, the Vision Transformer paper included the following excerpt:
When trained on mid-sized datasets such as ImageNet without strong regularization, these models yield modest accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data. However, the picture changes if the models are trained on larger datasets (14M-300M images). We find that large scale training trumps inductive bias.
Taking the above as a given is to say, maybe ImageNet really just wasn’t big enough, despite it being the biggest publicly available dataset around at the time.
Fixed. Thank you for the correction.