For the longest time, I would have used the convolutional architecture as an example of one of the few human-engineered priors that was still necessary in large scale machine learning tasks.
But in 2021, the Vision Transformer paper included the following excerpt:
When trained on mid-sized datasets such as ImageNet without strong regularization, these models yield modest accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data. However, the picture changes if the models are trained on larger datasets (14M-300M images). We find that large scale training trumps inductive bias.
Taking the above as a given is to say, maybe ImageNet really just wasn’t big enough, despite it being the biggest publicly available dataset around at the time.
For the longest time, I would have used the convolutional architecture as an example of one of the few human-engineered priors that was still necessary in large scale machine learning tasks.
But in 2021, the Vision Transformer paper included the following excerpt:
When trained on mid-sized datasets such as ImageNet without strong regularization, these models yield modest accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data. However, the picture changes if the models are trained on larger datasets (14M-300M images). We find that large scale training trumps inductive bias.
Taking the above as a given is to say, maybe ImageNet really just wasn’t big enough, despite it being the biggest publicly available dataset around at the time.