The results show a linear scaling law relating number of labels to task complexity. The CNN typically has a lower \hat{\lambda} than the MLP, which matches intuitions that some of the complexity is “stored” in the architecture because the convolutions apply a useful prior on functions good at solving image recognition tasks.
Is there any intuition or claim about why they look like they’re only diverging after 5 classes/task-labels?
My first guess was that it’s noise from the label ordering (some of the digits must be harder to learn than others). Ran it 10 times with the labels shuffled each time:
If you’re able to shift the crossover by just more resampling, yeah, that suggests that the slight inversion is a minor artifact—maybe the hyperparameters are slightly better tuned at the start for MLPs compared to CNNs or you don’t have enough regularization for MLPs to keep them near the CNNs as you scale which exaggerates the difference (adding in regularization is often a key ingredient in MLP papers), something boring like that...
Is there any intuition or claim about why they look like they’re only diverging after 5 classes/task-labels?
My first guess was that it’s noise from the label ordering (some of the digits must be harder to learn than others). Ran it 10 times with the labels shuffled each time:
Still unsure.
If you’re able to shift the crossover by just more resampling, yeah, that suggests that the slight inversion is a minor artifact—maybe the hyperparameters are slightly better tuned at the start for MLPs compared to CNNs or you don’t have enough regularization for MLPs to keep them near the CNNs as you scale which exaggerates the difference (adding in regularization is often a key ingredient in MLP papers), something boring like that...