The point: when we think through the gears of the experimental setup, the obvious guess is that the curves are mostly a result of top-1 prediction (as opposed to e.g. sampling from the predictive distribution), in a way which pretty strongly indicates that accuracy would plummet to near-zero as the correct digit ceases to be the most probable digit.
I think this is a reasonable prediction, but ends up being incorrect:
It decreases far faster than it should; on the top-1 theory, it should be ~flatlined for this whole graph (since for all α>0 the strict majority of labels are still correct). Certainly top-5 should not be decreasing.
Maybe noise makes training worse because the model can’t learn to just ignore it due to insufficient data? (E.g., making training more noisy means convergence/compute efficiency is lower.)
Also, does this decrease the size of the dataset by a factor of 5 in the uniform noise case? (Or did they normalize this by using a fixed set of labeled data and then just added additional noise labels?)
I think this is a reasonable prediction, but ends up being incorrect:
It decreases far faster than it should; on the top-1 theory, it should be ~flatlined for this whole graph (since for all α>0 the strict majority of labels are still correct). Certainly top-5 should not be decreasing.
This is in the data constrained case right?
Maybe noise makes training worse because the model can’t learn to just ignore it due to insufficient data? (E.g., making training more noisy means convergence/compute efficiency is lower.)
Also, does this decrease the size of the dataset by a factor of 5 in the uniform noise case? (Or did they normalize this by using a fixed set of labeled data and then just added additional noise labels?)