I agree that this is progress (now that I understand it better), though:
if SGD is MAP then it seems plausible that e.g. SGD + random initial conditions or simulated annealing would give you something like top N posterior models
I think there is strong evidence that the behavior of models trained via the same basic training process are likely to be highly correlated. This sort of correlation is related to low variance in the bias-variance tradeoff sense, and there is evidence that not only do massive neural networks tend to have pretty low variance, but that this variance is likely to continue to decrease as networks become larger.
I agree that this is progress (now that I understand it better), though:
I think there is strong evidence that the behavior of models trained via the same basic training process are likely to be highly correlated. This sort of correlation is related to low variance in the bias-variance tradeoff sense, and there is evidence that not only do massive neural networks tend to have pretty low variance, but that this variance is likely to continue to decrease as networks become larger.
Hmm, added to reading list, thank you.