And to top that off, they found that even in networks where they artificially increased ICS, performance barely suffered.
All networks, or just ones with batch normalization?
That’s a good point of clarification which perhaps weakens the point I was making there. From the paper,
adding the same amount of noise to the activations of the standard (non-BatchNorm) network prevents it from training entirely
All networks, or just ones with batch normalization?
That’s a good point of clarification which perhaps weakens the point I was making there. From the paper,