This paper offers a fairly intuitive explanation for why flatter minima generalize better: suppose the training and testing data have distinct, but nearby, minima that minimize their respective loss. Then, the curvature around the training minima acts as the second order term in a Taylor expansion that approximates the expected test loss for models nearby the training minima.
I feel like this explanation is just restating the question. Why are the minima of the test and training data often close to each other? What makes reality be that way?
You can come up with some explanation involving mumble mumble fine-tuning, but I feel like that just leaves us where we started.
My intuition: small changes to most parameters don’t influence behavior that much, especially if you’re in a flat basin. The local region in parameter space thus contains many possible small variations in model behavior. The behavior that solves the training data is similar to that which solves the test data, due to them being drawn from the same distribution. It’s thus likely that a nearby region in parameter space is a minima for the test data.
I feel like this explanation is just restating the question. Why are the minima of the test and training data often close to each other? What makes reality be that way?
You can come up with some explanation involving mumble mumble fine-tuning, but I feel like that just leaves us where we started.
My intuition: small changes to most parameters don’t influence behavior that much, especially if you’re in a flat basin. The local region in parameter space thus contains many possible small variations in model behavior. The behavior that solves the training data is similar to that which solves the test data, due to them being drawn from the same distribution. It’s thus likely that a nearby region in parameter space is a minima for the test data.