In particular, it is the singularities of these minimum-loss sets — points at which the tangent is ill-defined — that determine generalization performance.
To clarify: there is not necessarily a problem with the tangent, right? E.g., the function f(x)=x4 has a singularity at 0 because the second derivative vanishes there, but the tangent is define. I think for the same reason, some of the pictures may be misleading to some readers.
A model, p(x|w), parametrized by weights w∈W⊂Rd, where W is compact;
Why do we want compactness? Neural networks are parameterized in a non-compact set. (Though I guess usually, if things go well, the weights don’t blow up. So in that sense it can maybe be modeled to be compact)
The empirical Kullback-Leibler divergence is just a rescaled and shifted version of the negative log likelihood.
I think it is only shifted, and not also rescaled, if I’m not missing something.
But these predictions of “generalization error” are actually a contrived kind of theoretical device that isn’t what we mean by “generalization error” in the typical ML setting.
Why is that? I.e., in what way is the generalization error different from what ML people care about? Because real ML models don’t predict using an updated posterior over the parameter space? (I was just wondering if there is a different reason I’m missing)
To clarify: there is not necessarily a problem with the tangent, right? E.g., the function f(x)=x4 has a singularity at 0 because the second derivative vanishes there, but the tangent is define. I think for the same reason, some of the pictures may be misleading to some readers.
Why do we want compactness? Neural networks are parameterized in a non-compact set. (Though I guess usually, if things go well, the weights don’t blow up. So in that sense it can maybe be modeled to be compact)
I think it is only shifted, and not also rescaled, if I’m not missing something.
Why is that? I.e., in what way is the generalization error different from what ML people care about? Because real ML models don’t predict using an updated posterior over the parameter space? (I was just wondering if there is a different reason I’m missing)