Rohin Shah comments on [AN #148]: Analyzing generalization across more axes than just accuracy or loss

Rohin Shah 28 Apr 2021 18:36 UTC
LW: 9 AF: 7
AF
I asked the authors for feedback on my summary of the distributional generalization paper, and Preetum responded with the following (copied with his permission):
I agree with everything you’ve said in this summary, so my feedback below is mostly commentary / minor points.

- One intuitive way to think about Feature Calibration is that f(x) is “close to” a sample from p(y|x). Where the quality of the “closeness” is depends on the power of the classifier.

- Re. “classifiers which do not fit their train set”: As you say, our paper mostly focuses on Distributional Generalization (DG) for interpolating models. But I am hopeful that DG actually holds much more generally, and we should really be thinking of generalization as saying “test and train behaviors are close *as distributions*”.
Though we don’t formalize this yet for non-interpolating models, there are some suggestive experiments in Section 7 (eg: the confusion matrix of a model on the test set remains close to its confusion matrix on the train set, throughout the training process. As you start to fit noise on the train set, you see exactly this noise start to appear on the test set. Regularization which prevents fitting noise on the train set, also prevents this noise from appearing at test time).

- For me, one of the most interesting implications of DG/feature-calibration is that it gives a separation between overparameterized and underparameterized regimes (in the scaling limits of large models/data). With enough data, large enough underparameterized models will converge to Bayes-optimal classifiers, whereas overparameterized models will not (assuming DG). That is, interpolation is not always “benign”, it can actually hurt.
- You may like the discussion we added on these issues in the short version of our paper: Section 1.3 (“Related Work and Significance”) here: https://mltheory.org/dg_short.pdf
(there is no new material in this pdf, outside the Related Work).

- Also, we have a number of supporting experiments for Feature Calibration in the appendix that didn’t make it into the body (eg: more tasks for decision trees, and experiments with “bad” image classifiers like MLPs and RBF kernels).
- Sidenote: The “agreement property” has been bugging me for a while since it seems kind of magical. My current view is that “agreement” may be be a special case of a stronger (but less magical) property: The joint distribution (f(x), y) is statistically close to (f(x), f’(x))
on the test set, where f’ is an independently-trained classifier.
This can also be seen as an instance of DG, and it implies the agreement property. I sketched this conjecture in this tweet: https://twitter.com/PreetumNakkiran/status/1385741115211530241
(But this is speculative—not in the paper and hasn’t been rigorously tested).

- I included this figure in a talk on DG recently—point being that DG is a general definition, which includes both classical generalization and our new conjectures as special cases (and could include other yet-undiscovered behaviors).

- As mentioned at the end of our paper, there are *many* open questions remaining (and I would be very happy to see more work in this area).