[Linkpost] Generalization in diffusion models arises from geometry-adaptive harmonic representation
This is a linkpost for https://arxiv.org/abs/2310.02557.
we show that two denoisers trained on non-overlapping training sets converge to essentially the same denoising function. As a result, when used for image generation, these networks produce nearly identical samples. These results provide stronger and more direct evidence of generalization than standard comparisons of average performance on train and test sets. The fact that this generalization is achieved with a small train set relative to the network capacity and the image size implies that the network’s inductive biases are well-matched to the underlying distribution of photographic images.
Here, we showed empirically that diffusion models can achieve a strong form of generalization, converging to a unique density model that is independent of the specific training samples, with an amount of training data that is small relative to the size of the parameter or input spaces. The convergence exhibits a phase transition between memorization and generalization as training data grows. The amount of data needed to cross this phase transition depends on both the image complexity and the neural network capacity (Yoon et al., 2023), and it is of interest to extend both the theory and the empirical studies to account for these. The framework we introduced to assess memorization versus generalization may be applied to any generative model.
After skimming this paper I don’t feel that impressed. Maybe someone who read in detail could correct me.
There’s a boring claim and an exciting claim here:
The boring claim is that diffusion models learn to generalize beyond their exact training set, and models trained on training sets drawn from the same distribution will be pretty good at unseen images drawn from that distribution—and because they’re both pretty good, they’ll overlap in their suggested desnoisings.
The exciting claim is that diffusion models trained on overlapping data from the same dataset learn nearly the same algorithms, which can be seen because they produce suggested denoisings that are similar in ways that would be vanishingly unlikely if they weren’t overlapping mechanistically.
AFAICT, they show the boring claim that everyone already knew, and imply the exciting claim but don’t support it at all.
Haven’t read in detail but Fig. 2 seems to me to support the exciting claim (also because overparameterized models with 70k trainable parameters)?
Okay, sure, I kind of buy it. Generated images are closer to each other than to the nearest image in the training set. And the denoisers learn similar heuristics like “do averaging” and “there’s probably a face in the middle of the image.”
I still don’t really feel excited, but maybe that’s me and not the paper.