The D memorization is particularly puzzling when you look at improvements to GANs, most recently, BigGAN got (fixed) data augmentation & SimCLR losses: one can understand why spatial distortions & SimCLR might help D under the naive theory that D learns realism and structure of real images to penalize errors by G, but then how do we explain chance guessing on ImageNet validation...?
Further, how do we explain the JFT-300M stability either, given that it seems unlikely that D is ‘memorizing datapoints’ when the batch sizes would suggest that the JFT-300M runs in question may be running only 4 or 5 epochs at most? (mooch generally runs at most n=2048 minibatches, so even 500k iterations is only ~3.4 epoches.)
Note that the discriminator has far fewer parameters than there are bytes to memorize, so it necessarily is performing some sort of (lossy) compression to do well on the training set.
Eh. “compression” is not a helpful concept here because every single generative model trained in any way is “compressing”. (Someone once put up a website for using GPT-2 as a text compressor, because any model that emits likelihoods conditional on a history can be plugged into an arithmetic encoder and is immediately a lossless compressor/decompressor.)
Based on some other papers I don’t have handy now, I’ve hand-waved that perhaps what a GAN’s D does is it learns fuzzy patterns in image-space ‘around’ each real datapoint, and G spirals around each point, trying to approach it and collapse down to emitting the exact datapoint, but is repelled by D; as training progresses, D repels G from increasingly smaller regions around each datapoint. Because G spends its time traversing the image manifold and neural networks are biased towards simplicity, G inadvertently learns a generalizable generative model, even though it ‘wants’ to do nothing but memorize & spit out the original data (as the most certain Nash equilibrium way to defeat the D—obviously, D cannot possibly discriminate beyond 50-50 if given two identical copies of a real image). This is similar to the view of decision forests and neural networks as adaptive nearest-neighbor interpolators.
They don’t mention whether this also increases discriminator generalization or decreases training set accuracy, which I’d be interested to know.
mooch is pretty good about answering questions. You can ask him on Twitter. (I would bet the answer is probably that the equivalent test was not done on the JFT-300M models. His writeup is very thorough and I would expect him to have mentioned it if that had been done; in general, my impression is that the JFT-300M runs were done with very little time to spare and not nearly as thoroughly, since he spent all his time trying tweaks on BigGAN to get it to work at all.)
* One caveat I haven’t had time to update my writeup with: I found that D ranking worked in a weird way which I interpreted as consistent with D memorization; however, I was recently informed that I had implemented it wrong and it works much better when fixed; but on the gripping hand, they find that the D ranking still doesn’t really match up with ‘realism’ so maybe my error didn’t matter too much.
Thanks for sharing thoughts and links: discriminator ranking, SimCLR, CR, and BCR are all interesting and I hadn’t run into them yet. My naive thought was that you’d have to use differentiable augmenters to fit in generator augmentation.
You can ask him on Twitter.
I’m averse to using Twitter, but I will consider being motivated enough to sign-up and ask. Thanks for pointing this out.
“compression” is not a helpful concept here because every single generative model trained in any way is “compressing”
I am definitely using this concept too vaguely, although I was gesturing at compression in the discriminator instead of the generator. Thinking of the discriminator as a lossy compressor in this way would be… positing a mapping f: discriminator weights → distributions, which for trained weights does not fully recapture the training distribution? We could see G as attempting to match this imperfect distribution (since it doesn’t directly receive the training examples), and D as modifying weights to simultaneously 1. try to capture the training distribution as f(D), and 2. try to have f(D) avoid the output of G. Hence why I was thinking D might be “obfuscating”—in this picture, I think f(D) is pressured to be a more complicated manifold while sticking close to the training distribution, making it more difficult for G to fit it.
Is such an f implicit in the discriminator outputs? I think that it is just by normalizing across the whole space, although that’s computationally infeasible. I’d be interested in work that attempts to recover the training distribution from D alone.
My naive thought was that you’d have to use differentiable augmenters to fit in generator augmentation.
I believe the data augmentations in question are all differentiable, so you can backprop from the augmented images to G. (Which is not to say they are easy: the reason that Zhao et al 2020 came out before we got SimCLR working on our own BigGAN is that lucidrains & Shawn Presser got SimCLR working—we think—except it only works on GPUs, which we don’t have enough of to train BigGAN on, and TPU CPUs, where it memory-leaks. Very frustrating, especially now that Zhao shows that SimCLR would have worked for us.)
I’m averse to using Twitter, but I will consider being motivated enough to sign-up and ask.
I assume he has email; he also hangs out on our Discord and answers questions from time to time.
I think it’s decently likely I’m confused here.
It’s definitely a confusing topic. Most GAN researchers seem to sort of shrug and… something something the Nash equilibrium minimizes the Jensen–Shannon divergence something something converges with decreasing learning rate in the limit, well, it works in practice, OK? Nothing like likelihood or VAE or flow-based models, that’s for sure. (On the other hand, nobody’s ever trained those on something like JFT-300M, and the compute requirements for something like OpenAI Jukebox are hilarious—what is it, 17 hours on a V100 to generate a minute of audio?)
These are good questions, and some of the points that suggest we don’t really understand what GANs do or why they work. They are something I’ve previously highlighted in my writeups: https://www.gwern.net/Faces#discriminator-ranking * & https://github.com/tensorfork/tensorfork/issues/28 respectively.
The D memorization is particularly puzzling when you look at improvements to GANs, most recently, BigGAN got (fixed) data augmentation & SimCLR losses: one can understand why spatial distortions & SimCLR might help D under the naive theory that D learns realism and structure of real images to penalize errors by G, but then how do we explain chance guessing on ImageNet validation...?
Further, how do we explain the JFT-300M stability either, given that it seems unlikely that D is ‘memorizing datapoints’ when the batch sizes would suggest that the JFT-300M runs in question may be running only 4 or 5 epochs at most? (mooch generally runs at most n=2048 minibatches, so even 500k iterations is only ~3.4 epoches.)
Eh. “compression” is not a helpful concept here because every single generative model trained in any way is “compressing”. (Someone once put up a website for using GPT-2 as a text compressor, because any model that emits likelihoods conditional on a history can be plugged into an arithmetic encoder and is immediately a lossless compressor/decompressor.)
Based on some other papers I don’t have handy now, I’ve hand-waved that perhaps what a GAN’s D does is it learns fuzzy patterns in image-space ‘around’ each real datapoint, and G spirals around each point, trying to approach it and collapse down to emitting the exact datapoint, but is repelled by D; as training progresses, D repels G from increasingly smaller regions around each datapoint. Because G spends its time traversing the image manifold and neural networks are biased towards simplicity, G inadvertently learns a generalizable generative model, even though it ‘wants’ to do nothing but memorize & spit out the original data (as the most certain Nash equilibrium way to defeat the D—obviously, D cannot possibly discriminate beyond 50-50 if given two identical copies of a real image). This is similar to the view of decision forests and neural networks as adaptive nearest-neighbor interpolators.
mooch is pretty good about answering questions. You can ask him on Twitter. (I would bet the answer is probably that the equivalent test was not done on the JFT-300M models. His writeup is very thorough and I would expect him to have mentioned it if that had been done; in general, my impression is that the JFT-300M runs were done with very little time to spare and not nearly as thoroughly, since he spent all his time trying tweaks on BigGAN to get it to work at all.)
* One caveat I haven’t had time to update my writeup with: I found that D ranking worked in a weird way which I interpreted as consistent with D memorization; however, I was recently informed that I had implemented it wrong and it works much better when fixed; but on the gripping hand, they find that the D ranking still doesn’t really match up with ‘realism’ so maybe my error didn’t matter too much.
Thanks for sharing thoughts and links: discriminator ranking, SimCLR, CR, and BCR are all interesting and I hadn’t run into them yet. My naive thought was that you’d have to use differentiable augmenters to fit in generator augmentation.
I’m averse to using Twitter, but I will consider being motivated enough to sign-up and ask. Thanks for pointing this out.
I am definitely using this concept too vaguely, although I was gesturing at compression in the discriminator instead of the generator. Thinking of the discriminator as a lossy compressor in this way would be… positing a mapping f: discriminator weights → distributions, which for trained weights does not fully recapture the training distribution? We could see G as attempting to match this imperfect distribution (since it doesn’t directly receive the training examples), and D as modifying weights to simultaneously 1. try to capture the training distribution as f(D), and 2. try to have f(D) avoid the output of G. Hence why I was thinking D might be “obfuscating”—in this picture, I think f(D) is pressured to be a more complicated manifold while sticking close to the training distribution, making it more difficult for G to fit it.
Is such an f implicit in the discriminator outputs? I think that it is just by normalizing across the whole space, although that’s computationally infeasible. I’d be interested in work that attempts to recover the training distribution from D alone.
I think it’s decently likely I’m confused here.
I believe the data augmentations in question are all differentiable, so you can backprop from the augmented images to G. (Which is not to say they are easy: the reason that Zhao et al 2020 came out before we got SimCLR working on our own BigGAN is that lucidrains & Shawn Presser got SimCLR working—we think—except it only works on GPUs, which we don’t have enough of to train BigGAN on, and TPU CPUs, where it memory-leaks. Very frustrating, especially now that Zhao shows that SimCLR would have worked for us.)
I assume he has email; he also hangs out on our Discord and answers questions from time to time.
It’s definitely a confusing topic. Most GAN researchers seem to sort of shrug and… something something the Nash equilibrium minimizes the Jensen–Shannon divergence something something converges with decreasing learning rate in the limit, well, it works in practice, OK? Nothing like likelihood or VAE or flow-based models, that’s for sure. (On the other hand, nobody’s ever trained those on something like JFT-300M, and the compute requirements for something like OpenAI Jukebox are hilarious—what is it, 17 hours on a V100 to generate a minute of audio?)