Thanks for sharing thoughts and links: discriminator ranking, SimCLR, CR, and BCR are all interesting and I hadn’t run into them yet. My naive thought was that you’d have to use differentiable augmenters to fit in generator augmentation.
You can ask him on Twitter.
I’m averse to using Twitter, but I will consider being motivated enough to sign-up and ask. Thanks for pointing this out.
“compression” is not a helpful concept here because every single generative model trained in any way is “compressing”
I am definitely using this concept too vaguely, although I was gesturing at compression in the discriminator instead of the generator. Thinking of the discriminator as a lossy compressor in this way would be… positing a mapping f: discriminator weights → distributions, which for trained weights does not fully recapture the training distribution? We could see G as attempting to match this imperfect distribution (since it doesn’t directly receive the training examples), and D as modifying weights to simultaneously 1. try to capture the training distribution as f(D), and 2. try to have f(D) avoid the output of G. Hence why I was thinking D might be “obfuscating”—in this picture, I think f(D) is pressured to be a more complicated manifold while sticking close to the training distribution, making it more difficult for G to fit it.
Is such an f implicit in the discriminator outputs? I think that it is just by normalizing across the whole space, although that’s computationally infeasible. I’d be interested in work that attempts to recover the training distribution from D alone.
My naive thought was that you’d have to use differentiable augmenters to fit in generator augmentation.
I believe the data augmentations in question are all differentiable, so you can backprop from the augmented images to G. (Which is not to say they are easy: the reason that Zhao et al 2020 came out before we got SimCLR working on our own BigGAN is that lucidrains & Shawn Presser got SimCLR working—we think—except it only works on GPUs, which we don’t have enough of to train BigGAN on, and TPU CPUs, where it memory-leaks. Very frustrating, especially now that Zhao shows that SimCLR would have worked for us.)
I’m averse to using Twitter, but I will consider being motivated enough to sign-up and ask.
I assume he has email; he also hangs out on our Discord and answers questions from time to time.
I think it’s decently likely I’m confused here.
It’s definitely a confusing topic. Most GAN researchers seem to sort of shrug and… something something the Nash equilibrium minimizes the Jensen–Shannon divergence something something converges with decreasing learning rate in the limit, well, it works in practice, OK? Nothing like likelihood or VAE or flow-based models, that’s for sure. (On the other hand, nobody’s ever trained those on something like JFT-300M, and the compute requirements for something like OpenAI Jukebox are hilarious—what is it, 17 hours on a V100 to generate a minute of audio?)
Thanks for sharing thoughts and links: discriminator ranking, SimCLR, CR, and BCR are all interesting and I hadn’t run into them yet. My naive thought was that you’d have to use differentiable augmenters to fit in generator augmentation.
I’m averse to using Twitter, but I will consider being motivated enough to sign-up and ask. Thanks for pointing this out.
I am definitely using this concept too vaguely, although I was gesturing at compression in the discriminator instead of the generator. Thinking of the discriminator as a lossy compressor in this way would be… positing a mapping f: discriminator weights → distributions, which for trained weights does not fully recapture the training distribution? We could see G as attempting to match this imperfect distribution (since it doesn’t directly receive the training examples), and D as modifying weights to simultaneously 1. try to capture the training distribution as f(D), and 2. try to have f(D) avoid the output of G. Hence why I was thinking D might be “obfuscating”—in this picture, I think f(D) is pressured to be a more complicated manifold while sticking close to the training distribution, making it more difficult for G to fit it.
Is such an f implicit in the discriminator outputs? I think that it is just by normalizing across the whole space, although that’s computationally infeasible. I’d be interested in work that attempts to recover the training distribution from D alone.
I think it’s decently likely I’m confused here.
I believe the data augmentations in question are all differentiable, so you can backprop from the augmented images to G. (Which is not to say they are easy: the reason that Zhao et al 2020 came out before we got SimCLR working on our own BigGAN is that lucidrains & Shawn Presser got SimCLR working—we think—except it only works on GPUs, which we don’t have enough of to train BigGAN on, and TPU CPUs, where it memory-leaks. Very frustrating, especially now that Zhao shows that SimCLR would have worked for us.)
I assume he has email; he also hangs out on our Discord and answers questions from time to time.
It’s definitely a confusing topic. Most GAN researchers seem to sort of shrug and… something something the Nash equilibrium minimizes the Jensen–Shannon divergence something something converges with decreasing learning rate in the limit, well, it works in practice, OK? Nothing like likelihood or VAE or flow-based models, that’s for sure. (On the other hand, nobody’s ever trained those on something like JFT-300M, and the compute requirements for something like OpenAI Jukebox are hilarious—what is it, 17 hours on a V100 to generate a minute of audio?)