My naive thought was that you’d have to use differentiable augmenters to fit in generator augmentation.
I believe the data augmentations in question are all differentiable, so you can backprop from the augmented images to G. (Which is not to say they are easy: the reason that Zhao et al 2020 came out before we got SimCLR working on our own BigGAN is that lucidrains & Shawn Presser got SimCLR working—we think—except it only works on GPUs, which we don’t have enough of to train BigGAN on, and TPU CPUs, where it memory-leaks. Very frustrating, especially now that Zhao shows that SimCLR would have worked for us.)
I’m averse to using Twitter, but I will consider being motivated enough to sign-up and ask.
I assume he has email; he also hangs out on our Discord and answers questions from time to time.
I think it’s decently likely I’m confused here.
It’s definitely a confusing topic. Most GAN researchers seem to sort of shrug and… something something the Nash equilibrium minimizes the Jensen–Shannon divergence something something converges with decreasing learning rate in the limit, well, it works in practice, OK? Nothing like likelihood or VAE or flow-based models, that’s for sure. (On the other hand, nobody’s ever trained those on something like JFT-300M, and the compute requirements for something like OpenAI Jukebox are hilarious—what is it, 17 hours on a V100 to generate a minute of audio?)
I believe the data augmentations in question are all differentiable, so you can backprop from the augmented images to G. (Which is not to say they are easy: the reason that Zhao et al 2020 came out before we got SimCLR working on our own BigGAN is that lucidrains & Shawn Presser got SimCLR working—we think—except it only works on GPUs, which we don’t have enough of to train BigGAN on, and TPU CPUs, where it memory-leaks. Very frustrating, especially now that Zhao shows that SimCLR would have worked for us.)
I assume he has email; he also hangs out on our Discord and answers questions from time to time.
It’s definitely a confusing topic. Most GAN researchers seem to sort of shrug and… something something the Nash equilibrium minimizes the Jensen–Shannon divergence something something converges with decreasing learning rate in the limit, well, it works in practice, OK? Nothing like likelihood or VAE or flow-based models, that’s for sure. (On the other hand, nobody’s ever trained those on something like JFT-300M, and the compute requirements for something like OpenAI Jukebox are hilarious—what is it, 17 hours on a V100 to generate a minute of audio?)