Yeah, GANs for sequences are one of those ideas that people kept trying and it never worked. It wasn’t entirely clear why; I suspect that much of it was simply that due to the inefficiency of RL and the very very smolness of all the GAN sequence work back then*, that it was all dead on arrival. (I never really bought the “it’s just equivalent to likelihood” argument. GANs always seemed to operate in images in a very qualitatively distinct way from all likelihood-based approaches; and if you look at things abstractly enough, you can make anything equivalent to anything like that.) It’s possible that retrying today with proper scale might work, same way that image GANs now work at scale (despite being left for dead by contemporary researchers who had failed to note that BigGAN scaled just fine to JFT-300M).
But my real suspicion is that direct generative learning is too efficient, so the proper role for GANs would be as an additional phase of training, to sharpen a standard LLM.
AFAIK, this has not been done except inasmuch as you interpret the various preference-learning approaches as actor-critic RL (which means you can also further interpret them as GANs). Given how well diffusion models can be tuned by a simple adversarial loss into a GAN-like single-step Generator, I suspect that some adversarial training of LLMs might be quite useful. I should poke around in Arxiv and see if anyone’s tried that yet...
* LSTM RNNs, or heck, GPTs, wouldn’t look all that impressive if they were trained with similar compute/data as those sequence GAN papers were
Yeah, GANs for sequences are one of those ideas that people kept trying and it never worked. It wasn’t entirely clear why; I suspect that much of it was simply that due to the inefficiency of RL and the very very smolness of all the GAN sequence work back then*, that it was all dead on arrival. (I never really bought the “it’s just equivalent to likelihood” argument. GANs always seemed to operate in images in a very qualitatively distinct way from all likelihood-based approaches; and if you look at things abstractly enough, you can make anything equivalent to anything like that.) It’s possible that retrying today with proper scale might work, same way that image GANs now work at scale (despite being left for dead by contemporary researchers who had failed to note that BigGAN scaled just fine to JFT-300M).
But my real suspicion is that direct generative learning is too efficient, so the proper role for GANs would be as an additional phase of training, to sharpen a standard LLM.
AFAIK, this has not been done except inasmuch as you interpret the various preference-learning approaches as actor-critic RL (which means you can also further interpret them as GANs). Given how well diffusion models can be tuned by a simple adversarial loss into a GAN-like single-step Generator, I suspect that some adversarial training of LLMs might be quite useful. I should poke around in Arxiv and see if anyone’s tried that yet...
* LSTM RNNs, or heck, GPTs, wouldn’t look all that impressive if they were trained with similar compute/data as those sequence GAN papers were