This would require you to sample from GPT during training. If you want a sentence with 500 words you need to evaluate GPT 500 times. As a result, it would slow down training 500 times. The clever thing with GPT (and other autoregressive models) is that they circumvent sampling during training!
This would require you to sample from GPT during training. If you want a sentence with 500 words you need to evaluate GPT 500 times. As a result, it would slow down training 500 times. The clever thing with GPT (and other autoregressive models) is that they circumvent sampling during training!