This is great! Maybe you’d get better results if you “distill” GPT2-LN into GPT2-noLN by fine-tuning on the entire token probability distribution on OpenWebText.
This is great! Maybe you’d get better results if you “distill” GPT2-LN into GPT2-noLN by fine-tuning on the entire token probability distribution on OpenWebText.