I am frankly skeptical that this (section 3.9 in the pretrained frozen transformer paper) will hold up to Grad Student Descent on training parameters. But hey, maybe I’m wrong and there’s some nice property of the pretrained weights that can only be pushed into overfitting by finetuning.
Not according to this paper! They were able to get performance comparable to full-size networks, it seems. IDK.
I am frankly skeptical that this (section 3.9 in the pretrained frozen transformer paper) will hold up to Grad Student Descent on training parameters. But hey, maybe I’m wrong and there’s some nice property of the pretrained weights that can only be pushed into overfitting by finetuning.