I totally agree that you still have to do all the matrix multiplications of the original model etc. etc. I’m saying that you’ll need to do them fewer times, because you’ll be training on less data.
Each step costs, say, 6*N flop where N is parameter count. And then you do D steps, where D is how many data points you train on. So total flop cost is 6*N*D. When you fine-tune, you still spend 6*N for each data point, but you only need to train on 0.001D data points, at least according to the scaling laws, at least according to the orthodox interpretation around here.
I am frankly skeptical that this (section 3.9 in the pretrained frozen transformer paper) will hold up to Grad Student Descent on training parameters. But hey, maybe I’m wrong and there’s some nice property of the pretrained weights that can only be pushed into overfitting by finetuning.
I totally agree that you still have to do all the matrix multiplications of the original model etc. etc. I’m saying that you’ll need to do them fewer times, because you’ll be training on less data.
Each step costs, say, 6*N flop where N is parameter count. And then you do D steps, where D is how many data points you train on. So total flop cost is 6*N*D. When you fine-tune, you still spend 6*N for each data point, but you only need to train on 0.001D data points, at least according to the scaling laws, at least according to the orthodox interpretation around here.
I’d recommend reading Ajeya’s report (found here) for more on the scaling laws. There’s also this comment thread.
Sure, but if you’re training on less data it’s because fewer parameters is worse :P
Not according to this paper! They were able to get performance comparable to full-size networks, it seems. IDK.
I am frankly skeptical that this (section 3.9 in the pretrained frozen transformer paper) will hold up to Grad Student Descent on training parameters. But hey, maybe I’m wrong and there’s some nice property of the pretrained weights that can only be pushed into overfitting by finetuning.