I think it’s plausible that the data dependence will act like it’s 3 OOM smaller. Compute dependence will be different, though, right? Even if you’re just finetuning part of the model you have to run the whole thing to do evaluation. In a sense this actually seems like the worst of both worlds (but you get the benefit from pretraining).
Edit: Actually, I’m confused why you say a smaller model needs that factor fewer steps. I thought the slope on that one was actually quite gentle. It’s just that smaller models are cheap—or am I getting it wrong?
I think compute cost equals data x parameters, so even if parameters are the same, if data is 3 OOM smaller, then compute cost will be 3 OOM smaller.
I’m not sure I understand your edit question. I’m referring to the scaling laws as discussed and interpreted by Ajeya. Perhaps part of what’s going on is that in the sizes of model we’ve explored so far, bigger models only need a little bit more data, because bigger models are more data-efficient. But very soon it is prophecied that this will stop and we will transition to a slower scaling law according to which we need to increase data by almost as much as we increase parameter count. So that’s the relevant one I’m thinking about when thinking about TAI/AGI/etc.
I’m not sure how your reply relates to my guess, so I’m a little worried.
If you’re intending the compute comment to be in opposition to my first paragraph, then no—when finetuning a subset of the parameters, compute is not simply proportional to the size of the subset you’re finetuning, because you still have to do all the matrix multiplications of the original model, both for inference and gradient propagation. I think the point for the paper only finetuning a subset was to make a scientific point, not save compute.
My edit question was just because you said something about expecting the # of steps to be 3 OOM for a 3 OOM smaller model. But iirc really it’s more like the compute will be smaller, but the # of steps won’t change much (they’re just cheaper).
Do you have a reference for this picture of “need lots more data to get performance improvements?” I’ve also heard some things about a transition, but as a transition from compute-limited to data-limited, which means “need lots more compute to get performance improvements.”
I totally agree that you still have to do all the matrix multiplications of the original model etc. etc. I’m saying that you’ll need to do them fewer times, because you’ll be training on less data.
Each step costs, say, 6*N flop where N is parameter count. And then you do D steps, where D is how many data points you train on. So total flop cost is 6*N*D. When you fine-tune, you still spend 6*N for each data point, but you only need to train on 0.001D data points, at least according to the scaling laws, at least according to the orthodox interpretation around here.
I am frankly skeptical that this (section 3.9 in the pretrained frozen transformer paper) will hold up to Grad Student Descent on training parameters. But hey, maybe I’m wrong and there’s some nice property of the pretrained weights that can only be pushed into overfitting by finetuning.
I think it’s plausible that the data dependence will act like it’s 3 OOM smaller. Compute dependence will be different, though, right? Even if you’re just finetuning part of the model you have to run the whole thing to do evaluation. In a sense this actually seems like the worst of both worlds (but you get the benefit from pretraining).
Edit: Actually, I’m confused why you say a smaller model needs that factor fewer steps. I thought the slope on that one was actually quite gentle. It’s just that smaller models are cheap—or am I getting it wrong?
I think compute cost equals data x parameters, so even if parameters are the same, if data is 3 OOM smaller, then compute cost will be 3 OOM smaller.
I’m not sure I understand your edit question. I’m referring to the scaling laws as discussed and interpreted by Ajeya. Perhaps part of what’s going on is that in the sizes of model we’ve explored so far, bigger models only need a little bit more data, because bigger models are more data-efficient. But very soon it is prophecied that this will stop and we will transition to a slower scaling law according to which we need to increase data by almost as much as we increase parameter count. So that’s the relevant one I’m thinking about when thinking about TAI/AGI/etc.
I’m not sure how your reply relates to my guess, so I’m a little worried.
If you’re intending the compute comment to be in opposition to my first paragraph, then no—when finetuning a subset of the parameters, compute is not simply proportional to the size of the subset you’re finetuning, because you still have to do all the matrix multiplications of the original model, both for inference and gradient propagation. I think the point for the paper only finetuning a subset was to make a scientific point, not save compute.
My edit question was just because you said something about expecting the # of steps to be 3 OOM for a 3 OOM smaller model. But iirc really it’s more like the compute will be smaller, but the # of steps won’t change much (they’re just cheaper).
Do you have a reference for this picture of “need lots more data to get performance improvements?” I’ve also heard some things about a transition, but as a transition from compute-limited to data-limited, which means “need lots more compute to get performance improvements.”
I totally agree that you still have to do all the matrix multiplications of the original model etc. etc. I’m saying that you’ll need to do them fewer times, because you’ll be training on less data.
Each step costs, say, 6*N flop where N is parameter count. And then you do D steps, where D is how many data points you train on. So total flop cost is 6*N*D. When you fine-tune, you still spend 6*N for each data point, but you only need to train on 0.001D data points, at least according to the scaling laws, at least according to the orthodox interpretation around here.
I’d recommend reading Ajeya’s report (found here) for more on the scaling laws. There’s also this comment thread.
Sure, but if you’re training on less data it’s because fewer parameters is worse :P
Not according to this paper! They were able to get performance comparable to full-size networks, it seems. IDK.
I am frankly skeptical that this (section 3.9 in the pretrained frozen transformer paper) will hold up to Grad Student Descent on training parameters. But hey, maybe I’m wrong and there’s some nice property of the pretrained weights that can only be pushed into overfitting by finetuning.