Does this mean that this fine-tuning process can be thought of as training a NN that is 3 OOMs smaller, and thus needs 3 OOMs fewer training steps according to the scaling laws?
My guess is that the answer is mostly yes (maybe not the exact numbers predicted by existing scaling laws, but similar ballpark).
I think this is mostly irrelevant to timelines / previous scaling laws for transfer:
You still have to pretrain the Transformer, which will take the usual amount of compute (my calculation that you linked takes this into account).
The models trained in the new paper are not particularly strong. They are probably equivalent in performance to models that are multiple orders of magnitude smaller trained from scratch. (I think when comparing against training from scratch, the authors did use smaller models because that was more stable, though with a quick search I couldn’t find anything confirming that right now.) So if you think of the “default” as “train an X-parameter model from scratch”, then to get equivalent performance you’d probably want to do something like “pretrain a 100X-parameter model, then finetune 0.1% of its weights”. (Numbers completely made up.)
I expect there are a bunch of differences in how exactly models are trained. For example, the scaling law papers work almost exclusively with compute-optimal training, whereas this paper probably works with models trained to convergence.
You probably could come to a unified view that incorporates both this new paper and previous scaling law papers, but I expect you’d need to spend a bunch of time getting into the minutiae of the details across the two methods. (Probably high tens to low hundreds of hours.)
Thanks! Your answer no. 2 is especially convincing to me; I didn’t realize the authors used smaller models as the comparison—that seems like an unfair comparison! I would like to see how well these 0.1%-tuned transformers do compared to similarly-sized transformers trained from scratch.
I don’t think similarly-sized transformers would do much better and might do worse. Section 3.4 shows that large models trained from scratch massively overfit to the data. I vaguely recall the authors saying that similarly-sized transformers tended to be harder to train as well.
My guess is that the answer is mostly yes (maybe not the exact numbers predicted by existing scaling laws, but similar ballpark).
I think this is mostly irrelevant to timelines / previous scaling laws for transfer:
You still have to pretrain the Transformer, which will take the usual amount of compute (my calculation that you linked takes this into account).
The models trained in the new paper are not particularly strong. They are probably equivalent in performance to models that are multiple orders of magnitude smaller trained from scratch. (I think when comparing against training from scratch, the authors did use smaller models because that was more stable, though with a quick search I couldn’t find anything confirming that right now.) So if you think of the “default” as “train an X-parameter model from scratch”, then to get equivalent performance you’d probably want to do something like “pretrain a 100X-parameter model, then finetune 0.1% of its weights”. (Numbers completely made up.)
I expect there are a bunch of differences in how exactly models are trained. For example, the scaling law papers work almost exclusively with compute-optimal training, whereas this paper probably works with models trained to convergence.
You probably could come to a unified view that incorporates both this new paper and previous scaling law papers, but I expect you’d need to spend a bunch of time getting into the minutiae of the details across the two methods. (Probably high tens to low hundreds of hours.)
Thanks! Your answer no. 2 is especially convincing to me; I didn’t realize the authors used smaller models as the comparison—that seems like an unfair comparison! I would like to see how well these 0.1%-tuned transformers do compared to similarly-sized transformers trained from scratch.
I don’t think similarly-sized transformers would do much better and might do worse. Section 3.4 shows that large models trained from scratch massively overfit to the data. I vaguely recall the authors saying that similarly-sized transformers tended to be harder to train as well.