Daniel Kokotajlo comments on How do scaling laws work for fine-tuning?

Daniel Kokotajlo 4 Apr 2021 19:48 UTC
LW: 2 AF: 2
AF
Thanks! Your answer no. 2 is especially convincing to me; I didn’t realize the authors used smaller models as the comparison—that seems like an unfair comparison! I would like to see how well these 0.1%-tuned transformers do compared to similarly-sized transformers trained from scratch.
- Rohin Shah 4 Apr 2021 19:57 UTC
  LW: 4 AF: 4
  AF Parent
  I don’t think similarly-sized transformers would do much better and might do worse. Section 3.4 shows that large models trained from scratch massively overfit to the data. I vaguely recall the authors saying that similarly-sized transformers tended to be harder to train as well.