Charlie Steiner comments on How do scaling laws work for fine-tuning?

Charlie Steiner 6 Apr 2021 2:06 UTC
LW: 2 AF: 1
AF
Sure, but if you’re training on less data it’s because fewer parameters is worse :P
- Daniel Kokotajlo 6 Apr 2021 6:44 UTC
  LW: 3 AF: 2
  AF Parent
  Not according to this paper! They were able to get performance comparable to full-size networks, it seems. IDK.
  - Charlie Steiner 6 Apr 2021 14:19 UTC
    LW: 2 AF: 1
    AF Parent
    I am frankly skeptical that this (section 3.9 in the pretrained frozen transformer paper) will hold up to Grad Student Descent on training parameters. But hey, maybe I’m wrong and there’s some nice property of the pretrained weights that can only be pushed into overfitting by finetuning.