Daniel Kokotajlo comments on [Link] Training Compute-Optimal Large Language Models

Daniel Kokotajlo 1 Apr 2022 18:50 UTC
LW: 5 AF: 3
AF
Overall I guess this should shorten timelines, because the effect you explain here is counteracted by the other first-order effect of “oh geez it looks like our earlier scaling projections were inefficient; for any performance level, we now know how to reach that level for less compute cost than the earlier projections said.” What do you think?
- nostalgebraist 1 Apr 2022 19:47 UTC
  LW: 13 AF: 7
  AF Parent
  It ought to shorten actual timelines, for the reason you say. (Except insofar as data sourcing could actually become a practical problem.)
  However, it lengthens the Bio Anchors timeline, because the parameter count in Bio Anchors is fixed. (It’s the parameter count of a model that uses about as much inference compute as the brain.)
  This is a weird thing about Bio Anchors—it asks when models will cross a threshold for the compute required to run them, so efficiency improvements of various kinds will lengthen its timeline. It’s always waiting for its “sufficiently expensive model” (and it does not care that this model keeps “getting better” in terms of loss/etc as the efficiency improvements roll in).
  Anyway, I’d forgotten the prior used for dataset scaling in Bio Anchors, but it’s pretty broad (page 39 of part 2), with substantial mass on linear/super-linear scaling. So this news is less relevant than I had thought.
- Jacob_Hilton 1 Apr 2022 21:26 UTC
  LW: 4 AF: 3
  AF Parent
  I suppose that depends on whether you think this constitutes several years of progress over and above what you would have expected. I don’t think this comes close to that, so I think the effect is much smaller.
  - Daniel Kokotajlo 1 Apr 2022 21:32 UTC
    LW: 2 AF: 2
    AF Parent
    OK, good to know. I look forward to seeing the performance trends updated with the new scaling paradigm/law.
- Daniel Kokotajlo 1 Apr 2022 18:52 UTC
  LW: 3 AF: 2
  AF Parent
  (In terms of the neural network model, this means lowering our estimate for how many parameters will be needed.)