nostalgebraist comments on [Link] Training Compute-Optimal Large Language Models

nostalgebraist 1 Apr 2022 19:47 UTC
LW: 13 AF: 7
AF
It ought to shorten actual timelines, for the reason you say. (Except insofar as data sourcing could actually become a practical problem.)
However, it lengthens the Bio Anchors timeline, because the parameter count in Bio Anchors is fixed. (It’s the parameter count of a model that uses about as much inference compute as the brain.)
This is a weird thing about Bio Anchors—it asks when models will cross a threshold for the compute required to run them, so efficiency improvements of various kinds will lengthen its timeline. It’s always waiting for its “sufficiently expensive model” (and it does not care that this model keeps “getting better” in terms of loss/etc as the efficiency improvements roll in).
Anyway, I’d forgotten the prior used for dataset scaling in Bio Anchors, but it’s pretty broad (page 39 of part 2), with substantial mass on linear/super-linear scaling. So this news is less relevant than I had thought.