habryka comments on Thomas Larsen’s Shortform

habryka 27 May 2024 23:08 UTC
LW: 4 AF: 3
0
AF
Maybe I am being dumb, but why not do things on the basis of “actual FLOPs” instead of “effective FLOPs”? Seems like there is a relatively simple fact-of-the-matter about how many actual FLOPs were performed in the training of a model, and that seems like a reasonable basis on which to base regulation and evals.
- Thomas Larsen 28 May 2024 1:04 UTC
  LW: 4 AF: 3
  2
  AF Parent
  Yeah, actual FLOPs are the baseline thing that’s used in the EO. But the OpenAI/GDM/Anthropic RSPs all reference effective FLOPs.
  If there’s a large algorithmic improvement you might have a large gap in capability between two models with the same FLOP, which is not desirable. Ideal thresholds in regulation / scaling policies are as tightly tied as possible to the risks.
  Another downside that FLOPs / E-FLOPs share is that it’s unpredictable what capabilities a 1e26 or 1e28 FLOPs model will have. And it’s unclear what capabilities will emerge from a small bit of scaling: it’s possible that within a 4x flop scaling you get high capabilities that had not appeared at all in the smaller model.