gwern comments on New Scaling Laws for Large Language Models

gwern 2 Apr 2022 15:08 UTC
15 points

the fact that data and compute need to scale proportionally seems… like a big point in favor of NNs as memorizers/interpolators.

Surely it’s the opposite? The more bang you get out of each parameter, the less it looks like ‘just’ (whatever that means) memorization/interpolation. When you needed to increase parameters a lot, disproportionately, to cope with some more data, that does not speak well of abstractions or understanding. (If I can train a 1t model to get the same loss as what I thought was going to take a 100t model, why would I think that that 100t model must be memorizing/interpolating less?) Let’s take your claim to its logical extreme: suppose we discovered tomorrow a scaling law that made parameters near-constant (log, let’s say); would that not suggest that those parameters are super useful and it’s doing an amazing job of learning the underlying algorithm and is not memorizing/interpolating?

and hoping it doesn’t learn anything weird.

They already learn weird stuff, though.
- Not Relevant 2 Apr 2022 15:46 UTC
  6 points
  0
  Parent
  Sorry, you’re completely right about the first point. I’ll correct the original comment.
  
  Re: learning weird stuff, they definitely do, but a lot of contemporary weirdness feels very data dependent (e.g. I failed to realize my data was on a human-recognizably weird submanifold, like medical images from different hospitals with different patient populations) versus grokking-dependent (e.g. AlphaFold possibly figuring out new predictive principles underlying protein folding, or a hypothetical future model thinking about math textbooks for long enough that it solves a Millenium Prize problem).
  
  EDIT: though actually AlphaFold might be a bad example, because it got to simulate a shit-ton of data, so maybe I’ll just stick to the “deep grokking of math” hypothetical.