the fact that data and compute need to scale proportionally seems… like a big point in favor of NNs as memorizers/interpolators.
Surely it’s the opposite? The more bang you get out of each parameter, the less it looks like ‘just’ (whatever that means) memorization/interpolation. When you needed to increase parameters a lot, disproportionately, to cope with some more data, that does not speak well of abstractions or understanding. (If I can train a 1t model to get the same loss as what I thought was going to take a 100t model, why would I think that that 100t model must be memorizing/interpolating less?) Let’s take your claim to its logical extreme: suppose we discovered tomorrow a scaling law that made parameters near-constant (log, let’s say); would that not suggest that those parameters are super useful and it’s doing an amazing job of learning the underlying algorithm and is not memorizing/interpolating?
Sorry, you’re completely right about the first point. I’ll correct the original comment.
Re: learning weird stuff, they definitely do, but a lot of contemporary weirdness feels very data dependent (e.g. I failed to realize my data was on a human-recognizably weird submanifold, like medical images from different hospitals with different patient populations) versus grokking-dependent (e.g. AlphaFold possibly figuring out new predictive principles underlying protein folding, or a hypothetical future model thinking about math textbooks for long enough that it solves a Millenium Prize problem).
EDIT: though actually AlphaFold might be a bad example, because it got to simulate a shit-ton of data, so maybe I’ll just stick to the “deep grokking of math” hypothetical.
Surely it’s the opposite? The more bang you get out of each parameter, the less it looks like ‘just’ (whatever that means) memorization/interpolation. When you needed to increase parameters a lot, disproportionately, to cope with some more data, that does not speak well of abstractions or understanding. (If I can train a 1t model to get the same loss as what I thought was going to take a 100t model, why would I think that that 100t model must be memorizing/interpolating less?) Let’s take your claim to its logical extreme: suppose we discovered tomorrow a scaling law that made parameters near-constant (log, let’s say); would that not suggest that those parameters are super useful and it’s doing an amazing job of learning the underlying algorithm and is not memorizing/interpolating?
They already learn weird stuff, though.
Sorry, you’re completely right about the first point. I’ll correct the original comment.
Re: learning weird stuff, they definitely do, but a lot of contemporary weirdness feels very data dependent (e.g. I failed to realize my data was on a human-recognizably weird submanifold, like medical images from different hospitals with different patient populations) versus grokking-dependent (e.g. AlphaFold possibly figuring out new predictive principles underlying protein folding, or a hypothetical future model thinking about math textbooks for long enough that it solves a Millenium Prize problem).
EDIT: though actually AlphaFold might be a bad example, because it got to simulate a shit-ton of data, so maybe I’ll just stick to the “deep grokking of math” hypothetical.