I think the claim might be: models can’t compute more than O(number_of_parameters) useful and “different” things.
I think this will strongly depend on how we define “different”.
Or maybe the claim is something about how the residual stream only has d dimensions, so it’s only possible to encode so many things? (But we still need some notion of feature that doesn’t just allow for all different values (all 2^(d * bits_per_float) of them) to be different features?)
[Tenative] A more precise version of this claim could perhaps be defined with heuristic arguments: “for an n bit sized model, the heuristic argument which explains its performance won’t be more than n bits”. (Roughly, it’s unclear how this interacts with the inputs distribution being arbitrarily complex.)
Why is this true? Do you have a resource on this?
I think the claim might be: models can’t compute more than O(number_of_parameters) useful and “different” things.
I think this will strongly depend on how we define “different”.
Or maybe the claim is something about how the residual stream only has d dimensions, so it’s only possible to encode so many things? (But we still need some notion of feature that doesn’t just allow for all different values (all 2^(d * bits_per_float) of them) to be different features?)
[Tenative] A more precise version of this claim could perhaps be defined with heuristic arguments: “for an n bit sized model, the heuristic argument which explains its performance won’t be more than n bits”. (Roughly, it’s unclear how this interacts with the inputs distribution being arbitrarily complex.)