ryan_greenblatt comments on How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

ryan_greenblatt 20 Dec 2023 19:34 UTC
2 points
0
I think the claim might be: models can’t compute more than O(number_of_parameters) useful and “different” things.

I think this will strongly depend on how we define “different”.

Or maybe the claim is something about how the residual stream only has d dimensions, so it’s only possible to encode so many things? (But we still need some notion of feature that doesn’t just allow for all different values (all 2^(d * bits_per_float) of them) to be different features?)

[Tenative] A more precise version of this claim could perhaps be defined with heuristic arguments: “for an n bit sized model, the heuristic argument which explains its performance won’t be more than n bits”. (Roughly, it’s unclear how this interacts with the inputs distribution being arbitrarily complex.)