Nora Belrose comments on How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

Nora Belrose 21 Jan 2023 10:41 UTC
7 points
2
The number of truthlike features (or any kind of feature) cannot scale exponentially with the hidden dimension in an LLM, simply because the number of “features” scales at most linearly with the parameter count (for information theoretic reasons). Rather, I claim that the number of features scales at most quasi-quadratically with dimension, or $O (d^{2} l o g (d))$ .
With depth fixed, the number of parameters in a transformer scales as $O (d^{2})$ because of the weight matrices. According to this paper which was cited by the Chinchilla paper, the optimal depth scales logarithmically with width, hence the number of parameters, and therefore the number of “features” for a given width is $O (d^{2} l o g (d))$ . QED.
EDIT: It sounds like we are talking past each other, because you seem to think that “feature” means something like “total number of possible distinguishable states.” I don’t agree that this is a useful definition. I think in practice people use “feature” to mean something like “effective dimensionality” which scales as O(log(N)) in the number of distinguishable states. This is a useful definition IMO because we don’t actually have to enumerate all possible states of the neural net (at which level of granularity? machine precision?) to understand it; we just have to find the right basis in which to view it.
- Arthur Conmy 20 Dec 2023 16:26 UTC
  3 points
  0
  Parent
  The number of “features” scales at most linearly with the parameter count (for information theoretic reasons)
  Why is this true? Do you have a resource on this?
  - ryan_greenblatt 20 Dec 2023 19:34 UTC
    2 points
    0
    Parent
    I think the claim might be: models can’t compute more than O(number_of_parameters) useful and “different” things.
    
    I think this will strongly depend on how we define “different”.
    
    Or maybe the claim is something about how the residual stream only has d dimensions, so it’s only possible to encode so many things? (But we still need some notion of feature that doesn’t just allow for all different values (all 2^(d * bits_per_float) of them) to be different features?)
    
    [Tenative] A more precise version of this claim could perhaps be defined with heuristic arguments: “for an n bit sized model, the heuristic argument which explains its performance won’t be more than n bits”. (Roughly, it’s unclear how this interacts with the inputs distribution being arbitrarily complex.)
- Fabien Roger 21 Jan 2023 18:19 UTC
  2 points
  0
  Parent
  Here I’m using “feature” only with its simplest meaning: a direction in activation space. A truth-like feature only means “a direction in activation space with low CCS loss”, which is exactly what CCS enables you to find. By the example above, I show that there can be exponentially many of them. Therefore, the theorems above do not apply.
  Maybe requiring directions found by CCS to be “actual features” (satisfying the conditions of those theorems) might enable you to improve CCS. But I don’t know what those conditions are.