Fabien Roger comments on How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

Fabien Roger Jan 21, 2023, 8:41 AM
LW: 6 AF: 4
0
AF
It’s exciting to see a new research direction which could have big implications if it works!
I think that Hypothesis 1 is overly optimistic:
Hypothesis 1: GPT-n will consistently represent only a small number of different “truth-like” features in its activations.
[...]
[...] 1024 remaining perspectives to distinguish between
A few thousand of features is the optimistic number of truth-like features. I argue below that it’s possible and likely that there are 2^hundredths of truth-like features in LLMs.
Why it’s possible to have 2^hundredths of truth-like features
Let’s say that your dataset of activation is composed of d-dimensional one hot vectors and their element-wise opposites. Each of these represent a “fact”, and negating a fact gives you the opposite vector. Then any features in ${- 1, 1}^{d}$ is truth-like (up to a scaling constant): for each “fact” $x$ (a one hot vector multiplied by −1 or 1), $< d, x >\in {- 1, 1}$ , and for its opposite fact $^x$ , $< d, x >= - < d,^x >$ . This gives you $2^{d}$ features which are all truth-like.
Why it’s likely that there are 2^hundredths of truth-like features in real LLMs
I think the encoding described above is unlikely. But in a real network, you might expect the network to encode groups of facts like “facts that Democrat believe but not Republicans”, “climate change is real vs climate change is fake”, … When late in training it finds out ways to use “the truth”, it doesn’t need to build a new “truth-circuit” from scratch, it can just select the right combination of groups of facts.
(An additional reason for concern is that in practice you find “approximate truth-like directions”, and there can be much more approximate truth-like directions than truth-like directions.)
Even if hypothesis 1 is wrong, there might be ways to salvage the research direction. Thousands of bits of information would be able to distinguish between the 2^thousands truth-like features.
- Nora Belrose Jan 21, 2023, 10:41 AM
  7 points
  2
  Parent
  The number of truthlike features (or any kind of feature) cannot scale exponentially with the hidden dimension in an LLM, simply because the number of “features” scales at most linearly with the parameter count (for information theoretic reasons). Rather, I claim that the number of features scales at most quasi-quadratically with dimension, or $O (d^{2} l o g (d))$ .
  With depth fixed, the number of parameters in a transformer scales as $O (d^{2})$ because of the weight matrices. According to this paper which was cited by the Chinchilla paper, the optimal depth scales logarithmically with width, hence the number of parameters, and therefore the number of “features” for a given width is $O (d^{2} l o g (d))$ . QED.
  EDIT: It sounds like we are talking past each other, because you seem to think that “feature” means something like “total number of possible distinguishable states.” I don’t agree that this is a useful definition. I think in practice people use “feature” to mean something like “effective dimensionality” which scales as O(log(N)) in the number of distinguishable states. This is a useful definition IMO because we don’t actually have to enumerate all possible states of the neural net (at which level of granularity? machine precision?) to understand it; we just have to find the right basis in which to view it.
  - Arthur Conmy Dec 20, 2023, 4:26 PM
    3 points
    0
    Parent
    The number of “features” scales at most linearly with the parameter count (for information theoretic reasons)
    Why is this true? Do you have a resource on this?
    - ryan_greenblatt Dec 20, 2023, 7:34 PM
      2 points
      0
      Parent
      I think the claim might be: models can’t compute more than O(number_of_parameters) useful and “different” things.
      
      I think this will strongly depend on how we define “different”.
      
      Or maybe the claim is something about how the residual stream only has d dimensions, so it’s only possible to encode so many things? (But we still need some notion of feature that doesn’t just allow for all different values (all 2^(d * bits_per_float) of them) to be different features?)
      
      [Tenative] A more precise version of this claim could perhaps be defined with heuristic arguments: “for an n bit sized model, the heuristic argument which explains its performance won’t be more than n bits”. (Roughly, it’s unclear how this interacts with the inputs distribution being arbitrarily complex.)
  - Fabien Roger Jan 21, 2023, 6:19 PM
    2 points
    0
    Parent
    Here I’m using “feature” only with its simplest meaning: a direction in activation space. A truth-like feature only means “a direction in activation space with low CCS loss”, which is exactly what CCS enables you to find. By the example above, I show that there can be exponentially many of them. Therefore, the theorems above do not apply.
    Maybe requiring directions found by CCS to be “actual features” (satisfying the conditions of those theorems) might enable you to improve CCS. But I don’t know what those conditions are.
- Fabien Roger Jun 3, 2025, 9:58 AM
  LW: 2 AF: 2
  0
  AF Parent
  I still think this is a serious concern that prevents us from finding “the truth predictor”, but here is a way around it:
  is_it_bad(x) = max_predictor consistency(predictor) subject to the constraint that predictor(x) ≠ predictor(obvious good inputs)
  - For x such that all coherent (and high salience) views predict it is good, then it won’t be possible to find a consistent and salient predictor that predicts that x has a different label from obvious good inputs and thus is_it_bad(x) will be low;
  - For x that the model thinks is bad (regardless of what other coherent views think, and in particular coherent human simulators!), then is_it_bad(x) will be high!
  - For x such that some coherent view (e.g. the human simulator) thinks is bad, then is_it_bad(x) will be high.
  The proposed is_it_bad has no false negatives. It has false positives (in situations where the human simulator predicts the input is bad but the model knows it’s fine), but I think this is might fine for many applications (as long as we correctly use some prior/saliency/simplicity properties to exclude coherent predictors that are not interesting without excluding the ground truth). In particular, it might be outer-alignment safe-ish to maximize for outcome(x) - large_constant * is_it_bad(x).