It’s exciting to see a new research direction which could have big implications if it works!
I think that Hypothesis 1 is overly optimistic:
Hypothesis 1: GPT-n will consistently represent only a small number of different “truth-like” features in its activations. [...] [...] 1024 remaining perspectives to distinguish between
A few thousand of features is the optimistic number of truth-like features. I argue below that it’s possible and likely that there are 2^hundredths of truth-like features in LLMs.
Why it’s possible to have 2^hundredths of truth-like features Let’s say that your dataset of activation is composed of d-dimensional one hot vectors and their element-wise opposites. Each of these represent a “fact”, and negating a fact gives you the opposite vector. Then any features in {−1,1}d is truth-like (up to a scaling constant): for each “fact” x (a one hot vector multiplied by −1 or 1), <d,x>∈{−1,1}, and for its opposite fact ^x, <d,x>=−<d,^x>. This gives you 2d features which are all truth-like.
Why it’s likely that there are 2^hundredths of truth-like features in real LLMs I think the encoding described above is unlikely. But in a real network, you might expect the network to encode groups of facts like “facts that Democrat believe but not Republicans”, “climate change is real vs climate change is fake”, … When late in training it finds out ways to use “the truth”, it doesn’t need to build a new “truth-circuit” from scratch, it can just select the right combination of groups of facts.
(An additional reason for concern is that in practice you find “approximate truth-like directions”, and there can be much more approximate truth-like directions than truth-like directions.)
Even if hypothesis 1 is wrong, there might be ways to salvage the research direction. Thousands of bits of information would be able to distinguish between the 2^thousands truth-like features.
The number of truthlike features (or any kind of feature) cannot scale exponentially with the hidden dimension in an LLM, simply because the number of “features” scales at most linearly with the parameter count (for information theoretic reasons). Rather, I claim that the number of features scales at most quasi-quadratically with dimension, or O(d2log(d)).
With depth fixed, the number of parameters in a transformer scales as O(d2) because of the weight matrices. According to this paper which was cited by the Chinchilla paper, the optimal depth scales logarithmically with width, hence the number of parameters, and therefore the number of “features” for a given width is O(d2log(d)). QED.
EDIT: It sounds like we are talking past each other, because you seem to think that “feature” means something like “total number of possible distinguishable states.” I don’t agree that this is a useful definition. I think in practice people use “feature” to mean something like “effective dimensionality” which scales as O(log(N)) in the number of distinguishable states. This is a useful definition IMO because we don’t actually have to enumerate all possible states of the neural net (at which level of granularity? machine precision?) to understand it; we just have to find the right basis in which to view it.
I think the claim might be: models can’t compute more than O(number_of_parameters) useful and “different” things.
I think this will strongly depend on how we define “different”.
Or maybe the claim is something about how the residual stream only has d dimensions, so it’s only possible to encode so many things? (But we still need some notion of feature that doesn’t just allow for all different values (all 2^(d * bits_per_float) of them) to be different features?)
[Tenative] A more precise version of this claim could perhaps be defined with heuristic arguments: “for an n bit sized model, the heuristic argument which explains its performance won’t be more than n bits”. (Roughly, it’s unclear how this interacts with the inputs distribution being arbitrarily complex.)
Here I’m using “feature” only with its simplest meaning: a direction in activation space. A truth-like feature only means “a direction in activation space with low CCS loss”, which is exactly what CCS enables you to find. By the example above, I show that there can be exponentially many of them. Therefore, the theorems above do not apply.
Maybe requiring directions found by CCS to be “actual features” (satisfying the conditions of those theorems) might enable you to improve CCS. But I don’t know what those conditions are.
It’s exciting to see a new research direction which could have big implications if it works!
I think that Hypothesis 1 is overly optimistic:
A few thousand of features is the optimistic number of truth-like features. I argue below that it’s possible and likely that there are 2^hundredths of truth-like features in LLMs.
Why it’s possible to have 2^hundredths of truth-like features
Let’s say that your dataset of activation is composed of d-dimensional one hot vectors and their element-wise opposites. Each of these represent a “fact”, and negating a fact gives you the opposite vector. Then any features in {−1,1}d is truth-like (up to a scaling constant): for each “fact” x (a one hot vector multiplied by −1 or 1), <d,x>∈{−1,1}, and for its opposite fact ^x, <d,x>=−<d,^x>. This gives you 2d features which are all truth-like.
Why it’s likely that there are 2^hundredths of truth-like features in real LLMs
I think the encoding described above is unlikely. But in a real network, you might expect the network to encode groups of facts like “facts that Democrat believe but not Republicans”, “climate change is real vs climate change is fake”, … When late in training it finds out ways to use “the truth”, it doesn’t need to build a new “truth-circuit” from scratch, it can just select the right combination of groups of facts.
(An additional reason for concern is that in practice you find “approximate truth-like directions”, and there can be much more approximate truth-like directions than truth-like directions.)
Even if hypothesis 1 is wrong, there might be ways to salvage the research direction. Thousands of bits of information would be able to distinguish between the 2^thousands truth-like features.
The number of truthlike features (or any kind of feature) cannot scale exponentially with the hidden dimension in an LLM, simply because the number of “features” scales at most linearly with the parameter count (for information theoretic reasons). Rather, I claim that the number of features scales at most quasi-quadratically with dimension, or O(d2log(d)).
With depth fixed, the number of parameters in a transformer scales as O(d2) because of the weight matrices. According to this paper which was cited by the Chinchilla paper, the optimal depth scales logarithmically with width, hence the number of parameters, and therefore the number of “features” for a given width is O(d2log(d)). QED.
EDIT: It sounds like we are talking past each other, because you seem to think that “feature” means something like “total number of possible distinguishable states.” I don’t agree that this is a useful definition. I think in practice people use “feature” to mean something like “effective dimensionality” which scales as O(log(N)) in the number of distinguishable states. This is a useful definition IMO because we don’t actually have to enumerate all possible states of the neural net (at which level of granularity? machine precision?) to understand it; we just have to find the right basis in which to view it.
Why is this true? Do you have a resource on this?
I think the claim might be: models can’t compute more than O(number_of_parameters) useful and “different” things.
I think this will strongly depend on how we define “different”.
Or maybe the claim is something about how the residual stream only has d dimensions, so it’s only possible to encode so many things? (But we still need some notion of feature that doesn’t just allow for all different values (all 2^(d * bits_per_float) of them) to be different features?)
[Tenative] A more precise version of this claim could perhaps be defined with heuristic arguments: “for an n bit sized model, the heuristic argument which explains its performance won’t be more than n bits”. (Roughly, it’s unclear how this interacts with the inputs distribution being arbitrarily complex.)
Here I’m using “feature” only with its simplest meaning: a direction in activation space. A truth-like feature only means “a direction in activation space with low CCS loss”, which is exactly what CCS enables you to find. By the example above, I show that there can be exponentially many of them. Therefore, the theorems above do not apply.
Maybe requiring directions found by CCS to be “actual features” (satisfying the conditions of those theorems) might enable you to improve CCS. But I don’t know what those conditions are.