To me, this model predicts that sparse autoencoders should not find abstract features, because those are shards, and should not be localisable to a direction in activation space on a single token. Do you agree that this is implied?
If so, how do you square that with eg all the abstract features Anthropic found in Sonnet 3?
Sparse autoencoders finds features that correspond to abstract features of words and text. That’s not the same as finding features that correspond to reality.
Furthermore, the goal is not abstraction. Sunlight is among the least abstract concepts you can think of, yet it is central to the point. History does not proceed by abstraction, but rather by erosion, bifurcation and accumulation. The only reason abstraction even comes up in these contexts is because rationalists have a propensity to start with a maximally-shattered world-model containing only e.g. atoms or whatever, and attempt to somehow aggregate this back into wholes.
Sparse autoencoders finds features that correspond to abstract features of words and text. That’s not the same as finding features that correspond to reality.
(Base-model) LLMs are trained to minimize prediction error, and SAEs do seem to find features that sparsely predict error, such as a gender feature that, when removed, affects the probability of pronouns. So pragmatically, for the goal of “finding features that explain next-word-prediction”, which LLMs are directly trained for, SAEs find good examples![1]
I’m unsure what goal you have in mind for “features that correspond to reality”, or what that’d mean.
The one we checked last year was just Pythia-70M, which I don’t expect the LLM itself to have a gender feature that generalizes to both pronouns and anisogamy.
But again, the task is next-token prediction. Do you expect e.g. GPT 4 to have learned a gender concept that affects both knowledge about anisogamy and pronouns while trained on next-token prediction?
I guess to add, if I ask GPT-4o “What is the relationship between gender and anisogamy?”, it answers:
Anisogamy is the term used to describe a form of sexual reproduction involving the fusion of two dissimilar gametes—typically a large, immobile egg and a smaller, mobile sperm. This fundamental difference in gamete size and function is closely tied to the concept of gender in many organisms, including humans.
Here’s the relationship between gender and anisogamy:
1. **Defining Gender Roles in Reproduction**: In species with anisogamy, two distinct roles often emerge based on the type of gametes produced. The organism producing the larger gametes (eggs) is generally defined as female, while the organism producing the smaller gametes (sperm) is defined as male. This division creates the primary biological distinction between the two genders.
2. **Evolutionary Basis for Gender Differences**: Anisogamy drives evolutionary strategies that are different for each gender due to the differing investments in gamete production. Females, investing more in larger gametes, may evolve strategies that favor selectivity in mating to maximize reproductive success. Males, producing many small gametes, often evolve strategies that emphasize quantity over quality, such as competing with other males for access to females or investing in traits that enhance their reproductive success (e.g., mobility, attractiveness).
3. **Impact on Sexual Selection**: The asymmetry created by anisogamy often leads to differing selective pressures on males and females. This can manifest in various ways, such as differences in size, behavior, or reproductive strategies. Sexual selection can lead to the evolution of secondary sexual characteristics that are not directly related to gamete production but are advantageous in the competition for mates.
4. **Gender Diversity Beyond Binary**: While anisogamy is foundational in defining traditional male and female roles, it does not account for the diversity of gender seen in some species, including those with complex reproductive strategies, hermaphroditism, or species where sex can change over time. However, anisogamy still provides the underlying biological framework for the differentiation of reproductive roles.
In summary, anisogamy is a fundamental concept that underpins the biological basis of gender by distinguishing between the roles of gamete producers. It explains why there are generally two sexes in many species and sets the stage for the evolution of gender-specific traits and behaviors.
So clearly there is some kind of information about the relationship between gender and anisogamy within GPT-4o. The point of my post is that it is unlikely to be in the weight space or activation space.
Next-token prediction, and more generally autoregressive modelling, is precisely the problem. It assumes that the world is such that the past determines the future, whereas really the less-diminished shapes the more-diminished (“the greater determines the lesser”). As I admitted in the post, it’s plausible that future models will use different architectures where this is less of a problem.
But through gradient descent, shards act upon the neural networks by leaving imprints of themselves, and these imprints have no reason to be concentrated in any one spot of the network (whether activation-space or weight-space). So studying weights and activations is pretty doomed.
This paragraph sounded like you’re claiming LLMs do have concepts, but they’re not in specific activations or weights, but distributed across them instead.
But from your comment, you mean that LLMs themselves don’t learn the true simple-compressed features of reality, but a mere shadow of them.
This interpretation also matches the title better!
But are you saying the “true features” in the dataset + network? Because SAEs are trained on a dataset! (ignoring the problem pointed out in footnote 1).
Possibly clustering the data points by their network gradients would be a way to put some order into this mess?
Eric Michaud did cluster datapoints by their gradients here. From the abstract:
...Using language model gradients, we automatically decompose model behavior into a diverse set of skills (quanta).
This paragraph sounded like you’re claiming LLMs do have concepts, but they’re not in specific activations or weights, but distributed across them instead.
But from your comment, you mean that LLMs themselves don’t learn the true simple-compressed features of reality, but a mere shadow of them.
This interpretation also matches the title better!
A true feature of reality get diminished into many small fragments. These fragments birfucate into multiple groups, of which we will consider two groups, A and B. Group A gets collected and analysed by humans into human knowledge, which then again gets diminished into many small fragments, which we will call group C.
Group B and group C make impacts on the network. Each fragment in group B and group C produces a shadow in the network, leading to there being many shadows distributed across activation space and weight space. These many shadows form a channel which is highly reflective of the true feature of reality.
That allows there to be simple useful ways to connect the LLM to the true feature of reality. However, the simplicity of the feature and its connection is not reflected into a simple representation of the feature within the network; instead the concept works as a result of the many independent shadows making way for it.
But are you saying the “true features” in the dataset + network? Because SAEs are trained on a dataset! (ignoring the problem pointed out in footnote 1).
The true features branch of from the sun (and the earth). Why would you ignore the problem pointed out in footnote 1? It’s a pretty important problem.
To me, this model predicts that sparse autoencoders should not find abstract features, because those are shards, and should not be localisable to a direction in activation space on a single token. Do you agree that this is implied?
If so, how do you square that with eg all the abstract features Anthropic found in Sonnet 3?
Sparse autoencoders finds features that correspond to abstract features of words and text. That’s not the same as finding features that correspond to reality.
Furthermore, the goal is not abstraction. Sunlight is among the least abstract concepts you can think of, yet it is central to the point. History does not proceed by abstraction, but rather by erosion, bifurcation and accumulation. The only reason abstraction even comes up in these contexts is because rationalists have a propensity to start with a maximally-shattered world-model containing only e.g. atoms or whatever, and attempt to somehow aggregate this back into wholes.
(Base-model) LLMs are trained to minimize prediction error, and SAEs do seem to find features that sparsely predict error, such as a gender feature that, when removed, affects the probability of pronouns. So pragmatically, for the goal of “finding features that explain next-word-prediction”, which LLMs are directly trained for, SAEs find good examples![1]
I’m unsure what goal you have in mind for “features that correspond to reality”, or what that’d mean.
Not claiming that all SAE latents are good in this way though.
If you remove the gender feature, does the neural network lose its ability to talk about anisogamy?
The one we checked last year was just Pythia-70M, which I don’t expect the LLM itself to have a gender feature that generalizes to both pronouns and anisogamy.
But again, the task is next-token prediction. Do you expect e.g. GPT 4 to have learned a gender concept that affects both knowledge about anisogamy and pronouns while trained on next-token prediction?
I guess to add, if I ask GPT-4o “What is the relationship between gender and anisogamy?”, it answers:
So clearly there is some kind of information about the relationship between gender and anisogamy within GPT-4o. The point of my post is that it is unlikely to be in the weight space or activation space.
Next-token prediction, and more generally autoregressive modelling, is precisely the problem. It assumes that the world is such that the past determines the future, whereas really the less-diminished shapes the more-diminished (“the greater determines the lesser”). As I admitted in the post, it’s plausible that future models will use different architectures where this is less of a problem.
This paragraph sounded like you’re claiming LLMs do have concepts, but they’re not in specific activations or weights, but distributed across them instead.
But from your comment, you mean that LLMs themselves don’t learn the true simple-compressed features of reality, but a mere shadow of them.
This interpretation also matches the title better!
But are you saying the “true features” in the dataset + network? Because SAEs are trained on a dataset! (ignoring the problem pointed out in footnote 1).
Eric Michaud did cluster datapoints by their gradients here. From the abstract:
A true feature of reality get diminished into many small fragments. These fragments birfucate into multiple groups, of which we will consider two groups, A and B. Group A gets collected and analysed by humans into human knowledge, which then again gets diminished into many small fragments, which we will call group C.
Group B and group C make impacts on the network. Each fragment in group B and group C produces a shadow in the network, leading to there being many shadows distributed across activation space and weight space. These many shadows form a channel which is highly reflective of the true feature of reality.
That allows there to be simple useful ways to connect the LLM to the true feature of reality. However, the simplicity of the feature and its connection is not reflected into a simple representation of the feature within the network; instead the concept works as a result of the many independent shadows making way for it.
The true features branch of from the sun (and the earth). Why would you ignore the problem pointed out in footnote 1? It’s a pretty important problem.