It would be perverse to try to understand a king in terms of his molecular configuration, rather than in the contact between the farmer and the bandit.
It sure would.
But through gradient descent, shards act upon the neural networks by leaving imprints of themselves
Indeed.
these imprints have no reason to be concentrated in any one spot of the network (whether activation-space or weight-space)
This follows for weight-space, but I think it doesn’t follow for activation space. We expect that the ecological role of king is driven by some specific pressures that apply in certain specific circumstances (e.g. in the times that farmers would come in contact with bandits), while not being very applicable at most other times (e.g. when the tide is coming in). As such, to understand the role of the king, it is useful to be able to distinguish times when the environmental pressure strongly applies from the times when it does not strongly apply. Other inferences may be downstream of this ability to distinguish, and there will be some pressure for these downstream inferences to all refer to the same upstream feature, rather than having a bunch of redundant and incomplete copies. So I argue that there is in fact a reason for these imprints to be concentrated into a specific spot of activation space.
Recent work on SAEs as they apply to transformer residuals seem to back this intuition up.
We propose the Quantization Model of neural scaling laws, explaining both the observed power law dropoff of loss with model and data size, and also the sudden emergence of new capabilities with scale. We derive this model from what we call the Quantization Hypothesis, where network knowledge and skills are “quantized” into discrete chunks (quanta). We show that when quanta are learned in order of decreasing use frequency, then a power law in use frequencies explains observed power law scaling of loss. We validate this prediction on toy datasets, then study how scaling curves decompose for large language models. Using language model gradients, we automatically decompose model behavior into a diverse set of skills (quanta). We tentatively find that the frequency at which these quanta are used in the training distribution roughly follows a power law corresponding with the empirical scaling exponent for language models, a prediction of our theory.
This follows for weight-space, but I think it doesn’t follow for activation space. We expect that the ecological role of king is driven by some specific pressures that apply in certain specific circumstances (e.g. in the times that farmers would come in contact with bandits), while not being very applicable at most other times (e.g. when the tide is coming in). As such, to understand the role of the king, it is useful to be able to distinguish times when the environmental pressure strongly applies from the times when it does not strongly apply. Other inferences may be downstream of this ability to distinguish, and there will be some pressure for these downstream inferences to all refer to the same upstream feature, rather than having a bunch of redundant and incomplete copies. So I argue that there is in fact a reason for these imprints to be concentrated into a specific spot of activation space.
Issue is there are four things to distinguish:
The root cause of kings (related to farmer/bandit conflicts)
The whole of kings (including the king’s effect on the kingdom)
The material manifestation of a king (including molecular decomposition)
Symbols representing kings (such as the word “king”)
Large language models will no doubt develop representations for thing 4, as well as corresponding representations for symbols of other objects; quite plausibly you can dig up some axis in an SAE that correlates to language talking about farmer/bandit conflicts.
Right now, large language models can sort of model dynamics and therefore in some sense must have some knowledge of material manifestations. However, since these dynamics are described in natural language, they are spread out over many tokens, and so it is dubious whether they have any representations for thing 3.
Large language models absolutely do not have a representation for thing 2, because the whole of kings has shattered into many different shards before they were trained, and they’ve only been fed scattered bits and pieces of it. (Now, they might have representations of symbols representing kings, as in an embedding of the text “the king’s effect on the kingdom”, but they will be superficial.)
And finally, while they may have representations of the material manifestation of farmer/bandit conflicts, this does not mean said manifestation is deep enough to derive the downstream stuff. Rather they probably only represent it as independent/disconnected events.
Large language models absolutely do not have a representation for thing 2, because the whole of kings has shattered into many different shards before they were trained, and they’ve only been fed scattered bits and pieces of it.
Rationalist-empiricists don’t (see rationalists are missing a core piece of agent-like structure, and the causal backbone conjecture), and maybe society is sufficiently unhealthy that most people don’t (I think religious people have it to a greater extent, with their talk about higher powers, but they have it in a degenerate form where e.g. Christians worship love when really the sun is the objectively correct creator god) but it’s not really that hard to develop once you realize that you should.
Like, rationalists intellectually know that thermodynamics is a thing, but it doesn’t seem common to for rationalists to think of everything important as being the result of emanations from the sun. Instead, rationalists are more likely to take a “Gnostic” picture, by which I mean a worldview where one believes the world was created by an evil entity and that inner enlightenment can give access to a good greater entity. The Goddess of Everything Else being a central example of rationalist Gnosticism.
Like, rationalists intellectually know that thermodynamics is a thing, but it doesn’t seem common to for rationalists to think of everything important as being the result of emanations from the sun.
I expect if you took a room with 100 rationalists, and told them to consider something that is important to them, and then asked them how that thing came to be, and then had them repeat the process 25 more times, at least half of the rationalists in the room would include some variation of “because the sun shines” within their causal chains. At the same time, I don’t think rationalists tend to say things like “for dinner, I think I will make a tofu stir fry, and ultimately I’m able to make this decision because there’s a ball of fusing hydrogen about 1.5×1011m away”.
Put another way, I expect that large language models encode many salient learned aspects of their environments, and that those attributes are largely detectable in specific places in activation space. I do not expect that large language models encode all of the implications of those learned aspects of their environments anywhere, and I don’t particularly expect it to be possible to mechanistically determine all of those implications without actually running the language model. But I don’t think “don’t hold the whole of their world model, including all implications thereof, in mind at all times” is something particular to LLMs.
I expect if you took a room with 100 rationalists, and told them to consider something that is important to them, and then asked them how that thing came to be, and then had them repeat the process 25 more times, at least half of the rationalists in the room would include some variation of “because the sun shines” within their causal chains.
I dunno, maybe? At least if you carefully choose the question, then you can figure something out to guide them to it. But it’s not really central enough to guide their models of e.g. AI alignment.
At the same time, I don’t think rationalists tend to say things like “for dinner, I think I will make a tofu stir fry, and ultimately I’m able to make this decision because there’s a ball of fusing hydrogen about 1.5×1011m away”.
That would also be pretty pathological.
It’s more relevant to consider stuff like, imagine you invite someone out for dinner. Do you do this because:
The food is divine sunblessings that you share with them (correct reason)
Food is plentiful but people’s instincts don’t realize that yet (evil, manipulative reason)
That’s just what you’re supposed to do, culturally speaking (ungrounded, confused reason)
etc.
don’t hold the whole of their world model, including all implications thereof, in mind at all times” is something particular to LLMs
But if you structure your world-model such that the nodes predominantly look like a branching structure emanating from the sun, it’s not a question of “including all implications thereof”. The issue is rationalist-empiricists have a habit of instead structuring their world-model as an autoregressive-ish thing, where nodes at one time determine the nodes at the immediately next time, and so the importance of the sun is an “implication” instead of a straightforward part of the structure.
Let’s say your causal model looks something like this:
What causes you to specifically call out “sunblessings” as the “correct” upstream node in the world model of why you take your friend to dinner, as opposed to “fossil fuels” or “the big bang” or “human civilization existing” or “the restaurant having tasty food”?
Or do you reject the premise that your causal model should look like a tangled mess, and instead assert that it is possible to have a useful tree-shaped causal model (i.e. one that does not contain joining branches or loops).
Let’s say your causal model looks something like this:
What causes you to specifically call out “sunblessings” as the “correct” upstream node in the world model of why you take your friend to dinner, as opposed to “fossil fuels” or “the big bang” or “human civilization existing” or “the restaurant having tasty food”?
Nothing in this causal model centers the sun, that’s precisely what makes it so broken.
Fossil fuels, the big bang, and human civilization is not what you offered to your friend. Tastiness is a sensory quality, which is a superficial matter. If you offer your friend something that you think they superficially assume to be better than you really think it is, that is hardly a nice gesture.
Or do you reject the premise that your causal model should look like a tangled mess, and instead assert that it is possible to have a useful tree-shaped causal model (i.e. one that does not contain joining branches or loops).
I wouldn’t rule out that you could sometimes have joining branches and loops, but mechanistic models tend to have it far too much. (Admittedly your given model isn’t super mechanistic, but still, it’s directionally mechanistic compared to what I’m advocating.)
I don’t think I understand, concretely, what a non-mechanistic model looks like in your view. Can you give a concrete example of a useful non-mechanistic model?
Something that tracks resource flows rather than information flows. For example if you have a company, you can have nodes for the revenue from each of the products your are selling, aggregating into product category nodes, and finally into total revenue, which then branches off into profits and different clusters of expenses, with each cluster branching off into more narrow expenses. This sort of thing is useful because it makes it practical to study phenomena by looking at their accounting.
Sure, that’s also a useful thing to do sometimes. Is your contention that simple concentrated representations of resources and how they flow do not exist in the activations of LLMs that are reasoning about resources and how they flow?
If not, I think I still don’t understand what sort of thing you think LLMs don’t have a concentrated representation of.
It’s clearer to me that the structure of the world is centered on emanations, erosions, bifurcations and accumulations branching out from the sun than that these phenomena can be can be modelled purely as resource-flows. Really, even “from the sun” is somewhat secondary; I originally came to this line of thought while statistically modelling software performance problems, leading to a model I call “linear diffusion of sparse lognormals”.
I could imagine you could set up a prompt that makes the network represent things in this format, at least in some fragments of it. However, that’s not what you need in order to interpret the network, because that’s not how people use the network in practice, so it wouldn’t be informative for how the network works.
Instead, an interpretation of the network would be constituted by a map which shows how different branches of the world impacted the network. In the simplest form, you could imagine slicing up the world into categories (e.g. plants, animals, fungi) and then decompose the weight vector of the network into a sum of that due to plants, due to animals, and due to fungi (and presumably also interaction terms and such).
Of course in practice people use LLMs in a pretty narrow range of scenarios that don’t really match plants/animals/fungi, and the training data is probably heavily skewed towards the animals (and especially humans) branch of this tree, so realistically you’d need some more pragmatic model.
It sure would.
Indeed.
This follows for weight-space, but I think it doesn’t follow for activation space. We expect that the ecological role of king is driven by some specific pressures that apply in certain specific circumstances (e.g. in the times that farmers would come in contact with bandits), while not being very applicable at most other times (e.g. when the tide is coming in). As such, to understand the role of the king, it is useful to be able to distinguish times when the environmental pressure strongly applies from the times when it does not strongly apply. Other inferences may be downstream of this ability to distinguish, and there will be some pressure for these downstream inferences to all refer to the same upstream feature, rather than having a bunch of redundant and incomplete copies. So I argue that there is in fact a reason for these imprints to be concentrated into a specific spot of activation space.
Recent work on SAEs as they apply to transformer residuals seem to back this intuition up.
Also potentially relevant: “The Quantization Model of Neural Scaling” (Michaud et. al. 2024)
Issue is there are four things to distinguish:
The root cause of kings (related to farmer/bandit conflicts)
The whole of kings (including the king’s effect on the kingdom)
The material manifestation of a king (including molecular decomposition)
Symbols representing kings (such as the word “king”)
Large language models will no doubt develop representations for thing 4, as well as corresponding representations for symbols of other objects; quite plausibly you can dig up some axis in an SAE that correlates to language talking about farmer/bandit conflicts.
Right now, large language models can sort of model dynamics and therefore in some sense must have some knowledge of material manifestations. However, since these dynamics are described in natural language, they are spread out over many tokens, and so it is dubious whether they have any representations for thing 3.
Large language models absolutely do not have a representation for thing 2, because the whole of kings has shattered into many different shards before they were trained, and they’ve only been fed scattered bits and pieces of it. (Now, they might have representations of symbols representing kings, as in an embedding of the text “the king’s effect on the kingdom”, but they will be superficial.)
And finally, while they may have representations of the material manifestation of farmer/bandit conflicts, this does not mean said manifestation is deep enough to derive the downstream stuff. Rather they probably only represent it as independent/disconnected events.
Do humans have a representation for thing 2?
Rationalist-empiricists don’t (see rationalists are missing a core piece of agent-like structure, and the causal backbone conjecture), and maybe society is sufficiently unhealthy that most people don’t (I think religious people have it to a greater extent, with their talk about higher powers, but they have it in a degenerate form where e.g. Christians worship love when really the sun is the objectively correct creator god) but it’s not really that hard to develop once you realize that you should.
Like, rationalists intellectually know that thermodynamics is a thing, but it doesn’t seem common to for rationalists to think of everything important as being the result of emanations from the sun. Instead, rationalists are more likely to take a “Gnostic” picture, by which I mean a worldview where one believes the world was created by an evil entity and that inner enlightenment can give access to a good greater entity. The Goddess of Everything Else being a central example of rationalist Gnosticism.
I expect if you took a room with 100 rationalists, and told them to consider something that is important to them, and then asked them how that thing came to be, and then had them repeat the process 25 more times, at least half of the rationalists in the room would include some variation of “because the sun shines” within their causal chains. At the same time, I don’t think rationalists tend to say things like “for dinner, I think I will make a tofu stir fry, and ultimately I’m able to make this decision because there’s a ball of fusing hydrogen about 1.5×1011m away”.
Put another way, I expect that large language models encode many salient learned aspects of their environments, and that those attributes are largely detectable in specific places in activation space. I do not expect that large language models encode all of the implications of those learned aspects of their environments anywhere, and I don’t particularly expect it to be possible to mechanistically determine all of those implications without actually running the language model. But I don’t think “don’t hold the whole of their world model, including all implications thereof, in mind at all times” is something particular to LLMs.
I dunno, maybe? At least if you carefully choose the question, then you can figure something out to guide them to it. But it’s not really central enough to guide their models of e.g. AI alignment.
That would also be pretty pathological.
It’s more relevant to consider stuff like, imagine you invite someone out for dinner. Do you do this because:
The food is divine sunblessings that you share with them (correct reason)
Food is plentiful but people’s instincts don’t realize that yet (evil, manipulative reason)
That’s just what you’re supposed to do, culturally speaking (ungrounded, confused reason)
etc.
But if you structure your world-model such that the nodes predominantly look like a branching structure emanating from the sun, it’s not a question of “including all implications thereof”. The issue is rationalist-empiricists have a habit of instead structuring their world-model as an autoregressive-ish thing, where nodes at one time determine the nodes at the immediately next time, and so the importance of the sun is an “implication” instead of a straightforward part of the structure.
Let’s say your causal model looks something like this:
What causes you to specifically call out “sunblessings” as the “correct” upstream node in the world model of why you take your friend to dinner, as opposed to “fossil fuels” or “the big bang” or “human civilization existing” or “the restaurant having tasty food”?
Or do you reject the premise that your causal model should look like a tangled mess, and instead assert that it is possible to have a useful tree-shaped causal model (i.e. one that does not contain joining branches or loops).
Nothing in this causal model centers the sun, that’s precisely what makes it so broken.
Fossil fuels, the big bang, and human civilization is not what you offered to your friend. Tastiness is a sensory quality, which is a superficial matter. If you offer your friend something that you think they superficially assume to be better than you really think it is, that is hardly a nice gesture.
I wouldn’t rule out that you could sometimes have joining branches and loops, but mechanistic models tend to have it far too much. (Admittedly your given model isn’t super mechanistic, but still, it’s directionally mechanistic compared to what I’m advocating.)
I don’t think I understand, concretely, what a non-mechanistic model looks like in your view. Can you give a concrete example of a useful non-mechanistic model?
Something that tracks resource flows rather than information flows. For example if you have a company, you can have nodes for the revenue from each of the products your are selling, aggregating into product category nodes, and finally into total revenue, which then branches off into profits and different clusters of expenses, with each cluster branching off into more narrow expenses. This sort of thing is useful because it makes it practical to study phenomena by looking at their accounting.
Sure, that’s also a useful thing to do sometimes. Is your contention that simple concentrated representations of resources and how they flow do not exist in the activations of LLMs that are reasoning about resources and how they flow?
If not, I think I still don’t understand what sort of thing you think LLMs don’t have a concentrated representation of.
It’s clearer to me that the structure of the world is centered on emanations, erosions, bifurcations and accumulations branching out from the sun than that these phenomena can be can be modelled purely as resource-flows. Really, even “from the sun” is somewhat secondary; I originally came to this line of thought while statistically modelling software performance problems, leading to a model I call “linear diffusion of sparse lognormals”.
I could imagine you could set up a prompt that makes the network represent things in this format, at least in some fragments of it. However, that’s not what you need in order to interpret the network, because that’s not how people use the network in practice, so it wouldn’t be informative for how the network works.
Instead, an interpretation of the network would be constituted by a map which shows how different branches of the world impacted the network. In the simplest form, you could imagine slicing up the world into categories (e.g. plants, animals, fungi) and then decompose the weight vector of the network into a sum of that due to plants, due to animals, and due to fungi (and presumably also interaction terms and such).
Of course in practice people use LLMs in a pretty narrow range of scenarios that don’t really match plants/animals/fungi, and the training data is probably heavily skewed towards the animals (and especially humans) branch of this tree, so realistically you’d need some more pragmatic model.