AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
How much money would you guess was lost on this?
Yes.
Technically you didn’t specify that can’t be an arbitrary function, so you’d be able to reconstruct activations combining different bases, but it’d be horribly convoluted in practice.
I wouldn’t even be too fussed about ‘horribly convoluted’ here. I’m saying it’s worse than that. We would still have a problem even if we allowed ourselves arbitrary encoder functions to define the activations in the dictionary and magically knew which ones to pick.
The problem here isn’t that we can’t make a dictionary that includes all the feature directions as dictionary elements. We can do that. For example, while we can’t write
because those sums each already equal on their own, we can write
.
The problem is instead that we can’t make a dictionary that has the feature activations as the coefficients in the dictionary. This is bad because it means our dictionary activations cannot equal the scalar variables the model’s own circuits actually care about. They cannot equal the ‘features of the model’ in the sense defined at the start, the scalar features comprising its ontology. As a result, if we were to look at a causal graph of the model, using the half-size dictionary feature activations we picked as the graph nodes, a circuit taking in the feature through a linear read-off along the direction would have edges in our graph connecting it to both the elephant direction, making up about 50% of the total contribution, and the fifty attribute directions, making up the remaining 50%. Same the other way around, any circuit reading in even a single attribute feature will have edges connecting to all of the animal features[1], making up of the total contribution. It’s the worst of both worlds. Every circuit looks like a mess now.
Since the animals are sparse, in practice this usually means edges to a small set of different animals for every data point. Whichever ones happen to be active at the time.
E.g. it’s not possible to represent an elephant with any arbitrary combination of attributes, as the attributes themselves are what defines the elephant direction.
You can’t represent elephants along with arbitrary combinations of attributes. You can’t do that in a scheme where feature directions are fully random with no geometry either though. There, only a small number of features can have non-zero values at the same time, so you still only get non-zero attribute features at once maximum.[1]
We would want the dictionary to learn the attributes, not arbitrary combinations of attributes, since these are the true “base units” that can vary freely.
You can call them the “base units” if you like. But that won’t change the fact that some directions in the space spanned by those “base units” are special, with associated circuits that care about those directions in particular, and understanding or even recognising those circuits in a causal graph made of the “base units” will be pretty darned hard. For the same reason trying to understand the network in the neuron basis is hard.
Put another way, there’s no way to represent an “elephant” in this scheme without also attaching attributes to it.
Yes.
Likewise, it’s not possible to differentiate between an elephant with the set of attributes x y and z and a rabbit with identical attributes x y and z, since the sum of attributes are what you’re calling an elephant or rabbit.
Not quite. You cannot specify a rabbit and simultaneously specify the rabbit having arbitrary numerical attribute values for attributes differing from normal rabbits. You can have a rabbit, and some attributes treated as sparse boolean-ish features at the same time. E.g. works. Circuits downstream that store facts about rabbits will still be triggered by this . Circuits downstream that do something with attribute will be reading in an -attribute value of plus the -coefficient of rabbits.
A consequence of this is that ‘cute rabbit’ is a bit cuter than either ‘cute’ or ‘rabbit’ on their own. But that doesn’t seem particularly strange to me. Associations in my own mind sure seem to work like that.
Less, if you want to be able to perform computation in superposition.
Similarly, for people wanting to argue from the other direction, who might think a low current valuation is case-closed evidence against their success chances
To be clear: I think the investors would be wrong to think that AGI/ASI soon-ish isn’t pretty likely.
OpenAI’s valuation is very much reliant on being on a path to AGI in the not-too-distant future.
Really? I’m mostly ignorant on such matters, but I’d thought that their valuation seemed comically low compared to what I’d expect if their investors thought that OpenAI was likely to create anything close to a general superhuman AI system in the near future.[1] I considered this evidence that they think all the AGI/ASI talk is just marketing.
Well ok, if they actually thought OpenAI would create superintelligence as I think of it, their valuation would plummet because giving people money to kill you with is dumb. But there’s this space in between total obliviousness and alarm, occupied by a few actually earnest AI optimists. And, it seems to me, not occupied by the big OpenAI investors.
If I understand correctly, it sounds like you’re saying there is a “label” direction for each animal that’s separate from each of the attributes.
No, the animal vectors are all fully spanned by the fifty attribute features.
I’m confused why a dictionary that consists of a feature direction for each attribute and each animal label can’t explain these activations? These activations are just a (sparse) sum of these respective features, which are an animal label and a set of a few attributes, and all of these are (mostly) mutually orthogonal.
The animal features are sparse. The attribute features are not sparse.[1]
In this sense the activations are just the sum of the various elements of the dictionary multiplied by a magnitude, so it seems like you should be able to explain these activations using dictionary learning.
The magnitudes in a dictionary seeking to decompose the activation vector into these 1050 features will not be able to match the actual magnitudes of the features as seen by linear probes and the network’s own circuits.
Is the idea that the 1000 animals and 50 attributes form an overcomplete basis, therefore you can come up with infinite ways to span the space using these basis components?
No, that is not the idea.
Relative to the animal features at least. They could still be sparse relative to the rest of the network if this 50-dimensional animal subspace is rarely used.
‘elephant’ would be a sum of fifty attribute feature vectors, all with scalar coefficients that match elephants in particular. The coefficients would tend have sizes on the order of , because the subspace is fifty-dimensional. So, if you wanted to have a pure tiny feature and an elephant feature active at the same time to encode a tiny elephant, ‘elephant’ and ‘tiny’ would be expected to have read-off interference on the order of . Alternatively, you could instead encode a new animal ‘tiny elephant’ as its own point in the fifty-dimensional space. Those are actually distinct things here. If this is confusing, maybe it helps to imagine that the name for ‘tiny elephant’ is ‘exampledon’, and exampledons just happen to look like tiny elephants.
E.g. the concept of a “furry elephant” or a “tiny elephant” would be unrepresentable in this scheme
It’s representable. E.g. the model can learn a circuit reading in a direction that is equal to the sum of the furry attribute direction and the elephant direction, or the tiny direction and the elephant direction respectively. This circuit can then store facts about furry elephants or tiny elephants.
I feel like in this scheme, it’s not really the case that there’s 1000 animal directions, since the base unit is the attributes
In what sense? If you represent the network computations in terms of the attribute features, you will get a very complicated computational graph with lots of interaction lines going all over the place. So clearly, the attributes on their own are not a very good basis for understanding the network.
Similarly, you can always represent any neural network in the standard basis of the network architecture. Trivially, all features can be seen as mere combinations of these architectural ‘base units’. But if you try to understand what the network is doing in terms of interactions in the standard basis, you won’t get very far.
For there to be a true “elephant” direction, then it should be possible to have any set of arbitrary attributes attached to an elephant (small, furry, pink, etc...), and this would require that there is a “label” direction that indicates “elephant” that’s mostly orthogonal to every other feature so it can be queried uniquely via projection.
The ‘elephant’ feature in this setting is mostly-orthogonal to every other feature in the ontology, including the features that are attributes. So it can be read out with a linear projection. ‘elephant’ and ‘pink’ shouldn’t have substantially higher cosine similarity than ‘elephant’ and ‘parrot’.
you mean does not necessarily produce an agent that cares about x? (at any given relevant level of capability)
Yes.
I don’t think I am very good at explaining my thoughts on this in text. Some prior writings that have informed my models here are the MIRI dialogues, and the beginning parts of Steven Byrnes’ sequence on brain-like AGI, which sketch how the loss functions human minds train on might look and gave me an example apart from evolution to think about.
Some scattered points that may or may not be of use:
There is something here about path dependence. Late in training at high capability levels, very many things the system might want are compatible with scoring very well on the loss, because the system realises that doing things that score well on the loss is instrumentally useful. Thus, while many aspects of how the system thinks are maybe nailed down quite definitively and robustly by the environment, what it wants does not seem nailed down in this same robust way. Desires thus seem like they can be very chaotically dependent on dynamics in early training, what the system reflected on when, which heuristics it learned in what order, and other low level details like this that are very hard to precisely control.
I feel like there is something here about our imaginations, or at least mine, privileging the hypothesis. When I imagine an AI trained to say things a human observer would rate as ‘nice’, and to not say things a human observer rates as ‘not nice’, my imagination finds it natural to suppose that this AI will generalise to wanting to be a nice person. But when I imagine an AI trained to respond in English, rather than French or some other language, I do not jump to supposing that this AI will generalise to terminally valuing the English language.
Every training signal we expose the AI to reinforces very many behaviours at the same time. The human raters that may think they are training the AI to be nice are also training it to respond in English (because the raters speak English), to respond to queries at all instead of ignoring them, to respond in English that is grammatically correct enough to be understandable, and a bunch of other things. The AI is learning things related to ‘niceness’, ‘English grammar’ and ‘responsiveness’ all at the same time. Why would it generalise in a way that entangles its values with one of these concepts, but not the others?
What makes us single out the circuits responsible for giving nice answers to queries as special, as likely to be part of the circuit ensemble that will cohere into the AI’s desires when it is smarter? Why not circuits for grammar or circuits for writing in the style of 1840s poets or circuits for research taste in geology?
We may instinctively think of our constitution that specifies as equivalent to some sort of monosemantic -reinforcing training signal. But it really isn’t. The concept of sticks out to us when we we look at the text of the constitution, because the presence of concept is a thing that makes this text different from a generic text. But the constitution, and even more so any training signal based on the constitution, will by necessity be entangled with many concepts besides just , and the training will reinforce those concepts as well. Why then suppose that the AI’s nascent shard of value are latching on to , but are not in the same way latching on to all the other stuff its many training signals are entangled with?
It seems to me that there is no good reason to suppose this. Niceness is part of my values, so when I see it in the training signal I find it natural to imagine that the AI’s values would latch on to it. But I do not as readily register all the other concepts in the training signal the AI’s values might latch on to, because to my brain that does not value these things, they do not seem value-related.
There is something here about phase changes under reflection. If the AI gets to the point of thinking about itself and its own desires, the many shards of value it may have accumulated up to this point are going to amalgamate into something that may be related to each of the shards, but not necessarily in a straightforwardly human-intuitive way. For example, sometimes humans that have value shards related to empathy reflect on themselves, and emerge being negative utilitarians that want to kill everyone. For another example, sometimes humans reflect on themselves and seem to decide that they don’t like the goals they have been working towards, and they’d rather work towards different goals and be different people. There, the relationship between values pre-reflection and post-reflection can be so complicated that it can seem to an outside observer and the person themselves like they just switched values non-deterministically, by a magical act of free will. So it’s not enough to get some value shards that are kind of vaguely related to human values into the AI early in training. You may need to get many or all of the shards to be more than just vaguely right, and you need the reflection process to proceed in just the right way.
Nope. Try it out. If you attempt to split the activation vector into 1050 vectors for animals + attributes, you can’t get the dictionary activations to equal the feature activations , .
I did not know about this already.
For the same reasons training an agent on a constitution that says to care about does not, at arbitrary capability levels, produce an agent that cares about .
If you think that doing this does produce an agent that cares about even at arbitrary capability levels, then I guess in your world model it would indeed be consistent for that to work for inducing corrigibility as well.
The features a model thinks in do not need to form a basis or dictionary for its activations.
Three assumptions people in interpretability often make about the features that comprise a model’s ontology:
Features are one-dimensional variables.
Meaning, the value of feature on data point can be represented by some scalar number .
Features are ‘linearly represented’.
Features form a ‘basis’ for activation space.[3]
Meaning, the model’s activations at a given layer can be decomposed into a sum over all the features of the model represented in that layer[4]: .
It seems to me that a lot of people are not tracking that 3) is an extra assumption they are making. I think they think that assumption 3) is a natural consequence of assumptions 1) and 2), or even just of assumption 2) alone. It’s not.
Suppose we have a language model that has a thousand sparsely activating scalar, linearly represented features for different animals. So, “elephant”, “giraffe”, “parrot”, and so on all with their own associated feature directions . The model embeds those one thousand animal features in a fifty-dimensional sub-space of the activations. This subspace has a meaningful geometry: It is spanned by a set of fifty directions corresponding to different attributes animals have. Things like “furriness”, “size”, “length of tail” and such. So, each animal feature can equivalently be seen as either one of a thousand sparsely activating scalar feature, or just as a particular setting of those fifty not-so-sparse scalar attributes.
Some circuits in the model act on the animal directions . E.g. they have query-key lookups for various facts about elephants and parrots. Other circuits in the model act on the attribute directions . They’re involved in implementing logic like ‘if there’s a furry animal in the room, people with allergies might have problems’. Sometimes they’re involved in circuits that have nothing to do with animals whatsoever. The model’s “size” attribute is the same one used for houses and economies for example, so that direction might be read-in to a circuit storing some fact about economic growth.
So, both the one thousand animal features and the fifty attribute features are elements of the model’s ontology, variables along which small parts of its cognition are structured. But we can’t make a basis for the model activations out of those one thousand and fifty features of the model. We can write either , or . But does not equal the model activation vector , it’s too large.
Say we choose as our basis for this subspace of the example model’s activations, and then go on to make a causal graph of the model’s computation, with each basis element being a node in the graph, and lines between nodes representing connections. Then the circuits dealing with query-key lookups for animal facts will look neat and understandable at a glance, with few connections and clear logic. But the circuits involving the attributes will look like a mess. A circuit reading in the size direction will have a thousand small but collectively significant connections to all of the animals.
If we choose as our basis for the graph instead, circuits that act on some of the fifty attributes will look simple and sensible, but now the circuits storing animal facts will look like a mess. A circuit implementing “space” AND “cat” ⇒ [increase association with rainbows] is going to have fifty connections to features like “size” and “furriness’.
The model’s ontology does not correspond to either the basis or the basis. It just does not correspond to any basis of activation space at all, not even in a loose sense. Different circuits in the model can just process the activations in different bases, and they are under no obligation to agree with each other. Not even if they are situated right next to each other, in the same model layer.
Note that for all of this, we have not broken assumption 1) or assumption 2). The features this model makes use of are all linearly represented and scalar. We also haven’t broken the secret assumption 0) I left out at the start, that the model can be meaningfully said to have an ontology comprised of elementary features at all.
I’ve seen people call out assumptions 1) and 2), and at least think about how we can test whether they hold, and how we might need to adjust our interpretability techniques if and when they don’t hold. I have not seen people do this for assumption 3). Though I might just have missed it, of course.
My current dumb guess is that assumption 2) is mostly correct, but assumptions 1) and 3) are both incorrect.
The reason I think assumption 3) is incorrect is that the counterexample I sketched here seems to me like it’d be very common. LLMs seem to be made of lots of circuits. Why would these circuits all share a basis? They don’t seem to me to have much reason to.
I think a way we might find the model’s features without assumption 3) is to focus on the circuits and computations first. Try to directly decompose the model weights or layer transitions into separate, simple circuits, then infer the model’s features from looking at the directions those circuits read and write to. In the counterexample above, this would have shown us both the animal features and the attribute features.
Potentially up to some small noise. For a nice operationalisation, see definition 2 on page 3 of this paper.
It’s a vector because we’ve already assumed that features are all scalar. If a feature was two-dimensional instead, this would be a projection into an associated two-dimensional subspace.
I’m using the term basis loosely here, this also includes sparse overcomplete ‘bases’ like those in SAEs. The more accurate term would probably be ‘dictionary’, or ‘frame’.
Or if the computation isn’t layer aligned, the activations along some other causal cut through the network can be written as a sum of all the features represented on that cut.
I think the value proposition of AI 2027-style work lies largely in communication. Concreteness helps people understand things better. The details are mostly there to provide that concreteness, not to actually be correct.
If you imagine the set of possible futures that people like Daniel, you or I think plausible as big distributions, with high entropy and lots of unknown latent variables, the point is that the best way to start explaining those distributions to people outside the community is to draw a sample from them and write it up. This is a lot of work, but it really does seem to help. My experience matches habryka’s here. Most people really want to hear concrete end-to-end scenarios, not abstract discussion of the latent variables in my model and their relationships.
The bound is the same one you get for normal Solomonoff induction, except restricted to the set of programs the cut-off induction runs over. It’s a bound on the total expected error in terms of CE loss that the predictor will ever make, summed over all datapoints.
Look at the bound for cut-off induction in that post I linked, maybe? Hutter might also have something on it.
Can also discuss on a call if you like.
Note that this doesn’t work in real life, where the programs are not in fact restricted to outputting bit string predictions and can e.g. try to trick the hardware they’re running on.
You also want one that generalises well, and doesn’t do preformative predictions, and doesn’t have goals of its own. If your hypotheses aren’t even intended to be reflections of reality, how do we know these properties hold?
Because we have the prediction error bounds.
When we compare theories, we don’t consider the complexity of all the associated approximations and abstractions. We just consider the complexity of the theory itself.
E.g. the theory of evolution isn’t quite code for a costly simulation. But it can be viewed as set of statements about such a simulation. And the way we compare the theory of evolution to alternatives doesn’t involve comparing the complexity of the set of approximations we used to work out the consequences of each theory.
Yes.
That’s fine. I just want a computable predictor that works well. This one does.
Also, scientific hypotheses in practice aren’t actually simple code for a costly simulation we run. We use approximations and abstractions to make things cheap. Most of our science outside particle physics is about finding more effective approximations for stuff.
Edit: Actually, I don’t think this would yield you a different general predictor as the program dominating the posterior. General inductor program running program is pretty much never going to be the shortest implementation of .
The kind of ‘alignment technique’ that successfully points a dumb model in the rough direction of doing the task you want in early training does not necessarily straightforwardly connect to the kind of ‘alignment technique’ that will keep a model pointed quite precisely in the direction you want after it gets smart and self-reflective.
For a maybe not-so-great example, human RL reward signals in the brain used to successfully train and aim human cognition from infancy to point at reproductive fitness. Before the distributional shift, our brains usually neither got completely stuck in reward-hack loops, nor used their cognitive labour for something completely unrelated to reproductive fitness. After the distributional shift, our brains still don’t get stuck in reward-hack loops that much and we successfully train to intelligent adulthood. But the alignment with reproductive fitness is gone, or at least far weaker.