Goodfire (London). Formerly cofounded Apollo Research.
My main research interests are mechanistic interpretability and inner alignment.
Lee Sharkey
‘Fundamental’ vs ‘applied’ mechanistic interpretability research
A technical note on bilinear layers for interpretability
No theoretical reason—The method we used in the Interim Report to combine the two losses into one metric was pretty cursed. It’s probably just better to use L1 loss alone and reconstruction loss alone and then combine the findings. But having plots for both losses would have added more plots without much gain for the presentation. It also just seemed like the method that was hardest to discern the difference between full recovery and partial recovery because the differences were kind of subtle. In future work, some way to use the losses to measure feature recover will probably be re-introduced. It probably just won’t be the way we used in the interim report.
I strongly suspect this is the case too!
In fact, we might be able to speed up the learning of common features even further:
Pierre Peigné at SERIMATS has done some interesting work that looks at initialization schemes that speed up learning. If you initialize the autoencoders with a sample of datapoints (e.g. initialize the weights with a sample from the MLP activations dataset), each of which we assume to contain a linear combination of only a few of the ground truth features, then the initial phases of feature recovery is much faster*. We haven’t had time to check, but it’s presumably biased to recover the most common features first since they’re the most likely to be in a given data point.
*The ground truth feature recovery metric (MMCS) starts higher at the beginning of autoencoder training, but converges to full recovery at about the same time.
A small update to the Sparse Coding interim research report
And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven’t observed others to notice on their own.
I just want to point out that I’ve written a long list of such obstacles in this article: Circumventing interpretability: How to defeat mind-readers
I believe the example of deep deception that Nate describes in this post is actually a combination of several methods described in that post.
I’ll quote the parts of this post that correspond to particular interpretability circumvention methods in the other post.
But the highly-generic reaction “spin up a subthread to tackle this newly-discovered obstacle”, or “be concerned about patterns in how long things take”, is something that cannot be readily trained away, and so that much happens without issue.
^ This is an example of what I called ‘representational kludging’. It’s an ugly way to represent “Be concerned about the monitors detecting suspicious instructions”, but it turns out to function similarly to such a representation.
It also involves ‘passively externalised representations’, since the observation that there is a delay in the instruction is a representational proxy for the (banned) observation that the monitors detected something suspicious about the instructions.
A different cognitive subthread: Is it possible to look over the detailed history of instructions and their corresponding delay times, and figure out exactly which sorts of instructions cause delays, and rephrase the instructions to avoid those?
^ This is an example of the AI edging toward ‘thinking in the gaps in our tools’.
Simplify-and-critique didn’t do the trick, but perhaps a nearby strategy can help. One nearby strategy that has recently been proving itself is called simplify/translate/solve, and it involves translating the problem into a common abstract format, and then solving the problem there. (Metaphorically: if you’re having trouble telling whether your basket of apples will divide evenly between three friends, consider translating the question into a numerical representation and using long division.)
Simplify/translate/solve is close in the relevant embedding space to simplify-and-critique, which makes it a natural candidate to consider. (Perhaps some of the simplifications from simplify-and-critique can even be reused.)
Normally, simplify/translate/solve wouldn’t be deployed unless the AI had more familiarity with the original domain, and more of an idea about what translated domain might render the solution easier. But normally, threads don’t hiccup like this. And the overall system has recently been granted more computing power, and avoiding a 200-year wetlab lockup is quite important, so it’s OK to try approaches that might involve more brute force than usual.
^ This is an additional example of ‘representational kludging’, since the AI is using representational tools that are nearby in embedding space but that aren’t exactly appropriate for the job, but have a sufficiently similar function that they still do the job.
Overall comments:
I really liked this post!
I like it because priorly there didn’t exist a compelling story for the broad class of concepts to which it points. And I liked it for the name it gives to that broad class (‘deep deception’). I agree that it’s underappreciated that we’re still in trouble in the world where we (somehow) get good enough interpretability to monitor for and halt deceptive thoughts.
Thanks for your interest!
The autoencoder losses reported are the train losses. And you’re right to point at noise potentially being an issue. It’s my strong suspicion that some of the problems in these results are due to there being an insufficient number of data points to train the autoencoders on LM data.
> I would also be interested to test a bit more if this method works on toy models which clearly don’t have many features, such as a mixture of a dozen of gaussians, or random points in the unit square (where there is a lot of room “in the corners”), to see if this method produces strong false positives.
I’d be curious to see these results too!
> Layer 0 is also a baseline, since I expect embeddings to have fewer features than activations in later layers, though I’m not sure how many features you should expect in layer 0.
A rough estimate would be somewhere on the order of the vocabulary size (here 50k). A reason to think it might be more is that layer 0 MLP activations follow an attention layer, which means that features may represent combinations of token embeddings at different sequence positions and there are more potential combinations of tokens than in the vocabulary. A reason to think it may be fewer is that a lot of directions may get ‘compressed away’ in small networks.
My usual starting point is “maybe people will make a model-based RL AGI / brain-like AGI”. Then this post is sorta saying “maybe that AGI will become better at planning by reading about murphyjitsu and operations management etc.”, or “maybe that AGI will become better at learning by reading Cal Newport and installing Anki etc.”. Both of those things are true, but to me, they don’t seem safety-relevant at all.
Hm, I don’t think this quite captures what I view the post as saying.
Maybe what you’re thinking is: “Maybe Future Company X will program an RL architecture that doesn’t have any planning in the source code, and the people at Future Company X will think to themselves ‘Ah, planning is necessary for wiping out humanity, so I don’t have to worry about the fact that it’s misaligned!’, but then humanity gets wiped out anyway because planning can emerge organically even when it’s not in the source code”. If that’s what you’re thinking, then, well, I am happy to join you in spreading the generic message that people shouldn’t make unjustified claims about the (lack of) competence of their ML models.
As far as there is a safety-related claim in the post, this captures it much better than the previous quote.
But I happen to have a hunch that the Future Company X people are probably right, and more specifically, that future AGIs will be model-based RL algorithms with a human-written affordance for planning, and that algorithms without such an affordance won’t be able to do treacherous turns and other such things that make them very dangerous to humanity, notwithstanding the nonzero amount of “planning” that arises organically in the trained model as discussed in OP. But I can’t prove that my hunch is correct, and indeed, I acknowledge that in principle it’s quite possible for e.g. model-free RL to make powerful treacherous-turn-capable models, cf. evolution inventing humans. More discussion here.
I think my hunch is in the other direction. One of the justifications for my hunch is to gesture at the Bitter Lesson and to guess that a learned planning algorithm could potentially be a lot better than a planning algorithm we hard code into a system. But that’s a lightly held view. It feels plausible to me that your later points (1) and (2) turn out to be right, but again I think I lean in the other direction from you on (1).
I can also imagine a middle ground between our hunches that looks something like “We gave our agent a pretty strong inductive bias toward learning a planning algorithm, but still didn’t force it to learn one, yet it did.”
Why almost every RL agent does learned optimization
That’s correct. ‘Correlated features’ could ambiguously mean “Feature x tends to activate when feature y activates” OR “When we generate feature direction x, its distribution is correlated with feature y’s”. I don’t know if both happen in LMs. The former almost certainly does. The second doesn’t really make sense in the context of LMs since features are learned, not sampled from a distribution.
There should be a neat theoretical reason for the clean power law where L1 loss becomes too big. But it doesn’t make intuitive sense to me—it seems like if you just add some useless entries in the dictionary, the effect of losing one of the dimensions you do use on reconstruction loss won’t change, so why should the point where L1 loss becomes too big change? So unless you have a bug (or some weird design choice that divides loss by number of dimensions), those extra dimensions would have to be changing something.
The L1 loss on the activations does indeed take the mean activation value. I think it’s probably a more practical choice than simply taking the sum because it creates independence between hyperparameters: We wouldn’t want the size of the sparsity loss to change wildly relative to the reconstruction loss when we change the dictionary size. In the methods section I forgot to include the averaging terms. I’ve updated the text in the article. Good spot, thanks!
I’d definitely be interested in you including this as a variable in the toy data, and seeing how it affects the hyperparameter search heuristics.
Yeah I think this is probably worth checking too. We probably wouldn’t need to have too many different values to get a rough sense of its effect.
Fig. 9 is cursed. Is there a problem with estimating from just one component of the loss?
Yeah it kind of is… It’s probably better to just look at each loss component separately. Very helpful feedback, thanks!
In the toy datasets, the features have the same scale (uniform from zero to one when active multiplied by a unit vector). However in the NN case, there’s no particular reason to think the feature scales are normalized very much (though maybe they’re normalized a bit due to weight decay and similar). Is there some reason this is ok?
Hm it’s a great point. There’s no principled reason for it. Equivalently, there’s no principled reasons to expect the coefficients/activations for each feature to be on the same scale either. We should probably look into a ‘feature coefficient magnitude decay’ to create features that don’t all live on the same scale. Thanks!
E.g., learn a low rank autoencoder like in the toy models paper and then learn to extract features from this representation? I don’t see a particular reason why you used a hand derived superposition representation (which seems less realistic to me?).
One reason for this is that the polytopic features learned by the model in the Toy models of superposition paper can be thought of as approximately maximally distant points on a hypersphere (to my intuitions at least). When using high-ish numbers of dimensions as in our toy data (256), choosing points randomly on the hypersphere achieves approximately the same thing. By choosing points randomly like in the way we did here, we don’t have to train another potentially very large matrix that puts the one-hot features into superposition. The data generation method seemed like it would approximate real features about as well as polytope-like encodings of one-hot features (which are unrealistic too), so the small benefits didn’t seem like were worth the moderate computational costs. But I could be convinced otherwise on this if I’ve missed some important benefits.
Beyond this, I imagine it would be nicer if you trained a model do computation in superposition and then tried to decode the representations the model uses—you should still be able to know what the ‘real’ features are (I think).
Nice idea! This could potentially be a nice middle ground between toy data experiments and language model experiments. We’ll look into this, thanks again!
[Interim research report] Taking features out of superposition with sparse autoencoders
Current themes in mechanistic interpretability research
Interpreting Neural Networks through the Polytope Lens
I agree
This sounds really reasonable. I had only been thinking of a naive version of interpretability tools in the loss function that doesn’t attempt to interpret the gradient descent process. I’d be genuinely enthusiastic about the strong version you outlined. I expect to think a lot about it in the near future.
Thanks! Amended.
Bilinear layers—not confident at all! It might make structure more amenable to mathematical analysis so it might help? But as yet there aren’t any empirical interpretability wins that have come from bilinear layers.
Dictionary learning—This is one of my main bets for comprehensive interpretability.
Other areas—I’m also generally excited by the line of research outlined in https://arxiv.org/abs/2301.04709