Fantastic research! Any chance you’ll open-source weights of the insecure qwen model? This would be useful for interp folks.
jacob_drori
The Jacobians are much more sparse in pre-trained LLMs than in re-initialized transformers.
This would be very cool if true, but I think further experiments are needed to support it.
Imagine a dumb scenario where during training, all that happens to the MLP is that it “gets smaller”, so that MLP_trained(x) = c * MLP_init(x) for some small c. Then all the elements of the Jacobian also get smaller by a factor of c, and your current analysis—checking the number of elements above a threshold—would conclude that the Jacobian had gotten sparser. This feels wrong: merely rescaling a function shouldn’t affect the sparsity of the computation it implements.
To avoid this issue, you could report a scale-invariant quantity like the kurtosis of the Jacobian’s elements divided by their variance-squared, or ratio of L1 and L2 norms, or plenty of other options. But these quantities still aren’t perfect, since they aren’t invariant under linear transformations of the model’s activations:
E.g. suppose an mlp_out feature F depends linearly on some mlp_in feature G, which is roughly orthogonal to F. If we stretch all model activations along the F direction, and retrain our SAEs, then the new mlp_out SAE will contain (in an ideal world) a feature F’ which is the same as F but with larger activations by some factor. On the other hand, the mlp_in SAE should will contain a feature G’ which is roughly the same as G. Hence the (F, G) element of the Jacobian has been made bigger, simply by applying a linear transformation to the model’s activations. Generally this will affect our sparsity measure, which feels wrong: merely applying a linear map to all model activations shouldn’t change the sparsity of the computation being done on those activations. In other words, our sparsity measure shouldn’t depend on a choice of basis for the residual stream.
I’ll try to think of a principled measure of the sparsity of the Jacobian. In the meantime, I think it would still be interesting to see a scale-invariant quantity reported, as suggested above.
We have pretty robust measurements of complexity of algorithms from SLT
This seems overstated. What’s the best evidence so far that the LLC positively correlates with the complexity of the algorithm implemented by a model? In fact, do we even have any models whose circuitry we understand well enough to assign them a “complexity”?
… and it seems like similar methods can lead to pretty good ways of separating parallel circuits (Apollo also has some interesting work here that I think constitutes real progress)
Citation?
Same difference
I’d prefer “basis we just so happen to be measuring in”. Or “measurement basis” for short.
You could use “pointer variable”, but this would commit you to writing several more paragraphs to unpack what it means (which I encourage you to do, maybe in a later post).
Your use of “pure state” is totally different to the standard definition (namely rank(rho)=1). I suggest using a different term.
The QM state space has a preferred inner product, which we can use to e.g. dualize a (0,2) tensor (i.e. a thing that eats takes two vectors and gives a number) into a (1,1) tensor (i.e. an operator). So we can think of it either way.
Oops, good spot! I meant to write 1 minus that quantity. I’ve edited the OP.
This seems very interesting, but I think your post could do with a lot more detail. How were the correlations computed? How strongly do they support PRH? How was the OOD data generated? I’m sure the answers could be pieced together from the notebook, but most people won’t click through and read the code.
Ah, I think I understand. Let me write it out to double-check, and in case it helps others.
Say , for simplicity. Then . This sum has nonzero terms.
In your construction, . Focussing on a single neuron, labelled by , we have . This sum has nonzero terms.
So the preactivation of an MLP hidden neuron in the big network is . This sum has nonzero terms.We only “want” the terms where ; the rest (i.e. the majority) are noise. Each noise term in the sum is a random vector, so each of the different noise terms are roughly orthogonal, and so the norm of the noise is (times some other factors, but this captures the -dependence, which is what I was confused about).
I’m confused by the read-in bound:
Sure, each neuron reads from of the random subspaces. But in all but of those subspaces, the big network’s activations are smaller than , right? So I was expecting a tighter bound—something like:
There is a globe in your LLM
Domain-specific SAEs
Ah, so I think you’re saying “You’ve explained to me the precise reason why energy and momentum (i.e. time and space) are different at the fundamental level, but why does this lead to the differences we observe between energy and momentum (time and space) at the macro-level?
This is a great question, and as with any question of the form “why does this property emerge from these basic rules”, there’s unlikely to be a short answer. E.g. if you said “given our understanding of the standard model, explain how a cell works”, I’d have to reply “uhh, get out a pen and paper and get ready to churn through equations for several decades”.
In this case, one might be able to point to a few key points that tell the rough story. You’d want to look at properties of solutions PDEs on manifolds with metric of signature (1,3) (which means “one direction on the manifold is different to the other three, in that it carries a minus sign in the metric compared to the others in the metric”). I imagine that, generically, these solutions behave differently with respect to the “1″ direction and the “3” directions. These differences will lead to the rest of the emergent differences between space and time. Sorry I can’t be more specific!
> could one replace the energy-first formulations of quantum mechanics with momentum-first formulations?
Momentum is to space what energy is to time. Precisely, energy generates (in the Lie group sense) time-translations, whereas momentum generates spatial translations. So any question about ways in which energy and momentum differ is really a question about how time and space differ.
In ordinary quantum mechanics, time and space are treated very differently: is a coordinate whereas is a dynamical variable (which happens to be operator-valued). The equations of QM tell us how evolves as a function of .
But ordinary QM was long-ago replaced by quantum field theory, in which time and space are on a much more even footing: they are both coordinates, and the equations of QFT tell us how a third thing (the field ) evolves as a function of and . Now, the only difference between time and space is that there is only one dimension of the former but three of the latter (there may be some other very subtle differences I’m glossing over here, but I wouldn’t be surprised if they ultimately stem from this one).
All of this is to say: our best theory of how nature works (QFT), is neither formulated as “energy-first” nor as “momentum-first”. Instead, energy and momentum are on fairly equal footing.
Sure, there are plenty of quantities that are globally conserved at the fundamental (QFT) level. But most most of.these quantities aren’t transferred between objects at the everyday, macro level we humans are used to.
E.g. 1: most everyday objects have neutral electrical charge (because there exist positive and negative charges, which tend to attract and roughly cancel out) so conservation of charge isn’t very useful in day-to-day life.
E.g. 2: conservation of color charge doesn’t really say anything useful about everyday processes, since it’s only changed by subatomic processes (this is again basically due to the screening effect of particles with negative color charge, though the story here is much more subtle, since the main screening effect is due to virtual particles rather than real ones).
The only other fundamental conserved quantity I can think of that is nontrivially exchanged between objects at the macro level is momentum. And… momentum seems roughly as important as energy?
I guess there is a question about why energy, rather than momentum, appears in thermodynamics. If you’re interested, I can answer in a separate comment.
I’ll just answer the physics question, since I don’t know anything about cellular automata.
When you say time-reversal symmetry, do you mean that t → T-t is a symmetry for any T?
If so, the composition of two such transformations is a time-translation, so we automatically get time-translation symmetry, which implies the 1st law.
If not, then the 1st law needn’t hold. E.g. take any time-dependent hamiltonian satisfying H(t) = H(-t). This has time-reversal symmetry about t=0, but H is not conserved.
Open Source Automated Interpretability for Sparse Autoencoder Features
The theorem guarantees the existence of a -dimensional analytic manifold and a real analytic map
such that for each coordinate of one can write
I’m a bit confused here. First, I take it that labels coordinate patches? Second, consider the very simple case with and . What would put into the stated form?
Let V(ϵ) be volume of a behavioral region at cutoff ϵ. Your behavioral LLC at finite noise scale is λ(ϵ):=dlogV/dlogϵ, which is invariant under rescaling V by a constant. This information about the overall scale of V seems important. What’s the reason for throwing it out in SLT?