jacob_drori

Karma: 215

jacob_drori Mar 27, 2025, 5:14 PM
1 point
0
on: Memorization-generalization in practice
Vary temperature t and measure the resulting learning coefficient function $λ (t)$
This confuses me. IIUC, $p_{β, n} (w) = \frac{φ (w) exp (- n β L_{n} (w))}{\int d w^{'} φ (w^{'}) exp (- n β L_{n} (w^{'}))}$ . So changing temperature is equivalent to rescaling the loss by a constant. But such a rescaling doesn’t affect the LLC.

What did I misunderstand?

jacob_drori Mar 7, 2025, 4:26 PM
2 points
1
in reply to: Lucius Bushnaq’s comment on: Estimating the Probability of Sampling a Trained Neural Network at Random
Let $V (ϵ)$ be volume of a behavioral region at cutoff $ϵ$ . Your behavioral LLC at finite noise scale is $λ (ϵ) := d log V / d log ϵ$ , which is invariant under rescaling $V$ by a constant. This information about the overall scale of $V$ seems important. What’s the reason for throwing it out in SLT?

jacob_drori Feb 27, 2025, 6:03 AM
3 points
2
in reply to: Daniel Tan’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Fantastic research! Any chance you’ll open-source weights of the insecure qwen model? This would be useful for interp folks.

jacob_drori Feb 26, 2025, 10:03 PM
8 points
0
on: [PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
The Jacobians are much more sparse in pre-trained LLMs than in re-initialized transformers.
This would be very cool if true, but I think further experiments are needed to support it.
Imagine a dumb scenario where during training, all that happens to the MLP is that it “gets smaller”, so that MLP_trained(x) = c * MLP_init(x) for some small c. Then all the elements of the Jacobian also get smaller by a factor of c, and your current analysis—checking the number of elements above a threshold—would conclude that the Jacobian had gotten sparser. This feels wrong: merely rescaling a function shouldn’t affect the sparsity of the computation it implements.
To avoid this issue, you could report a scale-invariant quantity like the kurtosis of the Jacobian’s elements divided by their variance-squared, or ratio of L1 and L2 norms, or plenty of other options. But these quantities still aren’t perfect, since they aren’t invariant under linear transformations of the model’s activations:
E.g. suppose an mlp_out feature F depends linearly on some mlp_in feature G, which is roughly orthogonal to F. If we stretch all model activations along the F direction, and retrain our SAEs, then the new mlp_out SAE will contain (in an ideal world) a feature F’ which is the same as F but with larger activations by some factor. On the other hand, the mlp_in SAE should will contain a feature G’ which is roughly the same as G. Hence the (F, G) element of the Jacobian has been made bigger, simply by applying a linear transformation to the model’s activations. Generally this will affect our sparsity measure, which feels wrong: merely applying a linear map to all model activations shouldn’t change the sparsity of the computation being done on those activations. In other words, our sparsity measure shouldn’t depend on a choice of basis for the residual stream.
I’ll try to think of a principled measure of the sparsity of the Jacobian. In the meantime, I think it would still be interesting to see a scale-invariant quantity reported, as suggested above.

jacob_drori Jan 23, 2025, 5:56 PM
1 point
0
on: Against blanket arguments against interpretability
We have pretty robust measurements of complexity of algorithms from SLT
This seems overstated. What’s the best evidence so far that the LLC positively correlates with the complexity of the algorithm implemented by a model? In fact, do we even have any models whose circuitry we understand well enough to assign them a “complexity”?
… and it seems like similar methods can lead to pretty good ways of separating parallel circuits (Apollo also has some interesting work here that I think constitutes real progress)
Citation?

jacob_drori Jan 18, 2025, 10:59 AM
1 point
0
in reply to: tailcalled’s comment on: The quantum red pill or: They lied to you, we live in the (density) matrix
Same difference

jacob_drori Jan 18, 2025, 10:57 AM
1 point
0
in reply to: Dmitry Vaintrob’s comment on: The quantum red pill or: They lied to you, we live in the (density) matrix
I’d prefer “basis we just so happen to be measuring in”. Or “measurement basis” for short.
You could use “pointer variable”, but this would commit you to writing several more paragraphs to unpack what it means (which I encourage you to do, maybe in a later post).

jacob_drori Jan 18, 2025, 1:36 AM
3 points
2
on: The quantum red pill or: They lied to you, we live in the (density) matrix
Your use of “pure state” is totally different to the standard definition (namely rank(rho)=1). I suggest using a different term.

jacob_drori Jan 18, 2025, 1:33 AM
1 point
0
in reply to: tailcalled’s comment on: The quantum red pill or: They lied to you, we live in the (density) matrix
The QM state space has a preferred inner product, which we can use to e.g. dualize a (0,2) tensor (i.e. a thing that eats takes two vectors and gives a number) into a (1,1) tensor (i.e. an operator). So we can think of it either way.

jacob_drori Jan 5, 2025, 10:11 PM
2 points
0
in reply to: Cheng-Han Chiang’s comment on: Domain-specific SAEs
Oops, good spot! I meant to write 1 minus that quantity. I’ve edited the OP.

jacob_drori Oct 21, 2024, 5:47 AM
1 point
0
on: Exploring the Platonic Representation Hypothesis Beyond In-Distribution Data
This seems very interesting, but I think your post could do with a lot more detail. How were the correlations computed? How strongly do they support PRH? How was the OOD data generated? I’m sure the answers could be pieced together from the notebook, but most people won’t click through and read the code.

jacob_drori Oct 14, 2024, 8:45 PM
3 points
2
in reply to: Lucius Bushnaq’s comment on: Circuits in Superposition: Compressing many small neural networks into one
Ah, I think I understand. Let me write it out to double-check, and in case it helps others.

Say $δ = 0$ , for simplicity. Then $A^{l} = \sum_{t} E_{t} a_{t}^{l}$ . This sum has $k$ nonzero terms.

In your construction, $W^{l, i n} = \sum_{t} V_{t}^{l} W_{t}^{l, i n} E_{t}^{T}$ . Focussing on a single neuron, labelled by $i$ , we have $(W^{l, i n})_{i} = \sum_{t} (V_{t}^{l})_{i} W_{t}^{l, i n} E_{t}^{T}$ . This sum has $\sim p T$ nonzero terms.

So the preactivation of an MLP hidden neuron in the big network is $p_{i}^{l} = \sum_{t, t^{'}} (V_{t}^{l})_{i} W_{t}^{l, i n} E_{t}^{T} E_{t^{'}} a_{t^{'}}^{l}$ . This sum has $\sim k p T$ nonzero terms.
We only “want” the terms where $t = t^{'}$ ; the rest (i.e. the majority) are noise. Each noise term in the sum is a random vector, so each of the $\sim k p T$ different noise terms are roughly orthogonal, and so the norm of the noise is $O (\sqrt{k p T})$ (times some other factors, but this captures the $T$ -dependence, which is what I was confused about).

jacob_drori Oct 14, 2024, 5:25 PM
1 point
0
on: Circuits in Superposition: Compressing many small neural networks into one
I’m confused by the read-in bound:
$ϵ_{t}^{l, i n} = O (w a \sqrt{k T \frac{m d}{M D} log M})$
Sure, each neuron reads from $T \frac{n log M}{M}$ of the random subspaces. But in all but $k$ of those subspaces, the big network’s activations are smaller than $δ$ , right? So I was expecting a tighter bound—something like:
$ϵ_{t}^{l, i n} = O (w a \sqrt{(k + T δ) \frac{m d}{M D} log M})$

jacob_drori Sep 24, 2024, 8:02 PM
1 point
0
in reply to: tailcalled’s comment on: tailcalled’s Shortform
Ah, so I think you’re saying “You’ve explained to me the precise reason why energy and momentum (i.e. time and space) are different at the fundamental level, but why does this lead to the differences we observe between energy and momentum (time and space) at the macro-level?
This is a great question, and as with any question of the form “why does this property emerge from these basic rules”, there’s unlikely to be a short answer. E.g. if you said “given our understanding of the standard model, explain how a cell works”, I’d have to reply “uhh, get out a pen and paper and get ready to churn through equations for several decades”.
In this case, one might be able to point to a few key points that tell the rough story. You’d want to look at properties of solutions PDEs on manifolds with metric of signature (1,3) (which means “one direction on the manifold is different to the other three, in that it carries a minus sign in the metric compared to the others in the metric”). I imagine that, generically, these solutions behave differently with respect to the “1″ direction and the “3” directions. These differences will lead to the rest of the emergent differences between space and time. Sorry I can’t be more specific!

jacob_drori Sep 24, 2024, 6:41 PM
1 point
0
in reply to: tailcalled’s comment on: tailcalled’s Shortform
> could one replace the energy-first formulations of quantum mechanics with momentum-first formulations?

Momentum is to space what energy is to time. Precisely, energy generates (in the Lie group sense) time-translations, whereas momentum generates spatial translations. So any question about ways in which energy and momentum differ is really a question about how time and space differ.

In ordinary quantum mechanics, time and space are treated very differently: $t$ is a coordinate whereas $x$ is a dynamical variable (which happens to be operator-valued). The equations of QM tell us how $x$ evolves as a function of $t$ .

But ordinary QM was long-ago replaced by quantum field theory, in which time and space are on a much more even footing: they are both coordinates, and the equations of QFT tell us how a third thing (the field $ϕ (x, t)$ ) evolves as a function of $x$ and $t$ . Now, the only difference between time and space is that there is only one dimension of the former but three of the latter (there may be some other very subtle differences I’m glossing over here, but I wouldn’t be surprised if they ultimately stem from this one).

All of this is to say: our best theory of how nature works (QFT), is neither formulated as “energy-first” nor as “momentum-first”. Instead, energy and momentum are on fairly equal footing.

jacob_drori Sep 21, 2024, 1:20 AM
8 points
0
in reply to: tailcalled’s comment on: tailcalled’s Shortform
Sure, there are plenty of quantities that are globally conserved at the fundamental (QFT) level. But most most of.these quantities aren’t transferred between objects at the everyday, macro level we humans are used to.

E.g. 1: most everyday objects have neutral electrical charge (because there exist positive and negative charges, which tend to attract and roughly cancel out) so conservation of charge isn’t very useful in day-to-day life.

E.g. 2: conservation of color charge doesn’t really say anything useful about everyday processes, since it’s only changed by subatomic processes (this is again basically due to the screening effect of particles with negative color charge, though the story here is much more subtle, since the main screening effect is due to virtual particles rather than real ones).

The only other fundamental conserved quantity I can think of that is nontrivially exchanged between objects at the macro level is momentum. And… momentum seems roughly as important as energy?

I guess there is a question about why energy, rather than momentum, appears in thermodynamics. If you’re interested, I can answer in a separate comment.

jacob_drori Aug 30, 2024, 7:21 PM
3 points
0
on: Does a time-reversible physical law/Cellular Automaton always imply the First Law of Thermodynamics?
I’ll just answer the physics question, since I don’t know anything about cellular automata.

When you say time-reversal symmetry, do you mean that t → T-t is a symmetry for any T?

If so, the composition of two such transformations is a time-translation, so we automatically get time-translation symmetry, which implies the 1st law.

If not, then the 1st law needn’t hold. E.g. take any time-dependent hamiltonian satisfying H(t) = H(-t). This has time-reversal symmetry about t=0, but H is not conserved.

jacob_drori Jun 12, 2024, 8:10 PM
1 point
0
on: DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks
The theorem guarantees the existence of a $d$ -dimensional analytic manifold $M$ and a real analytic map
$g : M ∋ u \mapsto w \in W$
such that for each coordinate $M_{α}$ of $M$ one can write
$\begin{matrix} K (g (u)) & = u_{1}^{2 k_{1}} \dots u_{d}^{2 k_{d}} . . . \end{matrix}$
I’m a bit confused here. First, I take it that $α$ labels coordinate patches? Second, consider the very simple case with $d = 2$ and $K (w) = w_{1}^{2} + w_{2}^{2}$ . What $g$ would put $K$ into the stated form?

jacob_drori Apr 26, 2024, 12:00 AM
LW: 0 AF: -1
0
AF
on: Improving Dictionary Learning with Gated Sparse Autoencoders
Nice work! I’m not sure I fully understand what the “gated-ness” is adding, i.e. what the role the Heaviside step function is playing. What would happen if we did away with it? Namely, consider this setup:
Let $f$ and $^x$ be the encoder and decoder functions, as in your paper, and let $x$ be the model activation that is fed into the SAE.
The usual SAE reconstruction is $^x (f (x))$ , which suffers from the shrinkage problem.
Now, introduce a new learned parameter $t \in R^{n_{f e a t u r e s}}$ , and define an “expanded” reconstruction $y_{e x p a n d e d} =^x (t ⊙ f (x))$ , where $⊙$ denotes elementwise multiplication.
Finally, take the loss to be:
$L = | | {^x}_{c o p y} (f (x)) - x | |_{2}^{2} + | | y_{e x p a n d e d} - x | |_{2}^{2} + λ | | f (x) | |_{1}$ .
where ${^x}_{c o p y}$ ensures the decoder gets no gradients from the first term. As I understand it, this is exactly the loss appearing in your paper. The only difference in the setup is the lack of the Heaviside step function.
Did you try this setup? Or does it fail for an obvious reason I missed?

jacob_drori Mar 5, 2024, 4:22 PM
2 points
0
in reply to: Bart Bussmann’s comment on: Do sparse autoencoders find “true features”?
The peaks at 0.05 and 0.3 are strange. What regulariser did you use? Also, could you check whether all features whose nearest neighbour has cosine similarity 0.3 have the same nearest neighbour (and likewise for 0.05)?