Jason Gross

Karma: 258

[Replication] Crosscoder-based Stage-Wise Model Diffing

annas, Thomas Read, Oliver Clive-Griffin, dmanningcoe, Chun Hei Yip, rajashree and Jason Gross

Mar 22, 2025, 6:35 PM

21 points

0 comments7 min readLW link

Measuring Nonlinear Feature Interactions in Sparse Crosscoders [Project Proposal]

Jason Gross and rajashree

Jan 6, 2025, 4:22 AM

19 points

0 comments12 min readLW link

Jason Gross Nov 10, 2024, 4:04 PM
5 points
0
on: Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback

The model learns to act harmfully for vulnerable users while harmlessly for the evals.

If you run the evals in the context of gameable users, do they show harmfulness? (Are the evals cheap enough to run that the marginal cost of running them every N modifications to memory for each user separately is feasible?)

Jason Gross Oct 25, 2024, 3:23 AM
LW: 3 AF: 1
0
AF
on: Are we dropping the ball on Recommendation AIs?
I believe the closest research to this topic is under the heading “Performative Power” (cf, e.g., this arXiv paper). I think “The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power” by Shoshana Zuboff is also a pretty good book that seems related.

Jason Gross Aug 23, 2024, 1:50 AM
3 points
0
in reply to: leogao’s comment on: A simple model of math skill
The reason you can’t sample uniformly from the integers is more like “because they are not compact” or “because they are not bounded” than “because they are infinite and countable”. You also can’t sample uniformly at random from the reals. (If you could, then composing with floor would give you a uniformly random sample from the integers.)

If you want to build a uniform probability distribution over a countable set of numbers, aim for all the rationals in [0, 1].

Jason Gross Jul 22, 2024, 7:11 AM
1 point
0
in reply to: Lucius Bushnaq’s comment on: Lucius Bushnaq’s Shortform

I don’t want a description of every single plate and cable in a Toyota Corolla, I’m not thinking about the balance between the length of the Corolla blueprint and its fidelity as a central issue of interpretability as a field.

What I want right now is a basic understanding of combustion engines.

This is the wrong ‘length’. The right version of brute-force length is not “every weight and bias in the network” but “the program trace of running the network on every datapoint in pretrain”. Compressing the explanation (not just the source code) is the thing connected to understanding. This is what we found from getting formal proofs of model behavior in Compact Proofs of Model Performance via Mechanistic Interpretability.

Does the 17th-century scholar have the requisite background to understand the transcript of how bringing the metal plates in the spark plug close enough together results in the formation of a spark? And how gasoline will ignite and expand? I think given these two building blocks, a complete description of the frame-by-frame motion of the Toyota Corolla would eventually convince the 17th-century scholar that such motion is possible, and what remains would just be fitting the explanation into their head all at once. We already have the corresponding building blocks for neural nets: floating point operations.

Jason Gross Jul 22, 2024, 6:53 AM
LW: 1 AF: 1
0
AF
on: A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
[Lucius] Identify better SAE sparsity penalties by reasoning about the distribution of feature activations
- In sparse coding, one can derive what prior over encoded variables a particular sparsity penalty corresponds to. E.g. an L1 penalty assumes a Laplacian prior over feature activations, while a log(1+a^2) would assume a Cauchy prior. Can we figure out what distribution of feature activations over the data we’d expect, and use this to derive a better sparsity penalty that improves SAE quality?
This is very interesting! What prior does log(1+|a|) correspond to? And what about using $\prod_{i} (1 + | a_{i} |)$ instead of $\sum_{i} log (1 + | a_{i} |)$ ? Does this only hold if we expect feature activations to be independent (rather than, say, mutually exclusive)?

Jason Gross Jul 22, 2024, 6:33 AM
LW: 1 AF: 1
0
AF
on: A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
[Nix] Toy model of feature splitting
- There are at least two explanations for feature splitting I find plausible:
  Activations exist in higher dimensional manifolds in feature space, feature splitting is a symptom of one higher dimensional mostly-continuous feature being chunked into discrete features at different resolutions.
  There is a finite number of highly-related discrete features that activate on similar (but not identical) inputs and cause similar (but not identical) output actions. These can be summarized as a single feature with reasonable explained variance, but is better summarized as a collection of “split” features.
These do not sound like different explanations to me. In particular, the distinction between “mostly-continuous but approximated as discrete” and “discrete but very similar” seems ill-formed. All features are in fact discrete (because floating point numbers are discrete) and approximately continuous (because we posit that replacing floats with reals won’t change the behavior of the network meaningfully).
As far as toy models go, I’m pretty confident that the max-of-K setup from Compact Proofs of Model Performance via Mechanistic Interpretability will be a decent toy model. If you train SAEs post-unembed (probably also pre-unembed) with width d_vocab, you should find one feature for each sequence maximum (roughly). If you train with SAE width ${d_vocab}^{3} n_ctx$ , I expect each feature to split into roughly ${d_vocab}^{2} n_ctx$ features corresponding to the choice of query token, largest non-max token, and the number of copies of the maximum token. (How the SAE training data is distributed will change what exact features (principal directions of variation) are important to learn.). I’m quite interested in chatting with anyone working on / interested in this, and I expect my MATS scholar will get to testing this within the next month or two.
Edit: I expect this toy model will also permit exploring:
[Lee] Is there structure in feature splitting?
- Suppose we have a trained SAE with N features. If we apply e.g. NMF or SAEs to these directions are there directions that explain the structure of the splitting? As in, suppose we have a feature for math and a feature for physics. And suppose these split into (among other things)
  ‘topology in a math context’
  ‘topology in a physics context’
  ‘high dimensions in a math context’
  ‘high dimensions in a physics context’
- Is the topology-ifying direction the same for both features? Is the high-dimensionifying direction the same for both features? And if so, why did/didn’t the original SAEs find these directions?
I predict that whether or not the SAE finds the splitting directions depends on details about how much non-sparsity is penalized and how wide the SAE is. Given enough capacity, the SAE benefits (sparsity-wise) from replacing the (topology, math, physics) features with (topology-in-math, topology-in-physics), because split features activate more sparsely. Conversely, if the sparsity penalty is strong enough and there is not enough capacity to split, the loss recovered from having a topology feature at all (on top of the math/physics feature) may not outweigh the cost in sparsity.

Jason Gross Jul 21, 2024, 1:35 AM
2 points
0
in reply to: Joseph Miller’s comment on: Transformer Circuit Faithfulness Metrics Are Not Robust

Resample ablation is not more expensive than mean (they both are just replacing activations with different values). But to answer the question, I think you would—resample ablation biases the model toward some particular corrupt output.

Ah, I guess I was incorrectly imagining a more expensive version of resample ablation where you looked at not just a single corrupted cache, but looking at the result across all corrupted inputs. That is, in the simple toy model where you’re computing $f (x, y)$ where $x$ is the values for the circuit you care about and $y$ is the cache of corrupted activations, mean ablation is computing $f (x, E_{y \sim D} y)$ , and we could imagine versions of resample ablation that are computing $f (x, y)$ for some $y$ drawn from $D$ , or we could compute $E_{y \sim D} f (x, y)$ . I would say that both mean ablation and resample ablation as I’m imagining you’re describing it are both attempts to cheaply approximate $E_{y \sim D} f (x, y)$ .

Jason Gross Jul 18, 2024, 12:41 AM
3 points
2
on: Transformer Circuit Faithfulness Metrics Are Not Robust
But in other aspects there often isn’t a clearly correct methodology. For example, it’s unclear whether mean ablations are better than resample ablations for a particular experiment—even though this choice can dramatically change the outcome.

Would you ever really want mean ablation except as a cheaper approximation to resample ablation?

It seems to me that if you ask the question clearly enough, there’s a correct kind of ablation. For example, if the question is “how do we reproduce this behavior from scratch”, you want zero ablation.

Your table can be reorganized into the kinds of answers you’re seeking, namely:
- direct effect vs indirect effect corresponds to whether you ablate the complement of the circuit (direct effect) vs restoring the circuit itself (indirect effect, mediated by the rest of the model)
- necessity vs sufficiency corresponds to whether you ablate the circuit (direct effect necessary) / restore the complement of the circuit (indirect effect necessary) vs restoring the circuit (indirect effect sufficient) / ablating the complement of the circuit (direct effect sufficient)
- typical case vs worst case, and over what data distribution:
  - “all tokens vs specific tokens” should be absorbed into the more general category of “what’s the reference dataset distribution under consideration” / “what’s the null hypothesis over”,
  - zero ablation answers “reproduce behavior from scratch”
  - mean ablation is an approximation to resample ablation which itself is an approximation to computing the expected/typical behavior over some distribution
  - pessimal ablation is for dealing with worst-case behaviors
- granularity and component are about the scope of the solution language, and can be generalized a bit
Edit: This seems related to Hypothesis Testing the Circuit Hypothesis in LLMs

Jason Gross Jul 18, 2024, 12:05 AM
1 point
0
on: Transformer Circuit Faithfulness Metrics Are Not Robust
Do you want your IOI circuit to include the mechanism that decides it needs to output a name? Then use zero ablations. Or do you want to find the circuit that, given the context of outputting a name, completes the IOI task? Then use mean ablations. The ablation determines the task.
Mean ablation over webtext rather than the IOI task set should work just as well as zero ablation, right? “Mean ablation” is underspecified in the absence of a dataset distribution.

Jason Gross Jul 14, 2024, 8:51 PM
1 point
0
on: An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
it’s substantially worth if we restrict
Typo: should be “substantially worse”

Jason Gross Jul 11, 2024, 12:14 AM
LW: 4 AF: 2
1
AF
on: An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
Progress Measures for Grokking via Mechanistic Interpretability (Neel Nanda et al) - nothing important in mech interp has properly built on this IMO, but there’s just a ton of gorgeous results in there. I think it’s the most (only?) truly rigorous reverse-engineering work out there
Totally agree that this has gorgeous results, and this is what got me into mech interp in the first place! Re “most (only?) truly rigorous reverse-engineering work out there”: I think the clock and pizza paper seems comparably rigorous, and there’s also my recent Compact Proofs of Model Performance via Mechanistic Interpretability (and Gabe’s heuristic analysis of the same Max-of-K model), and the work one of my MARS scholars did showing that some pizza models use a ReLU to compute numerical integration, which is the first nontrivial mechanistic explanation of a nonlinearity found in a trained model (nontrivial in the sense that it asymptotically compresses the brute-force input-output behavior with a (provably) non-vacuous bound).
What links here?
- An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 by Neel Nanda (Jul 7, 2024, 5:39 PM; 135 points)

Jason Gross Jun 26, 2024, 5:14 PM
4 points
0
in reply to: Alexander Gietelink Oldenziel’s comment on: Formal verification, heuristic explanations and surprise accounting
Possibilities I see:
1. Maybe the cost can be amortized over the whole circuit? Use one bit per circuit to say “this is just and/or” vs “use all gates”.
2. This is an illustrative simplified example, in a more general scheme, you need to specify a coding scheme, which is equivalent to specifying a prior over possible things you might see.

Jason Gross Jun 25, 2024, 11:25 PM
LW: 12 AF: 6
3
AF
in reply to: RogerDearnaley’s comment on: Compact Proofs of Model Performance via Mechanistic Interpretability
I believe what you describe is effectively Casual Scrubbing. Edit: Note that it is not exactly the same as causal scrubbing, which picks looks at the activations for another input sampled at random.

On our particular model, doing this replacement shows us that the noise bound in our particular model is actually about 4 standard deviations worse than random, probably because the training procedure (sequences chosen uniformly at random) means we care a lot more about large possible maxes than small ones. (See Appendix H.1.2 for some very sparse details.)

On other toy models we’ve looked at (modular addition in particular, writeup forthcoming), we have (very) preliminary evidence suggesting that randomizing the noise has a steep drop-off in bound-tightness (as a function of how compact a proof the noise term comes from) in a very similar fashion to what we see with proofs. There seems to be a pretty narrow band of hypotheses for which the noise is structureless but we can’t prove it. This is supported by a handful of comments about how causal scrubbing indicates that many existing mech interp hypotheses in fact don’t capture enough of the behavior.

Jason Gross Jun 25, 2024, 7:33 PM
21 points
1
on: SAE feature geometry is outside the superposition hypothesis

I think it would be valuable to take a set of interesting examples of understood internal structure, and to ask what happens when we train SAEs to try to capture this structure. [...] In other cases, it may seem to us very unnatural to think of the structure we have uncovered in terms of a set of directions (sparse or otherwise) — what does the SAE do in this case?

I’m not sure how SAEs would capture the internal structure of the activations of the pizza model for modular addition, even in theory. In this case, ReLU is used to compute numerical integration, approximating $\int_{- π}^{π} ∣ ∣ cos (\frac{k}{2} + ϕ) ∣ ∣ cos (2 ϕ) d ϕ = \frac{4}{3} cos k$ (and/or similarly for sin). Each neuron is responsible for one small rectangle under the curve. Its input is the part of the integrand under the absolute value/ReLU, $cos (\frac{k}{2} + ϕ)$ (times a shared scaling coefficient), and the neuron’s coefficient in the fourier-transformed decoder matrix is the area element $cos (2 ϕ) d ϕ$ (again times a shared scaling coefficient).

Notably, in this scheme, the only fully free parameters are: the frequencies of interest, the ordering of neurons, and the two scaling coefficients. There are also constrained parameters for how evenly the space is divided up into boxes and where the function evaluation points are within each box. But the geometry of activation space here is effectively fully constrained up to permutation of the axes and global scaling factors.

What could SAEs even find in this case?

Compact Proofs of Model Performance via Mechanistic Interpretability

LawrenceC, rajashree, Adrià Garriga-alonso and Jason Gross

Jun 24, 2024, 7:27 PM

96 points

4 comments8 min readLW link

(arxiv.org)

Jason Gross Apr 27, 2024, 2:06 AM
LW: 2 AF: 1
0
AF
on: Sparsify: A mechanistic interpretability research agenda

We propose a simple fix: Use $L_{0 < p < 1}$ instead of $L_{1}$ , which seems to be a Pareto improvement over $L_{1}$ (at least in some real models, though results might be mixed) in terms of the number of features required to achieve a given reconstruction error.

When I was discussing better sparsity penalties with Lawrence, and the fact that I observed some instability in $L_{0 < p < 1}$ in toy models of super-position, he pointed out that the gradient of $L_{0 < p < 1}$ norm explodes near zero, meaning that features with “small errors” that cause them to have very small but non-zero overlap with some activations might be killed off entirely rather than merely having the overlap penalized.

See here for some brief write-up and animations.

Jason Gross Apr 9, 2024, 6:09 PM
1 point
0
AF
in reply to: Lee Sharkey’s comment on: Sparsify: A mechanistic interpretability research agenda

“explanation of (network, dataset)”: I’m afraid I don’t have a great formalish definition beyond just pointing at the intuitive notion.

What’s wrong with “proof” as a formal definition of explanation (of behavior of a network on a dataset)? I claim that description length works pretty well on “formal proof”, I’m in the process of producing a write-up on results exploring this.

Jason Gross Apr 5, 2024, 5:22 PM
2 points
0
on: Sparsify: A mechanistic interpretability research agenda
Choosing better sparsity penalties than L1 (Upcoming post - Ben Wright & Lee Sharkey): [...] We propose a simple fix: Use $L_{0 < p < 1}$ instead of $L_{1}$ , which seems to be a Pareto improvement over $L_{1}$
Is there any particular justification for using $L_{p}$ rather than, e.g., tanh (cf Anthropic’s Feb update), log1psum (acts.log1p().sum()), or prod1p (acts.log1p().sum().exp())? The agenda I’m pursuing (write-up in progress) gives theoretical justification for a sparsity penalty that explodes combinatorially in the number of active features, in any case where the downstream computation performed over the feature does not distribute linearly over features. The product-based sparsity penalty seems to perform a bit better than both $L_{0.5}$ and tanh on a toy example (sample size 1), see this colab.

Jason Gross

[Repli­ca­tion] Cross­coder-based Stage-Wise Model Diffing

Mea­sur­ing Non­lin­ear Fea­ture In­ter­ac­tions in Sparse Cross­coders [Pro­ject Pro­posal]

Com­pact Proofs of Model Perfor­mance via Mechanis­tic Interpretability

[Replication] Crosscoder-based Stage-Wise Model Diffing

Measuring Nonlinear Feature Interactions in Sparse Crosscoders [Project Proposal]

Compact Proofs of Model Performance via Mechanistic Interpretability