Kshitij Sachan

Karma: 343

Redwood Research

Kshitij Sachan Mar 16, 2023, 5:13 PM
LW: 5 AF: 4
2
AF
on: Towards understanding-based safety evaluations
Causal Scrubbing: My main problem with causal scrubbing as a solution here is that only guarantees the sufficiency, but not the necessity, or your explanation. As a result, my understanding is that a causal-scrubbing-based evaluation would admit a trivial explanation that simply asserts that the entire model is relevant for every behavior.

Redwood has been experimenting with learning (via gradient descent) causal scrubbing explanations that are somewhat addressing your necessity point. Specifically:
1. “Larger” explanations are penalized more (size refers to the number of dimensions of the residual stream the explanation claims the model is using for a specific behavior).
2. Explanations must be adversarially robust: an adversary shouldn’t be able to include additional parts of the model we claimed are unimportant and have a sizable effect on the scrubbed model’s predictions.
This approach doesn’t address all the concerns one might have with using causal scrubbing to understand models, but just wanted to flag that this is something we’re thinking about as well.

Kshitij Sachan Dec 5, 2022, 8:51 PM
1 point
0
in reply to: Neel Nanda’s comment on: Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
We haven’t had to use a non-linear decomposition in our interp work so far at Redwood. Just wanted to point out that it’s possible. I’m not sure when you would want to use one, but I haven’t thought about it that much.

Kshitij Sachan Dec 4, 2022, 7:01 PM
1 point
0
on: The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
I enjoyed reading this a lot.
I would be interested in a quantitative experiment showing what % of the models’ performance is explained by this linear assumption. For example, identify all output weight directions that correspond to “fire”, project those out only for the direct path to the output (and not the path to later heads/MLPs), and see if it tanks accuracy on sentences where the next token is fire.
I’m confused how to interpret this alongside Conjecture’s polytope framing? That work suggested that magnitude as well as direction in activation space is important. I know this analysis is looking at the weights, but obviously the weights affect the activations, so it seems like the linearity assumption shouldn’t hold?

Kshitij Sachan Dec 4, 2022, 5:21 AM
LW: 2 AF: 2
0
AF
in reply to: Neel Nanda’s comment on: Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
Yes! The important part is decomposing activations (not neccessarily linearly). I can rewrite my MLP as:
MLP(x) = f(x) + (MLP(x) - f(x))
and then claim that the MLP(x) - f(x) term is unimportant. There is an example of this in the parentheses balancer example.

Kshitij Sachan Dec 3, 2022, 4:08 PM
LW: 4 AF: 2
1
AF
in reply to: Neel Nanda’s comment on: Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
Nice summary! One small nitpick:
> In the features framing, we don’t necessarily assume that features are aligned with circuit components (eg, they could be arbitrary directions in neuron space), while in the subgraph framing we focus on components and don’t need to show that the components correspond to features
This feels slightly misleading. In practice, we often do claim that sub-components correspond to features. We can “rewrite” our model into an equivalent form that better reflects the computation it’s performing. For example, if we claim that a certain direction in an MLP’s output is important, we could rewrite the single MLP node as the sum of the MLP output in the direction + the residual term. Then, we could make claims about the direction we pointed out and also claim that the residual term is unimportant.
The important point is that we are allowed to rewrite our model however we want as long as the rewrite is equivalent.

Kshitij Sachan Oct 10, 2022, 5:20 PM
1 point
0
AF
in reply to: Alex Flint’s comment on: Polysemanticity and Capacity in Neural Networks
Good question! As you suggest in your comment, increasing marginal returns to capacity induce monosemanticity, and decreasing marginal returns induce polysemanticity.
We observe this in our toy model. We didn’t clearly spell this out in the post, but the marginal benefit curves labelled from A to F correspond to points in the phase diagram. At the top of the phase diagram where features are dense, there is no polysemanticity because the marginal benefit curves are increasing (see curves A and B). In the feature sparse region (points D, E, F), we see polysemanticity because the marginal benefit curves are decreasing.
The relationship between increasing/decreasing marginal returns and polysemanticity generalizes beyond our toy model. However, we don’t have a generic technique to define capacity across different architectures and loss functions. Without a general definition, it’s not immediately obvious how to regularize the loss for increasing returns to capacity.
You’re getting at a key question the research brings up: can we modify the loss function to make models more monosemantic? Empirically, increasing sparsity increases polysemanticity across all models we looked at (figure 7 from the arXiv paper)*. According to the capacity story, we only see polysemanticity when there is decreasing marginal returns to capacity. Therefore, we hypothesize that there is likely a fundamental connection between feature sparsity and decreasing marginal returns. That is to say, we are suggesting that: if features are sparse and similar enough in importance, polysemanticity is optimal.
*Different models showed qualitatively different levels of polysemanticity as a function of sparsity. It seems possible that tweaking the architecture of a LLM could change the amount of polysemanticity, but we might take a performance hit for doing so.

Polysemanticity and Capacity in Neural Networks

Buck, Adam Jermyn and Kshitij Sachan

Oct 7, 2022, 5:51 PM

87 points

14 comments3 min readLW link

Kshitij Sachan

Poly­se­man­tic­ity and Ca­pac­ity in Neu­ral Networks

Polysemanticity and Capacity in Neural Networks