Andrew Mack

Karma: 307

Andrew Mack Feb 27, 2025, 1:14 AM
LW: 10 AF: 7
1
AF
on: Fuzzing LLMs sometimes makes them reveal their secrets
Nice post!
I agree that an important goal of MELBO is to elicit “complex many-token behaviors” (this is a current priority of mine).

You may want to check out my recent results on eliciting password-locked performance on DeepSeek-Math-7B. Using my new training algorithm for finding MELBO vectors, it’s possible to find a vector which increases MATH performance from 3% to 23% on password-locked MATH.
The new algorithm is much more efficient than the sequential training procedure from the original post, to the point that I’m currently bottlenecked by inference (even when using vllm). It also comes with a decent heuristic procedure for choosing R. In my experience, if you have a data-set of at least $n \geq 10$ prompts, it’s pretty good at finding a bunch of vectors corresponding to the most “salient” concept in the data-set (e.g. “jailbreak vectors” when trained on harmful instructions, or “capability-increasing vectors” when trained on password-locked models).

Andrew Mack Feb 7, 2025, 9:23 PM
1 point
0
in reply to: StefanHex’s comment on: StefanHex’s Shortform
I think the relation between K-means and sparse dictionary learning (essentially K-means is equivalent to an L_0=1 constraint) is already well-known in the sparse coding literature? For example see this wiki article on K-SVD (a sparse dictionary learning algorithm) which first reviews this connection before getting into the nuances of k-SVD.

Were the SAEs for this comparison trained on multiple passes through the data, or just one pass/epoch? Because if for K-means you did multiple passes through the data but for SAEs just one then this feels like an unfair comparison.

Andrew Mack Dec 7, 2024, 1:10 AM
2 points
0
in reply to: Jacob G-W’s comment on: Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models
Regarding your first question (multiplicity of $u_{ℓ}$ ‘s, as compared with $v_{ℓ}$ ’s) - I would say that a priori my intuition matched yours (that there should be less multiplicity in output directions), but that the empirical evidence is mixed:

Evidence for less output vs input multiplicity: In initial experiments, I found that orthogonalizing $^U$ led to less stable optimization curves, and to subjectively less interpretable features. This suggests that there is less multiplicity in output directions. (And in fact my suggestion above in algorithms ²⁄₃ is not to orthogonalize $^U$ ).

Evidence for more (or at least the same) output vs input multiplicity: Taking the $^U$ from the same DCT for which I analyzed $^V$ multiplicity, and applying the same metrics for the top $240$ vectors, I get that the average of $| ⟨ {^u}_{ℓ}, {^u}_{ℓ^{'}} ⟩ |$ is $.25$ while the value for the ${^v}_{ℓ}$ ’s was $.36$ , so that on average the output directions are less similar to each other than the input directions (with the caveat that ideally I’d do the comparison over multiple runs and compute some sort of $p$ -value). Similarly, the condition number of $^U$ for that run is $27$ , less than the condition number of $^V$ of $38$ , so that $^U$ looks “less co-linear” than $^V$ .

As for how to think about output directions, my guess is that at layer $t = 20$ in a $30$ -layer model, these features are not just upweighting/downweighting tokens but are doing something more abstract. I don’t have any hard empirical evidence for this though.

Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

Dec 3, 2024, 9:19 PM

100 points

7 comments41 min readLW link

Andrew Mack May 6, 2024, 6:35 AM
5 points
0
in reply to: Andy Arditi’s comment on: Mechanistically Eliciting Latent Behaviors in Language Models
This is an interesting question!
I just checked this. The cosine similarity of $δ_{9}$ and $δ_{22}$ is .52, which is much more similar than you’d expect from random vectors of the same dimensionality (this is computing the $δ$ ’s across all tokens and then flattening, which is how the objective was computed for the main refusal experiment in the post).
If you restrict to calculating $δ$ ’s at just the assistant tag at the end of the prompt, the cosine similarity between $δ_{9}$ and $δ_{22}$ goes up to .87.
Interestingly, the cosine similarities in $δ$ ‘s seems to be somewhat high across all pairs of steering vectors (mean of .25 across pairs, which is higher than random vectors which will be close to zero). This suggests it might be better to do some sort of soft orthogonality constraint over the $δ$ ’s (by penalizing pairwise cosine similarities) rather than a hard orthogonality constraint over the steering vectors, if you want to get better diversity across vectors. I’ll have to try this at some point.

Andrew Mack May 6, 2024, 5:24 AM
2 points
0
in reply to: RGRGRG’s comment on: Mechanistically Eliciting Latent Behaviors in Language Models
Yes, I train them one at a time, constraining each new vector to be orthogonal to the older ones (this was not clear in the post, so thanks for asking!).
I haven’t experimented with this, but you could also imagine using only “soft” orthogonality constraints (e.g., penalizing pairwise cosine similarities between vectors).

Andrew Mack May 6, 2024, 5:18 AM
LW: 1 AF: 1
0
AF
in reply to: Sam Marks’s comment on: Mechanistically Eliciting Latent Behaviors in Language Models
Thanks for your comment! Yes, I’d say that roughly sums things up.
As I alluded to in the introduction, I think the advantage of having a latent perturbation (as opposed to just eliciting an interesting sequence of tokens) is due to i) generalization and ii) applications to mechanistic anomaly detection. Generalization is important in situations where the backdoored behavior is subtle. In this case, even if you had a non-latent way of eliciting the behavior (e.g. through adversarial inputs) you don’t know whether you’ve simply elicited a weird sequence of tokens or if there’s something systematic about it. But if you have a latent perturbation then you can run the perturbed model over many inputs to confirm that this is a generalizable behavior (e.g., systematic but subtle coding errors meant to sabotage you) rather than a weird sequence of tokens.
Moreover, this might chain well with mechanistic anomaly detection. Quoting the post: “before deployment, train a large set of unsupervised model perturbations, then during deployment, if the model’s activations look similar to one of the previously learned perturbations, auditors can go back and evaluate the behaviors encoded by that perturbation more thoroughly to see if there is anything problematic with these behaviors.”
I agree that it would be helpful to spell out/test the advantages of latent perturbations more concretely in followup work (for example, trying to show that mechanistic anomaly detection actually works using the learned steering vectors).

Andrew Mack May 3, 2024, 4:30 AM
LW: 1 AF: 1
0
AF
in reply to: tailcalled’s comment on: Mechanistically Eliciting Latent Behaviors in Language Models
Yes, I meant the unsupervised steering objective (magnitude of downstream changes) as opposed to cross-entropy.

Andrew Mack May 3, 2024, 4:25 AM
LW: 4 AF: 2
2
AF
in reply to: Bogdan Ionut Cirstea’s comment on: Mechanistically Eliciting Latent Behaviors in Language Models
Thanks for pointing me to these references, particularly on NoiseCLR! (I was unaware of it previously). I think those sorts of ideas will be very useful when trying to learn interesting vectors on a larger data-set of prompts. In particular, skimming that paper, it looks like the numerator of equation (5) (defining their contrastive learning objective) basically captures what I meant above when I suggested “one could maximize the cosine similarity between the differences in target activations across multiple prompts”. The fact that it seems to work so well in diffusion models gives me hope that it will also work in LLMs! My guess is that ultimately you get the most mileage out of combining the two objectives.

Andrew Mack May 1, 2024, 4:08 PM
LW: 3 AF: 3
0
AF
in reply to: Clément Dumas’s comment on: Mechanistically Eliciting Latent Behaviors in Language Models
Yes, the learned vectors are always applied at every token (for all examples).

Andrew Mack May 1, 2024, 4:07 PM
LW: 1 AF: 1
0
AF
in reply to: tailcalled’s comment on: Mechanistically Eliciting Latent Behaviors in Language Models
I haven’t tried the first singular vector of the Jacobian between layers. But for p=2,q=1 I tried looking at the first few eigenvectors of the Hessian of the objective function (around $θ = 0$ ) on the bomb-making prompt for Qwen-1.8B. These didn’t appear to do anything interesting regardless of norm. So my feeling is that full-blown gradient descent is needed.

Andrew Mack May 1, 2024, 3:59 PM
LW: 2 AF: 1
0
AF
in reply to: Carson Denison’s comment on: Mechanistically Eliciting Latent Behaviors in Language Models
Thanks for your comment! Here are my thoughts on this:
1. I agree that a more automated way of choosing hyper-parameters is an obvious and important next step! I have some ideas here, but it is certainly not a solved problem. Here are some rough ideas, in order of compute costs:
  1. Develop some useful heuristics based off diversity measures of steered completions. For example, for each value of R you could calculate sentence embeddings of the steered completions for a small number of learned vectors, and then use the summed variance in sentence embeddings as your diversity metric. Then, plot diversity as a function of R. My guess is that this might look something like a “hockey stick”: for small values of R you get essentially zero diversity, but you quickly hit a threshold/phase transition where diversity sky-rockets. The values of R with super high diversity are probably not what you want (they will be incoherent). Instead, you want a value of R right at the cusp of the phase transition. You could then scale out to thousands of vectors at this “cusp” value of R. The trick is coming up with a useful heuristic for identifying the cusp, which would likely require more experience applying the method to a diverse range of prompts/examples. Ultimately, this feels largely analogous to identifying “elbows” in scree plots. This obviously depends on there being obvious cusps in the first place (a priori it’s not clear if this will happen).
  2. Fine-tune an LLM to identify the best value of R. This is similar to the above idea, except that instead of trying to come up with a heuristic for identifying cusps, we use a fine-tuned LLM as the heuristic. In other words, we manually decide the best R for a number of example prompts and fine-tune the LLM to predict which value of R is best on a new prompt we care about (you could imagine a number of different ways of specifying the details of the fine-tuning).
  3. Just use many different values of R. If we’re already training thousands of vectors and using a trusted LLM to flag weird behavior, then multiplying this by N values of R may not be too costly, for N reasonably sized.
2. Your point about distinguishing between intentional backdoors and strange OOD behavior seems pretty important. I haven’t thought carefully about how you might distinguish the two in general, but I have thought a little bit about whether steering vectors might be helpful in recovering triggers. In particular, if the backdoor is something like “string X in the prompt elicits string Y in the response”, then I have this vague intuition that GCG-like attacks to discover X might work better if they targeted cosine-similarity with steering vectors which elicit Y (as you suggest). My reasoning here is that it “feels like” the optimization landscape should be more well-behaved if you’re targeting similarity in activations at some intermediate layer, as opposed to targeting at the logits level, the general principle being that if there are fewer layers between the tokens and the target then this should make things easier (although it’s by no means obvious to me that this hypothesis is true). So this is definitely an idea I would be interested in trying at some point. But given the difficulty of the challenge I would probably start with supervised steering vectors, and if results are good results then try with unsupervised steering vectors.
  1. Alternatively, here’s an approach for discovering the trigger X which doesn’t rely on GCG: say we know the distribution of clean prompts, and we have some steering vector $θ_{Y}$ which elicits some bad behavior Y. We then train some steering vector $θ_{p r o m p t}$ to elicit the clean prompt distribution, i.e. so that if we start with an empty prompt (“” or <BOS>) and then generate steered by $θ_{p r o m p t}$ , then we get samples from the clean prompt distribution. Then to generate from the triggered distribution, we simply steer with $θ_{p r o m p t} + θ_{Y} .$ Intuitively, the hope is to use steering vector arithmetic to get around the difficulties of sampling LLMs backwards. My guess is this basic version still wouldn’t work (a priori it’s not clear why $θ_{Y}$ would elicit both the behavior Y and the trigger X), but you could imagine something more sophisticated like: when sampling token t, you upweight tokens which would cause the layer-L residual stream of token t+1 to have high cosine similarity with $θ_{Y} .$ Again, it’s not at all clear if this would work, so to make things tractable I’d first try with vectors trained on some known Y, then if results are good move on to unsupervised vectors.
What links here?
- ARENA4.0 Capstone: Hyperparameter tuning for MELBO + replication on Llama-3.2-1b-Instruct by 25Hour (Oct 5, 2024, 11:30 AM; 34 points)

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

Apr 30, 2024, 6:51 PM

208 points

43 comments45 min readLW link

Andrew Mack

Deep Causal Transcod­ing: A Frame­work for Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models

Mechanistically Eliciting Latent Behaviors in Language Models