Arthur Conmy

Karma: 1,650

Intepretability

Views my own

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

lewis smith, Senthooran Rajamanoharan, Arthur Conmy, CallumMcDougall, Tom Lieberum, János Kramár, Rohin Shah and Neel Nanda

Mar 26, 2025, 7:07 PM

108 points

15 comments29 min readLW link

(deepmindsafetyresearch.medium.com)

Arthur Conmy Mar 5, 2025, 3:11 PM
5 points
2
on: Self-fulfilling misalignment data might be poisoning our AI models
Upweighting positive data
Data augmentation
...
It maybe also worth up-weighting https://darioamodei.com/machines-of-loving-grace along with the AI optimism blog post in the training data. In general it is a bit sad that there isn’t more good writing that I know of on this topic.

The GDM AGI Safety+Alignment Team is Hiring for Applied Interpretability Research

Arthur Conmy and Neel Nanda

Feb 24, 2025, 2:17 AM

48 points

1 comment7 min readLW link

Arthur Conmy Jan 12, 2025, 6:49 PM
7 points
0
in reply to: Nina Panickssery’s comment on: Activation space interpretability may be doomed
the best vector for probing is not the best vector for steering

AKA the predict/control discrepancy, from Section 3.3.1 of Wattenberg and Viegas, 2024

Arthur Conmy Dec 18, 2024, 2:25 AM
4 points
0
in reply to: Buck’s comment on: Sam Marks’s Shortform
I suggested something similar, and this was the discussion (bolding is the important author pushback):

Arthur Conmy

11:33 1 Dec
Why can’t the YC company not use system prompts and instead:
1) Detect whether regex has been used in the last ~100 tokens (and run this check every ~100 tokens of model output)
2) If yes, rewind back ~100 tokens, insert a comment like # Don’t use regex here (in a valid way given what code has been written so far), and continue the generation
Dhruv Pai
10:50 2 Dec
This seems like a reasonable baseline with the caveat that it requires expensive resampling and inserting such a comment in a useful way is difficult.
When we ran baselines simply repeating the number of times we told the model not to use regex right before generation in the system prompt, we didn’t see the instruction following improve (very circumstantial evidence). I don’t see a principled reason why this would be much worse than the above, however, since we do one-shot generation with such a comment right before the actual generation.

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks and Neel Nanda

Dec 11, 2024, 6:30 AM

82 points

6 comments2 min readLW link

(www.neuronpedia.org)

Evolutionary prompt optimization for SAE feature visualization

neverix, Daniel Tan, Dmitrii Kharlapenko, Neel Nanda and Arthur Conmy

Nov 14, 2024, 1:06 PM

21 points

0 comments9 min readLW link

SAEs are highly dataset dependent: a case study on the refusal direction

Connor Kissane, robertzk, Neel Nanda and Arthur Conmy

Nov 7, 2024, 5:22 AM

66 points

4 comments14 min readLW link

Arthur Conmy Nov 2, 2024, 11:32 AM
6 points
0
in reply to: Oscar’s comment on: IAPS: Mapping Technical Safety Research at AI Companies
- Here are the other GDM mech interp papers missed:
- We have some blog posts of comparable standard to the Anthropic circuit updates listed:
  - https://www.alignmentforum.org/posts/C5KAZQib3bzzpeyrg/full-post-progress-update-1-from-the-gdm-mech-interp-team
  - https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall
- You use a very wide scope for the “enhancing human feedback” (basically any post-training paper mentioning ‘align’-ing anything). So I will use a wide scope for what counts as mech interp and also include:
  - https://arxiv.org/abs/2401.06102
  - https://arxiv.org/abs/2304.14767
  - There are a few other papers from the PAIR group as well as Mor Geva and also Been Kim, but mostly with Google Research affiliations so it seems fine to not include these as IIRC you weren’t counting pre-GDM merger Google Research/Brain work

Arthur Conmy Oct 29, 2024, 1:48 PM
3 points
0
on: Bridging the VLM and mech interp communities for multimodal interpretability
The [Sparse Feature Circuits] approach can be seen as analogous to LoRA (Hu et al., 2021), in that you are constraining your model’s behavior
FWIW I consider SFC and LoRA pretty different, because in practice LoRA is practical, but it can be reversed very easily and has poor worst-case performance. Whereas Sparse Feature Circuits is very expensive, requires far more nodes in bigger models (forthcoming, I think), or requires only studying a subset of layers, but if it worked would likely have far better worst-case performance.
This makes LoRA a good baseline for some SFC-style tasks, but the research experience using both is pretty different.

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Oct 27, 2024, 6:46 PM

47 points

4 comments5 min readLW link

Arthur Conmy Oct 26, 2024, 8:54 PM
7 points
0
on: IAPS: Mapping Technical Safety Research at AI Companies
I assume all the data is fairly noisy, since scanning for the domain I know in https://raw.githubusercontent.com/Oscar-Delaney/safe_AI_papers/refs/heads/main/Automated%20categorization/final_output.csv, it misses ~half of the GDM Mech Interp output from the specified window and also mislabels https://arxiv.org/abs/2208.08345 and https://arxiv.org/abs/2407.13692 as Mech Interp (though two labels are applied to these papers and I didn’t dig to see which was used)

SAE features for refusal and sycophancy steering vectors

neverix, Dmitrii Kharlapenko, Arthur Conmy and Neel Nanda

Oct 12, 2024, 2:54 PM

29 points

4 comments7 min readLW link

Arthur Conmy Oct 10, 2024, 4:49 PM
LW: 3 AF: 2
0
AF
in reply to: Mark Xu’s comment on: Mark Xu’s Shortform
> think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted”

Does this mean something like:

1. People who join scaling labs can have their values drift, and future safety employers will suspect by-default that ex-scaling lab staff have had their values drift, or

2. If there is a non-existential AGI disaster, scaling lab staff will be looked down upon

or something else entirely?

Arthur Conmy Oct 6, 2024, 8:57 PM
6 points
3
on: The Geometry of Feelings and Nonsense in Large Language Models
This is a great write up, thanks! Has their been any follow up from the paper’s authors?

This seems a pretty compelling takedown to me which is not addressed by the existing paper (my understanding of the two WordNet experiments not discussed in post is: Figure 4 concerns whether under whitening a concept can be linearly separated (yes) and so the random baseline used here does not address the concerns in this post; Figure 5 shows that the whitening transformation preserves some of the word net cluster cosine sim, but moreover on the right basically everything is orthogonal, as found in this post).
This seems important to me since the point of mech interp is to not be another interpretability field dominated by pretty pictures (e.g. saliency maps) that fail basic sanity checks (e.g. this paper for saliency maps). (Workshops aren’t too important, but I’m still surprised about this)

Arthur Conmy Oct 1, 2024, 4:38 PM
LW: 2 AF: 1
0
AF
in reply to: cdt’s comment on: Base LLMs refuse too
My current best guess for why base models refuse so much is that “Sorry, I can’t help with that. I don’t know how to” is actually extremely common on the internet, based on discussion with Achyuta Rajaram on twitter: https://x.com/ArthurConmy/status/1840514842098106527
This fits with our observations about how frequently LLaMA-1 performs incompetent refusal

Arthur Conmy Sep 30, 2024, 4:01 PM
LW: 3 AF: 3
0
AF
in reply to: LawrenceC’s comment on: Base LLMs refuse too
> Qwen2 was explicitly trained on synthetic data from Qwen1.5

~~Where is the evidence for this claim? (Claude 3.5 Sonnet could also not find evidence on one rollout)~~
EDITED TO ADD: “these [Qwen] models are utilized to synthesize high-quality pre-training data” is clear evidence, I am being silly.

All other techinques mentioned here (e.g. filtering and adding more IT data at end of training) still sound like models “trained to predict the next word on the internet” (I don’t think the training samples being IID early and late in training is an important detail)

Base LLMs refuse too

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Sep 29, 2024, 4:04 PM

60 points

20 comments10 min readLW link

Arthur Conmy Sep 26, 2024, 10:54 PM
4 points
0
in reply to: peterbarnett’s comment on: peterbarnett’s Shortform
The Improbability Principle sounds close. The summary seems to suggest law of large numbers is one part of the pop science book, but admittedly some of the other parts (“probability lever”) seem less relevant

Arthur Conmy Aug 24, 2024, 10:11 AM
LW: 6 AF: 4
1
AF
in reply to: 4gate’s comment on: AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work
Is DM exploring this sort of stuff?
Yes. On the AGI safety and alignment team we are working on activation steering—e.g. Alex Turner who invented the technique with collaborators is working on this, and the first author of a few tokens deep is currently interning on the Gemini Safety team mentioned in this post. We don’t have sharp and fast lines between what counts as Gemini Safety and what counts as AGI safety and alignment, but several projects on AGI safety and alignment, and most projects on Gemini Safety would see “safety practices we can test right now” as a research goal.