Intepretability
Views my own
Intepretability
Views my own
The [Sparse Feature Circuits] approach can be seen as analogous to LoRA (Hu et al., 2021), in that you are constraining your model’s behavior
FWIW I consider SFC and LoRA pretty different, because in practice LoRA is practical, but it can be reversed very easily and has poor worst-case performance. Whereas Sparse Feature Circuits is very expensive, requires far more nodes in bigger models (forthcoming, I think), or requires only studying a subset of layers, but if it worked would likely have far better worst-case performance.
This makes LoRA a good baseline for some SFC-style tasks, but the research experience using both is pretty different.
I assume all the data is fairly noisy, since scanning for the domain I know in https://raw.githubusercontent.com/Oscar-Delaney/safe_AI_papers/refs/heads/main/Automated%20categorization/final_output.csv, it misses ~half of the GDM Mech Interp output from the specified window and also mislabels https://arxiv.org/abs/2208.08345 and https://arxiv.org/abs/2407.13692 as Mech Interp (though two labels are applied to these papers and I didn’t dig to see which was used)
> think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted”
Does this mean something like:
1. People who join scaling labs can have their values drift, and future safety employers will suspect by-default that ex-scaling lab staff have had their values drift, or
2. If there is a non-existential AGI disaster, scaling lab staff will be looked down upon
or something else entirely?
This is a great write up, thanks! Has their been any follow up from the paper’s authors?
This seems a pretty compelling takedown to me which is not addressed by the existing paper (my understanding of the two WordNet experiments not discussed in post is: Figure 4 concerns whether under whitening a concept can be linearly separated (yes) and so the random baseline used here does not address the concerns in this post; Figure 5 shows that the whitening transformation preserves some of the word net cluster cosine sim, but moreover on the right basically everything is orthogonal, as found in this post).
This seems important to me since the point of mech interp is to not be another interpretability field dominated by pretty pictures (e.g. saliency maps) that fail basic sanity checks (e.g. this paper for saliency maps). (Workshops aren’t too important, but I’m still surprised about this)
My current best guess for why base models refuse so much is that “Sorry, I can’t help with that. I don’t know how to” is actually extremely common on the internet, based on discussion with Achyuta Rajaram on twitter: https://x.com/ArthurConmy/status/1840514842098106527
This fits with our observations about how frequently LLaMA-1 performs incompetent refusal
> Qwen2 was explicitly trained on synthetic data from Qwen1.5
~~Where is the evidence for this claim? (Claude 3.5 Sonnet could also not find evidence on one rollout)~~
EDITED TO ADD: “these [Qwen] models are utilized to synthesize high-quality pre-training data” is clear evidence, I am being silly.
All other techinques mentioned here (e.g. filtering and adding more IT data at end of training) still sound like models “trained to predict the next word on the internet” (I don’t think the training samples being IID early and late in training is an important detail)
The Improbability Principle sounds close. The summary seems to suggest law of large numbers is one part of the pop science book, but admittedly some of the other parts (“probability lever”) seem less relevant
Is DM exploring this sort of stuff?
Yes. On the AGI safety and alignment team we are working on activation steering—e.g. Alex Turner who invented the technique with collaborators is working on this, and the first author of a few tokens deep is currently interning on the Gemini Safety team mentioned in this post. We don’t have sharp and fast lines between what counts as Gemini Safety and what counts as AGI safety and alignment, but several projects on AGI safety and alignment, and most projects on Gemini Safety would see “safety practices we can test right now” as a research goal.
I would say a better reference for the limitations of ROME is this paper: https://aclanthology.org/2023.findings-acl.733
Short explanation: Neel’s short summary, i.e. editing in the Rome fact will also make slightly related questions e.g. “The Louvre is cool. Obama was born in” … be completed with ” Rome” too.
I agree that twitter is a worse use of time.
Going to posters for works you already know to talk to authors seems a great idea and I do it. Re-reading your OP, you suggest things like checking papers are fake or not in poster sessions. Maybe you just meant papers that you already knew about? It sounded as if you were suggesting doing this for random papers, which I’m more skeptical about.
My opinion is that going to poster sessions, orals, pre-researching papers etc. at ICML/ICLR/NeurIPS is pretty valuable for new researchers and I wish I had done this before having any papers (you don’t need to have any papers to go to a conference). See also Thomas Kwa’s comment about random intuitions learnt from going to a conference.
After this, I agree with Leo that I think it would be a waste of my time to go to papers/orals/preresearch papers. Maybe there’s some value in this for conceptual research but for most empirical work I’m very skeptical (most papers are not good, but it takes my time to figure out whether a paper is good or not, etc.)
If there are some very common features in particular layers (e.g. an ‘attend to BOS’ feature), then restricting one expert to be active at a time will potentially force SAEs to learn common features in every expert.
+1 to similar concerns—I would have probably left one expert always on. This should both remove some redundant features.
Here are the other GDM mech interp papers missed:
https://arxiv.org/abs/2307.15771
https://arxiv.org/abs/2404.16014
https://arxiv.org/abs/2407.14435
We have some blog posts of comparable standard to the Anthropic circuit updates listed:
https://www.alignmentforum.org/posts/C5KAZQib3bzzpeyrg/full-post-progress-update-1-from-the-gdm-mech-interp-team
https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall
You use a very wide scope for the “enhancing human feedback” (basically any post-training paper mentioning ‘align’-ing anything). So I will use a wide scope for what counts as mech interp and also include:
https://arxiv.org/abs/2401.06102
https://arxiv.org/abs/2304.14767
There are a few other papers from the PAIR group as well as Mor Geva and also Been Kim, but mostly with Google Research affiliations so it seems fine to not include these as IIRC you weren’t counting pre-GDM merger Google Research/Brain work