CallumMcDougall

Karma: 2,000

New Cause Area Proposal

CallumMcDougallApr 1, 2025, 7:12 AM

107 points

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

lewis smith, Senthooran Rajamanoharan, Arthur Conmy, CallumMcDougall, Tom Lieberum, János Kramár, Rohin Shah and Neel Nanda

Mar 26, 2025, 7:07 PM

108 points

15 comments29 min readLW link

(deepmindsafetyresearch.medium.com)

CallumMcDougall Feb 22, 2025, 3:59 PM
3 points
0
in reply to: yihe’s comment on: Induction heads—illustrated
Sorry I didn’t get to this message earlier, glad you liked the post though! The answer is that attention heads can have multiple different functions—the simplest way is to store things entirely orthogonally so they lie in fully independent subspsaces, but even this isn’t necessary because it seems like transformers take advantage of superposition to represent multiple concepts at once, more so than they have dimensions.

ARENA 5.0 - Call for Applicants

JamesH, James Fox, CallumMcDougall, Chloe Li and David Quarel

Jan 30, 2025, 1:18 PM

35 points

2 comments6 min readLW link

Scaling Sparse Feature Circuit Finding to Gemma 9B

Diego Caples, Jatin Nainani, CallumMcDougall and rrenaud

Jan 10, 2025, 11:08 AM

86 points

11 comments17 min readLW link

CallumMcDougall Jan 1, 2025, 6:34 PM
5 points
0
in reply to: Fabien Roger’s comment on: How to replicate and extend our alignment faking demo
Oh, interesting, wasn’t aware of this bug. I guess this is probably fine since most people replicating it will be pulling it rather than copying and pasting it into their IDE. Also this comment thread is now here for anyone who might also get confused. Thanks for clarifying!

CallumMcDougall Dec 30, 2024, 2:40 PM
3 points
0
on: How to replicate and extend our alignment faking demo
+1, thanks for sharing! I think there’s a formatting error in the notebook, where the tags like <OUTPUT> were all removed and replaced with empty strings (e.g. see attached photo). We’ve recently made the ARENA evals material public, and we’ve got a working replication there which I think has the tags in the right place (section 2 of 3 on the page linked here)

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks and Neel Nanda

Dec 11, 2024, 6:30 AM

82 points

6 comments2 min readLW link

(www.neuronpedia.org)

CallumMcDougall Dec 6, 2024, 10:28 AM
4 points
1
on: [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
Amazing post! Forgot to do this for a while, but here’s a linked diagram explaining how I think about feature absorption, hopefully ppl find it helpful!

CallumMcDougall Oct 12, 2024, 9:05 PM
3 points
0
in reply to: chanind’s comment on: Toy Models of Feature Absorption in SAEs
I don’t know of specific examples, but this is the image I have in my head when thinking about why untied weights are more free than tied weights:
I think more generally this is why I think studying SAEs in the TMS setup can be a bit challenging, because there’s often too much symmetry and not enough complexity for untied weights to be useful, meaning just forcing your weights to be tied can fix a lot of problems! (We include it in ARENA mostly for illustration of key concepts, not because it gets you many super informative results). But I’m keen for more work like this trying to understand feature absorption better in more tractible cases

AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0

James Fox, Chloe Li, JamesH, Gracie Green and CallumMcDougall

Jul 6, 2024, 11:34 AM

57 points

7 comments6 min readLW link

CallumMcDougall Jul 4, 2024, 9:06 AM
2 points
0
in reply to: eggsyntax’s comment on: How ARENA course material gets made
Oh yeah this is great, thanks! For people reading this, I’ll highlight SLT + developmental interp + mamba as areas which I think are large enough to have specific exercise sections but currently don’t

How ARENA course material gets made

CallumMcDougallJul 2, 2024, 6:04 PM

41 points

2 comments7 min readLW link

CallumMcDougall Apr 5, 2024, 1:26 PM
2 points
0
in reply to: Johnny Lin’s comment on: SAE-VIS: Announcement Post
Thanks!! Really appreciate it

CallumMcDougall Apr 1, 2024, 11:09 AM
2 points
0
in reply to: Connor Kissane’s comment on: SAE-VIS: Announcement Post
Thanks so much! (-:

A Selection of Randomly Selected SAE Features

CallumMcDougall and Joseph Bloom

Apr 1, 2024, 9:09 AM

109 points

2 comments4 min readLW link

CallumMcDougall Mar 31, 2024, 3:44 PM
3 points
0
in reply to: Neel Nanda’s comment on: SAE-VIS: Announcement Post
Thanks so much, really glad to hear it’s been helpful!

SAE-VIS: Announcement Post

CallumMcDougall and Joseph Bloom

Mar 31, 2024, 3:30 PM

74 points

8 comments1 min readLW link

CallumMcDougall Jan 15, 2024, 8:17 AM
2 points
0
in reply to: Zack_M_Davis’s comment on: Six (and a half) intuitions for KL divergence
Thanks, really appreciate this (and the advice for later posts!)

Mech Interp Challenge: January—Deciphering the Caesar Cipher Model

CallumMcDougallJan 1, 2024, 6:03 PM

17 points

0 comments3 min readLW link