Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Arthur Conmy
Karma:
1,650
Intepretability
Views my own
All
Posts
Comments
New
Top
Old
Page
1
Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
lewis smith
,
Senthooran Rajamanoharan
,
Arthur Conmy
,
CallumMcDougall
,
Tom Lieberum
,
János Kramár
,
Rohin Shah
and
Neel Nanda
Mar 26, 2025, 7:07 PM
109
points
15
comments
29
min read
LW
link
(deepmindsafetyresearch.medium.com)
The GDM AGI Safety+Alignment Team is Hiring for Applied Interpretability Research
Arthur Conmy
and
Neel Nanda
Feb 24, 2025, 2:17 AM
48
points
1
comment
7
min read
LW
link
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
Can
,
Adam Karvonen
,
Johnny Lin
,
Curt Tigges
,
Joseph Bloom
,
chanind
,
Yeu-Tong Lau
,
Eoin Farrell
,
Arthur Conmy
,
CallumMcDougall
,
Kola Ayonrinde
,
Matthew Wearden
,
Sam Marks
and
Neel Nanda
Dec 11, 2024, 6:30 AM
82
points
6
comments
2
min read
LW
link
(www.neuronpedia.org)
Evolutionary prompt optimization for SAE feature visualization
neverix
,
Daniel Tan
,
Dmitrii Kharlapenko
,
Neel Nanda
and
Arthur Conmy
Nov 14, 2024, 1:06 PM
21
points
0
comments
9
min read
LW
link
SAEs are highly dataset dependent: a case study on the refusal direction
Connor Kissane
,
robertzk
,
Neel Nanda
and
Arthur Conmy
Nov 7, 2024, 5:22 AM
66
points
4
comments
14
min read
LW
link
Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
Connor Kissane
,
robertzk
,
Arthur Conmy
and
Neel Nanda
Oct 27, 2024, 6:46 PM
47
points
4
comments
5
min read
LW
link
SAE features for refusal and sycophancy steering vectors
neverix
,
Dmitrii Kharlapenko
,
Arthur Conmy
and
Neel Nanda
Oct 12, 2024, 2:54 PM
29
points
4
comments
7
min read
LW
link
Base LLMs refuse too
Connor Kissane
,
robertzk
,
Arthur Conmy
and
Neel Nanda
Sep 29, 2024, 4:04 PM
60
points
20
comments
10
min read
LW
link
Extracting SAE task features for in-context learning
Dmitrii Kharlapenko
,
neverix
,
Neel Nanda
and
Arthur Conmy
Aug 12, 2024, 8:34 PM
31
points
1
comment
9
min read
LW
link
Self-explaining SAE features
Dmitrii Kharlapenko
,
neverix
,
Neel Nanda
and
Arthur Conmy
Aug 5, 2024, 10:20 PM
60
points
13
comments
10
min read
LW
link
JumpReLU SAEs + Early Access to Gemma 2 SAEs
Senthooran Rajamanoharan
,
Tom Lieberum
,
nps29
,
Arthur Conmy
,
Vikrant Varma
,
János Kramár
and
Neel Nanda
Jul 19, 2024, 4:10 PM
48
points
10
comments
1
min read
LW
link
(storage.googleapis.com)
SAEs (usually) Transfer Between Base and Chat Models
Connor Kissane
,
robertzk
,
Arthur Conmy
and
Neel Nanda
Jul 18, 2024, 10:29 AM
66
points
0
comments
10
min read
LW
link
Attention Output SAEs Improve Circuit Analysis
Connor Kissane
,
robertzk
,
Arthur Conmy
and
Neel Nanda
Jun 21, 2024, 12:56 PM
33
points
3
comments
19
min read
LW
link
Improving Dictionary Learning with Gated Sparse Autoencoders
Senthooran Rajamanoharan
,
Arthur Conmy
,
lewis smith
,
Tom Lieberum
,
Vikrant Varma
,
János Kramár
,
Rohin Shah
and
Neel Nanda
Apr 25, 2024, 6:43 PM
63
points
38
comments
1
min read
LW
link
(arxiv.org)
[Full Post] Progress Update #1 from the GDM Mech Interp Team
Neel Nanda
,
Arthur Conmy
,
lewis smith
,
Senthooran Rajamanoharan
,
Tom Lieberum
,
János Kramár
and
Vikrant Varma
Apr 19, 2024, 7:06 PM
79
points
10
comments
8
min read
LW
link
[Summary] Progress Update #1 from the GDM Mech Interp Team
Neel Nanda
,
Arthur Conmy
,
lewis smith
,
Senthooran Rajamanoharan
,
Tom Lieberum
,
János Kramár
and
Vikrant Varma
Apr 19, 2024, 7:06 PM
72
points
0
comments
3
min read
LW
link
We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To
robertzk
,
Connor Kissane
,
Arthur Conmy
and
Neel Nanda
Mar 6, 2024, 5:03 AM
63
points
0
comments
12
min read
LW
link
Attention SAEs Scale to GPT-2 Small
Connor Kissane
,
robertzk
,
Arthur Conmy
and
Neel Nanda
Feb 3, 2024, 6:50 AM
78
points
4
comments
8
min read
LW
link
Sparse Autoencoders Work on Attention Layer Outputs
Connor Kissane
,
robertzk
,
Arthur Conmy
and
Neel Nanda
Jan 16, 2024, 12:26 AM
83
points
9
comments
18
min read
LW
link
My best guess at the important tricks for training 1L SAEs
Arthur Conmy
Dec 21, 2023, 1:59 AM
37
points
4
comments
3
min read
LW
link
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel