RSS

Sparse Au­toen­coders (SAEs)

TagLast edit: 6 Apr 2024 9:14 UTC by Joseph Bloom

Sparse Autoencoders (SAEs) are an unsupervised technique for decomposing the activations of a neural network into a sum of interpretable components (often referred to as features). Sparse Autoencoders may be useful interpretability and related alignment agendas.

For more information on SAEs see:

Towards Monose­man­tic­ity: De­com­pos­ing Lan­guage Models With Dic­tionary Learning

Zac Hatfield-Dodds5 Oct 2023 21:01 UTC
288 points
22 comments2 min readLW link1 review
(transformer-circuits.pub)

[In­terim re­search re­port] Tak­ing fea­tures out of su­per­po­si­tion with sparse autoencoders

13 Dec 2022 15:41 UTC
149 points
23 comments22 min readLW link2 reviews

In­ter­pretabil­ity with Sparse Au­toen­coders (Co­lab ex­er­cises)

CallumMcDougall29 Nov 2023 12:56 UTC
74 points
9 comments4 min readLW link

Open Source Sparse Au­toen­coders for all Resi­d­ual Stream Lay­ers of GPT2-Small

Joseph Bloom2 Feb 2024 6:54 UTC
102 points
37 comments15 min readLW link

Sparse Au­toen­coders Find Highly In­ter­pretable Direc­tions in Lan­guage Models

21 Sep 2023 15:30 UTC
159 points
8 comments5 min readLW link

At­ten­tion SAEs Scale to GPT-2 Small

3 Feb 2024 6:50 UTC
77 points
4 comments8 min readLW link

[Sum­mary] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
72 points
0 comments3 min readLW link

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

16 Jan 2024 0:26 UTC
83 points
9 comments18 min readLW link

Discrim­i­nat­ing Be­hav­iorally Iden­ti­cal Clas­sifiers: a model prob­lem for ap­ply­ing in­ter­pretabil­ity to scal­able oversight

Sam Marks18 Apr 2024 16:17 UTC
107 points
10 comments12 min readLW link

We In­spected Every Head In GPT-2 Small us­ing SAEs So You Don’t Have To

6 Mar 2024 5:03 UTC
61 points
0 comments12 min readLW link

Stitch­ing SAEs of differ­ent sizes

13 Jul 2024 17:19 UTC
39 points
12 comments12 min readLW link

[Paper] A is for Ab­sorp­tion: Study­ing Fea­ture Split­ting and Ab­sorp­tion in Sparse Autoencoders

25 Sep 2024 9:31 UTC
71 points
16 comments3 min readLW link
(arxiv.org)

Spar­sify: A mechanis­tic in­ter­pretabil­ity re­search agenda

Lee Sharkey3 Apr 2024 12:34 UTC
94 points
22 comments22 min readLW link

Un­der­stand­ing SAE Fea­tures with the Logit Lens

11 Mar 2024 0:16 UTC
66 points
0 comments14 min readLW link

Do Sparse Au­toen­coders (SAEs) trans­fer across base and fine­tuned lan­guage mod­els?

29 Sep 2024 19:37 UTC
26 points
8 comments25 min readLW link

Effi­cient Dic­tionary Learn­ing with Switch Sparse Autoencoders

Anish Mudide22 Jul 2024 18:45 UTC
118 points
19 comments12 min readLW link

Com­ments on An­thropic’s Scal­ing Monosemanticity

Robert_AIZI3 Jun 2024 12:15 UTC
97 points
8 comments7 min readLW link

[Full Post] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
77 points
10 comments8 min readLW link

My best guess at the im­por­tant tricks for train­ing 1L SAEs

Arthur Conmy21 Dec 2023 1:59 UTC
37 points
4 comments3 min readLW link

Show­ing SAE La­tents Are Not Atomic Us­ing Meta-SAEs

24 Aug 2024 0:56 UTC
61 points
9 comments20 min readLW link

SAE re­con­struc­tion er­rors are (em­piri­cally) pathological

wesg29 Mar 2024 16:37 UTC
105 points
16 comments8 min readLW link

An­nounc­ing Neu­ron­pe­dia: Plat­form for ac­cel­er­at­ing re­search into Sparse Autoencoders

25 Mar 2024 21:17 UTC
92 points
7 comments7 min readLW link

Scal­ing and eval­u­at­ing sparse autoencoders

leogao6 Jun 2024 22:50 UTC
106 points
6 comments1 min readLW link

A Selec­tion of Ran­domly Selected SAE Features

1 Apr 2024 9:09 UTC
109 points
2 comments4 min readLW link

Open Source Repli­ca­tion & Com­men­tary on An­thropic’s Dic­tionary Learn­ing Paper

Neel Nanda23 Oct 2023 22:38 UTC
93 points
12 comments9 min readLW link

SAE-VIS: An­nounce­ment Post

31 Mar 2024 15:30 UTC
74 points
8 comments1 min readLW link

Ad­dress­ing Fea­ture Sup­pres­sion in SAEs

16 Feb 2024 18:32 UTC
86 points
4 comments10 min readLW link

SAEs (usu­ally) Trans­fer Between Base and Chat Models

18 Jul 2024 10:29 UTC
66 points
0 comments10 min readLW link

An X-Ray is Worth 15 Fea­tures: Sparse Au­toen­coders for In­ter­pretable Ra­diol­ogy Re­port Generation

7 Oct 2024 8:53 UTC
38 points
0 comments5 min readLW link
(arxiv.org)

Self-ex­plain­ing SAE features

5 Aug 2024 22:20 UTC
60 points
13 comments10 min readLW link

Im­prov­ing Dic­tionary Learn­ing with Gated Sparse Autoencoders

25 Apr 2024 18:43 UTC
63 points
38 comments1 min readLW link
(arxiv.org)

JumpReLU SAEs + Early Ac­cess to Gemma 2 SAEs

19 Jul 2024 16:10 UTC
48 points
10 comments1 min readLW link
(storage.googleapis.com)

ProLU: A Non­lin­ear­ity for Sparse Autoencoders

Glen Taggart23 Apr 2024 14:09 UTC
44 points
4 comments9 min readLW link

[Linkpost] Play with SAEs on Llama 3

25 Sep 2024 22:35 UTC
40 points
2 comments1 min readLW link

A gen­tle in­tro­duc­tion to sparse autoencoders

Nick Jiang2 Sep 2024 18:11 UTC
9 points
0 comments6 min readLW link

In­ter­pret­ing Prefer­ence Models w/​ Sparse Autoencoders

1 Jul 2024 21:35 UTC
74 points
12 comments9 min readLW link

SAEs Dis­cover Mean­ingful Fea­tures in the IOI Task

5 Jun 2024 23:48 UTC
15 points
2 comments10 min readLW link

Towards Mul­ti­modal In­ter­pretabil­ity: Learn­ing Sparse In­ter­pretable Fea­tures in Vi­sion Transformers

hugofry29 Apr 2024 20:57 UTC
92 points
8 comments11 min readLW link

Ex­plor­ing SAE fea­tures in LLMs with defi­ni­tion trees and to­ken lists

mwatkins4 Oct 2024 22:15 UTC
37 points
5 comments6 min readLW link

HDBSCAN is Sur­pris­ingly Effec­tive at Find­ing In­ter­pretable Clusters of the SAE De­coder Matrix

11 Oct 2024 23:06 UTC
8 points
2 comments10 min readLW link

Fea­ture Tar­geted LLC Es­ti­ma­tion Dist­in­guishes SAE Fea­tures from Ran­dom Directions

19 Jul 2024 20:32 UTC
59 points
6 comments16 min readLW link

How to Bet­ter Re­port Sparse Au­toen­coder Performance

J Bostock2 Jun 2024 19:34 UTC
20 points
4 comments3 min readLW link

Can quan­tised au­toen­coders find and in­ter­pret cir­cuits in lan­guage mod­els?

charlieoneill24 Mar 2024 20:05 UTC
28 points
4 comments24 min readLW link

An In­tu­itive Ex­pla­na­tion of Sparse Au­toen­coders for Mechanis­tic In­ter­pretabil­ity of LLMs

Adam Karvonen25 Jun 2024 15:57 UTC
25 points
0 comments9 min readLW link
(adamkarvonen.github.io)

In­ter­pret­ing and Steer­ing Fea­tures in Images

Gytis Daujotas20 Jun 2024 18:33 UTC
65 points
6 comments5 min readLW link

Causal Graphs of GPT-2-Small’s Resi­d­ual Stream

David Udell9 Jul 2024 22:06 UTC
53 points
7 comments7 min readLW link

On the Prac­ti­cal Ap­pli­ca­tions of Interpretability

Nick Jiang15 Oct 2024 17:18 UTC
3 points
0 comments7 min readLW link

[Repli­ca­tion] Con­jec­ture’s Sparse Cod­ing in Small Transformers

16 Jun 2023 18:02 UTC
52 points
0 comments5 min readLW link

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers v2

Neel Nanda7 Jul 2024 17:39 UTC
134 points
15 comments25 min readLW link

SAEs you can See: Ap­ply­ing Sparse Au­toen­coders to Clustering

Robert_AIZI28 Oct 2024 14:48 UTC
27 points
0 comments10 min readLW link

At­ten­tion Out­put SAEs Im­prove Cir­cuit Analysis

21 Jun 2024 12:56 UTC
33 points
1 comment19 min readLW link

In­ter­pretabil­ity as Com­pres­sion: Re­con­sid­er­ing SAE Ex­pla­na­tions of Neu­ral Ac­ti­va­tions with MDL-SAEs

23 Aug 2024 18:52 UTC
40 points
5 comments16 min readLW link

Case Study: In­ter­pret­ing, Ma­nipu­lat­ing, and Con­trol­ling CLIP With Sparse Autoencoders

Gytis Daujotas1 Aug 2024 21:08 UTC
44 points
6 comments7 min readLW link

The ‘strong’ fea­ture hy­poth­e­sis could be wrong

lewis smith2 Aug 2024 14:33 UTC
221 points
17 comments17 min readLW link

To­k­enized SAEs: In­fus­ing per-to­ken bi­ases.

4 Aug 2024 9:17 UTC
19 points
20 comments15 min readLW link

Ex­cur­sions into Sparse Au­toen­coders: What is monose­man­tic­ity?

Jakub Smékal5 Aug 2024 19:22 UTC
2 points
0 comments10 min readLW link

Limi­ta­tions on the In­ter­pretabil­ity of Learned Fea­tures from Sparse Dic­tionary Learning

Tom Angsten30 Jul 2024 16:36 UTC
6 points
0 comments9 min readLW link

Eval­u­at­ing Sparse Au­toen­coders with Board Game Models

2 Aug 2024 19:50 UTC
38 points
1 comment9 min readLW link

Ex­tract­ing SAE task fea­tures for in-con­text learning

12 Aug 2024 20:34 UTC
31 points
1 comment9 min readLW link

Eval­u­at­ing Syn­thetic Ac­ti­va­tions com­posed of SAE La­tents in GPT-2

25 Sep 2024 20:37 UTC
27 points
0 comments3 min readLW link
(arxiv.org)

Toy Models of Su­per­po­si­tion: Sim­plified by Hand

Axel Sorensen29 Sep 2024 21:19 UTC
9 points
3 comments8 min readLW link

LLMs are likely not conscious

research_prime_space29 Sep 2024 20:57 UTC
6 points
9 comments1 min readLW link

Toy Models of Fea­ture Ab­sorp­tion in SAEs

7 Oct 2024 9:56 UTC
49 points
8 comments10 min readLW link

In­ter­pretabil­ity of SAE Fea­tures Rep­re­sent­ing Check in ChessGPT

Jonathan Kutasov5 Oct 2024 20:43 UTC
27 points
2 comments8 min readLW link

Do­main-spe­cific SAEs

jacob_drori7 Oct 2024 20:15 UTC
27 points
0 comments5 min readLW link

Stan­dard SAEs Might Be In­co­her­ent: A Choos­ing Prob­lem & A “Con­cise” Solution

Kola Ayonrinde30 Oct 2024 22:50 UTC
27 points
0 comments12 min readLW link

SAE fea­tures for re­fusal and syco­phancy steer­ing vectors

12 Oct 2024 14:54 UTC
26 points
4 comments7 min readLW link

It’s im­por­tant to know when to stop: Mechanis­tic Ex­plo­ra­tion of Gemma 2 List Generation

Gerard Boxo14 Oct 2024 17:04 UTC
8 points
0 comments6 min readLW link
(gboxo.github.io)

A suite of Vi­sion Sparse Au­toen­coders

27 Oct 2024 4:05 UTC
25 points
0 comments1 min readLW link

SAE Prob­ing: What is it good for? Ab­solutely some­thing!

1 Nov 2024 19:23 UTC
31 points
0 comments11 min readLW link

Evolu­tion­ary prompt op­ti­miza­tion for SAE fea­ture visualization

14 Nov 2024 13:06 UTC
16 points
0 comments9 min readLW link

SAEs are highly dataset de­pen­dent: a case study on the re­fusal direction

7 Nov 2024 5:22 UTC
63 points
4 comments14 min readLW link

An­a­lyz­ing how SAE fea­tures evolve across a for­ward pass

7 Nov 2024 22:07 UTC
47 points
0 comments1 min readLW link
(arxiv.org)

Cal­en­dar fea­ture ge­om­e­try in GPT-2 layer 8 resi­d­ual stream SAEs

17 Aug 2024 1:16 UTC
53 points
0 comments5 min readLW link

Mechanis­tic In­ter­pretabil­ity of Llama 3.2 with Sparse Autoencoders

PaulPauls24 Nov 2024 5:45 UTC
20 points
3 comments1 min readLW link
(github.com)

[Question] SAE sparse fea­ture graph us­ing only resi­d­ual layers

Jaehyuk Lim23 May 2024 13:32 UTC
0 points
3 comments1 min readLW link

Quick Thoughts on Scal­ing Monosemanticity

Joel Burget23 May 2024 16:22 UTC
28 points
1 comment4 min readLW link
(transformer-circuits.pub)

Are SAE fea­tures from the Base Model still mean­ingful to LLaVA?

Shan23Chen5 Dec 2024 19:24 UTC
4 points
0 comments10 min readLW link

Are SAE fea­tures from the Base Model still mean­ingful to LLaVA?

Shan23Chen5 Dec 2024 20:21 UTC
6 points
0 comments10 min readLW link
(www.lesswrong.com)

Train­ing a Sparse Au­toen­coder in < 30 min­utes on 16GB of VRAM us­ing an S3 cache

Louka Ewington-Pitsos24 Aug 2024 7:39 UTC
17 points
0 comments5 min readLW link

Do sparse au­toen­coders find “true fea­tures”?

Demian Till22 Feb 2024 18:06 UTC
73 points
33 comments11 min readLW link

Sparse Au­toen­coders: Fu­ture Work

21 Sep 2023 15:30 UTC
35 points
5 comments6 min readLW link

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

11 Dec 2024 6:30 UTC
71 points
1 comment2 min readLW link
(www.neuronpedia.org)

Im­prov­ing SAE’s by Sqrt()-ing L1 & Re­mov­ing Low­est Ac­ti­vat­ing Fea­tures

15 Mar 2024 16:30 UTC
26 points
5 comments4 min readLW link

Ex­am­in­ing Lan­guage Model Perfor­mance with Re­con­structed Ac­ti­va­tions us­ing Sparse Au­toen­coders

27 Feb 2024 2:43 UTC
42 points
16 comments15 min readLW link

Nor­mal­iz­ing Sparse Autoencoders

Fengyuan Hu8 Apr 2024 6:17 UTC
21 points
18 comments13 min readLW link

Case Stud­ies in Re­v­erse-Eng­ineer­ing Sparse Au­toen­coder Fea­tures by Us­ing MLP Linearization

14 Jan 2024 2:06 UTC
23 points
0 comments42 min readLW link

Some ad­di­tional SAE thoughts

Hoagy13 Jan 2024 19:31 UTC
30 points
4 comments13 min readLW link

[Repli­ca­tion] Con­jec­ture’s Sparse Cod­ing in Toy Models

2 Jun 2023 17:34 UTC
24 points
0 comments1 min readLW link

Some open-source dic­tio­nar­ies and dic­tio­nary learn­ing infrastructure

Sam Marks5 Dec 2023 6:05 UTC
45 points
7 comments5 min readLW link

Au­toIn­ter­pre­ta­tion Finds Sparse Cod­ing Beats Alternatives

Hoagy17 Jul 2023 1:41 UTC
56 points
1 comment7 min readLW link

(ten­ta­tively) Found 600+ Monose­man­tic Fea­tures in a Small LM Us­ing Sparse Autoencoders

Logan Riggs5 Jul 2023 16:49 UTC
60 points
1 comment7 min readLW link

Re­search Re­port: Sparse Au­toen­coders find only 9/​180 board state fea­tures in OthelloGPT

Robert_AIZI5 Mar 2024 13:55 UTC
61 points
24 comments10 min readLW link
(aizi.substack.com)

Sparse au­toen­coders find com­posed fea­tures in small toy mod­els

14 Mar 2024 18:00 UTC
33 points
12 comments15 min readLW link

Clas­sify­ing rep­re­sen­ta­tions of sparse au­toen­coders (SAEs)

Annah17 Nov 2023 13:54 UTC
15 points
6 comments2 min readLW link

Tak­ing fea­tures out of su­per­po­si­tion with sparse au­toen­coders more quickly with in­formed initialization

Pierre Peigné23 Sep 2023 16:21 UTC
30 points
8 comments5 min readLW link

Early Ex­per­i­ments in Re­ward Model In­ter­pre­ta­tion Us­ing Sparse Autoencoders

3 Oct 2023 7:45 UTC
17 points
0 comments5 min readLW link

Ex­plain­ing “Tak­ing fea­tures out of su­per­po­si­tion with sparse au­toen­coders”

Robert_AIZI16 Jun 2023 13:59 UTC
10 points
0 comments8 min readLW link
(aizi.substack.com)

Com­par­ing An­thropic’s Dic­tionary Learn­ing to Ours

Robert_AIZI7 Oct 2023 23:30 UTC
137 points
8 comments4 min readLW link

A small up­date to the Sparse Cod­ing in­terim re­search report

30 Apr 2023 19:54 UTC
61 points
5 comments1 min readLW link

Find­ing Sparse Lin­ear Con­nec­tions be­tween Fea­tures in LLMs

9 Dec 2023 2:27 UTC
69 points
5 comments10 min readLW link

Sparse Cod­ing, for Mechanis­tic In­ter­pretabil­ity and Ac­ti­va­tion Engineering

David Udell23 Sep 2023 19:16 UTC
42 points
7 comments34 min readLW link

Trans­former Debugger

Henk Tillman12 Mar 2024 19:08 UTC
25 points
0 comments1 min readLW link
(github.com)

Past Tense Features

Can20 Apr 2024 14:34 UTC
12 points
0 comments4 min readLW link

Transcoders en­able fine-grained in­ter­pretable cir­cuit anal­y­sis for lan­guage models

30 Apr 2024 17:58 UTC
71 points
14 comments17 min readLW link

Mas­sive Ac­ti­va­tions and why <bos> is im­por­tant in To­k­enized SAE Unigrams

Louka Ewington-Pitsos5 Sep 2024 2:19 UTC
1 point
0 comments3 min readLW link

In­ves­ti­gat­ing Sen­si­tive Direc­tions in GPT-2: An Im­proved Baseline and Com­par­a­tive Anal­y­sis of SAEs

6 Sep 2024 2:28 UTC
28 points
0 comments12 min readLW link

Iden­ti­fy­ing Func­tion­ally Im­por­tant Fea­tures with End-to-End Sparse Dic­tionary Learning

17 May 2024 16:25 UTC
57 points
10 comments4 min readLW link
(arxiv.org)

Re­search Re­port: Alter­na­tive spar­sity meth­ods for sparse au­toen­coders with Othel­loGPT.

Andrew Quaisley14 Jun 2024 0:57 UTC
17 points
5 comments12 min readLW link

[Linkpost] In­ter­pretable Anal­y­sis of Fea­tures Found in Open-source Sparse Au­toen­coder (par­tial repli­ca­tion)

Fernando Avalos9 Sep 2024 3:33 UTC
6 points
1 comment1 min readLW link
(forum.effectivealtruism.org)

Sparse Fea­tures Through Time

Rogan Inglis24 Jun 2024 18:06 UTC
12 points
1 comment1 min readLW link
(roganinglis.io)

Ac­ti­va­tion Pat­tern SVD: A pro­posal for SAE Interpretability

Daniel Tan28 Jun 2024 22:12 UTC
15 points
2 comments2 min readLW link

Ma­tryoshka Sparse Autoencoders

Noa Nabeshima14 Dec 2024 2:52 UTC
74 points
7 comments11 min readLW link

[In­terim re­search re­port] Ac­ti­va­tion plateaus & sen­si­tive di­rec­tions in GPT2

5 Jul 2024 17:05 UTC
65 points
2 comments5 min readLW link

Faith­ful vs In­ter­pretable Sparse Au­toen­coder Evals

Louka Ewington-Pitsos12 Jul 2024 5:37 UTC
2 points
0 comments12 min readLW link

De­cep­tive agents can col­lude to hide dan­ger­ous fea­tures in SAEs

15 Jul 2024 17:07 UTC
33 points
2 comments7 min readLW link

A List of 45+ Mech In­terp Pro­ject Ideas from Apollo Re­search’s In­ter­pretabil­ity Team

18 Jul 2024 14:15 UTC
117 points
18 comments18 min readLW link

BatchTopK: A Sim­ple Im­prove­ment for TopK-SAEs

20 Jul 2024 2:20 UTC
52 points
0 comments4 min readLW link

Com­po­si­tion­al­ity and Am­bi­guity: La­tent Co-oc­cur­rence and In­ter­pretable Subspaces

Matthew A. Clarke20 Dec 2024 15:16 UTC
1 point
0 comments37 min readLW link

Ini­tial Ex­per­i­ments Us­ing SAEs to Help De­tect AI Gen­er­ated Text

Aaron_Scher22 Jul 2024 5:16 UTC
17 points
0 comments14 min readLW link

Food, Pri­son & Ex­otic An­i­mals: Sparse Au­toen­coders De­tect 6.5x Perform­ing Youtube Thumbnails

Louka Ewington-Pitsos17 Sep 2024 3:52 UTC
6 points
2 comments7 min readLW link

Learn­ing Multi-Level Fea­tures with Ma­tryoshka SAEs

19 Dec 2024 15:59 UTC
25 points
1 comment11 min readLW link

Un­der­stand­ing Po­si­tional Fea­tures in Layer 0 SAEs

29 Jul 2024 9:36 UTC
43 points
0 comments5 min readLW link

Open Source Au­to­mated In­ter­pretabil­ity for Sparse Au­toen­coder Features

30 Jul 2024 21:11 UTC
67 points
1 comment13 min readLW link
(blog.eleuther.ai)
No comments.