RSS

Sparse Au­toen­coders (SAEs)

TagLast edit: Apr 6, 2024, 9:14 AM by Joseph Bloom

Sparse Autoencoders (SAEs) are an unsupervised technique for decomposing the activations of a neural network into a sum of interpretable components (often referred to as features). Sparse Autoencoders may be useful interpretability and related alignment agendas.

For more information on SAEs see:

Towards Monose­man­tic­ity: De­com­pos­ing Lan­guage Models With Dic­tionary Learning

Zac Hatfield-DoddsOct 5, 2023, 9:01 PM
288 points
22 comments2 min readLW link1 review
(transformer-circuits.pub)

[In­terim re­search re­port] Tak­ing fea­tures out of su­per­po­si­tion with sparse autoencoders

Dec 13, 2022, 3:41 PM
150 points
23 comments22 min readLW link2 reviews

In­ter­pretabil­ity with Sparse Au­toen­coders (Co­lab ex­er­cises)

CallumMcDougallNov 29, 2023, 12:56 PM
74 points
9 comments4 min readLW link

Open Source Sparse Au­toen­coders for all Resi­d­ual Stream Lay­ers of GPT2-Small

Joseph BloomFeb 2, 2024, 6:54 AM
103 points
37 comments15 min readLW link

Sparse Au­toen­coders Find Highly In­ter­pretable Direc­tions in Lan­guage Models

Sep 21, 2023, 3:30 PM
159 points
8 comments5 min readLW link

[Sum­mary] Progress Up­date #1 from the GDM Mech In­terp Team

Apr 19, 2024, 7:06 PM
72 points
0 comments3 min readLW link

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

Jan 16, 2024, 12:26 AM
83 points
9 comments18 min readLW link

At­ten­tion SAEs Scale to GPT-2 Small

Feb 3, 2024, 6:50 AM
78 points
4 comments8 min readLW link

Discrim­i­nat­ing Be­hav­iorally Iden­ti­cal Clas­sifiers: a model prob­lem for ap­ply­ing in­ter­pretabil­ity to scal­able oversight

Sam MarksApr 18, 2024, 4:17 PM
110 points
10 comments12 min readLW link

We In­spected Every Head In GPT-2 Small us­ing SAEs So You Don’t Have To

Mar 6, 2024, 5:03 AM
63 points
0 comments12 min readLW link

Stitch­ing SAEs of differ­ent sizes

Jul 13, 2024, 5:19 PM
39 points
12 comments12 min readLW link

[Paper] A is for Ab­sorp­tion: Study­ing Fea­ture Split­ting and Ab­sorp­tion in Sparse Autoencoders

Sep 25, 2024, 9:31 AM
73 points
16 comments3 min readLW link
(arxiv.org)

Do Sparse Au­toen­coders (SAEs) trans­fer across base and fine­tuned lan­guage mod­els?

Sep 29, 2024, 7:37 PM
26 points
8 comments25 min readLW link

Un­der­stand­ing SAE Fea­tures with the Logit Lens

Mar 11, 2024, 12:16 AM
68 points
0 comments14 min readLW link

Spar­sify: A mechanis­tic in­ter­pretabil­ity re­search agenda

Lee SharkeyApr 3, 2024, 12:34 PM
96 points
23 comments22 min readLW link

Com­ments on An­thropic’s Scal­ing Monosemanticity

Robert_AIZIJun 3, 2024, 12:15 PM
97 points
8 comments7 min readLW link

My best guess at the im­por­tant tricks for train­ing 1L SAEs

Arthur ConmyDec 21, 2023, 1:59 AM
37 points
4 comments3 min readLW link

Effi­cient Dic­tionary Learn­ing with Switch Sparse Autoencoders

Anish MudideJul 22, 2024, 6:45 PM
118 points
19 comments12 min readLW link

[Full Post] Progress Up­date #1 from the GDM Mech In­terp Team

Apr 19, 2024, 7:06 PM
79 points
10 comments8 min readLW link

SAE reg­u­lariza­tion pro­duces more in­ter­pretable models

Jan 28, 2025, 8:02 PM
21 points
7 comments4 min readLW link

Show­ing SAE La­tents Are Not Atomic Us­ing Meta-SAEs

Aug 24, 2024, 12:56 AM
68 points
10 comments20 min readLW link

Open Source Repli­ca­tion & Com­men­tary on An­thropic’s Dic­tionary Learn­ing Paper

Neel NandaOct 23, 2023, 10:38 PM
93 points
12 comments9 min readLW link

Scal­ing and eval­u­at­ing sparse autoencoders

leogaoJun 6, 2024, 10:50 PM
106 points
6 comments1 min readLW link

A Selec­tion of Ran­domly Selected SAE Features

Apr 1, 2024, 9:09 AM
109 points
2 comments4 min readLW link

Ad­dress­ing Fea­ture Sup­pres­sion in SAEs

Feb 16, 2024, 6:32 PM
86 points
4 comments10 min readLW link

An X-Ray is Worth 15 Fea­tures: Sparse Au­toen­coders for In­ter­pretable Ra­diol­ogy Re­port Generation

Oct 7, 2024, 8:53 AM
38 points
1 comment5 min readLW link
(arxiv.org)

SAEs (usu­ally) Trans­fer Between Base and Chat Models

Jul 18, 2024, 10:29 AM
66 points
0 comments10 min readLW link

An­nounc­ing Neu­ron­pe­dia: Plat­form for ac­cel­er­at­ing re­search into Sparse Autoencoders

Mar 25, 2024, 9:17 PM
93 points
7 comments7 min readLW link

SAE-VIS: An­nounce­ment Post

Mar 31, 2024, 3:30 PM
74 points
8 comments1 min readLW link

Cross-Layer Fea­ture Align­ment and Steer­ing in Large Lan­guage Model

dlaptevFeb 8, 2025, 8:18 PM
5 points
0 comments6 min readLW link

SAE re­con­struc­tion er­rors are (em­piri­cally) pathological

wesgMar 29, 2024, 4:37 PM
106 points
16 comments8 min readLW link

Com­par­ing the effec­tive­ness of top-down and bot­tom-up ac­ti­va­tion steer­ing for by­pass­ing re­fusal on harm­ful prompts

Ana KaprosFeb 12, 2025, 7:12 PM
7 points
0 comments5 min readLW link

Fea­ture Tar­geted LLC Es­ti­ma­tion Dist­in­guishes SAE Fea­tures from Ran­dom Directions

Jul 19, 2024, 8:32 PM
59 points
6 comments16 min readLW link

JumpReLU SAEs + Early Ac­cess to Gemma 2 SAEs

Jul 19, 2024, 4:10 PM
48 points
10 comments1 min readLW link
(storage.googleapis.com)

The ‘strong’ fea­ture hy­poth­e­sis could be wrong

lewis smithAug 2, 2024, 2:33 PM
223 points
19 comments17 min readLW link

Case Study: In­ter­pret­ing, Ma­nipu­lat­ing, and Con­trol­ling CLIP With Sparse Autoencoders

Gytis DaujotasAug 1, 2024, 9:08 PM
45 points
7 comments7 min readLW link

Im­prov­ing Dic­tionary Learn­ing with Gated Sparse Autoencoders

Apr 25, 2024, 6:43 PM
63 points
38 comments1 min readLW link
(arxiv.org)

To­k­enized SAEs: In­fus­ing per-to­ken bi­ases.

Aug 4, 2024, 9:17 AM
20 points
20 comments15 min readLW link

Ex­cur­sions into Sparse Au­toen­coders: What is monose­man­tic­ity?

Jakub SmékalAug 5, 2024, 7:22 PM
2 points
0 comments10 min readLW link

Self-ex­plain­ing SAE features

Aug 5, 2024, 10:20 PM
60 points
13 comments10 min readLW link

ProLU: A Non­lin­ear­ity for Sparse Autoencoders

Glen TaggartApr 23, 2024, 2:09 PM
44 points
4 comments9 min readLW link

A gen­tle in­tro­duc­tion to sparse autoencoders

Nick JiangSep 2, 2024, 6:11 PM
9 points
0 comments6 min readLW link

[Linkpost] Play with SAEs on Llama 3

Sep 25, 2024, 10:35 PM
40 points
2 comments1 min readLW link

SAEs Dis­cover Mean­ingful Fea­tures in the IOI Task

Jun 5, 2024, 11:48 PM
15 points
2 comments10 min readLW link

Towards Mul­ti­modal In­ter­pretabil­ity: Learn­ing Sparse In­ter­pretable Fea­tures in Vi­sion Transformers

hugofryApr 29, 2024, 8:57 PM
92 points
8 comments11 min readLW link

In­ter­pretabil­ity as Com­pres­sion: Re­con­sid­er­ing SAE Ex­pla­na­tions of Neu­ral Ac­ti­va­tions with MDL-SAEs

Aug 23, 2024, 6:52 PM
42 points
8 comments16 min readLW link

How to Bet­ter Re­port Sparse Au­toen­coder Performance

J BostockJun 2, 2024, 7:34 PM
20 points
4 comments3 min readLW link

Ex­plor­ing SAE fea­tures in LLMs with defi­ni­tion trees and to­ken lists

mwatkinsOct 4, 2024, 10:15 PM
37 points
5 comments6 min readLW link

At­ten­tion Out­put SAEs Im­prove Cir­cuit Analysis

Jun 21, 2024, 12:56 PM
33 points
3 comments19 min readLW link

On the Prac­ti­cal Ap­pli­ca­tions of Interpretability

Nick JiangOct 15, 2024, 5:18 PM
3 points
1 comment7 min readLW link

An In­tu­itive Ex­pla­na­tion of Sparse Au­toen­coders for Mechanis­tic In­ter­pretabil­ity of LLMs

Adam KarvonenJun 25, 2024, 3:57 PM
27 points
0 comments9 min readLW link
(adamkarvonen.github.io)

SAEs you can See: Ap­ply­ing Sparse Au­toen­coders to Clustering

Robert_AIZIOct 28, 2024, 2:48 PM
27 points
0 comments10 min readLW link

In­ter­pret­ing and Steer­ing Fea­tures in Images

Gytis DaujotasJun 20, 2024, 6:33 PM
66 points
6 comments5 min readLW link

HDBSCAN is Sur­pris­ingly Effec­tive at Find­ing In­ter­pretable Clusters of the SAE De­coder Matrix

Oct 11, 2024, 11:06 PM
8 points
2 comments10 min readLW link

Causal Graphs of GPT-2-Small’s Resi­d­ual Stream

David UdellJul 9, 2024, 10:06 PM
53 points
7 comments7 min readLW link

In­ter­pret­ing Prefer­ence Models w/​ Sparse Autoencoders

Jul 1, 2024, 9:35 PM
74 points
12 comments9 min readLW link

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers v2

Neel NandaJul 7, 2024, 5:39 PM
135 points
16 comments25 min readLW link

Bro­ken La­tents: Study­ing SAEs and Fea­ture Co-oc­cur­rence in Toy Models

Dec 30, 2024, 10:50 PM
22 points
3 comments15 min readLW link

[Repli­ca­tion] Con­jec­ture’s Sparse Cod­ing in Small Transformers

Jun 16, 2023, 6:02 PM
52 points
0 comments5 min readLW link

Can quan­tised au­toen­coders find and in­ter­pret cir­cuits in lan­guage mod­els?

charlieoneillMar 24, 2024, 8:05 PM
28 points
4 comments24 min readLW link

Proof-of-Con­cept De­bug­ger for a Small LLM

Mar 17, 2025, 10:27 PM
20 points
0 comments11 min readLW link

Ex­tract­ing SAE task fea­tures for in-con­text learning

Aug 12, 2024, 8:34 PM
31 points
1 comment9 min readLW link

Eval­u­at­ing Syn­thetic Ac­ti­va­tions com­posed of SAE La­tents in GPT-2

Sep 25, 2024, 8:37 PM
29 points
0 comments3 min readLW link
(arxiv.org)

Toy Models of Su­per­po­si­tion: Sim­plified by Hand

Axel SorensenSep 29, 2024, 9:19 PM
9 points
3 comments8 min readLW link

LLMs are likely not conscious

research_prime_spaceSep 29, 2024, 8:57 PM
6 points
9 comments1 min readLW link

[Question] Are Sparse Au­toen­coders a good idea for AI con­trol?

Gerard BoxoDec 26, 2024, 5:34 PM
3 points
4 comments1 min readLW link

Toy Models of Fea­ture Ab­sorp­tion in SAEs

Oct 7, 2024, 9:56 AM
49 points
8 comments10 min readLW link

In­ter­pretabil­ity of SAE Fea­tures Rep­re­sent­ing Check in ChessGPT

Jonathan KutasovOct 5, 2024, 8:43 PM
27 points
2 comments8 min readLW link

Do­main-spe­cific SAEs

jacob_droriOct 7, 2024, 8:15 PM
27 points
2 comments5 min readLW link

Stan­dard SAEs Might Be In­co­her­ent: A Choos­ing Prob­lem & A “Con­cise” Solution

Kola AyonrindeOct 30, 2024, 10:50 PM
27 points
0 comments12 min readLW link

SAE fea­tures for re­fusal and syco­phancy steer­ing vectors

Oct 12, 2024, 2:54 PM
29 points
4 comments7 min readLW link

It’s im­por­tant to know when to stop: Mechanis­tic Ex­plo­ra­tion of Gemma 2 List Generation

Gerard BoxoOct 14, 2024, 5:04 PM
8 points
0 comments6 min readLW link
(gboxo.github.io)

[PAPER] Ja­co­bian Sparse Au­toen­coders: Spar­sify Com­pu­ta­tions, Not Just Activations

Lucy FarnikFeb 26, 2025, 12:50 PM
79 points
8 comments7 min readLW link

SAE Train­ing Dataset In­fluence in Fea­ture Match­ing and a Hy­poth­e­sis on Po­si­tion Features

Seonglae ChoFeb 26, 2025, 5:05 PM
3 points
3 comments17 min readLW link

A suite of Vi­sion Sparse Au­toen­coders

Oct 27, 2024, 4:05 AM
24 points
0 comments1 min readLW link

SAE Prob­ing: What is it good for?

Nov 1, 2024, 7:23 PM
32 points
0 comments11 min readLW link

Evolu­tion­ary prompt op­ti­miza­tion for SAE fea­ture visualization

Nov 14, 2024, 1:06 PM
21 points
0 comments9 min readLW link

SAEs are highly dataset de­pen­dent: a case study on the re­fusal direction

Nov 7, 2024, 5:22 AM
66 points
4 comments14 min readLW link

An­a­lyz­ing how SAE fea­tures evolve across a for­ward pass

Nov 7, 2024, 10:07 PM
47 points
0 comments1 min readLW link
(arxiv.org)

Mechanis­tic In­ter­pretabil­ity of Llama 3.2 with Sparse Autoencoders

PaulPaulsNov 24, 2024, 5:45 AM
19 points
3 comments1 min readLW link
(github.com)

Take­aways From Our Re­cent Work on SAE Probing

Mar 3, 2025, 7:50 PM
30 points
0 comments5 min readLW link

Topolog­i­cal Data Anal­y­sis and Mechanis­tic Interpretability

Gunnar CarlssonFeb 24, 2025, 7:56 PM
14 points
4 comments7 min readLW link

Scal­ing Sparse Fea­ture Cir­cuit Find­ing to Gemma 9B

Jan 10, 2025, 11:08 AM
86 points
11 comments17 min readLW link

Find­ing Fea­tures Causally Up­stream of Refusal

Jan 14, 2025, 2:30 AM
48 points
5 comments12 min readLW link

Em­piri­cal In­sights into Fea­ture Geom­e­try in Sparse Autoencoders

Jason Boxi ZhangJan 24, 2025, 7:02 PM
6 points
0 comments11 min readLW link

[Repli­ca­tion] Cross­coder-based Stage-Wise Model Diffing

Mar 22, 2025, 6:35 PM
20 points
0 comments7 min readLW link

Fea­ture Hedg­ing: Another way cor­re­lated fea­tures break SAEs

Mar 25, 2025, 2:33 PM
13 points
0 comments17 min readLW link

Food, Pri­son & Ex­otic An­i­mals: Sparse Au­toen­coders De­tect 6.5x Perform­ing Youtube Thumbnails

Louka Ewington-PitsosSep 17, 2024, 3:52 AM
6 points
2 comments7 min readLW link

[Question] SAE sparse fea­ture graph us­ing only resi­d­ual layers

Jaehyuk LimMay 23, 2024, 1:32 PM
0 points
3 comments1 min readLW link

Quick Thoughts on Scal­ing Monosemanticity

Joel BurgetMay 23, 2024, 4:22 PM
28 points
1 comment4 min readLW link
(transformer-circuits.pub)

Are SAE fea­tures from the Base Model still mean­ingful to LLaVA?

Shan23ChenDec 5, 2024, 7:24 PM
5 points
2 comments10 min readLW link

Are SAE fea­tures from the Base Model still mean­ingful to LLaVA?

Shan23ChenFeb 18, 2025, 10:16 PM
8 points
2 comments10 min readLW link
(www.lesswrong.com)

Train­ing a Sparse Au­toen­coder in < 30 min­utes on 16GB of VRAM us­ing an S3 cache

Louka Ewington-PitsosAug 24, 2024, 7:39 AM
17 points
0 comments5 min readLW link

Sparse Au­toen­coder Fea­tures for Clas­sifi­ca­tions and Transferability

Shan23ChenFeb 18, 2025, 10:14 PM
5 points
0 comments1 min readLW link
(arxiv.org)

Do sparse au­toen­coders find “true fea­tures”?

Demian TillFeb 22, 2024, 6:06 PM
74 points
33 comments11 min readLW link

Sparse Au­toen­coders: Fu­ture Work

Sep 21, 2023, 3:30 PM
35 points
5 comments6 min readLW link

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

Dec 11, 2024, 6:30 AM
82 points
6 comments2 min readLW link
(www.neuronpedia.org)

Im­prov­ing SAE’s by Sqrt()-ing L1 & Re­mov­ing Low­est Ac­ti­vat­ing Fea­tures

Mar 15, 2024, 4:30 PM
26 points
5 comments4 min readLW link

Ex­am­in­ing Lan­guage Model Perfor­mance with Re­con­structed Ac­ti­va­tions us­ing Sparse Au­toen­coders

Feb 27, 2024, 2:43 AM
43 points
16 comments15 min readLW link

Nor­mal­iz­ing Sparse Autoencoders

Fengyuan HuApr 8, 2024, 6:17 AM
21 points
18 comments13 min readLW link

Case Stud­ies in Re­v­erse-Eng­ineer­ing Sparse Au­toen­coder Fea­tures by Us­ing MLP Linearization

Jan 14, 2024, 2:06 AM
23 points
0 comments42 min readLW link

Some ad­di­tional SAE thoughts

HoagyJan 13, 2024, 7:31 PM
31 points
4 comments13 min readLW link

[Repli­ca­tion] Con­jec­ture’s Sparse Cod­ing in Toy Models

Jun 2, 2023, 5:34 PM
24 points
0 comments1 min readLW link

Some open-source dic­tio­nar­ies and dic­tio­nary learn­ing infrastructure

Sam MarksDec 5, 2023, 6:05 AM
46 points
7 comments5 min readLW link

Au­toIn­ter­pre­ta­tion Finds Sparse Cod­ing Beats Alternatives

HoagyJul 17, 2023, 1:41 AM
57 points
1 comment7 min readLW link

(ten­ta­tively) Found 600+ Monose­man­tic Fea­tures in a Small LM Us­ing Sparse Autoencoders

Logan RiggsJul 5, 2023, 4:49 PM
60 points
1 comment7 min readLW link

Re­search Re­port: Sparse Au­toen­coders find only 9/​180 board state fea­tures in OthelloGPT

Robert_AIZIMar 5, 2024, 1:55 PM
61 points
24 comments10 min readLW link
(aizi.substack.com)

Sparse au­toen­coders find com­posed fea­tures in small toy mod­els

Mar 14, 2024, 6:00 PM
33 points
12 comments15 min readLW link

Clas­sify­ing rep­re­sen­ta­tions of sparse au­toen­coders (SAEs)

AnnahNov 17, 2023, 1:54 PM
15 points
6 comments2 min readLW link

Tak­ing fea­tures out of su­per­po­si­tion with sparse au­toen­coders more quickly with in­formed initialization

Pierre PeignéSep 23, 2023, 4:21 PM
30 points
8 comments5 min readLW link

Early Ex­per­i­ments in Re­ward Model In­ter­pre­ta­tion Us­ing Sparse Autoencoders

Oct 3, 2023, 7:45 AM
17 points
0 comments5 min readLW link

Ex­plain­ing “Tak­ing fea­tures out of su­per­po­si­tion with sparse au­toen­coders”

Robert_AIZIJun 16, 2023, 1:59 PM
10 points
0 comments8 min readLW link
(aizi.substack.com)

Com­par­ing An­thropic’s Dic­tionary Learn­ing to Ours

Robert_AIZIOct 7, 2023, 11:30 PM
137 points
8 comments4 min readLW link

A small up­date to the Sparse Cod­ing in­terim re­search report

Apr 30, 2023, 7:54 PM
61 points
5 comments1 min readLW link

Find­ing Sparse Lin­ear Con­nec­tions be­tween Fea­tures in LLMs

Dec 9, 2023, 2:27 AM
70 points
5 comments10 min readLW link

Sparse Cod­ing, for Mechanis­tic In­ter­pretabil­ity and Ac­ti­va­tion Engineering

David UdellSep 23, 2023, 7:16 PM
42 points
7 comments34 min readLW link

Trans­former Debugger

Henk TillmanMar 12, 2024, 7:08 PM
25 points
0 comments1 min readLW link
(github.com)

Past Tense Features

CanApr 20, 2024, 2:34 PM
12 points
0 comments4 min readLW link

Transcoders en­able fine-grained in­ter­pretable cir­cuit anal­y­sis for lan­guage models

Apr 30, 2024, 5:58 PM
72 points
14 comments17 min readLW link

Mas­sive Ac­ti­va­tions and why <bos> is im­por­tant in To­k­enized SAE Unigrams

Louka Ewington-PitsosSep 5, 2024, 2:19 AM
1 point
0 comments3 min readLW link

In­ves­ti­gat­ing Sen­si­tive Direc­tions in GPT-2: An Im­proved Baseline and Com­par­a­tive Anal­y­sis of SAEs

Sep 6, 2024, 2:28 AM
28 points
0 comments12 min readLW link

Iden­ti­fy­ing Func­tion­ally Im­por­tant Fea­tures with End-to-End Sparse Dic­tionary Learning

May 17, 2024, 4:25 PM
57 points
20 comments4 min readLW link
(arxiv.org)

Re­search Re­port: Alter­na­tive spar­sity meth­ods for sparse au­toen­coders with Othel­loGPT.

Andrew QuaisleyJun 14, 2024, 12:57 AM
17 points
5 comments12 min readLW link

[Linkpost] In­ter­pretable Anal­y­sis of Fea­tures Found in Open-source Sparse Au­toen­coder (par­tial repli­ca­tion)

Fernando AvalosSep 9, 2024, 3:33 AM
6 points
1 comment1 min readLW link
(forum.effectivealtruism.org)

Sparse Fea­tures Through Time

Rogan InglisJun 24, 2024, 6:06 PM
12 points
1 comment1 min readLW link
(roganinglis.io)

Mea­sur­ing Non­lin­ear Fea­ture In­ter­ac­tions in Sparse Cross­coders [Pro­ject Pro­posal]

Jan 6, 2025, 4:22 AM
19 points
0 comments12 min readLW link

Ac­ti­va­tion Pat­tern SVD: A pro­posal for SAE Interpretability

Daniel TanJun 28, 2024, 10:12 PM
15 points
2 comments2 min readLW link

Ma­tryoshka Sparse Autoencoders

Noa NabeshimaDec 14, 2024, 2:52 AM
91 points
15 comments11 min readLW link

[In­terim re­search re­port] Ac­ti­va­tion plateaus & sen­si­tive di­rec­tions in GPT2

Jul 5, 2024, 5:05 PM
65 points
2 comments5 min readLW link

Faith­ful vs In­ter­pretable Sparse Au­toen­coder Evals

Louka Ewington-PitsosJul 12, 2024, 5:37 AM
2 points
0 comments12 min readLW link

De­cep­tive agents can col­lude to hide dan­ger­ous fea­tures in SAEs

Jul 15, 2024, 5:07 PM
33 points
2 comments7 min readLW link

A List of 45+ Mech In­terp Pro­ject Ideas from Apollo Re­search’s In­ter­pretabil­ity Team

Jul 18, 2024, 2:15 PM
119 points
18 comments18 min readLW link

BatchTopK: A Sim­ple Im­prove­ment for TopK-SAEs

Jul 20, 2024, 2:20 AM
53 points
0 comments4 min readLW link

Com­po­si­tion­al­ity and Am­bi­guity: La­tent Co-oc­cur­rence and In­ter­pretable Subspaces

Dec 20, 2024, 3:16 PM
32 points
0 comments37 min readLW link

Ini­tial Ex­per­i­ments Us­ing SAEs to Help De­tect AI Gen­er­ated Text

Aaron_ScherJul 22, 2024, 5:16 AM
17 points
0 comments14 min readLW link

Cal­en­dar fea­ture ge­om­e­try in GPT-2 layer 8 resi­d­ual stream SAEs

Aug 17, 2024, 1:16 AM
53 points
0 comments5 min readLW link

Learn­ing Multi-Level Fea­tures with Ma­tryoshka SAEs

Dec 19, 2024, 3:59 PM
35 points
4 comments11 min readLW link

Un­der­stand­ing Po­si­tional Fea­tures in Layer 0 SAEs

Jul 29, 2024, 9:36 AM
43 points
0 comments5 min readLW link

Open Source Au­to­mated In­ter­pretabil­ity for Sparse Au­toen­coder Features

Jul 30, 2024, 9:11 PM
67 points
1 comment13 min readLW link
(blog.eleuther.ai)

Deep sparse au­toen­coders yield in­ter­pretable fea­tures too

Armaan A. AbrahamFeb 23, 2025, 5:46 AM
29 points
8 comments8 min readLW link

Limi­ta­tions on the In­ter­pretabil­ity of Learned Fea­tures from Sparse Dic­tionary Learning

Tom AngstenJul 30, 2024, 4:36 PM
6 points
0 comments9 min readLW link

Eval­u­at­ing Sparse Au­toen­coders with Board Game Models

Aug 2, 2024, 7:50 PM
38 points
1 comment9 min readLW link
No comments.