Sparse Autoencoders (SAEs)

TagLast edit: Apr 6, 2024, 9:14 AM by Joseph Bloom

Sparse Autoencoders (SAEs) are an unsupervised technique for decomposing the activations of a neural network into a sum of interpretable components (often referred to as features). Sparse Autoencoders may be useful interpretability and related alignment agendas.

For more information on SAEs see:

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Zac Hatfield-DoddsOct 5, 2023, 9:01 PM

288 points

22 comments2 min readLW link 1 review

(transformer-circuits.pub)

[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey, Dan Braun and beren

Dec 13, 2022, 3:41 PM

150 points

23 comments22 min readLW link 2 reviews

Interpretability with Sparse Autoencoders (Colab exercises)

CallumMcDougallNov 29, 2023, 12:56 PM

76 points

9 comments4 min readLW link

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

Joseph BloomFeb 2, 2024, 6:54 AM

103 points

37 comments15 min readLW link

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Logan Riggs, Hoagy, Aidan Ewart and Robert_AIZI

Sep 21, 2023, 3:30 PM

159 points

8 comments5 min readLW link

Sparse Autoencoders Work on Attention Layer Outputs

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Jan 16, 2024, 12:26 AM

83 points

9 comments18 min readLW link

Attention SAEs Scale to GPT-2 Small

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Feb 3, 2024, 6:50 AM

78 points

4 comments8 min readLW link

[Summary] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

Apr 19, 2024, 7:06 PM

72 points

0 comments3 min readLW link

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Sam MarksApr 18, 2024, 4:17 PM

113 points

10 comments12 min readLW link

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

robertzk, Connor Kissane, Arthur Conmy and Neel Nanda

Mar 6, 2024, 5:03 AM

63 points

0 comments12 min readLW link

Stitching SAEs of different sizes

Bart Bussmann, Patrick Leask, Joseph Bloom, Curt Tigges and Neel Nanda

Jul 13, 2024, 5:19 PM

39 points

12 comments12 min readLW link

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

chanind, TomasD, hrdkbhatnagar and Joseph Bloom

Sep 25, 2024, 9:31 AM

73 points

16 comments3 min readLW link

(arxiv.org)

Understanding SAE Features with the Logit Lens

Joseph Bloom and Johnny Lin

Mar 11, 2024, 12:16 AM

68 points

0 comments14 min readLW link

Sparsify: A mechanistic interpretability research agenda

Lee SharkeyApr 3, 2024, 12:34 PM

96 points

23 comments22 min readLW link

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Taras Kutsyk, Tommaso Mencattini and Ciprian Florea

Sep 29, 2024, 7:37 PM

26 points

8 comments25 min readLW link

Comments on Anthropic’s Scaling Monosemanticity

Robert_AIZIJun 3, 2024, 12:15 PM

97 points

8 comments7 min readLW link

[Full Post] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

Apr 19, 2024, 7:06 PM

79 points

10 comments8 min readLW link

My best guess at the important tricks for training 1L SAEs

Arthur ConmyDec 21, 2023, 1:59 AM

37 points

4 comments3 min readLW link

Efficient Dictionary Learning with Switch Sparse Autoencoders

Anish MudideJul 22, 2024, 6:45 PM

118 points

20 comments12 min readLW link

SAE regularization produces more interpretable models

Peter Lai and StefanHex

Jan 28, 2025, 8:02 PM

21 points

7 comments4 min readLW link

Scaling and evaluating sparse autoencoders

leogaoJun 6, 2024, 10:50 PM

106 points

6 comments1 min readLW link

An X-Ray is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation

hugofry, Ahmed Abdulaal, NMontanaBrown and a-ijishakin

Oct 7, 2024, 8:53 AM

39 points

1 comment5 min readLW link

(arxiv.org)

SAE-VIS: Announcement Post

CallumMcDougall and Joseph Bloom

Mar 31, 2024, 3:30 PM

74 points

8 comments1 min readLW link

SAE reconstruction errors are (empirically) pathological

wesgMar 29, 2024, 4:37 PM

106 points

16 comments8 min readLW link

A Selection of Randomly Selected SAE Features

CallumMcDougall and Joseph Bloom

Apr 1, 2024, 9:09 AM

109 points

2 comments4 min readLW link

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders

Johnny Lin and Joseph Bloom

Mar 25, 2024, 9:17 PM

93 points

7 comments7 min readLW link

Addressing Feature Suppression in SAEs

Benjamin Wright and Lee Sharkey

Feb 16, 2024, 6:32 PM

86 points

4 comments10 min readLW link

Cross-Layer Feature Alignment and Steering in Large Language Model

dlaptevFeb 8, 2025, 8:18 PM

5 points

0 comments6 min readLW link

Open Source Replication & Commentary on Anthropic’s Dictionary Learning Paper

Neel NandaOct 23, 2023, 10:38 PM

93 points

12 comments9 min readLW link

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Bart Bussmann, Michael Pearce, Patrick Leask, Joseph Bloom, Lee Sharkey and Neel Nanda

Aug 24, 2024, 12:56 AM

68 points

10 comments20 min readLW link

SAEs (usually) Transfer Between Base and Chat Models

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Jul 18, 2024, 10:29 AM

66 points

0 comments10 min readLW link

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Senthooran Rajamanoharan, Tom Lieberum, nps29, Arthur Conmy, Vikrant Varma, János Kramár and Neel Nanda

Jul 19, 2024, 4:10 PM

48 points

10 comments1 min readLW link

(storage.googleapis.com)

The ‘strong’ feature hypothesis could be wrong

lewis smithAug 2, 2024, 2:33 PM

231 points

19 comments17 min readLW link

Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders

Gytis DaujotasAug 1, 2024, 9:08 PM

45 points

7 comments7 min readLW link

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, lewis smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah and Neel Nanda

Apr 25, 2024, 6:43 PM

63 points

38 comments1 min readLW link

(arxiv.org)

Tokenized SAEs: Infusing per-token biases.

tdooms and danwil

Aug 4, 2024, 9:17 AM

20 points

20 comments15 min readLW link

Excursions into Sparse Autoencoders: What is monosemanticity?

Jakub SmékalAug 5, 2024, 7:22 PM

2 points

0 comments10 min readLW link

Self-explaining SAE features

Dmitrii Kharlapenko, neverix, Neel Nanda and Arthur Conmy

Aug 5, 2024, 10:20 PM

60 points

13 comments10 min readLW link

ProLU: A Nonlinearity for Sparse Autoencoders

Glen TaggartApr 23, 2024, 2:09 PM

44 points

4 comments9 min readLW link

A gentle introduction to sparse autoencoders

Nick JiangSep 2, 2024, 6:11 PM

9 points

0 comments6 min readLW link

[Linkpost] Play with SAEs on Llama 3

Tom McGrath, Eric Ho and Dan Balsam

Sep 25, 2024, 10:35 PM

40 points

2 comments1 min readLW link

SAEs Discover Meaningful Features in the IOI Task

Alex Makelov, Georg Lange and Neel Nanda

Jun 5, 2024, 11:48 PM

15 points

2 comments10 min readLW link

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

hugofryApr 29, 2024, 8:57 PM

92 points

8 comments11 min readLW link

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Kola Ayonrinde, Michael Pearce and Lee Sharkey

Aug 23, 2024, 6:52 PM

42 points

8 comments16 min readLW link

How to Better Report Sparse Autoencoder Performance

J BostockJun 2, 2024, 7:34 PM

20 points

4 comments3 min readLW link

Exploring SAE features in LLMs with definition trees and token lists

mwatkinsOct 4, 2024, 10:15 PM

37 points

5 comments6 min readLW link

Attention Output SAEs Improve Circuit Analysis

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Jun 21, 2024, 12:56 PM

33 points

3 comments19 min readLW link

On the Practical Applications of Interpretability

Nick JiangOct 15, 2024, 5:18 PM

4 points

1 comment7 min readLW link

An Intuitive Explanation of Sparse Autoencoders for Mechanistic Interpretability of LLMs

Adam KarvonenJun 25, 2024, 3:57 PM

27 points

0 comments9 min readLW link

(adamkarvonen.github.io)

SAEs you can See: Applying Sparse Autoencoders to Clustering

Robert_AIZIOct 28, 2024, 2:48 PM

27 points

0 comments10 min readLW link

Interpreting and Steering Features in Images

Gytis DaujotasJun 20, 2024, 6:33 PM

66 points

6 comments5 min readLW link

HDBSCAN is Surprisingly Effective at Finding Interpretable Clusters of the SAE Decoder Matrix

Jaehyuk Lim, Kanishk Tantia and Sinem

Oct 11, 2024, 11:06 PM

8 points

2 comments10 min readLW link

Causal Graphs of GPT-2-Small’s Residual Stream

David UdellJul 9, 2024, 10:06 PM

53 points

7 comments7 min readLW link

Interpreting Preference Models w/ Sparse Autoencoders

Logan Riggs and Jannik Brinkmann

Jul 1, 2024, 9:35 PM

74 points

12 comments9 min readLW link

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Neel NandaJul 7, 2024, 5:39 PM

135 points

16 comments25 min readLW link

[Replication] Conjecture’s Sparse Coding in Small Transformers

Hoagy and Logan Riggs

Jun 16, 2023, 6:02 PM

52 points

0 comments5 min readLW link

Broken Latents: Studying SAEs and Feature Co-occurrence in Toy Models

chanind and Demian Till

Dec 30, 2024, 10:50 PM

22 points

3 comments15 min readLW link

Can quantised autoencoders find and interpret circuits in language models?

charlieoneillMar 24, 2024, 8:05 PM

28 points

4 comments24 min readLW link

Proof-of-Concept Debugger for a Small LLM

Peter Lai and StefanHex

Mar 17, 2025, 10:27 PM

27 points

0 comments11 min readLW link

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions

Lidor Banuel Dabbah and Aviel Boag

Jul 19, 2024, 8:32 PM

59 points

6 comments16 min readLW link

Comparing the effectiveness of top-down and bottom-up activation steering for bypassing refusal on harmful prompts

Ana KaprosFeb 12, 2025, 7:12 PM

7 points

0 comments5 min readLW link

How do SAE Circuits Fail? A Case Study Using a Starts-with-‘E’ Letter Detection Task

adsingh-64Mar 30, 2025, 12:47 AM

1 point

0 comments3 min readLW link

Evaluating Synthetic Activations composed of SAE Latents in GPT-2

Giorgi Giglemiani, nlpet, Chatrik, Jett Janiak and StefanHex

Sep 25, 2024, 8:37 PM

29 points

0 comments3 min readLW link

(arxiv.org)

Toy Models of Superposition: Simplified by Hand

Axel SorensenSep 29, 2024, 9:19 PM

9 points

3 comments8 min readLW link

LLMs are likely not conscious

research_prime_spaceSep 29, 2024, 8:57 PM

6 points

9 comments1 min readLW link

[Question] Are Sparse Autoencoders a good idea for AI control?

Gerard BoxoDec 26, 2024, 5:34 PM

3 points

4 comments1 min readLW link

Toy Models of Feature Absorption in SAEs

chanind, hrdkbhatnagar, TomasD and Joseph Bloom

Oct 7, 2024, 9:56 AM

49 points

8 comments10 min readLW link

Interpretability of SAE Features Representing Check in ChessGPT

Jonathan KutasovOct 5, 2024, 8:43 PM

27 points

2 comments8 min readLW link

Domain-specific SAEs

jacob_droriOct 7, 2024, 8:15 PM

28 points

2 comments5 min readLW link

Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution

Kola AyonrindeOct 30, 2024, 10:50 PM

27 points

0 comments12 min readLW link

SAE features for refusal and sycophancy steering vectors

neverix, Dmitrii Kharlapenko, Arthur Conmy and Neel Nanda

Oct 12, 2024, 2:54 PM

29 points

4 comments7 min readLW link

It’s important to know when to stop: Mechanistic Exploration of Gemma 2 List Generation

Gerard BoxoOct 14, 2024, 5:04 PM

9 points

0 comments6 min readLW link

(gboxo.github.io)

[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

Lucy FarnikFeb 26, 2025, 12:50 PM

79 points

8 comments7 min readLW link

SAE Training Dataset Influence in Feature Matching and a Hypothesis on Position Features

Seonglae ChoFeb 26, 2025, 5:05 PM

3 points

3 comments17 min readLW link

A suite of Vision Sparse Autoencoders

Louka Ewington-Pitsos and RRGoyal

Oct 27, 2024, 4:05 AM

24 points

0 comments1 min readLW link

SAE Probing: What is it good for?

Subhash Kantamneni, Josh Engels, Senthooran Rajamanoharan and Neel Nanda

Nov 1, 2024, 7:23 PM

33 points

0 comments11 min readLW link

Evolutionary prompt optimization for SAE feature visualization

neverix, Daniel Tan, Dmitrii Kharlapenko, Neel Nanda and Arthur Conmy

Nov 14, 2024, 1:06 PM

21 points

0 comments9 min readLW link

SAEs are highly dataset dependent: a case study on the refusal direction

Connor Kissane, robertzk, Neel Nanda and Arthur Conmy

Nov 7, 2024, 5:22 AM

66 points

4 comments14 min readLW link

Analyzing how SAE features evolve across a forward pass

bensenberner, danibalcells, Michael Oesterle, Ediz Ucar and StefanHex

Nov 7, 2024, 10:07 PM

47 points

0 comments1 min readLW link

(arxiv.org)

Can SAE steering reveal sandbagging?

jordine, Hoang Khiem, Felix Hofstätter and Cleo Nardo

Apr 15, 2025, 12:33 PM

35 points

3 comments4 min readLW link

Mechanistic Interpretability of Llama 3.2 with Sparse Autoencoders

PaulPaulsNov 24, 2024, 5:45 AM

19 points

3 comments1 min readLW link

(github.com)

Takeaways From Our Recent Work on SAE Probing

Josh Engels, Subhash Kantamneni, Senthooran Rajamanoharan and Neel Nanda

Mar 3, 2025, 7:50 PM

30 points

0 comments5 min readLW link

Topological Data Analysis and Mechanistic Interpretability

Gunnar CarlssonFeb 24, 2025, 7:56 PM

15 points

4 comments7 min readLW link

Alignment Does Not Need to Be Opaque! An Introduction to Feature Steering with Reinforcement Learning

Jeremias FerraoApr 18, 2025, 7:34 PM

1 point

0 comments10 min readLW link

Scaling Sparse Feature Circuit Finding to Gemma 9B

Diego Caples, Jatin Nainani, CallumMcDougall and rrenaud

Jan 10, 2025, 11:08 AM

86 points

11 comments17 min readLW link

Finding Features Causally Upstream of Refusal

Daniel Lee, Eric Breck and Andy Arditi

Jan 14, 2025, 2:30 AM

53 points

5 comments12 min readLW link

Empirical Insights into Feature Geometry in Sparse Autoencoders

Jason Boxi ZhangJan 24, 2025, 7:02 PM

7 points

0 comments11 min readLW link

[Replication] Crosscoder-based Stage-Wise Model Diffing

annas, Thomas Read, Oliver Clive-Griffin, dmanningcoe, Chun Hei Yip, rajashree and Jason Gross

Mar 22, 2025, 6:35 PM

21 points

0 comments7 min readLW link

Feature Hedging: Another way correlated features break SAEs

chanind, TomasD and Adrià Garriga-alonso

Mar 25, 2025, 2:33 PM

21 points

0 comments18 min readLW link

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

lewis smith, Senthooran Rajamanoharan, Arthur Conmy, CallumMcDougall, Tom Lieberum, János Kramár, Rohin Shah and Neel Nanda

Mar 26, 2025, 7:07 PM

109 points

15 comments29 min readLW link

(deepmindsafetyresearch.medium.com)

Calendar feature geometry in GPT-2 layer 8 residual stream SAEs

Patrick Leask, Bart Bussmann and Neel Nanda

Aug 17, 2024, 1:16 AM

53 points

0 comments5 min readLW link

A Bunch of Matryoshka SAEs

chanind, TomasD and Adrià Garriga-alonso

Apr 4, 2025, 2:53 PM

23 points

0 comments8 min readLW link

[Question] SAE sparse feature graph using only residual layers

Jaehyuk LimMay 23, 2024, 1:32 PM

0 points

3 comments1 min readLW link

Quick Thoughts on Scaling Monosemanticity

Joel BurgetMay 23, 2024, 4:22 PM

28 points

1 comment4 min readLW link

(transformer-circuits.pub)

Are SAE features from the Base Model still meaningful to LLaVA?

Shan23ChenDec 5, 2024, 7:24 PM

5 points

2 comments10 min readLW link

Are SAE features from the Base Model still meaningful to LLaVA?

Shan23ChenFeb 18, 2025, 10:16 PM

8 points

2 comments10 min readLW link

(www.lesswrong.com)

Training a Sparse Autoencoder in < 30 minutes on 16GB of VRAM using an S3 cache

Louka Ewington-PitsosAug 24, 2024, 7:39 AM

17 points

0 comments5 min readLW link

Sparse Autoencoder Features for Classifications and Transferability

Shan23ChenFeb 18, 2025, 10:14 PM

5 points

0 comments1 min readLW link

(arxiv.org)

Do sparse autoencoders find “true features”?

Demian TillFeb 22, 2024, 6:06 PM

74 points

33 comments11 min readLW link

Sparse Autoencoders: Future Work

Logan Riggs and Aidan Ewart

Sep 21, 2023, 3:30 PM

35 points

5 comments6 min readLW link

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks and Neel Nanda

Dec 11, 2024, 6:30 AM

82 points

6 comments2 min readLW link

(www.neuronpedia.org)

Improving SAE’s by Sqrt()-ing L1 & Removing Lowest Activating Features

Logan Riggs and Jannik Brinkmann

Mar 15, 2024, 4:30 PM

26 points

5 comments4 min readLW link

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

Evan Anders and Joseph Bloom

Feb 27, 2024, 2:43 AM

43 points

16 comments15 min readLW link

Normalizing Sparse Autoencoders

Fengyuan HuApr 8, 2024, 6:17 AM

21 points

18 comments13 min readLW link

Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization

Jacob Dunefsky, Philippe Chlenski, Senthooran Rajamanoharan and Neel Nanda

Jan 14, 2024, 2:06 AM

24 points

0 comments42 min readLW link

Some additional SAE thoughts

HoagyJan 13, 2024, 7:31 PM

31 points

4 comments13 min readLW link

[Replication] Conjecture’s Sparse Coding in Toy Models

Hoagy and Logan Riggs

Jun 2, 2023, 5:34 PM

24 points

0 comments1 min readLW link

Some open-source dictionaries and dictionary learning infrastructure

Sam MarksDec 5, 2023, 6:05 AM

46 points

7 comments5 min readLW link

AutoInterpretation Finds Sparse Coding Beats Alternatives

HoagyJul 17, 2023, 1:41 AM

57 points

1 comment7 min readLW link

(tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders

Logan RiggsJul 5, 2023, 4:49 PM

60 points

1 comment7 min readLW link

Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT

Robert_AIZIMar 5, 2024, 1:55 PM

61 points

24 comments10 min readLW link

(aizi.substack.com)

Sparse autoencoders find composed features in small toy models

Evan Anders, Clement Neo, Jason Hoelscher-Obermaier and Jessica N. Howard

Mar 14, 2024, 6:00 PM

33 points

12 comments15 min readLW link

Classifying representations of sparse autoencoders (SAEs)

AnnahNov 17, 2023, 1:54 PM

15 points

6 comments2 min readLW link

Taking features out of superposition with sparse autoencoders more quickly with informed initialization

Pierre PeignéSep 23, 2023, 4:21 PM

30 points

8 comments5 min readLW link

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

lukemarks, Amirali Abdullah, Rauno Arike, Fazl and nothoughtsheadempty

Oct 3, 2023, 7:45 AM

17 points

0 comments5 min readLW link

Explaining “Taking features out of superposition with sparse autoencoders”

Robert_AIZIJun 16, 2023, 1:59 PM

10 points

0 comments8 min readLW link

(aizi.substack.com)

Comparing Anthropic’s Dictionary Learning to Ours

Robert_AIZIOct 7, 2023, 11:30 PM

137 points

8 comments4 min readLW link

A small update to the Sparse Coding interim research report

Lee Sharkey, Dan Braun and beren

Apr 30, 2023, 7:54 PM

61 points

5 comments1 min readLW link

Finding Sparse Linear Connections between Features in LLMs

Logan Riggs, Sam Mitchell and Adam Kaufman

Dec 9, 2023, 2:27 AM

70 points

5 comments10 min readLW link

Sparse Coding, for Mechanistic Interpretability and Activation Engineering

David UdellSep 23, 2023, 7:16 PM

42 points

7 comments34 min readLW link

Transformer Debugger

Henk TillmanMar 12, 2024, 7:08 PM

25 points

0 comments1 min readLW link

(github.com)

Past Tense Features

CanApr 20, 2024, 2:34 PM

12 points

0 comments4 min readLW link

Transcoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky, Philippe Chlenski and Neel Nanda

Apr 30, 2024, 5:58 PM

74 points

14 comments17 min readLW link

Massive Activations and why <bos> is important in Tokenized SAE Unigrams

Louka Ewington-PitsosSep 5, 2024, 2:19 AM

1 point

0 comments3 min readLW link

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Daniel Lee and StefanHex

Sep 6, 2024, 2:28 AM

28 points

0 comments12 min readLW link

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill and Lee Sharkey

May 17, 2024, 4:25 PM

57 points

20 comments4 min readLW link

(arxiv.org)

Research Report: Alternative sparsity methods for sparse autoencoders with OthelloGPT.

Andrew QuaisleyJun 14, 2024, 12:57 AM

17 points

5 comments12 min readLW link

[Linkpost] Interpretable Analysis of Features Found in Open-source Sparse Autoencoder (partial replication)

Fernando AvalosSep 9, 2024, 3:33 AM

6 points

1 comment1 min readLW link

(forum.effectivealtruism.org)

Sparse Features Through Time

Rogan InglisJun 24, 2024, 6:06 PM

12 points

1 comment1 min readLW link

(roganinglis.io)

Measuring Nonlinear Feature Interactions in Sparse Crosscoders [Project Proposal]

Jason Gross and rajashree

Jan 6, 2025, 4:22 AM

19 points

0 comments12 min readLW link

Activation Pattern SVD: A proposal for SAE Interpretability

Daniel TanJun 28, 2024, 10:12 PM

15 points

2 comments2 min readLW link

Matryoshka Sparse Autoencoders

Noa NabeshimaDec 14, 2024, 2:52 AM

98 points

15 comments11 min readLW link

[Interim research report] Activation plateaus & sensitive directions in GPT2

StefanHex and jake_mendel

Jul 5, 2024, 5:05 PM

65 points

2 comments5 min readLW link

Faithful vs Interpretable Sparse Autoencoder Evals

Louka Ewington-PitsosJul 12, 2024, 5:37 AM

2 points

0 comments12 min readLW link

Deceptive agents can collude to hide dangerous features in SAEs

Simon Lermen and Mateusz Dziemian

Jul 15, 2024, 5:07 PM

33 points

2 comments7 min readLW link

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex and Nicholas Goldowsky-Dill

Jul 18, 2024, 2:15 PM

121 points

18 comments18 min readLW link

BatchTopK: A Simple Improvement for TopK-SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

Jul 20, 2024, 2:20 AM

59 points

0 comments4 min readLW link

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces

Matthew A. Clarke, hrdkbhatnagar and Joseph Bloom

Dec 20, 2024, 3:16 PM

32 points

0 comments37 min readLW link

Initial Experiments Using SAEs to Help Detect AI Generated Text

Aaron_ScherJul 22, 2024, 5:16 AM

17 points

0 comments14 min readLW link

Food, Prison & Exotic Animals: Sparse Autoencoders Detect 6.5x Performing Youtube Thumbnails

Louka Ewington-PitsosSep 17, 2024, 3:52 AM

6 points

2 comments7 min readLW link

Learning Multi-Level Features with Matryoshka SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

Dec 19, 2024, 3:59 PM

42 points

6 comments11 min readLW link

Understanding Positional Features in Layer 0 SAEs

bilalchughtai and Yeu-Tong Lau

Jul 29, 2024, 9:36 AM

43 points

0 comments5 min readLW link

Open Source Automated Interpretability for Sparse Autoencoder Features

kh4dien, SrGonao, jacob_drori and Nora Belrose

Jul 30, 2024, 9:11 PM

67 points

1 comment13 min readLW link

(blog.eleuther.ai)

Deep sparse autoencoders yield interpretable features too

Armaan A. AbrahamFeb 23, 2025, 5:46 AM

29 points

8 comments8 min readLW link

Limitations on the Interpretability of Learned Features from Sparse Dictionary Learning

Tom AngstenJul 30, 2024, 4:36 PM

6 points

0 comments9 min readLW link

Evaluating Sparse Autoencoders with Board Game Models

Adam Karvonen, Sam Marks, Can, Benjamin Wright, Jannik Brinkmann, Logan Riggs and Rico Angell

Aug 2, 2024, 7:50 PM

38 points

1 comment9 min readLW link

Extracting SAE task features for in-context learning

Dmitrii Kharlapenko, neverix, Neel Nanda and Arthur Conmy

Aug 12, 2024, 8:34 PM

31 points

1 comment9 min readLW link

No comments.

Sparse Au­toen­coders (SAEs)

Sparse Autoencoders (SAEs)