RSS

In­ter­pretabil­ity (ML & AI)

TagLast edit: Jan 22, 2025, 4:27 PM by Dakara

Interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model’s output, but the model can’t tell you why it made that output. This makes it hard to determine the cause of biases in ML models.

A prominent subfield of interpretability of neural networks is mechanistic interpretability, which attempts to understand how neural networks perform the tasks they perform, for example by finding circuits in transformer models. This can be contrasted to subfieds of interpretability which seek to attribute some output to a part of a specific input, such as clarifying which pixels in an input image caused a computer vision model to output the classification “horse”.

See Also

Research

A small up­date to the Sparse Cod­ing in­terim re­search report

Apr 30, 2023, 7:54 PM
61 points
5 comments1 min readLW link

In­ter­pretabil­ity in ML: A Broad Overview

lifelonglearnerAug 4, 2020, 7:03 PM
53 points
5 comments15 min readLW link

Ti­maeus’s First Four Months

Feb 28, 2024, 5:01 PM
172 points
6 comments6 min readLW link

[In­terim re­search re­port] Tak­ing fea­tures out of su­per­po­si­tion with sparse autoencoders

Dec 13, 2022, 3:41 PM
149 points
23 comments22 min readLW link2 reviews

A Mechanis­tic In­ter­pretabil­ity Anal­y­sis of Grokking

Aug 15, 2022, 2:41 AM
373 points
47 comments36 min readLW link1 review
(colab.research.google.com)

Toward A Math­e­mat­i­cal Frame­work for Com­pu­ta­tion in Superposition

Jan 18, 2024, 9:06 PM
203 points
18 comments63 min readLW link

A Longlist of The­o­ries of Im­pact for Interpretability

Neel NandaMar 11, 2022, 2:55 PM
127 points
41 comments5 min readLW link2 reviews

Re-Ex­am­in­ing LayerNorm

Eric WinsorDec 1, 2022, 10:20 PM
127 points
12 comments5 min readLW link

Find­ing Neu­rons in a Haystack: Case Stud­ies with Sparse Probing

May 3, 2023, 1:30 PM
33 points
6 comments2 min readLW link1 review
(arxiv.org)

200 Con­crete Open Prob­lems in Mechanis­tic In­ter­pretabil­ity: Introduction

Neel NandaDec 28, 2022, 9:06 PM
106 points
0 comments10 min readLW link

Chris Olah’s views on AGI safety

evhubNov 1, 2019, 8:13 PM
207 points
38 comments12 min readLW link2 reviews

How To Go From In­ter­pretabil­ity To Align­ment: Just Re­tar­get The Search

johnswentworthAug 10, 2022, 4:08 PM
209 points
34 comments3 min readLW link1 review

The Sin­gu­lar Value De­com­po­si­tions of Trans­former Weight Ma­tri­ces are Highly Interpretable

Nov 28, 2022, 12:54 PM
199 points
33 comments31 min readLW link

[Question] Papers to start get­ting into NLP-fo­cused al­ign­ment research

FeraidoonSep 24, 2022, 11:53 PM
6 points
0 comments1 min readLW link

Search­ing for Search

Nov 28, 2022, 3:31 PM
94 points
9 comments14 min readLW link1 review

A Rocket–In­ter­pretabil­ity Analogy

plexOct 21, 2024, 1:55 PM
149 points
31 comments1 min readLW link

A Prob­lem to Solve Be­fore Build­ing a De­cep­tion Detector

Feb 7, 2025, 7:35 PM
62 points
8 comments14 min readLW link

In­ter­pret­ing Neu­ral Net­works through the Poly­tope Lens

Sep 23, 2022, 5:58 PM
144 points
29 comments33 min readLW link

Strik­ing Im­pli­ca­tions for Learn­ing The­ory, In­ter­pretabil­ity — and Safety?

RogerDearnaleyJan 5, 2024, 8:46 AM
37 points
4 comments2 min readLW link

Resi­d­ual stream norms grow ex­po­nen­tially over the for­ward pass

May 7, 2023, 12:46 AM
77 points
24 comments11 min readLW link

Effi­cient Dic­tionary Learn­ing with Switch Sparse Autoencoders

Anish MudideJul 22, 2024, 6:45 PM
118 points
19 comments12 min readLW link

ParaS­cope: Do Lan­guage Models Plan the Up­com­ing Para­graph?

NickyPFeb 21, 2025, 4:50 PM
33 points
0 comments20 min readLW link

Against Al­most Every The­ory of Im­pact of Interpretability

Charbel-RaphaëlAug 17, 2023, 6:44 PM
329 points
90 comments26 min readLW link2 reviews

Towards Monose­man­tic­ity: De­com­pos­ing Lan­guage Models With Dic­tionary Learning

Zac Hatfield-DoddsOct 5, 2023, 9:01 PM
288 points
22 comments2 min readLW link1 review
(transformer-circuits.pub)

An­nounc­ing Apollo Research

May 30, 2023, 4:17 PM
217 points
11 comments8 min readLW link

How to use and in­ter­pret ac­ti­va­tion patching

Apr 24, 2024, 8:35 AM
12 points
2 comments18 min readLW link

Do Sparse Au­toen­coders (SAEs) trans­fer across base and fine­tuned lan­guage mod­els?

Sep 29, 2024, 7:37 PM
26 points
8 comments25 min readLW link

How In­ter­pretabil­ity can be Impactful

Connall GarrodJul 18, 2022, 12:06 AM
18 points
0 comments37 min readLW link

The ‘strong’ fea­ture hy­poth­e­sis could be wrong

lewis smithAug 2, 2024, 2:33 PM
222 points
19 comments17 min readLW link

A trans­parency and in­ter­pretabil­ity tech tree

evhubJun 16, 2022, 11:44 PM
163 points
11 comments18 min readLW link1 review

Trans­parency and AGI safety

jylin04Jan 11, 2021, 6:51 PM
54 points
12 comments30 min readLW link

Spar­sify: A mechanis­tic in­ter­pretabil­ity re­search agenda

Lee SharkeyApr 3, 2024, 12:34 PM
95 points
22 comments22 min readLW link

SAE re­con­struc­tion er­rors are (em­piri­cally) pathological

wesgMar 29, 2024, 4:37 PM
105 points
16 comments8 min readLW link

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. MurphyMay 12, 2022, 8:01 PM
58 points
0 comments59 min readLW link

Take­aways From 3 Years Work­ing In Ma­chine Learning

George3d6Apr 8, 2022, 5:14 PM
35 points
10 comments11 min readLW link
(www.epistem.ink)

Com­ments on An­thropic’s Scal­ing Monosemanticity

Robert_AIZIJun 3, 2024, 12:15 PM
97 points
8 comments7 min readLW link

The Case for Rad­i­cal Op­ti­mism about Interpretability

Quintin PopeDec 16, 2021, 11:38 PM
66 points
16 comments8 min readLW link1 review

Opinions on In­ter­pretable Ma­chine Learn­ing and 70 Sum­maries of Re­cent Papers

Apr 9, 2021, 7:19 PM
141 points
17 comments102 min readLW link

Ideation and Tra­jec­tory Model­ling in Lan­guage Models

NickyPOct 5, 2023, 7:21 PM
16 points
2 comments10 min readLW link

What is In­ter­pretabil­ity?

Mar 17, 2020, 8:23 PM
35 points
0 comments11 min readLW link

Ma­chine Un­learn­ing Eval­u­a­tions as In­ter­pretabil­ity Benchmarks

Oct 23, 2023, 4:33 PM
33 points
2 comments11 min readLW link

Trans­former Circuits

evhubDec 22, 2021, 9:09 PM
144 points
4 comments3 min readLW link
(transformer-circuits.pub)

In­ter­pret­ing the Learn­ing of Deceit

RogerDearnaleyDec 18, 2023, 8:12 AM
30 points
14 comments9 min readLW link

Ac­tu­ally, Othello-GPT Has A Lin­ear Emer­gent World Representation

Neel NandaMar 29, 2023, 10:13 PM
211 points
26 comments19 min readLW link
(neelnanda.io)

The Lo­cal In­ter­ac­tion Ba­sis: Iden­ti­fy­ing Com­pu­ta­tion­ally-Rele­vant and Sparsely In­ter­act­ing Fea­tures in Neu­ral Networks

May 20, 2024, 5:53 PM
105 points
4 comments3 min readLW link

Fact Find­ing: At­tempt­ing to Re­v­erse-Eng­ineer Fac­tual Re­call on the Neu­ron Level (Post 1)

Dec 23, 2023, 2:44 AM
106 points
10 comments22 min readLW link2 reviews

SAE fea­ture ge­om­e­try is out­side the su­per­po­si­tion hypothesis

jake_mendelJun 24, 2024, 4:07 PM
227 points
17 comments11 min readLW link

Steer­ing GPT-2-XL by adding an ac­ti­va­tion vector

May 13, 2023, 6:42 PM
437 points
98 comments50 min readLW link1 review

My ten­ta­tive in­ter­pretabil­ity re­search agenda—topol­ogy match­ing.

Maxwell ClarkeOct 8, 2022, 10:14 PM
10 points
2 comments4 min readLW link

SAE reg­u­lariza­tion pro­duces more in­ter­pretable models

Jan 28, 2025, 8:02 PM
21 points
7 comments4 min readLW link

At­tri­bu­tion-based pa­ram­e­ter decomposition

Jan 25, 2025, 1:12 PM
105 points
17 comments4 min readLW link
(publications.apolloresearch.ai)

Open Source Sparse Au­toen­coders for all Resi­d­ual Stream Lay­ers of GPT2-Small

Joseph BloomFeb 2, 2024, 6:54 AM
103 points
37 comments15 min readLW link

Com­pact Proofs of Model Perfor­mance via Mechanis­tic Interpretability

Jun 24, 2024, 7:27 PM
96 points
4 comments8 min readLW link
(arxiv.org)

The Plan − 2022 Update

johnswentworthDec 1, 2022, 8:43 PM
239 points
37 comments8 min readLW link1 review

(ten­ta­tively) Found 600+ Monose­man­tic Fea­tures in a Small LM Us­ing Sparse Autoencoders

Logan RiggsJul 5, 2023, 4:49 PM
60 points
1 comment7 min readLW link

A Com­pre­hen­sive Mechanis­tic In­ter­pretabil­ity Ex­plainer & Glossary

Neel NandaDec 21, 2022, 12:35 PM
91 points
6 comments2 min readLW link
(neelnanda.io)

The­o­ries of im­pact for Science of Deep Learning

Marius HobbhahnDec 1, 2022, 2:39 PM
24 points
0 comments11 min readLW link

LLMs Univer­sally Learn a Fea­ture Rep­re­sent­ing To­ken Fre­quency /​ Rarity

Sean OsierJun 30, 2024, 2:48 AM
12 points
5 comments6 min readLW link
(github.com)

Us­ing GPT-N to Solve In­ter­pretabil­ity of Neu­ral Net­works: A Re­search Agenda

Sep 3, 2020, 6:27 PM
68 points
11 comments2 min readLW link

Ba­sic facts about lan­guage mod­els dur­ing training

berenFeb 21, 2023, 11:46 AM
98 points
15 comments18 min readLW link

[Pro­posal] Method of lo­cat­ing use­ful sub­nets in large models

Quintin PopeOct 13, 2021, 8:52 PM
9 points
0 comments2 min readLW link

In­tro­duc­tion to in­ac­cessible information

Ryan KiddDec 9, 2021, 1:28 AM
27 points
6 comments8 min readLW link

Mechanis­tic Ano­maly De­tec­tion Re­search Update

Aug 6, 2024, 10:33 AM
11 points
0 comments1 min readLW link
(blog.eleuther.ai)

Towards Mul­ti­modal In­ter­pretabil­ity: Learn­ing Sparse In­ter­pretable Fea­tures in Vi­sion Transformers

hugofryApr 29, 2024, 8:57 PM
92 points
8 comments11 min readLW link

SolidGoldMag­ikarp (plus, prompt gen­er­a­tion)

Feb 5, 2023, 10:02 PM
680 points
206 comments12 min readLW link1 review

LLM Mo­du­lar­ity: The Separa­bil­ity of Ca­pa­bil­ities in Large Lan­guage Models

NickyPMar 26, 2023, 9:57 PM
99 points
3 comments41 min readLW link

A List of 45+ Mech In­terp Pro­ject Ideas from Apollo Re­search’s In­ter­pretabil­ity Team

Jul 18, 2024, 2:15 PM
118 points
18 comments18 min readLW link

200 COP in MI: In­ter­pret­ing Al­gorith­mic Problems

Neel NandaDec 31, 2022, 7:55 PM
33 points
2 comments10 min readLW link

Is In­ter­pretabil­ity All We Need?

RogerDearnaleyNov 14, 2023, 5:31 AM
1 point
1 comment1 min readLW link

Re­fusal in LLMs is me­di­ated by a sin­gle direction

Apr 27, 2024, 11:13 AM
244 points
95 comments10 min readLW link

Cir­cum­vent­ing in­ter­pretabil­ity: How to defeat mind-readers

Lee SharkeyJul 14, 2022, 4:59 PM
114 points
15 comments33 min readLW link

EIS XIV: Is mechanis­tic in­ter­pretabil­ity about to be prac­ti­cally use­ful?

scasperOct 11, 2024, 10:13 PM
68 points
4 comments7 min readLW link

Deep learn­ing mod­els might be se­cretly (al­most) linear

berenApr 24, 2023, 6:43 PM
117 points
29 comments4 min readLW link

In­ter­pret­ing and Steer­ing Fea­tures in Images

Gytis DaujotasJun 20, 2024, 6:33 PM
65 points
6 comments5 min readLW link

MATS Ap­pli­ca­tions + Re­search Direc­tions I’m Cur­rently Ex­cited About

Neel NandaFeb 6, 2025, 11:03 AM
72 points
6 comments8 min readLW link

Ex­tract­ing and Eval­u­at­ing Causal Direc­tion in LLMs’ Activations

Dec 14, 2022, 2:33 PM
29 points
5 comments11 min readLW link

Lan­guage Models Use Tri­gonom­e­try to Do Addition

Subhash KantamneniFeb 5, 2025, 1:50 PM
70 points
1 comment10 min readLW link

Re­fusal mechanisms: ini­tial ex­per­i­ments with Llama-2-7b-chat

Dec 8, 2023, 5:08 PM
81 points
7 comments7 min readLW link

Find­ing Sparse Lin­ear Con­nec­tions be­tween Fea­tures in LLMs

Dec 9, 2023, 2:27 AM
69 points
5 comments10 min readLW link

Map­ping the se­man­tic void: Strange go­ings-on in GPT em­bed­ding spaces

mwatkinsDec 14, 2023, 1:10 PM
114 points
31 comments14 min readLW link

EIS XIII: Reflec­tions on An­thropic’s SAE Re­search Circa May 2024

scasperMay 21, 2024, 8:15 PM
157 points
16 comments3 min readLW link

An­thropic an­nounces in­ter­pretabil­ity ad­vances. How much does this ad­vance al­ign­ment?

Seth HerdMay 21, 2024, 10:30 PM
49 points
4 comments3 min readLW link
(www.anthropic.com)

An­nounc­ing Hu­man-al­igned AI Sum­mer School

May 22, 2024, 8:55 AM
50 points
0 comments1 min readLW link
(humanaligned.ai)

Assess­ment of AI safety agen­das: think about the down­side risk

Roman LeventovDec 19, 2023, 9:00 AM
13 points
1 comment1 min readLW link

Mech In­terp Challenge: Jan­uary—De­ci­pher­ing the Cae­sar Cipher Model

CallumMcDougallJan 1, 2024, 6:03 PM
17 points
0 comments3 min readLW link

Re­search Re­port: Sparse Au­toen­coders find only 9/​180 board state fea­tures in OthelloGPT

Robert_AIZIMar 5, 2024, 1:55 PM
61 points
24 comments10 min readLW link
(aizi.substack.com)

Gra­di­ent Rout­ing: Mask­ing Gra­di­ents to Lo­cal­ize Com­pu­ta­tion in Neu­ral Networks

Dec 6, 2024, 10:19 PM
161 points
12 comments11 min readLW link
(arxiv.org)

Re­sults from the Tur­ing Sem­i­nar hackathon

Dec 7, 2023, 2:50 PM
29 points
1 comment6 min readLW link

Mea­sur­ing Struc­ture Devel­op­ment in Al­gorith­mic Transformers

Aug 22, 2024, 8:38 AM
56 points
4 comments11 min readLW link

Stage­wise Devel­op­ment in Neu­ral Networks

Mar 20, 2024, 7:54 PM
96 points
1 comment11 min readLW link

Apollo Re­search 1-year update

May 29, 2024, 5:44 PM
93 points
0 comments7 min readLW link

AI al­ign­ment as a trans­la­tion problem

Roman LeventovFeb 5, 2024, 2:14 PM
22 points
2 comments3 min readLW link

AXRP Epi­sode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

DanielFilanAug 24, 2024, 10:30 PM
21 points
0 comments74 min readLW link

Causal Graphs of GPT-2-Small’s Resi­d­ual Stream

David UdellJul 9, 2024, 10:06 PM
53 points
7 comments7 min readLW link

Difficulty classes for al­ign­ment properties

JozdienFeb 20, 2024, 9:08 AM
34 points
5 comments2 min readLW link

Im­prov­ing SAE’s by Sqrt()-ing L1 & Re­mov­ing Low­est Ac­ti­vat­ing Fea­tures

Mar 15, 2024, 4:30 PM
26 points
5 comments4 min readLW link

Ev­i­dence of Learned Look-Ahead in a Chess-Play­ing Neu­ral Network

Erik JennerJun 4, 2024, 3:50 PM
120 points
14 comments13 min readLW link

AtP*: An effi­cient and scal­able method for lo­cal­iz­ing LLM be­havi­our to components

Mar 18, 2024, 5:28 PM
19 points
0 comments1 min readLW link
(arxiv.org)

Mechanism for fea­ture learn­ing in neu­ral net­works and back­prop­a­ga­tion-free ma­chine learn­ing models

Matt GoldenbergMar 19, 2024, 2:55 PM
8 points
1 comment1 min readLW link
(www.science.org)

A Selec­tion of Ran­domly Selected SAE Features

Apr 1, 2024, 9:09 AM
109 points
2 comments4 min readLW link

SAE-VIS: An­nounce­ment Post

Mar 31, 2024, 3:30 PM
74 points
8 comments1 min readLW link

Gated At­ten­tion Blocks: Pre­limi­nary Progress to­ward Re­mov­ing At­ten­tion Head Superposition

Apr 8, 2024, 11:14 AM
42 points
4 comments15 min readLW link

Discrim­i­nat­ing Be­hav­iorally Iden­ti­cal Clas­sifiers: a model prob­lem for ap­ply­ing in­ter­pretabil­ity to scal­able oversight

Sam MarksApr 18, 2024, 4:17 PM
109 points
10 comments12 min readLW link

[Sum­mary] Progress Up­date #1 from the GDM Mech In­terp Team

Apr 19, 2024, 7:06 PM
72 points
0 comments3 min readLW link

[Full Post] Progress Up­date #1 from the GDM Mech In­terp Team

Apr 19, 2024, 7:06 PM
78 points
10 comments8 min readLW link

ProLU: A Non­lin­ear­ity for Sparse Autoencoders

Glen TaggartApr 23, 2024, 2:09 PM
44 points
4 comments9 min readLW link

Why I stopped be­ing into basin broadness

tailcalledApr 25, 2024, 8:47 PM
16 points
3 comments2 min readLW link

Su­per­po­si­tion is not “just” neu­ron polysemanticity

LawrenceCApr 26, 2024, 11:22 PM
65 points
4 comments13 min readLW link

Im­prov­ing Dic­tionary Learn­ing with Gated Sparse Autoencoders

Apr 25, 2024, 6:43 PM
63 points
38 comments1 min readLW link
(arxiv.org)

SAEs Dis­cover Mean­ingful Fea­tures in the IOI Task

Jun 5, 2024, 11:48 PM
15 points
2 comments10 min readLW link

Mechanis­tic In­ter­pretabil­ity Work­shop Hap­pen­ing at ICML 2024!

May 3, 2024, 1:18 AM
48 points
6 comments1 min readLW link

Au­tomat­ing LLM Au­dit­ing with Devel­op­men­tal Interpretability

Sep 4, 2024, 3:50 PM
19 points
0 comments3 min readLW link

“What the hell is a rep­re­sen­ta­tion, any­way?” | Clar­ify­ing AI in­ter­pretabil­ity with tools from philos­o­phy of cog­ni­tive sci­ence | Part 1: Ve­hi­cles vs. contents

IwanWilliamsJun 9, 2024, 2:19 PM
9 points
1 comment4 min readLW link

Ra­tional An­i­ma­tions’ in­tro to mechanis­tic interpretability

WriterJun 14, 2024, 4:10 PM
45 points
1 comment11 min readLW link
(youtu.be)

At­ten­tion Out­put SAEs Im­prove Cir­cuit Analysis

Jun 21, 2024, 12:56 PM
33 points
3 comments19 min readLW link

In­ter­pret­ing Prefer­ence Models w/​ Sparse Autoencoders

Jul 1, 2024, 9:35 PM
74 points
12 comments9 min readLW link

How ARENA course ma­te­rial gets made

CallumMcDougallJul 2, 2024, 6:04 PM
41 points
2 comments7 min readLW link

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers v2

Neel NandaJul 7, 2024, 5:39 PM
134 points
16 comments25 min readLW link

Why I’m bear­ish on mechanis­tic in­ter­pretabil­ity: the shards are not in the network

tailcalledSep 13, 2024, 5:09 PM
22 points
40 comments1 min readLW link

JumpReLU SAEs + Early Ac­cess to Gemma 2 SAEs

Jul 19, 2024, 4:10 PM
48 points
10 comments1 min readLW link
(storage.googleapis.com)

Trans­former Dy­nam­ics: a neuro-in­spired ap­proach to MechInterp

Feb 22, 2025, 9:33 PM
9 points
0 comments5 min readLW link

Learn­ing Multi-Level Fea­tures with Ma­tryoshka SAEs

Dec 19, 2024, 3:59 PM
33 points
4 comments11 min readLW link

Physics of Lan­guage mod­els (part 2.1)

Nathan Helm-BurgerSep 19, 2024, 4:48 PM
9 points
2 comments1 min readLW link
(youtu.be)

A New Class of Glitch To­kens—BPE Subto­ken Ar­ti­facts (BSA)

Lao MeinSep 20, 2024, 1:13 PM
37 points
7 comments5 min readLW link

Why did ChatGPT say that? Prompt en­g­ineer­ing and more, with PIZZA.

Jessica RumbelowAug 3, 2024, 12:07 PM
41 points
2 comments4 min readLW link

Glitch To­ken Cat­a­log - (Al­most) a Full Clear

Lao MeinSep 21, 2024, 12:22 PM
38 points
3 comments37 min readLW link

The GDM AGI Safety+Align­ment Team is Hiring for Ap­plied In­ter­pretabil­ity Research

Feb 24, 2025, 2:17 AM
46 points
1 comment7 min readLW link

Self-ex­plain­ing SAE features

Aug 5, 2024, 10:20 PM
60 points
13 comments10 min readLW link

You can re­move GPT2’s Lay­erNorm by fine-tun­ing for an hour

StefanHexAug 8, 2024, 6:33 PM
161 points
11 comments8 min readLW link

[Linkpost] Play with SAEs on Llama 3

Sep 25, 2024, 10:35 PM
40 points
2 comments1 min readLW link

AXRP Epi­sode 36 - Adam Shai and Paul Riech­ers on Com­pu­ta­tional Mechanics

DanielFilanSep 29, 2024, 5:50 AM
25 points
0 comments55 min readLW link

Ex­plor­ing SAE fea­tures in LLMs with defi­ni­tion trees and to­ken lists

mwatkinsOct 4, 2024, 10:15 PM
37 points
5 comments6 min readLW link

HDBSCAN is Sur­pris­ingly Effec­tive at Find­ing In­ter­pretable Clusters of the SAE De­coder Matrix

Oct 11, 2024, 11:06 PM
8 points
2 comments10 min readLW link

Cir­cuits in Su­per­po­si­tion: Com­press­ing many small neu­ral net­works into one

Oct 14, 2024, 1:06 PM
129 points
8 comments13 min readLW link

The Com­pu­ta­tional Com­plex­ity of Cir­cuit Dis­cov­ery for In­ner Interpretability

Bogdan Ionut CirsteaOct 17, 2024, 1:18 PM
11 points
2 comments1 min readLW link
(arxiv.org)

SAEs you can See: Ap­ply­ing Sparse Au­toen­coders to Clustering

Robert_AIZIOct 28, 2024, 2:48 PM
27 points
0 comments10 min readLW link

AXRP Epi­sode 38.2 - Jesse Hoogland on Sin­gu­lar Learn­ing Theory

DanielFilanNov 27, 2024, 6:30 AM
34 points
0 comments10 min readLW link

Deep Learn­ing is cheap Solomonoff in­duc­tion?

Dec 7, 2024, 11:00 AM
42 points
1 comment17 min readLW link

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

Dec 11, 2024, 6:30 AM
82 points
6 comments2 min readLW link
(www.neuronpedia.org)

Dmitry’s Koan

Dmitry VaintrobJan 10, 2025, 4:27 AM
43 points
8 comments22 min readLW link

Paper club: He et al. on mod­u­lar ar­ith­metic (part I)

Dmitry VaintrobJan 13, 2025, 11:18 AM
13 points
0 comments8 min readLW link

Du­pli­cate to­ken neu­rons in the first layer of GPT-2

Alex GibsonDec 27, 2024, 4:21 AM
2 points
0 comments5 min readLW link

Log­its, log-odds, and loss for par­allel circuits

Dmitry VaintrobJan 20, 2025, 9:56 AM
56 points
3 comments11 min readLW link

AXRP Epi­sode 38.5 - Adrià Gar­riga-Alonso on De­tect­ing AI Scheming

DanielFilanJan 20, 2025, 12:40 AM
9 points
0 comments16 min readLW link

Against blan­ket ar­gu­ments against interpretability

Dmitry VaintrobJan 22, 2025, 9:46 AM
50 points
4 comments7 min readLW link

QFT and neu­ral nets: the ba­sic idea

Dmitry VaintrobJan 24, 2025, 1:54 PM
19 points
0 comments8 min readLW link

On polytopes

Dmitry VaintrobJan 25, 2025, 1:56 PM
56 points
5 comments12 min readLW link

The gen­er­al­iza­tion phase diagram

Dmitry VaintrobJan 26, 2025, 8:30 PM
26 points
2 comments16 min readLW link

Im­ple­ment­ing ac­ti­va­tion steering

AnnahFeb 5, 2024, 5:51 PM
71 points
8 comments7 min readLW link

The mem­o­riza­tion-gen­er­al­iza­tion spec­trum and learn­ing coefficients

Dmitry VaintrobJan 28, 2025, 4:53 PM
17 points
0 comments10 min readLW link

In­fer­ence-Time In­ter­ven­tion: Elic­it­ing Truth­ful An­swers from a Lan­guage Model

likennethJun 11, 2023, 5:38 AM
195 points
4 comments1 min readLW link
(arxiv.org)

Con­di­tional Im­por­tance in Toy Models of Superposition

james__pFeb 2, 2025, 8:35 PM
7 points
2 comments10 min readLW link

Neu­ron Ac­ti­va­tions to CLIP Embed­dings: Geom­e­try of Lin­ear Com­bi­na­tions in La­tent Space

Roman MalovFeb 3, 2025, 10:30 AM
4 points
0 comments2 min readLW link

Thoughts on Toy Models of Superposition

james__pFeb 2, 2025, 1:52 PM
4 points
0 comments9 min readLW link

Cross-Layer Fea­ture Align­ment and Steer­ing in Large Lan­guage Model

dlaptevFeb 8, 2025, 8:18 PM
4 points
0 comments6 min readLW link

How Do In­duc­tion Heads Ac­tu­ally Work in Trans­form­ers With Finite Ca­pac­ity?

Fabien RogerMar 23, 2023, 9:09 AM
27 points
0 comments5 min readLW link

Wittgen­stein and ML — pa­ram­e­ters vs architecture

Cleo NardoMar 24, 2023, 4:54 AM
44 points
9 comments5 min readLW link

Othello-GPT: Fu­ture Work I Am Ex­cited About

Neel NandaMar 29, 2023, 10:13 PM
48 points
2 comments33 min readLW link
(neelnanda.io)

Othello-GPT: Reflec­tions on the Re­search Process

Neel NandaMar 29, 2023, 10:13 PM
36 points
0 comments15 min readLW link
(neelnanda.io)

Gi­ant (In)scrutable Ma­tri­ces: (Maybe) the Best of All Pos­si­ble Worlds

1a3ornApr 4, 2023, 5:39 PM
208 points
38 comments5 min readLW link1 review

Iden­ti­fy­ing se­man­tic neu­rons, mechanis­tic cir­cuits & in­ter­pretabil­ity web apps

Apr 13, 2023, 11:59 AM
18 points
0 comments8 min readLW link

Shap­ley Value At­tri­bu­tion in Chain of Thought

leogaoApr 14, 2023, 5:56 AM
106 points
7 comments4 min readLW link

Smar­tyHead­erCode: anoma­lous to­kens for GPT3.5 and GPT-4

AdamYedidiaApr 15, 2023, 10:35 PM
71 points
18 comments6 min readLW link

Lan­guage Models are a Po­ten­tially Safe Path to Hu­man-Level AGI

Nadav BrandesApr 20, 2023, 12:40 AM
28 points
7 comments8 min readLW link1 review

Be­havi­oural statis­tics for a maze-solv­ing agent

Apr 20, 2023, 10:26 PM
46 points
11 comments10 min readLW link

Should we pub­lish mechanis­tic in­ter­pretabil­ity re­search?

Apr 21, 2023, 4:19 PM
106 points
40 comments13 min readLW link

Neu­ral net­work poly­topes (Co­lab note­book)

Zach FurmanApr 21, 2023, 10:42 PM
11 points
0 comments1 min readLW link
(colab.research.google.com)

Ex­plain­ing the Trans­former Cir­cuits Frame­work by Example

Felix HofstätterApr 25, 2023, 1:45 PM
8 points
0 comments15 min readLW link

Mech In­terp Challenge: Novem­ber—De­ci­pher­ing the Cu­mu­la­tive Sum Model

CallumMcDougallNov 2, 2023, 5:10 PM
18 points
2 comments2 min readLW link

Dropout can cre­ate a priv­ileged ba­sis in the ReLU out­put model.

lewis smithApr 28, 2023, 1:59 AM
24 points
3 comments5 min readLW link

In­ter­pretabil­ity with Sparse Au­toen­coders (Co­lab ex­er­cises)

CallumMcDougallNov 29, 2023, 12:56 PM
74 points
9 comments4 min readLW link

How use­ful is mechanis­tic in­ter­pretabil­ity?

Dec 1, 2023, 2:54 AM
166 points
54 comments25 min readLW link

Deep For­get­ting & Un­learn­ing for Safely-Scoped LLMs

scasperDec 5, 2023, 4:48 PM
124 points
30 comments13 min readLW link

An An­a­lytic Per­spec­tive on AI Alignment

DanielFilanMar 1, 2020, 4:10 AM
54 points
45 comments8 min readLW link
(danielfilan.com)

Ver­ifi­ca­tion and Transparency

DanielFilanAug 8, 2019, 1:50 AM
35 points
6 comments2 min readLW link
(danielfilan.com)

Mechanis­tic Trans­parency for Ma­chine Learning

DanielFilanJul 11, 2018, 12:34 AM
54 points
9 comments4 min readLW link

How can In­ter­pretabil­ity help Align­ment?

May 23, 2020, 4:16 PM
37 points
3 comments9 min readLW link

One Way to Think About ML Transparency

Matthew BarnettSep 2, 2019, 11:27 PM
26 points
28 comments5 min readLW link

Re­laxed ad­ver­sar­ial train­ing for in­ner alignment

evhubSep 10, 2019, 11:03 PM
69 points
27 comments27 min readLW link

Spar­sity and in­ter­pretabil­ity?

Jun 1, 2020, 1:25 PM
41 points
3 comments7 min readLW link

AXRP Epi­sode 21 - In­ter­pretabil­ity for Eng­ineers with Stephen Casper

DanielFilanMay 2, 2023, 12:50 AM
12 points
1 comment66 min readLW link

[Linkpost]Trans­former-Based LM Sur­prisal Pre­dicts Hu­man Read­ing Times Best with About Two Billion Train­ing Tokens

Curtis HuebnerMay 4, 2023, 5:16 PM
10 points
1 comment1 min readLW link
(arxiv.org)

Ex­cit­ing New In­ter­pretabil­ity Paper!

research_prime_spaceMay 9, 2023, 4:39 PM
12 points
1 comment1 min readLW link

AGI-Au­to­mated In­ter­pretabil­ity is Suicide

__RicG__May 10, 2023, 2:20 PM
25 points
33 comments7 min readLW link

New OpenAI Paper—Lan­guage mod­els can ex­plain neu­rons in lan­guage models

MrThinkMay 10, 2023, 7:46 AM
47 points
14 comments1 min readLW link

Ac­ti­va­tion ad­di­tions in a small resi­d­ual network

Garrett BakerMay 22, 2023, 8:28 PM
22 points
4 comments3 min readLW link

[Linkpost] In­ter­pretabil­ity Dreams

DanielFilanMay 24, 2023, 9:08 PM
39 points
2 comments2 min readLW link
(transformer-circuits.pub)

Search ver­sus design

Alex FlintAug 16, 2020, 4:53 PM
109 points
40 comments36 min readLW link1 review

Why and When In­ter­pretabil­ity Work is Dangerous

Nicholas / Heather KrossMay 28, 2023, 12:27 AM
20 points
9 comments8 min readLW link
(www.thinkingmuchbetter.com)

Ex­plor­ing Con­cept-Spe­cific Slices in Weight Ma­tri­ces for Net­work Interpretability

DuncanFowlerJun 9, 2023, 4:39 PM
1 point
0 comments6 min readLW link

fMRI LIKE APPROACH TO AI ALIGNMENT /​ DECEPTIVE BEHAVIOUR

Escaque 66Jul 11, 2023, 5:17 PM
−1 points
3 comments2 min readLW link

Towards Devel­op­men­tal Interpretability

Jul 12, 2023, 7:33 PM
192 points
10 comments9 min readLW link1 review

Au­toIn­ter­pre­ta­tion Finds Sparse Cod­ing Beats Alternatives

HoagyJul 17, 2023, 1:41 AM
57 points
1 comment7 min readLW link

He­donic Loops and Tam­ing RL

berenJul 19, 2023, 3:12 PM
20 points
14 comments9 min readLW link

Tiny Mech In­terp Pro­jects: Emer­gent Po­si­tional Embed­dings of Words

Neel NandaJul 18, 2023, 9:24 PM
51 points
1 comment9 min readLW link

Desider­ata for an AI

Nathan Helm-BurgerJul 19, 2023, 4:18 PM
9 points
0 comments4 min readLW link

Open prob­lems in ac­ti­va­tion engineering

Jul 24, 2023, 7:46 PM
51 points
2 comments1 min readLW link
(coda.io)

Mech In­terp Puz­zle 1: Sus­pi­ciously Similar Embed­dings in GPT-Neo

Neel NandaJul 16, 2023, 10:02 PM
67 points
15 comments1 min readLW link

Does Cir­cuit Anal­y­sis In­ter­pretabil­ity Scale? Ev­i­dence from Mul­ti­ple Choice Ca­pa­bil­ities in Chinchilla

Jul 20, 2023, 10:50 AM
44 points
3 comments2 min readLW link
(arxiv.org)

Really Strong Fea­tures Found in Resi­d­ual Stream

Logan RiggsJul 8, 2023, 7:40 PM
69 points
6 comments2 min readLW link

Neuronpedia

Johnny LinJul 26, 2023, 4:29 PM
135 points
51 comments2 min readLW link
(neuronpedia.org)

AXRP Epi­sode 23 - Mechanis­tic Ano­maly De­tec­tion with Mark Xu

DanielFilanJul 27, 2023, 1:50 AM
22 points
0 comments72 min readLW link

Mech In­terp Puz­zle 2: Word2Vec Style Embeddings

Neel NandaJul 28, 2023, 12:50 AM
41 points
4 comments2 min readLW link

Apollo Re­search is hiring evals and in­ter­pretabil­ity en­g­ineers & scientists

Marius HobbhahnAug 4, 2023, 10:54 AM
25 points
0 comments2 min readLW link

Grow­ing Bon­sai Net­works with RNNs

ameoAug 7, 2023, 5:34 PM
21 points
5 comments1 min readLW link
(cprimozic.net)

Ap­ply for the 2023 Devel­op­men­tal In­ter­pretabil­ity Con­fer­ence!

Aug 25, 2023, 7:12 AM
33 points
0 comments2 min readLW link

Paper Walk­through: Au­to­mated Cir­cuit Dis­cov­ery with Arthur Conmy

Neel NandaAug 29, 2023, 10:07 PM
36 points
1 comment1 min readLW link
(www.youtube.com)

You’re Mea­sur­ing Model Com­plex­ity Wrong

Oct 11, 2023, 11:46 AM
92 points
17 comments13 min readLW link

Sparse Cod­ing, for Mechanis­tic In­ter­pretabil­ity and Ac­ti­va­tion Engineering

David UdellSep 23, 2023, 7:16 PM
42 points
7 comments34 min readLW link

High­lights: Went­worth, Shah, and Mur­phy on “Re­tar­get­ing the Search”

RobertMSep 14, 2023, 2:18 AM
85 points
4 comments8 min readLW link

Mech In­terp Challenge: Septem­ber—De­ci­pher­ing the Ad­di­tion Model

CallumMcDougallSep 13, 2023, 10:23 PM
35 points
0 comments4 min readLW link

Sparse Au­toen­coders Find Highly In­ter­pretable Direc­tions in Lan­guage Models

Sep 21, 2023, 3:30 PM
159 points
8 comments5 min readLW link

Three ways in­ter­pretabil­ity could be impactful

Arthur ConmySep 18, 2023, 1:02 AM
47 points
8 comments4 min readLW link

Neel Nanda on the Mechanis­tic In­ter­pretabil­ity Re­searcher Mindset

Michaël TrazziSep 21, 2023, 7:47 PM
37 points
1 comment3 min readLW link
(theinsideview.ai)

In­ter­pret­ing OpenAI’s Whisper

EllenaRSep 24, 2023, 5:53 PM
115 points
13 comments7 min readLW link

Mech In­terp Challenge: Oc­to­ber—De­ci­pher­ing the Sorted List Model

CallumMcDougallOct 3, 2023, 10:57 AM
23 points
0 comments3 min readLW link

In­ner Align­ment in Salt-Starved Rats

Steven ByrnesNov 19, 2020, 2:40 AM
137 points
41 comments11 min readLW link2 reviews

Paper: Un­der­stand­ing and Con­trol­ling a Maze-Solv­ing Policy Network

Oct 13, 2023, 1:38 AM
70 points
0 comments1 min readLW link
(arxiv.org)

[Paper] All’s Fair In Love And Love: Copy Sup­pres­sion in GPT-2 Small

Oct 13, 2023, 6:32 PM
82 points
4 comments8 min readLW link

Multi-di­men­sional re­wards for AGI in­ter­pretabil­ity and control

Steven ByrnesJan 4, 2021, 3:08 AM
19 points
8 comments10 min readLW link

MIRI com­ments on Co­tra’s “Case for Align­ing Nar­rowly Su­per­hu­man Models”

Rob BensingerMar 5, 2021, 11:43 PM
142 points
13 comments26 min readLW link

Trans­parency Trichotomy

Mark XuMar 28, 2021, 8:26 PM
25 points
2 comments7 min readLW link

Solv­ing the whole AGI con­trol prob­lem, ver­sion 0.0001

Steven ByrnesApr 8, 2021, 3:14 PM
63 points
7 comments26 min readLW link

Knowl­edge Neu­rons in Pre­trained Transformers

evhubMay 17, 2021, 10:54 PM
100 points
7 comments2 min readLW link
(arxiv.org)

Garrabrant and Shah on hu­man mod­el­ing in AGI

Rob BensingerAug 4, 2021, 4:35 AM
60 points
10 comments47 min readLW link

Neu­ral net /​ de­ci­sion tree hy­brids: a po­ten­tial path to­ward bridg­ing the in­ter­pretabil­ity gap

Nathan Helm-BurgerSep 23, 2021, 12:38 AM
21 points
2 comments12 min readLW link

Let’s buy out Cyc, for use in AGI in­ter­pretabil­ity sys­tems?

Steven ByrnesDec 7, 2021, 8:46 PM
49 points
10 comments2 min readLW link

Solv­ing In­ter­pretabil­ity Week

Logan RiggsDec 13, 2021, 5:09 PM
11 points
5 comments1 min readLW link

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel NandaDec 15, 2021, 11:44 PM
127 points
9 comments15 min readLW link

Ques­tion 3: Con­trol pro­pos­als for min­i­miz­ing bad outcomes

Cameron BergFeb 12, 2022, 7:13 PM
5 points
1 comment7 min readLW link

Progress Re­port 1: in­ter­pretabil­ity ex­per­i­ments & learn­ing, test­ing com­pres­sion hypotheses

Nathan Helm-BurgerMar 22, 2022, 8:12 PM
11 points
0 comments2 min readLW link

[In­tro to brain-like-AGI safety] 9. Take­aways from neuro 2/​2: On AGI motivation

Steven ByrnesMar 23, 2022, 12:48 PM
46 points
11 comments22 min readLW link

The case for be­com­ing a black-box in­ves­ti­ga­tor of lan­guage models

BuckMay 6, 2022, 2:35 PM
126 points
20 comments3 min readLW link

Deep Learn­ing Sys­tems Are Not Less In­ter­pretable Than Logic/​Prob­a­bil­ity/​Etc

johnswentworthJun 4, 2022, 5:41 AM
159 points
55 comments2 min readLW link1 review

How Do Selec­tion The­o­rems Re­late To In­ter­pretabil­ity?

johnswentworthJun 9, 2022, 7:39 PM
60 points
14 comments3 min readLW link

Progress Re­port 6: get the tool working

Nathan Helm-BurgerJun 10, 2022, 11:18 AM
4 points
0 comments2 min readLW link

[Question] Can you MRI a deep learn­ing model?

Yair HalberstadtJun 13, 2022, 1:43 PM
3 points
3 comments1 min readLW link

Vi­su­al­iz­ing Neu­ral net­works, how to blame the bias

Donald HobsonJul 9, 2022, 3:52 PM
7 points
1 comment6 min readLW link

[Question] How op­ti­mistic should we be about AI figur­ing out how to in­ter­pret it­self?

oh54321Jul 25, 2022, 10:09 PM
3 points
1 comment1 min readLW link

Pre­cur­sor check­ing for de­cep­tive alignment

evhubAug 3, 2022, 10:56 PM
24 points
0 comments14 min readLW link

In­ter­pretabil­ity/​Tool-ness/​Align­ment/​Cor­rigi­bil­ity are not Composable

johnswentworthAug 8, 2022, 6:05 PM
143 points
13 comments3 min readLW link

AI Trans­parency: Why it’s crit­i­cal and how to ob­tain it.

Zohar JacksonAug 14, 2022, 10:31 AM
6 points
1 comment5 min readLW link

What Makes an Idea Un­der­stand­able? On Ar­chi­tec­turally and Cul­turally Nat­u­ral Ideas.

Aug 16, 2022, 2:09 AM
21 points
2 comments16 min readLW link

What Makes A Good Mea­sure­ment De­vice?

johnswentworthAug 24, 2022, 10:45 PM
37 points
7 comments2 min readLW link

Tak­ing the pa­ram­e­ters which seem to mat­ter and ro­tat­ing them un­til they don’t

Garrett BakerAug 26, 2022, 6:26 PM
120 points
48 comments1 min readLW link

A rough idea for solv­ing ELK: An ap­proach for train­ing gen­er­al­ist agents like GATO to make plans and de­scribe them to hu­mans clearly and hon­estly.

Michael SoareverixSep 8, 2022, 3:20 PM
2 points
2 comments2 min readLW link

Swap and Scale

Stephen FowlerSep 9, 2022, 10:41 PM
17 points
3 comments1 min readLW link

[Linkpost] A sur­vey on over 300 works about in­ter­pretabil­ity in deep networks

scasperSep 12, 2022, 7:07 PM
97 points
7 comments2 min readLW link
(arxiv.org)

Sparse tri­nary weighted RNNs as a path to bet­ter lan­guage model interpretability

Am8ryllisSep 17, 2022, 7:48 PM
19 points
13 comments3 min readLW link

Toy Models of Superposition

evhubSep 21, 2022, 11:48 PM
69 points
4 comments5 min readLW link1 review
(transformer-circuits.pub)

QAPR 3: in­ter­pretabil­ity-guided train­ing of neu­ral nets

Quintin PopeSep 28, 2022, 4:02 PM
58 points
2 comments10 min readLW link

More Re­cent Progress in the The­ory of Neu­ral Networks

jylin04Oct 6, 2022, 4:57 PM
82 points
6 comments4 min readLW link

Poly­se­man­tic­ity and Ca­pac­ity in Neu­ral Networks

Oct 7, 2022, 5:51 PM
87 points
14 comments3 min readLW link

Ar­ti­cle Re­view: Google’s AlphaTensor

Robert_AIZIOct 12, 2022, 6:04 PM
8 points
4 comments10 min readLW link

[Question] Pre­vi­ous Work on Re­cre­at­ing Neu­ral Net­work In­put from In­ter­me­di­ate Layer Activations

bglassOct 12, 2022, 7:28 PM
1 point
3 comments1 min readLW link

(OLD) An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers

Neel NandaOct 18, 2022, 9:08 PM
72 points
5 comments12 min readLW link
(www.neelnanda.io)

A Bare­bones Guide to Mechanis­tic In­ter­pretabil­ity Prerequisites

Neel NandaOct 24, 2022, 8:45 PM
64 points
12 comments3 min readLW link
(neelnanda.io)

A Walk­through of A Math­e­mat­i­cal Frame­work for Trans­former Circuits

Neel NandaOct 25, 2022, 8:24 PM
52 points
7 comments1 min readLW link
(www.youtube.com)

[Book] In­ter­pretable Ma­chine Learn­ing: A Guide for Mak­ing Black Box Models Explainable

Esben KranOct 31, 2022, 11:38 AM
20 points
1 comment1 min readLW link
(christophm.github.io)

“Cars and Elephants”: a hand­wavy ar­gu­ment/​anal­ogy against mechanis­tic interpretability

David Scott Krueger (formerly: capybaralet)Oct 31, 2022, 9:26 PM
48 points
25 comments2 min readLW link

Real-Time Re­search Record­ing: Can a Trans­former Re-Derive Po­si­tional Info?

Neel NandaNov 1, 2022, 11:56 PM
69 points
16 comments1 min readLW link
(youtu.be)

A Mys­tery About High Di­men­sional Con­cept Encoding

Fabien RogerNov 3, 2022, 5:05 PM
46 points
13 comments7 min readLW link

A Walk­through of In­ter­pretabil­ity in the Wild (w/​ au­thors Kevin Wang, Arthur Conmy & Alexan­dre Variengien)

Neel NandaNov 7, 2022, 10:39 PM
30 points
15 comments3 min readLW link
(youtu.be)

A Walk­through of In-Con­text Learn­ing and In­duc­tion Heads (w/​ Charles Frye) Part 1 of 2

Neel NandaNov 22, 2022, 5:12 PM
20 points
0 comments1 min readLW link
(www.youtube.com)

Sub­sets and quo­tients in interpretability

Erik JennerDec 2, 2022, 11:13 PM
26 points
1 comment7 min readLW link

Find­ing gliders in the game of life

paulfchristianoDec 1, 2022, 8:40 PM
101 points
8 comments16 min readLW link
(ai-alignment.com)

[ASoT] Nat­u­ral ab­strac­tions and AlphaZero

Ulisse MiniDec 10, 2022, 5:53 PM
33 points
1 comment1 min readLW link
(arxiv.org)

Paper: Trans­form­ers learn in-con­text by gra­di­ent descent

LawrenceCDec 16, 2022, 11:10 AM
28 points
11 comments2 min readLW link
(arxiv.org)

Can we effi­ciently ex­plain model be­hav­iors?

paulfchristianoDec 16, 2022, 7:40 PM
64 points
3 comments9 min readLW link
(ai-alignment.com)

Durkon, an open-source tool for In­her­ently In­ter­pretable Modelling

abstractapplicDec 24, 2022, 1:49 AM
37 points
0 comments4 min readLW link

Con­crete Steps to Get Started in Trans­former Mechanis­tic Interpretability

Neel NandaDec 25, 2022, 10:21 PM
57 points
7 comments12 min readLW link
(www.neelnanda.io)

Analo­gies be­tween Soft­ware Re­v­erse Eng­ineer­ing and Mechanis­tic Interpretability

Dec 26, 2022, 12:26 PM
34 points
6 comments11 min readLW link
(www.neelnanda.io)

200 COP in MI: The Case for Analysing Toy Lan­guage Models

Neel NandaDec 28, 2022, 9:07 PM
40 points
3 comments7 min readLW link

200 COP in MI: Look­ing for Cir­cuits in the Wild

Neel NandaDec 29, 2022, 8:59 PM
16 points
5 comments13 min readLW link

200 COP in MI: Ex­plor­ing Poly­se­man­tic­ity and Superposition

Neel NandaJan 3, 2023, 1:52 AM
34 points
6 comments16 min readLW link

Com­ments on OpenPhil’s In­ter­pretabil­ity RFP

paulfchristianoNov 5, 2021, 10:36 PM
91 points
5 comments7 min readLW link

200 COP in MI: Analysing Train­ing Dynamics

Neel NandaJan 4, 2023, 4:08 PM
16 points
0 comments14 min readLW link

Paper: Su­per­po­si­tion, Me­moriza­tion, and Dou­ble Des­cent (An­thropic)

LawrenceCJan 5, 2023, 5:54 PM
53 points
11 comments1 min readLW link
(transformer-circuits.pub)

200 COP in MI: Tech­niques, Tool­ing and Automation

Neel NandaJan 6, 2023, 3:08 PM
13 points
0 comments15 min readLW link

200 COP in MI: Image Model Interpretability

Neel NandaJan 8, 2023, 2:53 PM
18 points
3 comments6 min readLW link

200 COP in MI: In­ter­pret­ing Re­in­force­ment Learning

Neel NandaJan 10, 2023, 5:37 PM
25 points
1 comment10 min readLW link

World-Model In­ter­pretabil­ity Is All We Need

Thane RuthenisJan 14, 2023, 7:37 PM
35 points
22 comments21 min readLW link

How does GPT-3 spend its 175B pa­ram­e­ters?

Robert_AIZIJan 13, 2023, 7:21 PM
41 points
14 comments6 min readLW link
(aizi.substack.com)

200 COP in MI: Study­ing Learned Fea­tures in Lan­guage Models

Neel NandaJan 19, 2023, 3:48 AM
24 points
2 comments30 min readLW link

[Question] Trans­former Mech In­terp: Any vi­su­al­iza­tions?

Joyee ChenJan 18, 2023, 4:32 AM
3 points
0 comments1 min readLW link

Mechanis­tic In­ter­pretabil­ity Quick­start Guide

Neel NandaJan 31, 2023, 4:35 PM
42 points
3 comments6 min readLW link
(www.neelnanda.io)

More find­ings on Me­moriza­tion and dou­ble descent

Marius HobbhahnFeb 1, 2023, 6:26 PM
53 points
2 comments19 min readLW link

More find­ings on max­i­mal data dimension

Marius HobbhahnFeb 2, 2023, 6:33 PM
27 points
1 comment11 min readLW link

AXRP Epi­sode 19 - Mechanis­tic In­ter­pretabil­ity with Neel Nanda

DanielFilanFeb 4, 2023, 3:00 AM
45 points
0 comments117 min readLW link

Mech In­terp Pro­ject Ad­vis­ing Call: Me­mori­sa­tion in GPT-2 Small

Neel NandaFeb 4, 2023, 2:17 PM
7 points
0 comments1 min readLW link

[ASoT] Policy Tra­jec­tory Visualization

Ulisse MiniFeb 7, 2023, 12:13 AM
9 points
2 comments1 min readLW link

Re­view of AI Align­ment Progress

PeterMcCluskeyFeb 7, 2023, 6:57 PM
72 points
32 comments7 min readLW link
(bayesianinvestor.com)

On Devel­op­ing a Math­e­mat­i­cal The­ory of In­ter­pretabil­ity

carboniferous_umbraculum Feb 9, 2023, 1:45 AM
64 points
8 comments6 min readLW link

The con­cep­tual Dop­pelgänger problem

TsviBTFeb 12, 2023, 5:23 PM
12 points
5 comments4 min readLW link

EIS V: Blind Spots In AI Safety In­ter­pretabil­ity Research

scasperFeb 16, 2023, 7:09 PM
54 points
24 comments10 min readLW link

In­ter­ven­ing in the Resi­d­ual Stream

MadHatterFeb 22, 2023, 6:29 AM
30 points
1 comment9 min readLW link

Video/​an­i­ma­tion: Neel Nanda ex­plains what mechanis­tic in­ter­pretabil­ity is

DanielFilanFeb 22, 2023, 10:42 PM
24 points
7 comments1 min readLW link
(youtu.be)

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

Mar 11, 2023, 6:59 PM
332 points
28 comments23 min readLW link

Ad­den­dum: ba­sic facts about lan­guage mod­els dur­ing training

berenMar 6, 2023, 7:24 PM
22 points
2 comments5 min readLW link

Maze-solv­ing agents: Add a top-right vec­tor, make the agent go to the top-right

Mar 31, 2023, 7:20 PM
101 points
17 comments11 min readLW link

The Translu­cent Thoughts Hy­pothe­ses and Their Implications

Fabien RogerMar 9, 2023, 4:30 PM
142 points
7 comments19 min readLW link

Paper Repli­ca­tion Walk­through: Re­v­erse-Eng­ineer­ing Mo­du­lar Addition

Neel NandaMar 12, 2023, 1:25 PM
18 points
0 comments1 min readLW link
(neelnanda.io)

At­tri­bu­tion Patch­ing: Ac­ti­va­tion Patch­ing At In­dus­trial Scale

Neel NandaMar 16, 2023, 9:44 PM
45 points
10 comments58 min readLW link
(www.neelnanda.io)

In­tro­duc­ing Leap Labs, an AI in­ter­pretabil­ity startup

Jessica RumbelowMar 6, 2023, 4:16 PM
103 points
12 comments1 min readLW link

A cir­cuit for Python doc­strings in a 4-layer at­ten­tion-only transformer

Feb 20, 2023, 7:35 PM
96 points
8 comments21 min readLW link

Interpretability

Oct 29, 2021, 7:28 AM
60 points
13 comments12 min readLW link

Em­piri­cal In­sights into Fea­ture Geom­e­try in Sparse Autoencoders

Jason Boxi ZhangJan 24, 2025, 7:02 PM
5 points
0 comments11 min readLW link

A Bite Sized In­tro­duc­tion to ELK

Luk27182Sep 17, 2022, 12:28 AM
5 points
0 comments6 min readLW link

Ano­ma­lous To­kens in Deep­Seek-V3 and r1

henryJan 25, 2025, 10:55 PM
135 points
2 comments7 min readLW link

Reflec­tions on Trust­ing Trust & AI

Itay YonaJan 16, 2023, 6:36 AM
10 points
1 comment3 min readLW link
(mentaleap.ai)

The Shard The­ory Align­ment Scheme

David UdellAug 25, 2022, 4:52 AM
47 points
32 comments2 min readLW link

EIS IX: In­ter­pretabil­ity and Adversaries

scasperFeb 20, 2023, 6:25 PM
30 points
8 comments8 min readLW link

Lay­ing the Foun­da­tions for Vi­sion and Mul­ti­modal Mechanis­tic In­ter­pretabil­ity & Open Problems

Mar 13, 2024, 5:09 PM
44 points
13 comments14 min readLW link

Use­ful start­ing code for interpretability

eggsyntaxFeb 13, 2024, 11:13 PM
26 points
2 comments1 min readLW link

A Chess-GPT Lin­ear Emer­gent World Representation

Adam KarvonenFeb 8, 2024, 4:25 AM
105 points
14 comments7 min readLW link
(adamkarvonen.github.io)

Un­der­stand­ing Hid­den Com­pu­ta­tions in Chain-of-Thought Reasoning

rokosbasiliskAug 24, 2024, 4:35 PM
6 points
1 comment1 min readLW link

Fluent dream­ing for lan­guage mod­els (AI in­ter­pretabil­ity method)

Feb 6, 2024, 6:02 AM
45 points
5 comments1 min readLW link
(arxiv.org)

Paper: Open Prob­lems in Mechanis­tic Interpretability

Jan 29, 2025, 10:25 AM
68 points
0 comments1 min readLW link
(arxiv.org)

At­ten­tion SAEs Scale to GPT-2 Small

Feb 3, 2024, 6:50 AM
78 points
4 comments8 min readLW link

Show­ing SAE La­tents Are Not Atomic Us­ing Meta-SAEs

Aug 24, 2024, 12:56 AM
68 points
10 comments20 min readLW link

Un­der­stand­ing SAE Fea­tures with the Logit Lens

Mar 11, 2024, 12:16 AM
68 points
0 comments14 min readLW link

Ex­plor­ing OpenAI’s La­tent Direc­tions: Tests, Ob­ser­va­tions, and Pok­ing Around

Johnny LinJan 31, 2024, 6:01 AM
26 points
4 comments14 min readLW link

De­cep­tion and Jailbreak Se­quence: 1. Iter­a­tive Refine­ment Stages of De­cep­tion in LLMs

Aug 22, 2024, 7:32 AM
23 points
1 comment21 min readLW link

Ques­tions I’d Want to Ask an AGI+ to Test Its Un­der­stand­ing of Ethics

sweenesmJan 26, 2024, 11:40 PM
14 points
6 comments4 min readLW link

In­for­mal se­man­tics and Orders

Q HomeAug 27, 2022, 4:17 AM
14 points
10 comments26 min readLW link

Find­ing Back­ward Chain­ing Cir­cuits in Trans­form­ers Trained on Tree Search

May 28, 2024, 5:29 AM
50 points
1 comment9 min readLW link
(arxiv.org)

Trac­ing Ty­pos in LLMs: My At­tempt at Un­der­stand­ing How Models Cor­rect Misspellings

Ivan DostalFeb 2, 2025, 7:56 PM
3 points
1 comment5 min readLW link

In­ter­pret­ing au­tonomous driv­ing agents with at­ten­tion based architecture

Manav DahraFeb 1, 2025, 11:20 PM
1 point
0 comments11 min readLW link

Ex­plor­ing how Othel­loGPT com­putes its world model

JMaarFeb 2, 2025, 9:29 PM
7 points
0 comments8 min readLW link

Search­ing for Mo­du­lar­ity in Large Lan­guage Models

Sep 8, 2022, 2:25 AM
44 points
3 comments14 min readLW link

Vi­su­al­iz­ing Interpretability

Darold DavisFeb 3, 2025, 7:36 PM
2 points
0 comments4 min readLW link

Craft­ing Poly­se­man­tic Trans­former Bench­marks with Known Circuits

Aug 23, 2024, 10:03 PM
10 points
0 comments25 min readLW link

De­tect­ing Strate­gic De­cep­tion Us­ing Lin­ear Probes

Feb 6, 2025, 3:46 PM
100 points
9 comments2 min readLW link
(arxiv.org)

In­ter­pretabil­ity as Com­pres­sion: Re­con­sid­er­ing SAE Ex­pla­na­tions of Neu­ral Ac­ti­va­tions with MDL-SAEs

Aug 23, 2024, 6:52 PM
41 points
5 comments16 min readLW link

Ex­plor­ing the Evolu­tion and Mi­gra­tion of Differ­ent Layer Embed­ding in LLMs

Ruixuan HuangMar 8, 2024, 3:01 PM
6 points
0 comments8 min readLW link

Large lan­guage mod­els learn to rep­re­sent the world

gjmJan 22, 2023, 1:10 PM
101 points
20 comments3 min readLW link1 review

From No Mind to a Mind – A Con­ver­sa­tion That Changed an AI

parthibanarjuna sFeb 7, 2025, 11:50 AM
1 point
0 comments3 min readLW link

Gra­di­ent Anatomy’s—Hal­lu­ci­na­tion Ro­bust­ness in Med­i­cal Q&A

DieSabFeb 12, 2025, 7:16 PM
1 point
0 comments10 min readLW link

Em­piri­cal risk min­i­miza­tion is fun­da­men­tally confused

Jesse HooglandMar 22, 2023, 4:58 PM
32 points
8 comments1 min readLW link

Try­ing to find the un­der­ly­ing struc­ture of com­pu­ta­tional systems

Matthias G. MayerSep 13, 2022, 9:16 PM
17 points
9 comments4 min readLW link

Sen­tience in Machines—How Do We Test for This Ob­jec­tively?

Mayowa OsiboduMar 26, 2023, 6:56 PM
−2 points
0 comments2 min readLW link
(www.researchgate.net)

Ap­prox­i­ma­tion is ex­pen­sive, but the lunch is cheap

Apr 19, 2023, 2:19 PM
70 points
3 comments16 min readLW link

Some com­mon con­fu­sion about in­duc­tion heads

Alexandre VariengienMar 28, 2023, 9:51 PM
64 points
4 comments5 min readLW link

Spread­sheet for 200 Con­crete Prob­lems In Interpretability

Jay BaileyMar 29, 2023, 6:51 AM
13 points
0 comments1 min readLW link

Back­doors have uni­ver­sal rep­re­sen­ta­tions across large lan­guage models

Dec 6, 2024, 10:56 PM
14 points
0 comments16 min readLW link

Co­or­di­nate-Free In­ter­pretabil­ity Theory

johnswentworthSep 14, 2022, 11:33 PM
52 points
16 comments5 min readLW link

De­con­fus­ing “Ca­pa­bil­ities vs. Align­ment”

RobertMJan 23, 2023, 4:46 AM
27 points
7 comments2 min readLW link

The Quan­ti­za­tion Model of Neu­ral Scaling

nzMar 31, 2023, 4:02 PM
17 points
0 comments1 min readLW link
(arxiv.org)

AISC 2023, Progress Re­port for March: Team In­ter­pretable Architectures

Apr 2, 2023, 4:19 PM
14 points
0 comments14 min readLW link

Ex­plo­ra­tory Anal­y­sis of RLHF Trans­form­ers with TransformerLens

Curt TiggesApr 3, 2023, 4:09 PM
21 points
2 comments11 min readLW link
(blog.eleuther.ai)

If in­ter­pretabil­ity re­search goes well, it may get dangerous

So8resApr 3, 2023, 9:48 PM
202 points
11 comments2 min readLW link

Math­e­mat­i­cal Cir­cuits in Neu­ral Networks

Sean OsierSep 22, 2022, 3:48 AM
34 points
4 comments1 min readLW link
(www.youtube.com)

Univer­sal­ity and Hid­den In­for­ma­tion in Con­cept Bot­tle­neck Models

HoagyApr 5, 2023, 2:00 PM
23 points
0 comments11 min readLW link

No con­vinc­ing ev­i­dence for gra­di­ent de­scent in ac­ti­va­tion space

BlaineApr 12, 2023, 4:48 AM
82 points
9 comments20 min readLW link

Bing AI Gen­er­at­ing Voyn­ich Manuscript Con­tinu­a­tions—It does not know how it knows

Matthew_OpitzApr 10, 2023, 8:22 PM
15 points
6 comments13 min readLW link

Towards a solu­tion to the al­ign­ment prob­lem via ob­jec­tive de­tec­tion and eval­u­a­tion

Paul CologneseApr 12, 2023, 3:39 PM
9 points
7 comments12 min readLW link

Mechanis­ti­cally in­ter­pret­ing time in GPT-2 small

Apr 16, 2023, 5:57 PM
68 points
6 comments21 min readLW link

EIS X: Con­tinual Learn­ing, Mo­du­lar­ity, Com­pres­sion, and Biolog­i­cal Brains

scasperFeb 21, 2023, 4:59 PM
14 points
4 comments3 min readLW link

Re­search Re­port: In­cor­rect­ness Cascades

Robert_AIZIApr 14, 2023, 12:49 PM
19 points
0 comments10 min readLW link
(aizi.substack.com)

What’s in the box?! – Towards in­ter­pretabil­ity by dis­t­in­guish­ing niches of value within neu­ral net­works.

Joshua ClancyFeb 29, 2024, 6:33 PM
3 points
4 comments128 min readLW link

Re­call and Re­gur­gi­ta­tion in GPT2

Megan KinnimentOct 3, 2022, 7:35 PM
43 points
1 comment26 min readLW link

Hard-Cod­ing Neu­ral Computation

MadHatterDec 13, 2021, 4:35 AM
34 points
8 comments27 min readLW link

How-to Trans­former Mechanis­tic In­ter­pretabil­ity—in 50 lines of code or less!

StefanHexJan 24, 2023, 6:45 PM
47 points
5 comments13 min readLW link

Vi­su­al­iz­ing Learned Rep­re­sen­ta­tions of Rice Disease

muhia_beeOct 3, 2022, 9:09 AM
7 points
0 comments4 min readLW link
(indecisive-sand-24a.notion.site)

An in­tro­duc­tion to lan­guage model interpretability

Alexandre VariengienApr 20, 2023, 10:22 PM
14 points
0 comments9 min readLW link

[RFC] Pos­si­ble ways to ex­pand on “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion”.

Jan 25, 2023, 7:03 PM
48 points
6 comments12 min readLW link

Spooky ac­tion at a dis­tance in the loss landscape

Jan 28, 2023, 12:22 AM
61 points
4 comments7 min readLW link
(www.jessehoogland.com)

A Univer­sal Emer­gent De­com­po­si­tion of Retrieval Tasks in Lan­guage Models

Dec 19, 2023, 11:52 AM
84 points
3 comments10 min readLW link
(arxiv.org)

I was Wrong, Si­mu­la­tor The­ory is Real

Robert_AIZIApr 26, 2023, 5:45 PM
75 points
7 comments3 min readLW link
(aizi.substack.com)

How poly­se­man­tic can one neu­ron be? In­ves­ti­gat­ing fea­tures in TinyS­to­ries.

Evan AndersJan 16, 2024, 7:10 PM
14 points
0 comments8 min readLW link
(evanhanders.blog)

z is not the cause of x

hrbigelowOct 23, 2023, 5:43 PM
6 points
2 comments9 min readLW link

Grokking Beyond Neu­ral Networks

Jack MillerOct 30, 2023, 5:28 PM
10 points
0 comments2 min readLW link
(arxiv.org)

Ro­bust­ness of Con­trast-Con­sis­tent Search to Ad­ver­sar­ial Prompting

Nov 1, 2023, 12:46 PM
18 points
1 comment7 min readLW link

Es­ti­mat­ing effec­tive di­men­sion­al­ity of MNIST models

Arjun PanicksseryNov 2, 2023, 2:13 PM
41 points
3 comments1 min readLW link

Nat­u­ral Cat­e­gories Update

Logan ZoellnerOct 10, 2022, 3:19 PM
33 points
6 comments2 min readLW link

Growth and Form in a Toy Model of Superposition

Nov 8, 2023, 11:08 AM
89 points
7 comments14 min readLW link

What’s go­ing on? LLMs and IS-A sen­tences

Bill BenzonNov 8, 2023, 4:58 PM
6 points
15 comments4 min readLW link

Poly­se­man­tic At­ten­tion Head in a 4-Layer Transformer

Nov 9, 2023, 4:16 PM
51 points
0 comments6 min readLW link

PhD Po­si­tion: AI In­ter­pretabil­ity in Ber­lin, Germany

TiberiusApr 28, 2023, 1:44 PM
3 points
0 comments1 min readLW link
(stephanw.net)

Re­search Adenda: Model­ling Tra­jec­to­ries of Lan­guage Models

NickyPNov 13, 2023, 2:33 PM
27 points
0 comments12 min readLW link

Elic­it­ing La­tent Knowl­edge in Com­pre­hen­sive AI Ser­vices Models

acabodiNov 17, 2023, 2:36 AM
6 points
0 comments5 min readLW link

In­ci­den­tal polysemanticity

Nov 15, 2023, 4:00 AM
43 points
7 comments11 min readLW link

Against LLM Reductionism

Erich_GrunewaldMar 8, 2023, 3:52 PM
140 points
17 comments18 min readLW link
(www.erichgrunewald.com)

What’s go­ing on with Per-Com­po­nent Weight Up­dates?

4gateAug 22, 2024, 9:22 PM
1 point
0 comments6 min readLW link

No Really, At­ten­tion is ALL You Need—At­ten­tion can do feed­for­ward networks

Robert_AIZIJan 31, 2023, 6:48 PM
29 points
7 comments6 min readLW link
(aizi.substack.com)

Help out Red­wood Re­search’s in­ter­pretabil­ity team by find­ing heuris­tics im­ple­mented by GPT-2 small

Oct 12, 2022, 9:25 PM
50 points
11 comments4 min readLW link

Am­bigu­ous out-of-dis­tri­bu­tion gen­er­al­iza­tion on an al­gorith­mic task

Feb 13, 2025, 6:24 PM
82 points
6 comments11 min readLW link

AISC pro­ject: TinyEvals

Jett JaniakNov 22, 2023, 8:47 PM
22 points
0 comments4 min readLW link

A day in the life of a mechanis­tic in­ter­pretabil­ity researcher

Bill BenzonNov 28, 2023, 2:45 PM
3 points
3 comments1 min readLW link

EIS XI: Mov­ing Forward

scasperFeb 22, 2023, 7:05 PM
19 points
2 comments9 min readLW link

Towards an Ethics Calcu­la­tor for Use by an AGI

sweenesmDec 12, 2023, 6:37 PM
3 points
2 comments11 min readLW link

Search­ing for a model’s con­cepts by their shape – a the­o­ret­i­cal framework

Feb 23, 2023, 8:14 PM
51 points
0 comments19 min readLW link

Mechanis­tic in­ter­pretabil­ity through clustering

Alistair FraserDec 4, 2023, 6:49 PM
1 point
0 comments1 min readLW link

Colour ver­sus Shape Goal Mis­gen­er­al­iza­tion in Re­in­force­ment Learn­ing: A Case Study

Karolis JucysDec 8, 2023, 1:18 PM
13 points
1 comment4 min readLW link
(arxiv.org)

ChatGPT: Tan­tal­iz­ing af­terthoughts in search of story tra­jec­to­ries [in­duc­tion heads]

Bill BenzonFeb 3, 2023, 10:35 AM
4 points
0 comments20 min readLW link

Lan­guage Model Me­moriza­tion, Copy­right Law, and Con­di­tional Pre­train­ing Alignment

RogerDearnaleyDec 7, 2023, 6:14 AM
9 points
0 comments11 min readLW link

Causal scrub­bing: Appendix

Dec 3, 2022, 12:58 AM
18 points
4 comments20 min readLW link

Causal Scrub­bing: a method for rigor­ously test­ing in­ter­pretabil­ity hy­pothe­ses [Red­wood Re­search]

Dec 3, 2022, 12:58 AM
205 points
35 comments20 min readLW link1 review

Some Les­sons Learned from Study­ing Indi­rect Ob­ject Iden­ti­fi­ca­tion in GPT-2 small

Oct 28, 2022, 11:55 PM
101 points
9 comments9 min readLW link2 reviews
(arxiv.org)

Au­dit­ing games for high-level interpretability

Paul CologneseNov 1, 2022, 10:44 AM
33 points
1 comment7 min readLW link

Hid­den Cog­ni­tion De­tec­tion Meth­ods and Bench­marks

Paul CologneseFeb 26, 2024, 5:31 AM
22 points
11 comments4 min readLW link

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

Jan 16, 2024, 12:26 AM
83 points
9 comments18 min readLW link

Find­ing De­cep­tion in Lan­guage Models

Aug 20, 2024, 9:42 AM
18 points
4 comments4 min readLW link

Some mis­cel­la­neous thoughts on ChatGPT, sto­ries, and me­chan­i­cal interpretability

Bill BenzonFeb 4, 2023, 7:35 PM
2 points
0 comments3 min readLW link

Gra­di­ent hacking

evhubOct 16, 2019, 12:53 AM
107 points
39 comments3 min readLW link2 reviews

Will trans­parency help catch de­cep­tion? Per­haps not

Matthew BarnettNov 4, 2019, 8:52 PM
43 points
5 comments7 min readLW link

Task vec­tors & anal­ogy mak­ing in LLMs

SergiiJan 8, 2024, 3:17 PM
9 points
1 comment4 min readLW link
(grgv.xyz)

EIS XII: Sum­mary

scasperFeb 23, 2023, 5:45 PM
18 points
0 comments6 min readLW link

Ro­hin Shah on rea­sons for AI optimism

abergalOct 31, 2019, 12:10 PM
40 points
58 comments1 min readLW link
(aiimpacts.org)

Gra­di­ent sur­fing: the hid­den role of regularization

Jesse HooglandFeb 6, 2023, 3:50 AM
37 points
9 comments14 min readLW link
(www.jessehoogland.com)

Un­der­stand­ing mesa-op­ti­miza­tion us­ing toy models

May 7, 2023, 5:00 PM
43 points
2 comments10 min readLW link

A Search for More ChatGPT /​ GPT-3.5 /​ GPT-4 “Un­speak­able” Glitch Tokens

Martin FellMay 9, 2023, 2:36 PM
26 points
9 comments6 min readLW link

A tech­ni­cal note on bil­in­ear lay­ers for interpretability

Lee SharkeyMay 8, 2023, 6:06 AM
59 points
0 comments1 min readLW link
(arxiv.org)

A com­par­i­son of causal scrub­bing, causal ab­strac­tions, and re­lated methods

Jun 8, 2023, 11:40 PM
73 points
3 comments22 min readLW link

Lan­guage mod­els can ex­plain neu­rons in lan­guage models

nzMay 9, 2023, 5:29 PM
23 points
0 comments1 min readLW link
(openai.com)

Mechanis­tic In­ter­pretabil­ity as Re­v­erse Eng­ineer­ing (fol­low-up to “cars and elephants”)

David Scott Krueger (formerly: capybaralet)Nov 3, 2022, 11:19 PM
28 points
3 comments1 min readLW link

Solv­ing the Mechanis­tic In­ter­pretabil­ity challenges: EIS VII Challenge 1

May 9, 2023, 7:41 PM
119 points
1 comment10 min readLW link

[Question] Have you heard about MIT’s “liquid neu­ral net­works”? What do you think about them?

PpauMay 9, 2023, 8:16 PM
35 points
14 comments1 min readLW link

‘Fun­da­men­tal’ vs ‘ap­plied’ mechanis­tic in­ter­pretabil­ity research

Lee SharkeyMay 23, 2023, 6:26 PM
65 points
6 comments3 min readLW link

Toy Models and Tegum Products

Adam JermynNov 4, 2022, 6:51 PM
28 points
7 comments5 min readLW link

De­ci­sion Trans­former Interpretability

Feb 6, 2023, 7:29 AM
84 points
13 comments24 min readLW link

[Question] AI in­ter­pretabil­ity could be harm­ful?

Roman LeventovMay 10, 2023, 8:43 PM
13 points
2 comments1 min readLW link

In­put Swap Graphs: Dis­cov­er­ing the role of neu­ral net­work com­po­nents at scale

Alexandre VariengienMay 12, 2023, 9:41 AM
92 points
0 comments33 min readLW link

Con­trast Pairs Drive the Em­piri­cal Perfor­mance of Con­trast Con­sis­tent Search (CCS)

Scott EmmonsMay 31, 2023, 5:09 PM
97 points
1 comment6 min readLW link1 review

Are SAE fea­tures from the Base Model still mean­ingful to LLaVA?

Shan23ChenFeb 18, 2025, 10:16 PM
8 points
2 comments10 min readLW link
(www.lesswrong.com)

Ad­den­dum: More Effi­cient FFNs via Attention

Robert_AIZIFeb 6, 2023, 6:55 PM
10 points
2 comments5 min readLW link
(aizi.substack.com)

My cur­rent work­flow to study the in­ter­nal mechanisms of LLM

Yulu PiMay 16, 2023, 3:27 PM
4 points
0 comments1 min readLW link

A Mechanis­tic In­ter­pretabil­ity Anal­y­sis of a GridWorld Agent-Si­mu­la­tor (Part 1 of N)

Joseph BloomMay 16, 2023, 10:59 PM
36 points
2 comments16 min readLW link

Gen­der Vec­tors in ROME’s La­tent Space

XodarapMay 21, 2023, 6:46 PM
14 points
2 comments3 min readLW link

Why I’m Work­ing On Model Ag­nos­tic Interpretability

Jessica RumbelowNov 11, 2022, 9:24 AM
27 points
9 comments2 min readLW link

The limited up­side of interpretability

Peter S. ParkNov 15, 2022, 6:46 PM
13 points
11 comments1 min readLW link

Solv­ing the Mechanis­tic In­ter­pretabil­ity challenges: EIS VII Challenge 2

May 25, 2023, 3:37 PM
71 points
1 comment13 min readLW link

Align­ing an H-JEPA agent via train­ing on the out­puts of an LLM-based “ex­em­plary ac­tor”

Roman LeventovMay 29, 2023, 11:08 AM
12 points
10 comments30 min readLW link

Cur­rent themes in mechanis­tic in­ter­pretabil­ity research

Nov 16, 2022, 2:14 PM
89 points
2 comments12 min readLW link

The king token

p.b.May 28, 2023, 7:18 PM
17 points
0 comments4 min readLW link

Short Re­mark on the (sub­jec­tive) math­e­mat­i­cal ‘nat­u­ral­ness’ of the Nanda—Lie­berum ad­di­tion mod­ulo 113 algorithm

carboniferous_umbraculum Jun 1, 2023, 11:31 AM
104 points
12 comments2 min readLW link

Eng­ineer­ing Monose­man­tic­ity in Toy Models

Nov 18, 2022, 1:43 AM
75 points
7 comments3 min readLW link
(arxiv.org)

[Linkpost] Rosetta Neu­rons: Min­ing the Com­mon Units in a Model Zoo

Bogdan Ionut CirsteaJun 17, 2023, 4:38 PM
12 points
0 comments1 min readLW link

[Re­search Up­date] Sparse Au­toen­coder fea­tures are bimodal

Robert_AIZIJun 22, 2023, 1:15 PM
24 points
1 comment5 min readLW link
(aizi.substack.com)

Un­der­stand­ing understanding

mthqAug 23, 2019, 6:10 PM
24 points
1 comment2 min readLW link

Ano­ma­lous Con­cept De­tec­tion for De­tect­ing Hid­den Cognition

Paul CologneseMar 4, 2024, 4:52 PM
24 points
3 comments10 min readLW link

The risk-re­ward trade­off of in­ter­pretabil­ity research

Jul 5, 2023, 5:05 PM
15 points
1 comment6 min readLW link

Lo­cal­iz­ing goal mis­gen­er­al­iza­tion in a maze-solv­ing policy network

Jan BetleyJul 6, 2023, 4:21 PM
37 points
2 comments7 min readLW link

In­ter­pret­ing Mo­du­lar Ad­di­tion in MLPs

Bart BussmannJul 7, 2023, 9:22 AM
20 points
0 comments6 min readLW link

The Ground Truth Prob­lem (Or, Why Eval­u­at­ing In­ter­pretabil­ity Meth­ods Is Hard)

Jessica RumbelowNov 17, 2022, 11:06 AM
27 points
2 comments2 min readLW link

LLM mis­al­ign­ment can prob­a­bly be found with­out man­ual prompt engineering

ProgramCrafterJul 8, 2023, 2:35 PM
1 point
0 comments1 min readLW link

Are SAE fea­tures from the Base Model still mean­ingful to LLaVA?

Shan23ChenDec 5, 2024, 7:24 PM
5 points
2 comments10 min readLW link

in­ter­pret­ing GPT: the logit lens

nostalgebraistAug 31, 2020, 2:47 AM
227 points
38 comments10 min readLW link

By De­fault, GPTs Think In Plain Sight

Fabien RogerNov 19, 2022, 7:15 PM
88 points
36 comments9 min readLW link

In­ter­pret­ing Embed­ding Spaces by Conceptualization

Adi SimhiFeb 28, 2023, 6:38 PM
3 points
0 comments1 min readLW link
(arxiv.org)

How does a toy 2 digit sub­trac­tion trans­former pre­dict the sign of the out­put?

Evan AndersDec 19, 2023, 6:56 PM
14 points
0 comments8 min readLW link
(evanhanders.blog)

Im­pact sto­ries for model in­ter­nals: an ex­er­cise for in­ter­pretabil­ity researchers

jennySep 25, 2023, 11:15 PM
29 points
3 comments7 min readLW link

Still no Lie De­tec­tor for LLMs

Jul 18, 2023, 7:56 PM
49 points
2 comments21 min readLW link

Multi-Com­po­nent Learn­ing and S-Curves

Nov 30, 2022, 1:37 AM
63 points
24 comments7 min readLW link

In­side the mind of a su­per­hu­man Go model: How does Leela Zero read lad­ders?

Haoxing DuMar 1, 2023, 1:47 AM
157 points
8 comments30 min readLW link

Ac­ti­va­tion adding ex­per­i­ments with llama-7b

Nina PanicksseryJul 16, 2023, 4:17 AM
51 points
1 comment3 min readLW link

LLM Ba­sics: Embed­ding Spaces—Trans­former To­ken Vec­tors Are Not Points in Space

NickyPFeb 13, 2023, 6:52 PM
81 points
11 comments15 min readLW link

GPT-2′s po­si­tional em­bed­ding ma­trix is a helix

AdamYedidiaJul 21, 2023, 4:16 AM
44 points
21 comments4 min readLW link

[Linkpost] In­ter­pret­ing Mul­ti­modal Video Trans­form­ers Us­ing Brain Recordings

Bogdan Ionut CirsteaJul 21, 2023, 11:26 AM
5 points
0 comments1 min readLW link

Train­ing Pro­cess Trans­parency through Gra­di­ent In­ter­pretabil­ity: Early ex­per­i­ments on toy lan­guage models

Jul 21, 2023, 2:52 PM
56 points
1 comment1 min readLW link

Causal scrub­bing: re­sults on a paren bal­ance checker

Dec 3, 2022, 12:59 AM
34 points
2 comments30 min readLW link

Be­cause of Lay­erNorm, Direc­tions in GPT-2 MLP Lay­ers are Monosemantic

ojorgensenJul 28, 2023, 7:43 PM
13 points
3 comments13 min readLW link

Causal scrub­bing: re­sults on in­duc­tion heads

Dec 3, 2022, 12:59 AM
34 points
1 comment17 min readLW link

Is the “Valley of Con­fused Ab­strac­tions” real?

jacquesthibsDec 5, 2022, 1:36 PM
20 points
11 comments2 min readLW link

Thoughts about the Mechanis­tic In­ter­pretabil­ity Challenge #2 (EIS VII #2)

RGRGRGJul 28, 2023, 8:44 PM
24 points
5 comments20 min readLW link

AI Safety 101 : In­tro­duc­tion to Vi­sion Interpretability

Jul 28, 2023, 5:32 PM
42 points
0 comments1 min readLW link
(github.com)

A Short Memo on AI In­ter­pretabil­ity Rain­bows

scasperJul 27, 2023, 11:05 PM
18 points
0 comments2 min readLW link

Gra­di­ent de­scent might see the di­rec­tion of the op­ti­mum from far away

Mikhail SaminJul 28, 2023, 4:19 PM
68 points
13 comments4 min readLW link

A multi-dis­ci­plinary view on AI safety research

Roman LeventovFeb 8, 2023, 4:50 PM
46 points
4 comments26 min readLW link

An ex­plo­ra­tion of GPT-2′s em­bed­ding weights

Adam ScherlisDec 13, 2022, 12:46 AM
44 points
4 comments10 min readLW link

[Linkpost] Mul­ti­modal Neu­rons in Pre­trained Text-Only Transformers

Bogdan Ionut CirsteaAug 4, 2023, 3:29 PM
11 points
0 comments1 min readLW link

Ground-Truth La­bel Im­bal­ance Im­pairs the Perfor­mance of Con­trast-Con­sis­tent Search (and Other Con­trast-Pair-Based Un­su­per­vised Meth­ods)

Aug 5, 2023, 5:55 PM
6 points
2 comments7 min readLW link
(drive.google.com)

Mech In­terp Challenge: Au­gust—De­ci­pher­ing the First Unique Char­ac­ter Model

CallumMcDougallAug 9, 2023, 7:14 PM
36 points
1 comment3 min readLW link

The po­si­tional em­bed­ding ma­trix and pre­vi­ous-to­ken heads: how do they ac­tu­ally work?

AdamYedidiaAug 10, 2023, 1:58 AM
26 points
4 comments13 min readLW link

Take­aways from a Mechanis­tic In­ter­pretabil­ity pro­ject on “For­bid­den Facts”

Dec 15, 2023, 11:05 AM
33 points
8 comments10 min readLW link

An in­ter­ac­tive in­tro­duc­tion to grokking and mechanis­tic interpretability

Aug 7, 2023, 7:09 PM
23 points
3 comments1 min readLW link
(pair.withgoogle.com)

De­com­pos­ing in­de­pen­dent gen­er­al­iza­tions in neu­ral net­works via Hes­sian analysis

Aug 14, 2023, 5:04 PM
83 points
4 comments1 min readLW link

Un­der­stand­ing the In­for­ma­tion Flow in­side Large Lan­guage Models

Aug 15, 2023, 9:13 PM
19 points
0 comments17 min readLW link

En­hanc­ing Cor­rigi­bil­ity in AI Sys­tems through Ro­bust Feed­back Loops

JustausernameAug 24, 2023, 3:53 AM
1 point
0 comments6 min readLW link

Causal­ity and a Cost Se­man­tics for Neu­ral Networks

scottviteriAug 21, 2023, 9:02 PM
22 points
1 comment1 min readLW link

[Question] Would it be use­ful to col­lect the con­texts, where var­i­ous LLMs think the same?

Martin VlachAug 24, 2023, 10:01 PM
6 points
1 comment1 min readLW link

Un­der­stand­ing Coun­ter­bal­anced Sub­trac­tions for Bet­ter Ac­ti­va­tion Additions

ojorgensenAug 17, 2023, 1:53 PM
21 points
0 comments14 min readLW link

Memetic Judo #3: The In­tel­li­gence of Stochas­tic Par­rots v.2

Max TKAug 20, 2023, 3:18 PM
8 points
33 comments6 min readLW link

How “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion” Fits Into a Broader Align­ment Scheme

CollinDec 15, 2022, 6:22 PM
244 points
39 comments16 min readLW link1 review

My cur­rent think­ing about ChatGPT @3QD [Gär­den­fors, Wolfram, and the value of spec­u­la­tion]

Bill BenzonMar 1, 2023, 10:50 AM
2 points
0 comments5 min readLW link

An OV-Co­her­ent Toy Model of At­ten­tion Head Superposition

Aug 29, 2023, 7:44 PM
26 points
2 comments6 min readLW link

An ad­ver­sar­ial ex­am­ple for Direct Logit At­tri­bu­tion: mem­ory man­age­ment in gelu-4l

Aug 30, 2023, 5:36 PM
17 points
0 comments8 min readLW link
(arxiv.org)

Bar­ri­ers to Mechanis­tic In­ter­pretabil­ity for AGI Safety

Connor LeahyAug 29, 2023, 10:56 AM
63 points
13 comments1 min readLW link
(www.youtube.com)

Open Call for Re­search As­sis­tants in Devel­op­men­tal Interpretability

Aug 30, 2023, 9:02 AM
55 points
11 comments4 min readLW link

An In­ter­pretabil­ity Illu­sion for Ac­ti­va­tion Patch­ing of Ar­bi­trary Subspaces

Aug 29, 2023, 1:04 AM
77 points
4 comments1 min readLW link

The Eng­ineer’s In­ter­pretabil­ity Se­quence (EIS) I: Intro

scasperFeb 9, 2023, 4:28 PM
46 points
24 comments3 min readLW link

In­ter­pret­ing a ma­trix-val­ued word em­bed­ding with a math­e­mat­i­cally proven char­ac­ter­i­za­tion of all optima

Joseph Van NameSep 4, 2023, 4:19 PM
3 points
4 comments12 min readLW link

Ex­plain­ing grokking through cir­cuit efficiency

Sep 8, 2023, 2:39 PM
101 points
11 comments3 min readLW link
(arxiv.org)

Au­to­mat­i­cally find­ing fea­ture vec­tors in the OV cir­cuits of Trans­form­ers with­out us­ing probing

Jacob DunefskySep 12, 2023, 5:38 PM
15 points
2 comments29 min readLW link

Why mechanis­tic in­ter­pretabil­ity does not and can­not con­tribute to long-term AGI safety (from mes­sages with a friend)

RemmeltDec 19, 2022, 12:02 PM
−3 points
9 comments31 min readLW link

Un­cov­er­ing La­tent Hu­man Wel­lbe­ing in LLM Embeddings

Sep 14, 2023, 1:40 AM
32 points
7 comments8 min readLW link
(far.ai)

Ex­pand­ing the Scope of Superposition

Derek LarsonSep 13, 2023, 5:38 PM
10 points
0 comments4 min readLW link

Char­bel-Raphaël and Lu­cius dis­cuss interpretability

Oct 30, 2023, 5:50 AM
111 points
7 comments21 min readLW link

Cat­e­gor­i­cal Or­ga­ni­za­tion in Me­mory: ChatGPT Or­ga­nizes the 665 Topic Tags from My New Sa­vanna Blog

Bill BenzonDec 14, 2023, 1:02 PM
0 points
6 comments2 min readLW link

Seek­ing Feed­back on My Mechanis­tic In­ter­pretabil­ity Re­search Agenda

RGRGRGSep 12, 2023, 6:45 PM
3 points
1 comment3 min readLW link

Some Notes on the math­e­mat­ics of Toy Au­toen­cod­ing Problems

carboniferous_umbraculum Dec 22, 2022, 5:21 PM
18 points
1 comment12 min readLW link

EIS II: What is “In­ter­pretabil­ity”?

scasperFeb 9, 2023, 4:48 PM
28 points
6 comments4 min readLW link

Mechanis­tic In­ter­pretabil­ity Read­ing group

Sep 26, 2023, 4:26 PM
15 points
0 comments1 min readLW link

An­nounc­ing the CNN In­ter­pretabil­ity Competition

scasperSep 26, 2023, 4:21 PM
22 points
0 comments4 min readLW link

High-level in­ter­pretabil­ity: de­tect­ing an AI’s objectives

Sep 28, 2023, 7:30 PM
71 points
4 comments21 min readLW link

We Found An Neu­ron in GPT-2

Feb 11, 2023, 6:27 PM
143 points
23 comments7 min readLW link
(clementneo.com)

New Tool: the Resi­d­ual Stream Viewer

AdamYedidiaOct 1, 2023, 12:49 AM
32 points
7 comments4 min readLW link
(tinyurl.com)

In­ter­pretabil­ity Ex­ter­nal­ities Case Study—Hun­gry Hun­gry Hippos

Magdalena WacheSep 20, 2023, 2:42 PM
64 points
22 comments2 min readLW link

Graph­i­cal ten­sor no­ta­tion for interpretability

Jordan TaylorOct 4, 2023, 8:04 AM
141 points
11 comments19 min readLW link

Tak­ing fea­tures out of su­per­po­si­tion with sparse au­toen­coders more quickly with in­formed initialization

Pierre PeignéSep 23, 2023, 4:21 PM
30 points
8 comments5 min readLW link

Creat­ing a Dis­cord server for Mechanis­tic In­ter­pretabil­ity Projects

Victor LevosoMar 12, 2023, 6:00 PM
30 points
6 comments2 min readLW link

Has any­one ex­per­i­mented with Do­drio, a tool for ex­plor­ing trans­former mod­els through in­ter­ac­tive vi­su­al­iza­tion?

Bill BenzonDec 11, 2023, 8:34 PM
4 points
0 comments1 min readLW link

Early Ex­per­i­ments in Re­ward Model In­ter­pre­ta­tion Us­ing Sparse Autoencoders

Oct 3, 2023, 7:45 AM
17 points
0 comments5 min readLW link

What would it mean to un­der­stand how a large lan­guage model (LLM) works? Some quick notes.

Bill BenzonOct 3, 2023, 3:11 PM
20 points
4 comments8 min readLW link

Bi­ases in Bi­ases, or Cri­tique of the Critique

ThePathYouWillChooseAug 19, 2024, 5:11 PM
1 point
0 comments1 min readLW link

A per­sonal ex­pla­na­tion of ELK con­cept and task.

Zeyu QinOct 6, 2023, 3:55 AM
1 point
0 comments1 min readLW link

En­tan­gle­ment and in­tu­ition about words and mean­ing

Bill BenzonOct 4, 2023, 2:16 PM
4 points
0 comments2 min readLW link

At­tribut­ing to in­ter­ac­tions with GCPD and GWPD

jennyOct 11, 2023, 3:06 PM
20 points
0 comments6 min readLW link

Com­par­ing An­thropic’s Dic­tionary Learn­ing to Ours

Robert_AIZIOct 7, 2023, 11:30 PM
137 points
8 comments4 min readLW link

Bird-eye view vi­su­al­iza­tion of LLM activations

SergiiOct 8, 2023, 12:12 PM
11 points
2 comments1 min readLW link
(grgv.xyz)

Idea: Net­work mod­u­lar­ity and in­ter­pretabil­ity by sex­ual reproduction

qbolecFeb 12, 2023, 11:06 PM
3 points
3 comments1 min readLW link

In­ter­nal In­ter­faces Are a High-Pri­or­ity In­ter­pretabil­ity Target

Thane RuthenisDec 29, 2022, 5:49 PM
26 points
6 comments7 min readLW link

Un­der­stand­ing LLMs: Some ba­sic ob­ser­va­tions about words, syn­tax, and dis­course [w/​ a con­jec­ture about grokking]

Bill BenzonOct 11, 2023, 7:13 PM
6 points
0 comments5 min readLW link

An­nounc­ing Timaeus

Oct 22, 2023, 11:59 AM
188 points
15 comments4 min readLW link

Ex­plain­ing SolidGoldMag­ikarp by look­ing at it from ran­dom directions

Robert_AIZIFeb 14, 2023, 2:54 PM
8 points
0 comments8 min readLW link
(aizi.substack.com)

Mechanis­tic in­ter­pretabil­ity of LLM anal­ogy-making

SergiiOct 20, 2023, 12:53 PM
2 points
0 comments4 min readLW link
(grgv.xyz)

In­ter­nal Tar­get In­for­ma­tion for AI Oversight

Paul CologneseOct 20, 2023, 2:53 PM
15 points
0 comments5 min readLW link

Re­veal­ing In­ten­tion­al­ity In Lan­guage Models Through AdaVAE Guided Sampling

jdpOct 20, 2023, 7:32 AM
119 points
15 comments22 min readLW link

[Question] Does a broad overview of Mechanis­tic In­ter­pretabil­ity ex­ist?

kourabiOct 16, 2023, 1:16 AM
1 point
0 comments1 min readLW link

ChatGPT tells 20 ver­sions of its pro­to­typ­i­cal story, with a short note on method

Bill BenzonOct 14, 2023, 3:27 PM
6 points
0 comments5 min readLW link

Map­ping ChatGPT’s on­tolog­i­cal land­scape, gra­di­ents and choices [in­ter­pretabil­ity]

Bill BenzonOct 15, 2023, 8:12 PM
1 point
0 comments18 min readLW link

[Question] Can we iso­late neu­rons that rec­og­nize fea­tures vs. those which have some other role?

Joshua ClancyOct 21, 2023, 12:30 AM
4 points
2 comments3 min readLW link

In­ves­ti­gat­ing the learn­ing co­effi­cient of mod­u­lar ad­di­tion: hackathon project

Oct 17, 2023, 7:51 PM
94 points
5 comments12 min readLW link

Fea­tures and Ad­ver­saries in MemoryDT

Oct 20, 2023, 7:32 AM
31 points
6 comments25 min readLW link

Thoughts On (Solv­ing) Deep Deception

JozdienOct 21, 2023, 10:40 PM
71 points
6 comments6 min readLW link

But is it re­ally in Rome? An in­ves­ti­ga­tion of the ROME model edit­ing technique

jacquesthibsDec 30, 2022, 2:40 AM
104 points
2 comments18 min readLW link

Case Stud­ies in Re­v­erse-Eng­ineer­ing Sparse Au­toen­coder Fea­tures by Us­ing MLP Linearization

Jan 14, 2024, 2:06 AM
23 points
0 comments42 min readLW link

Cal­en­dar fea­ture ge­om­e­try in GPT-2 layer 8 resi­d­ual stream SAEs

Aug 17, 2024, 1:16 AM
53 points
0 comments5 min readLW link

[Question] Are Mix­ture-of-Ex­perts Trans­form­ers More In­ter­pretable Than Dense Trans­form­ers?

simeon_cDec 31, 2022, 11:34 AM
8 points
5 comments1 min readLW link

[Question] SAE sparse fea­ture graph us­ing only resi­d­ual layers

Jaehyuk LimMay 23, 2024, 1:32 PM
0 points
3 comments1 min readLW link

Challenge: know ev­ery­thing that the best go bot knows about go

DanielFilanMay 11, 2021, 5:10 AM
48 points
113 comments2 min readLW link
(danielfilan.com)

In­duc­tion heads—illustrated

CallumMcDougallJan 2, 2023, 3:35 PM
127 points
11 comments3 min readLW link

Spec­u­la­tions against GPT-n writ­ing al­ign­ment papers

Donald HobsonJun 7, 2021, 9:13 PM
31 points
6 comments2 min readLW link

Try­ing to ap­prox­i­mate Statis­ti­cal Models as Scor­ing Tables

JsevillamolJun 29, 2021, 5:20 PM
18 points
2 comments9 min readLW link

On the Im­por­tance of Open Sourc­ing Re­ward Models

elandgreJan 2, 2023, 7:01 PM
18 points
5 comments6 min readLW link

EIS III: Broad Cri­tiques of In­ter­pretabil­ity Research

scasperFeb 14, 2023, 6:24 PM
20 points
2 comments11 min readLW link

Ex­plor­ing the Resi­d­ual Stream of Trans­form­ers for Mechanis­tic In­ter­pretabil­ity — Explained

Zeping YuDec 26, 2023, 12:36 AM
7 points
1 comment11 min readLW link

Pos­si­ble re­search di­rec­tions to im­prove the mechanis­tic ex­pla­na­tion of neu­ral networks

delton137Nov 9, 2021, 2:36 AM
31 points
8 comments9 min readLW link

[linkpost] Ac­qui­si­tion of Chess Knowl­edge in AlphaZero

Quintin PopeNov 23, 2021, 7:55 AM
8 points
1 comment1 min readLW link

We In­spected Every Head In GPT-2 Small us­ing SAEs So You Don’t Have To

Mar 6, 2024, 5:03 AM
63 points
0 comments12 min readLW link

ChatGPT tells sto­ries, and a note about re­verse en­g­ineer­ing: A Work­ing Paper

Bill BenzonMar 3, 2023, 3:12 PM
3 points
0 comments3 min readLW link

Teaser: Hard-cod­ing Trans­former Models

MadHatterDec 12, 2021, 10:04 PM
74 points
19 comments1 min readLW link

De­cep­tion and Jailbreak Se­quence: 2. Iter­a­tive Refine­ment Stages of Jailbreaks in LLM

Winnie YangAug 28, 2024, 8:41 AM
7 points
2 comments31 min readLW link

Ophiol­ogy (or, how the Mamba ar­chi­tec­ture works)

Apr 9, 2024, 7:31 PM
67 points
8 comments10 min readLW link

DSLT 0. Distill­ing Sin­gu­lar Learn­ing Theory

Liam CarrollJun 16, 2023, 9:50 AM
77 points
7 comments5 min readLW link

Nor­mal­iz­ing Sparse Autoencoders

Fengyuan HuApr 8, 2024, 6:17 AM
21 points
18 comments13 min readLW link

EIS IV: A Spotlight on Fea­ture At­tri­bu­tion/​Saliency

scasperFeb 15, 2023, 6:46 PM
19 points
1 comment4 min readLW link

Can Large Lan­guage Models effec­tively iden­tify cy­ber­se­cu­rity risks?

emile delcourtAug 30, 2024, 8:20 PM
18 points
0 comments11 min readLW link

Scal­ing Laws and Superposition

Pavan KattaApr 10, 2024, 3:36 PM
9 points
4 comments5 min readLW link
(www.pavankatta.com)

[Question] Bar­cod­ing LLM Train­ing Data Sub­sets. Any­one try­ing this for in­ter­pretabil­ity?

right..enough?Apr 13, 2024, 3:09 AM
7 points
0 comments7 min readLW link

Is This Lie De­tec­tor Really Just a Lie De­tec­tor? An In­ves­ti­ga­tion of LLM Probe Speci­fic­ity.

Josh LevyJun 4, 2024, 3:45 PM
38 points
0 comments18 min readLW link

Ex­per­i­ments with an al­ter­na­tive method to pro­mote spar­sity in sparse autoencoders

Eoin FarrellApr 15, 2024, 6:21 PM
29 points
7 comments12 min readLW link

Trans­form­ers Rep­re­sent Belief State Geom­e­try in their Resi­d­ual Stream

Adam ShaiApr 16, 2024, 9:16 PM
413 points
100 comments12 min readLW link

Fact Find­ing: Sim­plify­ing the Cir­cuit (Post 2)

Dec 23, 2023, 2:45 AM
25 points
3 comments14 min readLW link

The Nat­u­ral Ab­strac­tion Hy­poth­e­sis: Im­pli­ca­tions and Evidence

CallumMcDougallDec 14, 2021, 11:14 PM
39 points
9 comments19 min readLW link

graph­patch: a Python Library for Ac­ti­va­tion Patching

Occam's LaserJun 5, 2024, 3:08 PM
13 points
2 comments1 min readLW link

Past Tense Features

CanApr 20, 2024, 2:34 PM
12 points
0 comments4 min readLW link

Mechanis­tic In­ter­pretabil­ity for the MLP Lay­ers (rough early thoughts)

MadHatterDec 24, 2021, 7:24 AM
12 points
3 comments1 min readLW link
(www.youtube.com)

Re­dun­dant At­ten­tion Heads in Large Lan­guage Models For In Con­text Learning

skunnavakkamSep 1, 2024, 8:08 PM
7 points
1 comment4 min readLW link
(skunnavakkam.github.io)

Ba­sic Facts about Lan­guage Model Internals

Jan 4, 2023, 1:01 PM
130 points
19 comments9 min readLW link

An Open Philan­thropy grant pro­posal: Causal rep­re­sen­ta­tion learn­ing of hu­man preferences

PabloAMCJan 11, 2022, 11:28 AM
19 points
6 comments8 min readLW link

Transcoders en­able fine-grained in­ter­pretable cir­cuit anal­y­sis for lan­guage models

Apr 30, 2024, 5:58 PM
72 points
14 comments17 min readLW link

How does a toy 2 digit sub­trac­tion trans­former pre­dict the differ­ence?

Evan AndersDec 22, 2023, 9:17 PM
12 points
0 comments10 min readLW link
(evanhanders.blog)

Re­la­tion­ships among words, met­al­in­gual defi­ni­tion, and interpretability

Bill BenzonJun 7, 2024, 7:18 PM
2 points
0 comments5 min readLW link

Prac­ti­cal Pit­falls of Causal Scrubbing

Mar 27, 2023, 7:47 AM
87 points
17 comments13 min readLW link

Vi­su­al­iz­ing neu­ral net­work planning

May 9, 2024, 6:40 AM
4 points
0 comments5 min readLW link

Align­ment Gaps

kcyrasJun 8, 2024, 3:23 PM
11 points
4 comments8 min readLW link

Closed-Source Evaluations

JonoJun 8, 2024, 2:18 PM
15 points
4 comments1 min readLW link

EIS VI: Cri­tiques of Mechanis­tic In­ter­pretabil­ity Work in AI Safety

scasperFeb 17, 2023, 8:48 PM
49 points
9 comments12 min readLW link

How To Do Patch­ing Fast

Joseph MillerMay 11, 2024, 8:13 PM
44 points
6 comments4 min readLW link

Gears-Level Men­tal Models of Trans­former Interpretability

KevinRoWangMar 29, 2022, 8:09 PM
72 points
4 comments6 min readLW link

De­com­piling Tracr Trans­form­ers—An in­ter­pretabil­ity experiment

Hannes ThurnherrMar 27, 2024, 9:49 AM
4 points
0 comments14 min readLW link

In­ves­ti­gat­ing Sen­si­tive Direc­tions in GPT-2: An Im­proved Baseline and Com­par­a­tive Anal­y­sis of SAEs

Sep 6, 2024, 2:28 AM
28 points
0 comments12 min readLW link

An­nounc­ing Neu­ron­pe­dia: Plat­form for ac­cel­er­at­ing re­search into Sparse Autoencoders

Mar 25, 2024, 9:17 PM
93 points
7 comments7 min readLW link

In­tro­duc­ing SARA: a new ac­ti­va­tion steer­ing technique

Alejandro TlaieJun 9, 2024, 3:33 PM
17 points
7 comments6 min readLW link

Ex­plor­ing Llama-3-8B MLP Neurons

ntt123Jun 9, 2024, 2:19 PM
10 points
0 comments4 min readLW link
(neuralblog.github.io)

Adam Op­ti­mizer Causes Priv­ileged Ba­sis in Trans­former LM Resi­d­ual Stream

Sep 6, 2024, 5:55 PM
70 points
7 comments4 min readLW link

Progress Re­port 2

Nathan Helm-BurgerMar 30, 2022, 2:29 AM
4 points
1 comment1 min readLW link

[Question] LLM/​AI hype

Student192837465Jun 15, 2024, 8:12 PM
1 point
0 comments1 min readLW link

Logit Prisms: De­com­pos­ing Trans­former Out­puts for Mechanis­tic Interpretability

ntt123Jun 17, 2024, 11:46 AM
5 points
4 comments6 min readLW link
(neuralblog.github.io)

Analysing Ad­ver­sar­ial At­tacks with Lin­ear Probing

Jun 17, 2024, 2:16 PM
9 points
0 comments8 min readLW link

Towards White Box Deep Learning

Maciej SatkiewiczMar 27, 2024, 6:20 PM
18 points
5 comments1 min readLW link
(arxiv.org)

Progress re­port 3: clus­ter­ing trans­former neurons

Nathan Helm-BurgerApr 5, 2022, 11:13 PM
5 points
0 comments2 min readLW link

What is a cir­cuit? [in in­ter­pretabil­ity]

Yudhister KumarFeb 14, 2025, 4:40 AM
23 points
1 comment1 min readLW link

Work­shop: In­ter­pretabil­ity in LLMs Us­ing Geo­met­ric and Statis­ti­cal Methods

Karthik ViswanathanFeb 22, 2025, 9:39 AM
2 points
0 comments2 min readLW link

Sparse Fea­tures Through Time

Rogan InglisJun 24, 2024, 6:06 PM
12 points
1 comment1 min readLW link
(roganinglis.io)

Sparse Au­toen­coder Fea­tures for Clas­sifi­ca­tions and Transferability

Shan23ChenFeb 18, 2025, 10:14 PM
5 points
0 comments1 min readLW link
(arxiv.org)

Rep­re­sen­ta­tion Tuning

Christopher AckermanJun 27, 2024, 5:44 PM
35 points
9 comments13 min readLW link

Ac­ti­va­tion Pat­tern SVD: A pro­posal for SAE Interpretability

Daniel TanJun 28, 2024, 10:12 PM
15 points
2 comments2 min readLW link

Mea­sur­ing Non­lin­ear Fea­ture In­ter­ac­tions in Sparse Cross­coders [Pro­ject Pro­posal]

Jan 6, 2025, 4:22 AM
19 points
0 comments12 min readLW link

Is GPT3 a Good Ra­tion­al­ist? - In­struc­tGPT3 [2/​2]

simeon_cApr 7, 2022, 1:46 PM
11 points
0 comments7 min readLW link

Fact Find­ing: How to Think About In­ter­pret­ing Me­mori­sa­tion (Post 4)

Dec 23, 2023, 2:46 AM
22 points
0 comments9 min readLW link

De­com­pos­ing the QK cir­cuit with Bilin­ear Sparse Dic­tionary Learning

Jul 2, 2024, 1:17 PM
86 points
7 comments12 min readLW link

The role of philo­soph­i­cal think­ing in un­der­stand­ing large lan­guage mod­els: Cal­ibrat­ing and clos­ing the gap be­tween first-per­son ex­pe­rience and un­der­ly­ing mechanisms

Bill BenzonFeb 23, 2024, 12:19 PM
4 points
0 comments10 min readLW link

Test­ing which LLM ar­chi­tec­tures can do hid­den se­rial reasoning

Filip SondejDec 16, 2024, 1:48 PM
81 points
9 comments4 min readLW link

Othel­loGPT learned a bag of heuristics

Jul 2, 2024, 9:12 AM
109 points
10 comments9 min readLW link

Ma­tryoshka Sparse Autoencoders

Noa NabeshimaDec 14, 2024, 2:52 AM
90 points
15 comments11 min readLW link

[In­terim re­search re­port] Ac­ti­va­tion plateaus & sen­si­tive di­rec­tions in GPT2

Jul 5, 2024, 5:05 PM
65 points
2 comments5 min readLW link

Progress Re­port 4: logit lens redux

Nathan Helm-BurgerApr 8, 2022, 6:35 PM
4 points
0 comments2 min readLW link

Another list of the­o­ries of im­pact for interpretability

Beth BarnesApr 13, 2022, 1:29 PM
33 points
1 comment5 min readLW link

Trans­former Cir­cuit Faith­ful­ness Met­rics Are Not Robust

Jul 12, 2024, 3:47 AM
104 points
5 comments7 min readLW link
(arxiv.org)

La­tent Ad­ver­sar­ial Train­ing (LAT) Im­proves the Rep­re­sen­ta­tion of Refusal

Jan 6, 2025, 10:24 AM
20 points
6 comments10 min readLW link

Stitch­ing SAEs of differ­ent sizes

Jul 13, 2024, 5:19 PM
39 points
12 comments12 min readLW link

An In­tro­duc­tion to Rep­re­sen­ta­tion Eng­ineer­ing—an ac­ti­va­tion-based paradigm for con­trol­ling LLMs

Jan WehnerJul 14, 2024, 10:37 AM
36 points
6 comments17 min readLW link

De­cep­tive agents can col­lude to hide dan­ger­ous fea­tures in SAEs

Jul 15, 2024, 5:07 PM
33 points
2 comments7 min readLW link

Mech In­terp Lacks Good Paradigms

Daniel TanJul 16, 2024, 3:47 PM
38 points
0 comments14 min readLW link

Ar­rakis—A toolkit to con­duct, track and vi­su­al­ize mechanis­tic in­ter­pretabil­ity ex­per­i­ments.

Yash SrivastavaJul 17, 2024, 2:02 AM
3 points
2 comments5 min readLW link

Su­per­po­si­tion through Ac­tive Learn­ing Lens

akankshancSep 17, 2024, 5:32 PM
1 point
0 comments10 min readLW link

In­ter­pretabil­ity in Ac­tion: Ex­plo­ra­tory Anal­y­sis of VPT, a Minecraft Agent

Jul 18, 2024, 5:02 PM
9 points
0 comments1 min readLW link
(arxiv.org)

SAEs (usu­ally) Trans­fer Between Base and Chat Models

Jul 18, 2024, 10:29 AM
66 points
0 comments10 min readLW link

Truth is Univer­sal: Ro­bust De­tec­tion of Lies in LLMs

Lennart BuergerJul 19, 2024, 2:07 PM
24 points
3 comments2 min readLW link
(arxiv.org)

Fea­ture Tar­geted LLC Es­ti­ma­tion Dist­in­guishes SAE Fea­tures from Ran­dom Directions

Jul 19, 2024, 8:32 PM
59 points
6 comments16 min readLW link

BatchTopK: A Sim­ple Im­prove­ment for TopK-SAEs

Jul 20, 2024, 2:20 AM
53 points
0 comments4 min readLW link

Com­po­si­tion­al­ity and Am­bi­guity: La­tent Co-oc­cur­rence and In­ter­pretable Subspaces

Dec 20, 2024, 3:16 PM
30 points
0 comments37 min readLW link

Fact Find­ing: Do Early Lay­ers Spe­cial­ise in Lo­cal Pro­cess­ing? (Post 5)

Dec 23, 2023, 2:46 AM
18 points
0 comments4 min readLW link

Ini­tial Ex­per­i­ments Us­ing SAEs to Help De­tect AI Gen­er­ated Text

Aaron_ScherJul 22, 2024, 5:16 AM
17 points
0 comments14 min readLW link

In­tro­duc­tion to the se­quence: In­ter­pretabil­ity Re­search for the Most Im­por­tant Century

Evan R. MurphyMay 12, 2022, 7:59 PM
16 points
0 comments8 min readLW link

Pac­ing Out­side the Box: RNNs Learn to Plan in Sokoban

Jul 25, 2024, 10:00 PM
59 points
8 comments2 min readLW link
(arxiv.org)

A short cri­tique of Omo­hun­dro’s “Ba­sic AI Drives”

Soumyadeep BoseDec 19, 2024, 7:19 PM
6 points
0 comments4 min readLW link

Open Source Au­to­mated In­ter­pretabil­ity for Sparse Au­toen­coder Features

Jul 30, 2024, 9:11 PM
67 points
1 comment13 min readLW link
(blog.eleuther.ai)

Un­der­stand­ing Po­si­tional Fea­tures in Layer 0 SAEs

Jul 29, 2024, 9:36 AM
43 points
0 comments5 min readLW link

An In­ter­pretabil­ity Illu­sion from Pop­u­la­tion Statis­tics in Causal Analysis

Daniel TanJul 29, 2024, 2:50 PM
9 points
3 comments1 min readLW link

Deep sparse au­toen­coders yield in­ter­pretable fea­tures too

Armaan A. AbrahamFeb 23, 2025, 5:46 AM
23 points
4 comments8 min readLW link

CNN fea­ture vi­su­al­iza­tion in 50 lines of code

StefanHexMay 26, 2022, 11:02 AM
17 points
4 comments5 min readLW link

Con­struct­ing Neu­ral Net­work Pa­ram­e­ters with Down­stream Trainability

ch271828nJul 31, 2024, 6:13 PM
1 point
0 comments1 min readLW link
(github.com)

Limi­ta­tions on the In­ter­pretabil­ity of Learned Fea­tures from Sparse Dic­tionary Learning

Tom AngstenJul 30, 2024, 4:36 PM
6 points
0 comments9 min readLW link

AI psy­chol­ogy should ground the the­o­ries of AI con­scious­ness and in­form hu­man-AI eth­i­cal in­ter­ac­tion design

Roman LeventovJan 8, 2023, 6:37 AM
20 points
8 comments2 min readLW link

QNR prospects are im­por­tant for AI al­ign­ment research

Eric DrexlerFeb 3, 2022, 3:20 PM
85 points
12 comments11 min readLW link1 review

Eval­u­at­ing Sparse Au­toen­coders with Board Game Models

Aug 2, 2024, 7:50 PM
38 points
1 comment9 min readLW link

La­bel­ling, Vari­ables, and In-Con­text Learn­ing in Llama2

Joshua PenmanAug 3, 2024, 7:36 PM
6 points
0 comments1 min readLW link
(colab.research.google.com)

Thoughts on For­mal­iz­ing Composition

Tom LieberumJun 7, 2022, 7:51 AM
13 points
0 comments7 min readLW link

Re­search Ques­tions from Stained Glass Windows

StefanHexJun 8, 2022, 12:38 PM
4 points
0 comments2 min readLW link

Try­ing to iso­late ob­jec­tives: ap­proaches to­ward high-level interpretability

JozdienJan 9, 2023, 6:33 PM
48 points
14 comments8 min readLW link

Notes on In­ter­nal Ob­jec­tives in Toy Models of Agents

Paul CologneseFeb 22, 2024, 8:02 AM
16 points
0 comments8 min readLW link

Toy Models of Su­per­po­si­tion: what about BitNets?

Alejandro TlaieAug 8, 2024, 4:29 PM
5 points
1 comment5 min readLW link

EIS VII: A Challenge for Mechanists

scasperFeb 18, 2023, 6:27 PM
36 points
4 comments3 min readLW link

Towards a Unified In­ter­pretabil­ity of Ar­tifi­cial and Biolog­i­cal Neu­ral Networks

jan_bauerDec 21, 2024, 11:10 PM
2 points
0 comments1 min readLW link

Emer­gence, The Blind Spot of GenAI In­ter­pretabil­ity?

Quentin FEUILLADE--MONTIXIAug 10, 2024, 10:07 AM
16 points
8 comments3 min readLW link

Ex­tract­ing SAE task fea­tures for in-con­text learning

Aug 12, 2024, 8:34 PM
31 points
1 comment9 min readLW link

[Paper] A is for Ab­sorp­tion: Study­ing Fea­ture Split­ting and Ab­sorp­tion in Sparse Autoencoders

Sep 25, 2024, 9:31 AM
73 points
16 comments3 min readLW link
(arxiv.org)

GPT-2 Some­times Fails at IOI

Ronak_MehtaAug 14, 2024, 11:24 PM
13 points
0 comments2 min readLW link
(ronakrm.github.io)

Eval­u­at­ing Syn­thetic Ac­ti­va­tions com­posed of SAE La­tents in GPT-2

Sep 25, 2024, 8:37 PM
29 points
0 comments3 min readLW link
(arxiv.org)

Char­ac­ter­iz­ing sta­ble re­gions in the resi­d­ual stream of LLMs

Sep 26, 2024, 1:44 PM
42 points
4 comments1 min readLW link
(arxiv.org)

The Geom­e­try of Feel­ings and Non­sense in Large Lan­guage Models

Sep 27, 2024, 5:49 PM
59 points
10 comments4 min readLW link

Avoid­ing jailbreaks by dis­cour­ag­ing their rep­re­sen­ta­tion in ac­ti­va­tion space

Guido BergmanSep 27, 2024, 5:49 PM
7 points
2 comments9 min readLW link

Knowl­edge Base 1: Could it in­crease in­tel­li­gence and make it safer?

iwisSep 30, 2024, 4:00 PM
−4 points
0 comments4 min readLW link

Steer­ing LLMs’ Be­hav­ior with Con­cept Ac­ti­va­tion Vectors

Ruixuan HuangSep 28, 2024, 9:53 AM
8 points
0 comments10 min readLW link

Base LLMs re­fuse too

Sep 29, 2024, 4:04 PM
60 points
20 comments10 min readLW link

Ex­plor­ing Shard-like Be­hav­ior: Em­piri­cal In­sights into Con­tex­tual De­ci­sion-Mak­ing in RL Agents

Alejandro AristizabalSep 29, 2024, 12:32 AM
6 points
0 comments15 min readLW link

Devel­op­men­tal Stages in Multi-Prob­lem Grokking

James SullivanSep 29, 2024, 6:58 PM
4 points
0 comments6 min readLW link

Ex­plor­ing De­com­pos­abil­ity of SAE Features

Vikram_NSep 30, 2024, 6:28 PM
1 point
0 comments3 min readLW link

LLMs are likely not conscious

research_prime_spaceSep 29, 2024, 8:57 PM
6 points
9 comments1 min readLW link

Toy Models of Su­per­po­si­tion: Sim­plified by Hand

Axel SorensenSep 29, 2024, 9:19 PM
9 points
3 comments8 min readLW link

Do sparse au­toen­coders find “true fea­tures”?

Demian TillFeb 22, 2024, 6:06 PM
73 points
33 comments11 min readLW link

Toy Models of Fea­ture Ab­sorp­tion in SAEs

Oct 7, 2024, 9:56 AM
49 points
8 comments10 min readLW link

In­ter­pretabil­ity of SAE Fea­tures Rep­re­sent­ing Check in ChessGPT

Jonathan KutasovOct 5, 2024, 8:43 PM
27 points
2 comments8 min readLW link

(Maybe) A Bag of Heuris­tics is All There Is & A Bag of Heuris­tics is All You Need

SodiumOct 3, 2024, 7:11 PM
34 points
17 comments17 min readLW link

Do­main-spe­cific SAEs

jacob_droriOct 7, 2024, 8:15 PM
27 points
2 comments5 min readLW link

There is a globe in your LLM

jacob_droriOct 8, 2024, 12:43 AM
88 points
4 comments1 min readLW link

Minor in­ter­pretabil­ity ex­plo­ra­tion #1: Grokking of mod­u­lar ad­di­tion, sub­trac­tion, mul­ti­pli­ca­tion, for differ­ent ac­ti­va­tion functions

Rareș BaronFeb 26, 2025, 11:35 AM
3 points
6 comments3 min readLW link

Hamil­to­nian Dy­nam­ics in AI: A Novel Ap­proach to Op­ti­miz­ing Rea­son­ing in Lan­guage Models

Javier Marin ValenzuelaOct 9, 2024, 7:14 PM
3 points
0 comments10 min readLW link

SAE fea­tures for re­fusal and syco­phancy steer­ing vectors

Oct 12, 2024, 2:54 PM
29 points
4 comments7 min readLW link

Stan­dard SAEs Might Be In­co­her­ent: A Choos­ing Prob­lem & A “Con­cise” Solution

Kola AyonrindeOct 30, 2024, 10:50 PM
27 points
0 comments12 min readLW link

It’s im­por­tant to know when to stop: Mechanis­tic Ex­plo­ra­tion of Gemma 2 List Generation

Gerard BoxoOct 14, 2024, 5:04 PM
8 points
0 comments6 min readLW link
(gboxo.github.io)

Sparse au­toen­coders find com­posed fea­tures in small toy mod­els

Mar 14, 2024, 6:00 PM
33 points
12 comments15 min readLW link

An­thropic’s SoLU (Soft­max Lin­ear Unit)

Joel BurgetJul 4, 2022, 6:38 PM
21 points
1 comment4 min readLW link
(transformer-circuits.pub)

Deep neu­ral net­works are not opaque.

jem-mosigJul 6, 2022, 6:03 PM
22 points
14 comments3 min readLW link

A short pro­ject on Mamba: grokking & interpretability

Alejandro TlaieOct 18, 2024, 4:59 PM
21 points
0 comments6 min readLW link

[PAPER] Ja­co­bian Sparse Au­toen­coders: Spar­sify Com­pu­ta­tions, Not Just Activations

Lucy FarnikFeb 26, 2025, 12:50 PM
59 points
7 comments7 min readLW link

Auto-match­ing hid­den lay­ers in Py­torch LLMs

chanindFeb 19, 2024, 12:40 PM
2 points
0 comments3 min readLW link

SAE Train­ing Dataset In­fluence in Fea­ture Match­ing and a Hy­poth­e­sis on Po­si­tion Features

Seonglae ChoFeb 26, 2025, 5:05 PM
2 points
0 comments17 min readLW link

Monose­man­tic­ity & Quantization

Rahul ChandOct 22, 2024, 10:57 PM
1 point
0 comments9 min readLW link

Race Along Rashomon Ridge

Jul 7, 2022, 3:20 AM
50 points
15 comments8 min readLW link

En­abling New Ap­pli­ca­tions with To­day’s Mechanis­tic In­ter­pretabil­ity Toolkit

ananya_joshiOct 25, 2024, 5:53 PM
3 points
0 comments3 min readLW link

Open Source Repli­ca­tion of An­thropic’s Cross­coder pa­per for model-diffing

Oct 27, 2024, 6:46 PM
40 points
4 comments5 min readLW link

The shal­low re­al­ity of ‘deep learn­ing the­ory’

Jesse HooglandFeb 22, 2023, 4:16 AM
34 points
11 comments3 min readLW link
(www.jessehoogland.com)

Bridg­ing the VLM and mech in­terp com­mu­ni­ties for mul­ti­modal in­ter­pretabil­ity

Sonia JosephOct 28, 2024, 2:41 PM
19 points
5 comments15 min readLW link

Can quan­tised au­toen­coders find and in­ter­pret cir­cuits in lan­guage mod­els?

charlieoneillMar 24, 2024, 8:05 PM
28 points
4 comments24 min readLW link

SAE Prob­ing: What is it good for?

Nov 1, 2024, 7:23 PM
32 points
0 comments11 min readLW link

Com­po­si­tion Cir­cuits in Vi­sion Trans­form­ers (Hy­poth­e­sis)

phenomanonNov 1, 2024, 10:16 PM
1 point
0 comments3 min readLW link

Test­ing “True” Lan­guage Un­der­stand­ing in LLMs: A Sim­ple Proposal

MtryaSamNov 2, 2024, 7:12 PM
9 points
2 comments2 min readLW link

Evolu­tion­ary prompt op­ti­miza­tion for SAE fea­ture visualization

Nov 14, 2024, 1:06 PM
20 points
0 comments9 min readLW link

Iden­ti­fy­ing Func­tion­ally Im­por­tant Fea­tures with End-to-End Sparse Dic­tionary Learning

May 17, 2024, 4:25 PM
57 points
20 comments4 min readLW link
(arxiv.org)

SAEs are highly dataset de­pen­dent: a case study on the re­fusal direction

Nov 7, 2024, 5:22 AM
66 points
4 comments14 min readLW link

An­a­lyz­ing how SAE fea­tures evolve across a for­ward pass

Nov 7, 2024, 10:07 PM
47 points
0 comments1 min readLW link
(arxiv.org)

Antonym Heads Pre­dict Se­man­tic Op­po­sites in Lan­guage Models

Jake WardNov 15, 2024, 3:32 PM
3 points
0 comments5 min readLW link

Effects of Non-Uniform Spar­sity on Su­per­po­si­tion in Toy Models

Shreyans JainNov 14, 2024, 4:59 PM
4 points
3 comments6 min readLW link

A Sober Look at Steer­ing Vec­tors for LLMs

Nov 23, 2024, 5:30 PM
37 points
0 comments5 min readLW link

Mechanis­tic In­ter­pretabil­ity of Llama 3.2 with Sparse Autoencoders

PaulPaulsNov 24, 2024, 5:45 AM
19 points
3 comments1 min readLW link
(github.com)

Sparse MLP Distillation

slavachalnevJan 15, 2024, 7:39 PM
30 points
3 comments6 min readLW link

Find­ing Skele­tons on Rashomon Ridge

Jul 24, 2022, 10:31 PM
30 points
2 comments7 min readLW link

Ad­dress­ing Fea­ture Sup­pres­sion in SAEs

Feb 16, 2024, 6:32 PM
86 points
4 comments10 min readLW link

Beyond Gaus­sian: Lan­guage Model Rep­re­sen­ta­tions and Distributions

Matt LevinsonNov 24, 2024, 1:53 AM
6 points
1 comment5 min readLW link

In­tri­ca­cies of Fea­ture Geom­e­try in Large Lan­guage Models

Dec 7, 2024, 6:10 PM
68 points
0 comments12 min readLW link

In­ter­pretabil­ity: In­te­grated Gra­di­ents is a de­cent at­tri­bu­tion method

May 20, 2024, 5:55 PM
23 points
7 comments6 min readLW link

My Jan­uary al­ign­ment the­ory Nanowrimo

Dmitry VaintrobJan 2, 2025, 12:07 AM
41 points
2 comments2 min readLW link

Gram­mars, sub­gram­mars, and com­bi­na­torics of gen­er­al­iza­tion in transformers

Dmitry VaintrobJan 2, 2025, 9:37 AM
36 points
0 comments17 min readLW link

The sub­set par­ity learn­ing prob­lem: much more than you wanted to know

Dmitry VaintrobJan 3, 2025, 9:13 AM
93 points
18 comments11 min readLW link

The Laws of Large Numbers

Dmitry VaintrobJan 4, 2025, 11:54 AM
38 points
11 comments12 min readLW link

The AI Con­trol Prob­lem in a wider in­tel­lec­tual context

philosophybearJan 13, 2023, 12:28 AM
11 points
3 comments12 min readLW link

What are poly­se­man­tic neu­rons?

Jan 8, 2025, 7:35 AM
8 points
0 comments4 min readLW link
(aisafety.info)

Ac­ti­va­tion space in­ter­pretabil­ity may be doomed

Jan 8, 2025, 12:49 PM
147 points
31 comments8 min readLW link

Ac­ti­va­tion Mag­ni­tudes Mat­ter On Their Own: In­sights from Lan­guage Model Distri­bu­tional Analysis

Matt LevinsonJan 10, 2025, 6:53 AM
4 points
0 comments4 min readLW link

Scal­ing Sparse Fea­ture Cir­cuit Find­ing to Gemma 9B

Jan 10, 2025, 11:08 AM
86 points
10 comments17 min readLW link

Can we effi­ciently dis­t­in­guish differ­ent mechanisms?

paulfchristianoDec 27, 2022, 12:20 AM
88 points
30 comments16 min readLW link
(ai-alignment.com)

A pro­posal for iter­ated in­ter­pretabil­ity with known-in­ter­pretable nar­row AIs

Peter BerggrenJan 11, 2025, 2:43 PM
6 points
0 comments2 min readLW link

In­ter­pretabil­ity isn’t Free

Joel BurgetAug 4, 2022, 3:02 PM
10 points
1 comment2 min readLW link

Find­ing Fea­tures Causally Up­stream of Refusal

Jan 14, 2025, 2:30 AM
48 points
5 comments12 min readLW link

EIS VIII: An Eng­ineer’s Un­der­stand­ing of De­cep­tive Alignment

scasperFeb 19, 2023, 3:25 PM
30 points
5 comments4 min readLW link

Dis­sected boxed AI

Nathan1123Aug 12, 2022, 2:37 AM
−8 points
2 comments1 min readLW link

Con­tex­tual at­ten­tion heads in the first layer of GPT-2

Alex GibsonJan 20, 2025, 1:24 PM
6 points
0 comments13 min readLW link

Ex­am­in­ing Lan­guage Model Perfor­mance with Re­con­structed Ac­ti­va­tions us­ing Sparse Au­toen­coders

Feb 27, 2024, 2:43 AM
43 points
16 comments15 min readLW link

Monet: Mix­ture of Monose­man­tic Ex­perts for Trans­form­ers Explained

CalebMarescaJan 25, 2025, 7:37 PM
20 points
2 comments11 min readLW link

Us­ing the prob­a­bil­is­tic method to bound the perfor­mance of toy transformers

Alex GibsonJan 21, 2025, 11:01 PM
1 point
0 comments3 min readLW link

Neu­ral net­works gen­er­al­ize be­cause of this one weird trick

Jesse HooglandJan 18, 2023, 12:10 AM
179 points
34 comments53 min readLW link1 review
(www.jessehoogland.com)

In­ter­pretabil­ity Tools Are an At­tack Channel

Thane RuthenisAug 17, 2022, 6:47 PM
42 points
14 comments1 min readLW link
No comments.