RSS

Trans­former Circuits

Tag

Find­ing Neu­rons in a Haystack: Case Stud­ies with Sparse Probing

May 3, 2023, 1:30 PM
33 points
6 comments2 min readLW link1 review
(arxiv.org)

Find­ing Sparse Lin­ear Con­nec­tions be­tween Fea­tures in LLMs

Dec 9, 2023, 2:27 AM
70 points
5 comments10 min readLW link

Ex­plain­ing the Trans­former Cir­cuits Frame­work by Example

Felix HofstätterApr 25, 2023, 1:45 PM
8 points
0 comments15 min readLW link

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers v2

Neel NandaJul 7, 2024, 5:39 PM
135 points
16 comments25 min readLW link

In­ter­pret­ing OpenAI’s Whisper

EllenaRSep 24, 2023, 5:53 PM
116 points
13 comments7 min readLW link

Un­der­stand­ing the ten­sor product for­mu­la­tion in Trans­former Circuits

Tom LieberumDec 24, 2021, 6:05 PM
16 points
2 comments3 min readLW link

How to Think About Ac­ti­va­tion Patching

Neel NandaJun 4, 2023, 2:17 PM
50 points
5 comments20 min readLW link
(www.neelnanda.io)

A Walk­through of In­ter­pretabil­ity in the Wild (w/​ au­thors Kevin Wang, Arthur Conmy & Alexan­dre Variengien)

Neel NandaNov 7, 2022, 10:39 PM
30 points
15 comments3 min readLW link
(youtu.be)

A Walk­through of In-Con­text Learn­ing and In­duc­tion Heads (w/​ Charles Frye) Part 1 of 2

Neel NandaNov 22, 2022, 5:12 PM
20 points
0 comments1 min readLW link
(www.youtube.com)

Paper Walk­through: Au­to­mated Cir­cuit Dis­cov­ery with Arthur Conmy

Neel NandaAug 29, 2023, 10:07 PM
36 points
1 comment1 min readLW link
(www.youtube.com)

200 COP in MI: In­ter­pret­ing Al­gorith­mic Problems

Neel NandaDec 31, 2022, 7:55 PM
33 points
2 comments10 min readLW link

200 COP in MI: Ex­plor­ing Poly­se­man­tic­ity and Superposition

Neel NandaJan 3, 2023, 1:52 AM
34 points
6 comments16 min readLW link

200 COP in MI: Analysing Train­ing Dynamics

Neel NandaJan 4, 2023, 4:08 PM
16 points
0 comments14 min readLW link

200 COP in MI: Tech­niques, Tool­ing and Automation

Neel NandaJan 6, 2023, 3:08 PM
13 points
0 comments15 min readLW link

200 Con­crete Open Prob­lems in Mechanis­tic In­ter­pretabil­ity: Introduction

Neel NandaDec 28, 2022, 9:06 PM
106 points
0 comments10 min readLW link

Sleep peace­fully: no hid­den rea­son­ing de­tected in LLMs. Well, at least in small ones.

Apr 4, 2025, 8:49 PM
16 points
2 comments7 min readLW link

Does Cir­cuit Anal­y­sis In­ter­pretabil­ity Scale? Ev­i­dence from Mul­ti­ple Choice Ca­pa­bil­ities in Chinchilla

Jul 20, 2023, 10:50 AM
44 points
3 comments2 min readLW link
(arxiv.org)

200 COP in MI: Look­ing for Cir­cuits in the Wild

Neel NandaDec 29, 2022, 8:59 PM
16 points
5 comments13 min readLW link

An In­ter­pretabil­ity Illu­sion for Ac­ti­va­tion Patch­ing of Ar­bi­trary Subspaces

Aug 29, 2023, 1:04 AM
77 points
4 comments1 min readLW link

Graph­i­cal ten­sor no­ta­tion for interpretability

Jordan TaylorOct 4, 2023, 8:04 AM
141 points
11 comments19 min readLW link

An ad­ver­sar­ial ex­am­ple for Direct Logit At­tri­bu­tion: mem­ory man­age­ment in gelu-4l

Aug 30, 2023, 5:36 PM
17 points
0 comments8 min readLW link
(arxiv.org)

An­thropic’s SoLU (Soft­max Lin­ear Unit)

Joel BurgetJul 4, 2022, 6:38 PM
21 points
1 comment4 min readLW link
(transformer-circuits.pub)

No Really, At­ten­tion is ALL You Need—At­ten­tion can do feed­for­ward networks

Robert_AIZIJan 31, 2023, 6:48 PM
29 points
7 comments6 min readLW link
(aizi.substack.com)

Ad­den­dum: More Effi­cient FFNs via Attention

Robert_AIZIFeb 6, 2023, 6:55 PM
10 points
2 comments5 min readLW link
(aizi.substack.com)

Au­to­mat­i­cally find­ing fea­ture vec­tors in the OV cir­cuits of Trans­form­ers with­out us­ing probing

Jacob DunefskySep 12, 2023, 5:38 PM
16 points
2 comments29 min readLW link

Are SAE fea­tures from the Base Model still mean­ingful to LLaVA?

Shan23ChenDec 5, 2024, 7:24 PM
5 points
2 comments10 min readLW link

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

Jan 16, 2024, 12:26 AM
83 points
9 comments18 min readLW link

Find­ing Back­ward Chain­ing Cir­cuits in Trans­form­ers Trained on Tree Search

May 28, 2024, 5:29 AM
50 points
1 comment9 min readLW link
(arxiv.org)

Can quan­tised au­toen­coders find and in­ter­pret cir­cuits in lan­guage mod­els?

charlieoneillMar 24, 2024, 8:05 PM
28 points
4 comments24 min readLW link

“What the hell is a rep­re­sen­ta­tion, any­way?” | Clar­ify­ing AI in­ter­pretabil­ity with tools from philos­o­phy of cog­ni­tive sci­ence | Part 1: Ve­hi­cles vs. contents

IwanWilliamsJun 9, 2024, 2:19 PM
9 points
1 comment4 min readLW link

Logit Prisms: De­com­pos­ing Trans­former Out­puts for Mechanis­tic Interpretability

ntt123Jun 17, 2024, 11:46 AM
5 points
4 comments6 min readLW link
(neuralblog.github.io)

Ar­rakis—A toolkit to con­duct, track and vi­su­al­ize mechanis­tic in­ter­pretabil­ity ex­per­i­ments.

Yash SrivastavaJul 17, 2024, 2:02 AM
3 points
2 comments5 min readLW link

SAEs (usu­ally) Trans­fer Between Base and Chat Models

Jul 18, 2024, 10:29 AM
66 points
0 comments10 min readLW link

Open Source Repli­ca­tion of An­thropic’s Cross­coder pa­per for model-diffing

Oct 27, 2024, 6:46 PM
47 points
4 comments5 min readLW link

Con­crete Meth­ods for Heuris­tic Es­ti­ma­tion on Neu­ral Networks

Oliver DanielsNov 14, 2024, 5:07 AM
28 points
0 comments27 min readLW link

Scal­ing Sparse Fea­ture Cir­cuit Find­ing to Gemma 9B

Jan 10, 2025, 11:08 AM
86 points
11 comments17 min readLW link

Poly­se­man­tic At­ten­tion Head in a 4-Layer Transformer

Nov 9, 2023, 4:16 PM
51 points
0 comments6 min readLW link

AISC pro­ject: TinyEvals

Jett JaniakNov 22, 2023, 8:47 PM
22 points
0 comments4 min readLW link

An Anal­ogy for Un­der­stand­ing Transformers

CallumMcDougallMay 13, 2023, 12:20 PM
89 points
6 comments9 min readLW link