RSS

Trans­former Circuits

Tag

Find­ing Neu­rons in a Haystack: Case Stud­ies with Sparse Probing

3 May 2023 13:30 UTC
33 points
5 comments2 min readLW link
(arxiv.org)

Find­ing Sparse Lin­ear Con­nec­tions be­tween Fea­tures in LLMs

9 Dec 2023 2:27 UTC
69 points
5 comments10 min readLW link

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers v2

Neel Nanda7 Jul 2024 17:39 UTC
134 points
15 comments25 min readLW link

Ex­plain­ing the Trans­former Cir­cuits Frame­work by Example

Felix Hofstätter25 Apr 2023 13:45 UTC
8 points
0 comments15 min readLW link

How to Think About Ac­ti­va­tion Patching

Neel Nanda4 Jun 2023 14:17 UTC
50 points
5 comments20 min readLW link
(www.neelnanda.io)

Does Cir­cuit Anal­y­sis In­ter­pretabil­ity Scale? Ev­i­dence from Mul­ti­ple Choice Ca­pa­bil­ities in Chinchilla

20 Jul 2023 10:50 UTC
44 points
3 comments2 min readLW link
(arxiv.org)

Paper Walk­through: Au­to­mated Cir­cuit Dis­cov­ery with Arthur Conmy

Neel Nanda29 Aug 2023 22:07 UTC
36 points
1 comment1 min readLW link
(www.youtube.com)

In­ter­pret­ing OpenAI’s Whisper

EllenaR24 Sep 2023 17:53 UTC
114 points
13 comments7 min readLW link

Un­der­stand­ing the ten­sor product for­mu­la­tion in Trans­former Circuits

Tom Lieberum24 Dec 2021 18:05 UTC
16 points
2 comments3 min readLW link

A Walk­through of In­ter­pretabil­ity in the Wild (w/​ au­thors Kevin Wang, Arthur Conmy & Alexan­dre Variengien)

Neel Nanda7 Nov 2022 22:39 UTC
30 points
15 comments3 min readLW link
(youtu.be)

A Walk­through of In-Con­text Learn­ing and In­duc­tion Heads (w/​ Charles Frye) Part 1 of 2

Neel Nanda22 Nov 2022 17:12 UTC
20 points
0 comments1 min readLW link
(www.youtube.com)

200 COP in MI: Look­ing for Cir­cuits in the Wild

Neel Nanda29 Dec 2022 20:59 UTC
16 points
5 comments13 min readLW link

200 COP in MI: In­ter­pret­ing Al­gorith­mic Problems

Neel Nanda31 Dec 2022 19:55 UTC
33 points
2 comments10 min readLW link

200 COP in MI: Ex­plor­ing Poly­se­man­tic­ity and Superposition

Neel Nanda3 Jan 2023 1:52 UTC
34 points
6 comments16 min readLW link

200 COP in MI: Analysing Train­ing Dynamics

Neel Nanda4 Jan 2023 16:08 UTC
16 points
0 comments14 min readLW link

200 COP in MI: Tech­niques, Tool­ing and Automation

Neel Nanda6 Jan 2023 15:08 UTC
13 points
0 comments15 min readLW link

200 Con­crete Open Prob­lems in Mechanis­tic In­ter­pretabil­ity: Introduction

Neel Nanda28 Dec 2022 21:06 UTC
106 points
0 comments10 min readLW link

An Anal­ogy for Un­der­stand­ing Transformers

CallumMcDougall13 May 2023 12:20 UTC
89 points
6 comments9 min readLW link

An­thropic’s SoLU (Soft­max Lin­ear Unit)

Joel Burget4 Jul 2022 18:38 UTC
21 points
1 comment4 min readLW link
(transformer-circuits.pub)

An In­ter­pretabil­ity Illu­sion for Ac­ti­va­tion Patch­ing of Ar­bi­trary Subspaces

29 Aug 2023 1:04 UTC
77 points
4 comments1 min readLW link

No Really, At­ten­tion is ALL You Need—At­ten­tion can do feed­for­ward networks

Robert_AIZI31 Jan 2023 18:48 UTC
29 points
7 comments6 min readLW link
(aizi.substack.com)

Au­to­mat­i­cally find­ing fea­ture vec­tors in the OV cir­cuits of Trans­form­ers with­out us­ing probing

Jacob Dunefsky12 Sep 2023 17:38 UTC
13 points
0 comments29 min readLW link

Graph­i­cal ten­sor no­ta­tion for interpretability

Jordan Taylor4 Oct 2023 8:04 UTC
137 points
11 comments19 min readLW link

Find­ing Back­ward Chain­ing Cir­cuits in Trans­form­ers Trained on Tree Search

28 May 2024 5:29 UTC
50 points
1 comment9 min readLW link
(arxiv.org)

Can quan­tised au­toen­coders find and in­ter­pret cir­cuits in lan­guage mod­els?

charlieoneill24 Mar 2024 20:05 UTC
28 points
4 comments24 min readLW link

“What the hell is a rep­re­sen­ta­tion, any­way?” | Clar­ify­ing AI in­ter­pretabil­ity with tools from philos­o­phy of cog­ni­tive sci­ence | Part 1: Ve­hi­cles vs. contents

IwanWilliams9 Jun 2024 14:19 UTC
9 points
1 comment4 min readLW link

Logit Prisms: De­com­pos­ing Trans­former Out­puts for Mechanis­tic Interpretability

ntt12317 Jun 2024 11:46 UTC
5 points
4 comments6 min readLW link
(neuralblog.github.io)

An ad­ver­sar­ial ex­am­ple for Direct Logit At­tri­bu­tion: mem­ory man­age­ment in gelu-4l

30 Aug 2023 17:36 UTC
17 points
0 comments8 min readLW link
(arxiv.org)

Ar­rakis—A toolkit to con­duct, track and vi­su­al­ize mechanis­tic in­ter­pretabil­ity ex­per­i­ments.

Yash Srivastava17 Jul 2024 2:02 UTC
2 points
2 comments5 min readLW link

SAEs (usu­ally) Trans­fer Between Base and Chat Models

18 Jul 2024 10:29 UTC
65 points
0 comments10 min readLW link

Open Source Repli­ca­tion of An­thropic’s Cross­coder pa­per for model-diffing

27 Oct 2024 18:46 UTC
38 points
4 comments5 min readLW link

Con­crete Meth­ods for Heuris­tic Es­ti­ma­tion on Neu­ral Networks

Oliver Daniels14 Nov 2024 5:07 UTC
27 points
0 comments27 min readLW link

Ad­den­dum: More Effi­cient FFNs via Attention

Robert_AIZI6 Feb 2023 18:55 UTC
10 points
2 comments5 min readLW link
(aizi.substack.com)

Poly­se­man­tic At­ten­tion Head in a 4-Layer Transformer

9 Nov 2023 16:16 UTC
51 points
0 comments6 min readLW link

AISC pro­ject: TinyEvals

Jett Janiak22 Nov 2023 20:47 UTC
22 points
0 comments4 min readLW link

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

16 Jan 2024 0:26 UTC
83 points
9 comments18 min readLW link