RSS

In­ter­pretabil­ity (ML & AI)

TagLast edit: 10 Nov 2023 13:11 UTC by niplav

Transparency and interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model’s output, but the model can’t tell you why it made that output. This makes it hard to determine the cause of biases in ML models.

A prominent subfield of interpretability of neural networks is mechanistic interpretability, which attempts to understand how neural networks perform the tasks they perform, for example by finding circuits in transformer models. This can be contrasted to subfieds of interpretability which seek to attribute some output to a part of a specific input, such as clarifying which pixels in an input image caused a computer vision model to output the classification “horse”.

See Also

Research

A small up­date to the Sparse Cod­ing in­terim re­search report

30 Apr 2023 19:54 UTC
61 points
5 comments1 min readLW link

In­ter­pretabil­ity in ML: A Broad Overview

lifelonglearner4 Aug 2020 19:03 UTC
53 points
5 comments15 min readLW link

Re-Ex­am­in­ing LayerNorm

Eric Winsor1 Dec 2022 22:20 UTC
125 points
12 comments5 min readLW link

[In­terim re­search re­port] Tak­ing fea­tures out of su­per­po­si­tion with sparse autoencoders

13 Dec 2022 15:41 UTC
149 points
23 comments22 min readLW link2 reviews

Find­ing Neu­rons in a Haystack: Case Stud­ies with Sparse Probing

3 May 2023 13:30 UTC
33 points
5 comments2 min readLW link
(arxiv.org)

200 Con­crete Open Prob­lems in Mechanis­tic In­ter­pretabil­ity: Introduction

Neel Nanda28 Dec 2022 21:06 UTC
106 points
0 comments10 min readLW link

Chris Olah’s views on AGI safety

evhub1 Nov 2019 20:13 UTC
207 points
38 comments12 min readLW link2 reviews

A Longlist of The­o­ries of Im­pact for Interpretability

Neel Nanda11 Mar 2022 14:55 UTC
124 points
37 comments5 min readLW link2 reviews

How To Go From In­ter­pretabil­ity To Align­ment: Just Re­tar­get The Search

johnswentworth10 Aug 2022 16:08 UTC
202 points
34 comments3 min readLW link1 review

The Sin­gu­lar Value De­com­po­si­tions of Trans­former Weight Ma­tri­ces are Highly Interpretable

28 Nov 2022 12:54 UTC
198 points
33 comments31 min readLW link

[Question] Papers to start get­ting into NLP-fo­cused al­ign­ment research

Feraidoon24 Sep 2022 23:53 UTC
6 points
0 comments1 min readLW link

A Mechanis­tic In­ter­pretabil­ity Anal­y­sis of Grokking

15 Aug 2022 2:41 UTC
373 points
47 comments36 min readLW link1 review
(colab.research.google.com)

Search­ing for Search

28 Nov 2022 15:31 UTC
94 points
9 comments14 min readLW link1 review

A Rocket–In­ter­pretabil­ity Analogy

plex21 Oct 2024 13:55 UTC
149 points
31 comments1 min readLW link

Effi­cient Dic­tionary Learn­ing with Switch Sparse Autoencoders

Anish Mudide22 Jul 2024 18:45 UTC
118 points
19 comments12 min readLW link

SolidGoldMag­ikarp (plus, prompt gen­er­a­tion)

5 Feb 2023 22:02 UTC
676 points
205 comments12 min readLW link

Trans­parency and AGI safety

jylin0411 Jan 2021 18:51 UTC
54 points
12 comments30 min readLW link

Do Sparse Au­toen­coders (SAEs) trans­fer across base and fine­tuned lan­guage mod­els?

29 Sep 2024 19:37 UTC
26 points
8 comments25 min readLW link

Strik­ing Im­pli­ca­tions for Learn­ing The­ory, In­ter­pretabil­ity — and Safety?

RogerDearnaley5 Jan 2024 8:46 UTC
37 points
4 comments2 min readLW link

A trans­parency and in­ter­pretabil­ity tech tree

evhub16 Jun 2022 23:44 UTC
163 points
11 comments18 min readLW link1 review

Against Al­most Every The­ory of Im­pact of Interpretability

Charbel-Raphaël17 Aug 2023 18:44 UTC
322 points
86 comments26 min readLW link

Resi­d­ual stream norms grow ex­po­nen­tially over the for­ward pass

7 May 2023 0:46 UTC
76 points
24 comments11 min readLW link

How to use and in­ter­pret ac­ti­va­tion patching

24 Apr 2024 8:35 UTC
12 points
0 comments18 min readLW link

How In­ter­pretabil­ity can be Impactful

Connall Garrod18 Jul 2022 0:06 UTC
18 points
0 comments37 min readLW link

Take­aways From 3 Years Work­ing In Ma­chine Learning

George3d68 Apr 2022 17:14 UTC
35 points
10 comments11 min readLW link
(www.epistem.ink)

Com­ments on An­thropic’s Scal­ing Monosemanticity

Robert_AIZI3 Jun 2024 12:15 UTC
97 points
8 comments7 min readLW link

My ten­ta­tive in­ter­pretabil­ity re­search agenda—topol­ogy match­ing.

Maxwell Clarke8 Oct 2022 22:14 UTC
10 points
2 comments4 min readLW link

Trans­former Circuits

evhub22 Dec 2021 21:09 UTC
144 points
4 comments3 min readLW link
(transformer-circuits.pub)

In­ter­pret­ing the Learn­ing of Deceit

RogerDearnaley18 Dec 2023 8:12 UTC
30 points
14 comments9 min readLW link

Ac­tu­ally, Othello-GPT Has A Lin­ear Emer­gent World Representation

Neel Nanda29 Mar 2023 22:13 UTC
211 points
26 comments19 min readLW link
(neelnanda.io)

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC
58 points
0 comments59 min readLW link

Ma­chine Un­learn­ing Eval­u­a­tions as In­ter­pretabil­ity Benchmarks

23 Oct 2023 16:33 UTC
33 points
2 comments11 min readLW link

The Case for Rad­i­cal Op­ti­mism about Interpretability

Quintin Pope16 Dec 2021 23:38 UTC
66 points
16 comments8 min readLW link1 review

Opinions on In­ter­pretable Ma­chine Learn­ing and 70 Sum­maries of Re­cent Papers

9 Apr 2021 19:19 UTC
141 points
17 comments102 min readLW link

What is In­ter­pretabil­ity?

17 Mar 2020 20:23 UTC
35 points
0 comments11 min readLW link

Ideation and Tra­jec­tory Model­ling in Lan­guage Models

NickyP5 Oct 2023 19:21 UTC
16 points
2 comments10 min readLW link

SAE re­con­struc­tion er­rors are (em­piri­cally) pathological

wesg29 Mar 2024 16:37 UTC
105 points
16 comments8 min readLW link

[Pro­posal] Method of lo­cat­ing use­ful sub­nets in large models

Quintin Pope13 Oct 2021 20:52 UTC
9 points
0 comments2 min readLW link

Ex­tract­ing and Eval­u­at­ing Causal Direc­tion in LLMs’ Activations

14 Dec 2022 14:33 UTC
29 points
5 comments11 min readLW link

Us­ing GPT-N to Solve In­ter­pretabil­ity of Neu­ral Net­works: A Re­search Agenda

3 Sep 2020 18:27 UTC
68 points
11 comments2 min readLW link

Is In­ter­pretabil­ity All We Need?

RogerDearnaley14 Nov 2023 5:31 UTC
1 point
1 comment1 min readLW link

Towards Monose­man­tic­ity: De­com­pos­ing Lan­guage Models With Dic­tionary Learning

Zac Hatfield-Dodds5 Oct 2023 21:01 UTC
287 points
21 comments2 min readLW link
(transformer-circuits.pub)

Com­pact Proofs of Model Perfor­mance via Mechanis­tic Interpretability

24 Jun 2024 19:27 UTC
95 points
3 comments8 min readLW link
(arxiv.org)

In­ter­pret­ing Neu­ral Net­works through the Poly­tope Lens

23 Sep 2022 17:58 UTC
144 points
29 comments33 min readLW link

An­nounc­ing Apollo Research

30 May 2023 16:17 UTC
215 points
11 comments8 min readLW link

A List of 45+ Mech In­terp Pro­ject Ideas from Apollo Re­search’s In­ter­pretabil­ity Team

18 Jul 2024 14:15 UTC
117 points
18 comments18 min readLW link

Re­fusal in LLMs is me­di­ated by a sin­gle direction

27 Apr 2024 11:13 UTC
228 points
93 comments10 min readLW link

The­o­ries of im­pact for Science of Deep Learning

Marius Hobbhahn1 Dec 2022 14:39 UTC
24 points
0 comments11 min readLW link

The Plan − 2022 Update

johnswentworth1 Dec 2022 20:43 UTC
239 points
37 comments8 min readLW link1 review

The ‘strong’ fea­ture hy­poth­e­sis could be wrong

lewis smith2 Aug 2024 14:33 UTC
218 points
17 comments17 min readLW link

Towards Mul­ti­modal In­ter­pretabil­ity: Learn­ing Sparse In­ter­pretable Fea­tures in Vi­sion Transformers

hugofry29 Apr 2024 20:57 UTC
89 points
8 comments11 min readLW link

Deep learn­ing mod­els might be se­cretly (al­most) linear

beren24 Apr 2023 18:43 UTC
117 points
29 comments4 min readLW link

Ba­sic facts about lan­guage mod­els dur­ing training

beren21 Feb 2023 11:46 UTC
97 points
15 comments18 min readLW link

LLMs Univer­sally Learn a Fea­ture Rep­re­sent­ing To­ken Fre­quency /​ Rarity

Sean Osier30 Jun 2024 2:48 UTC
12 points
5 comments6 min readLW link
(github.com)

LLM Mo­du­lar­ity: The Separa­bil­ity of Ca­pa­bil­ities in Large Lan­guage Models

NickyP26 Mar 2023 21:57 UTC
99 points
3 comments41 min readLW link

Mechanis­tic Ano­maly De­tec­tion Re­search Update

6 Aug 2024 10:33 UTC
11 points
0 comments1 min readLW link
(blog.eleuther.ai)

Spar­sify: A mechanis­tic in­ter­pretabil­ity re­search agenda

Lee Sharkey3 Apr 2024 12:34 UTC
94 points
22 comments22 min readLW link

In­ter­pret­ing and Steer­ing Fea­tures in Images

Gytis Daujotas20 Jun 2024 18:33 UTC
65 points
6 comments5 min readLW link

A Com­pre­hen­sive Mechanis­tic In­ter­pretabil­ity Ex­plainer & Glossary

Neel Nanda21 Dec 2022 12:35 UTC
85 points
6 comments2 min readLW link
(neelnanda.io)

(ten­ta­tively) Found 600+ Monose­man­tic Fea­tures in a Small LM Us­ing Sparse Autoencoders

Logan Riggs5 Jul 2023 16:49 UTC
60 points
1 comment7 min readLW link

In­tro­duc­tion to in­ac­cessible information

Ryan Kidd9 Dec 2021 1:28 UTC
27 points
6 comments8 min readLW link

EIS XIV: Is mechanis­tic in­ter­pretabil­ity about to be prac­ti­cally use­ful?

scasper11 Oct 2024 22:13 UTC
67 points
4 comments7 min readLW link

200 COP in MI: In­ter­pret­ing Al­gorith­mic Problems

Neel Nanda31 Dec 2022 19:55 UTC
33 points
2 comments10 min readLW link

Cir­cum­vent­ing in­ter­pretabil­ity: How to defeat mind-readers

Lee Sharkey14 Jul 2022 16:59 UTC
114 points
15 comments33 min readLW link

Ti­maeus’s First Four Months

28 Feb 2024 17:01 UTC
172 points
6 comments6 min readLW link

Deep For­get­ting & Un­learn­ing for Safely-Scoped LLMs

scasper5 Dec 2023 16:48 UTC
123 points
30 comments13 min readLW link

An An­a­lytic Per­spec­tive on AI Alignment

DanielFilan1 Mar 2020 4:10 UTC
54 points
45 comments8 min readLW link
(danielfilan.com)

Ver­ifi­ca­tion and Transparency

DanielFilan8 Aug 2019 1:50 UTC
35 points
6 comments2 min readLW link
(danielfilan.com)

Mechanis­tic Trans­parency for Ma­chine Learning

DanielFilan11 Jul 2018 0:34 UTC
54 points
9 comments4 min readLW link

How can In­ter­pretabil­ity help Align­ment?

23 May 2020 16:16 UTC
37 points
3 comments9 min readLW link

The case for be­com­ing a black-box in­ves­ti­ga­tor of lan­guage models

Buck6 May 2022 14:35 UTC
126 points
20 comments3 min readLW link

One Way to Think About ML Transparency

Matthew Barnett2 Sep 2019 23:27 UTC
26 points
28 comments5 min readLW link

Re­laxed ad­ver­sar­ial train­ing for in­ner alignment

evhub10 Sep 2019 23:03 UTC
69 points
27 comments27 min readLW link

Deep Learn­ing Sys­tems Are Not Less In­ter­pretable Than Logic/​Prob­a­bil­ity/​Etc

johnswentworth4 Jun 2022 5:41 UTC
148 points
55 comments2 min readLW link1 review

Spar­sity and in­ter­pretabil­ity?

1 Jun 2020 13:25 UTC
41 points
3 comments7 min readLW link

How Do Selec­tion The­o­rems Re­late To In­ter­pretabil­ity?

johnswentworth9 Jun 2022 19:39 UTC
60 points
14 comments3 min readLW link

Progress Re­port 6: get the tool working

Nathan Helm-Burger10 Jun 2022 11:18 UTC
4 points
0 comments2 min readLW link

[Question] Can you MRI a deep learn­ing model?

Yair Halberstadt13 Jun 2022 13:43 UTC
3 points
3 comments1 min readLW link

Mechanism for fea­ture learn­ing in neu­ral net­works and back­prop­a­ga­tion-free ma­chine learn­ing models

Matt Goldenberg19 Mar 2024 14:55 UTC
8 points
1 comment1 min readLW link
(www.science.org)

AXRP Epi­sode 21 - In­ter­pretabil­ity for Eng­ineers with Stephen Casper

DanielFilan2 May 2023 0:50 UTC
12 points
1 comment66 min readLW link

Vi­su­al­iz­ing Neu­ral net­works, how to blame the bias

Donald Hobson9 Jul 2022 15:52 UTC
7 points
1 comment6 min readLW link

[Question] How op­ti­mistic should we be about AI figur­ing out how to in­ter­pret it­self?

oh5432125 Jul 2022 22:09 UTC
3 points
1 comment1 min readLW link

Pre­cur­sor check­ing for de­cep­tive alignment

evhub3 Aug 2022 22:56 UTC
24 points
0 comments14 min readLW link

[Linkpost]Trans­former-Based LM Sur­prisal Pre­dicts Hu­man Read­ing Times Best with About Two Billion Train­ing Tokens

Curtis Huebner4 May 2023 17:16 UTC
10 points
1 comment1 min readLW link
(arxiv.org)

In­ter­pretabil­ity/​Tool-ness/​Align­ment/​Cor­rigi­bil­ity are not Composable

johnswentworth8 Aug 2022 18:05 UTC
130 points
12 comments3 min readLW link

AI Trans­parency: Why it’s crit­i­cal and how to ob­tain it.

Zohar Jackson14 Aug 2022 10:31 UTC
6 points
1 comment5 min readLW link

What Makes an Idea Un­der­stand­able? On Ar­chi­tec­turally and Cul­turally Nat­u­ral Ideas.

16 Aug 2022 2:09 UTC
21 points
2 comments16 min readLW link

Stage­wise Devel­op­ment in Neu­ral Networks

20 Mar 2024 19:54 UTC
89 points
1 comment11 min readLW link

Find­ing Sparse Lin­ear Con­nec­tions be­tween Fea­tures in LLMs

9 Dec 2023 2:27 UTC
69 points
5 comments10 min readLW link

What Makes A Good Mea­sure­ment De­vice?

johnswentworth24 Aug 2022 22:45 UTC
37 points
7 comments2 min readLW link

Ra­tional An­i­ma­tions’ in­tro to mechanis­tic interpretability

Writer14 Jun 2024 16:10 UTC
45 points
1 comment11 min readLW link
(youtu.be)

Tak­ing the pa­ram­e­ters which seem to mat­ter and ro­tat­ing them un­til they don’t

Garrett Baker26 Aug 2022 18:26 UTC
120 points
48 comments1 min readLW link

[Linkpost] Play with SAEs on Llama 3

25 Sep 2024 22:35 UTC
40 points
2 comments1 min readLW link

Re­search Re­port: Sparse Au­toen­coders find only 9/​180 board state fea­tures in OthelloGPT

Robert_AIZI5 Mar 2024 13:55 UTC
61 points
24 comments10 min readLW link
(aizi.substack.com)

A rough idea for solv­ing ELK: An ap­proach for train­ing gen­er­al­ist agents like GATO to make plans and de­scribe them to hu­mans clearly and hon­estly.

Michael Soareverix8 Sep 2022 15:20 UTC
2 points
2 comments2 min readLW link

Swap and Scale

Stephen Fowler9 Sep 2022 22:41 UTC
17 points
3 comments1 min readLW link

[Linkpost] A sur­vey on over 300 works about in­ter­pretabil­ity in deep networks

scasper12 Sep 2022 19:07 UTC
97 points
7 comments2 min readLW link
(arxiv.org)

Ex­cit­ing New In­ter­pretabil­ity Paper!

research_prime_space9 May 2023 16:39 UTC
12 points
1 comment1 min readLW link

EIS XIII: Reflec­tions on An­thropic’s SAE Re­search Circa May 2024

scasper21 May 2024 20:15 UTC
157 points
16 comments3 min readLW link

Sparse tri­nary weighted RNNs as a path to bet­ter lan­guage model interpretability

Am8ryllis17 Sep 2022 19:48 UTC
19 points
13 comments3 min readLW link

Toy Models of Superposition

evhub21 Sep 2022 23:48 UTC
69 points
4 comments5 min readLW link1 review
(transformer-circuits.pub)

Apollo Re­search 1-year update

29 May 2024 17:44 UTC
93 points
0 comments7 min readLW link

AGI-Au­to­mated In­ter­pretabil­ity is Suicide

__RicG__10 May 2023 14:20 UTC
23 points
33 comments7 min readLW link

QAPR 3: in­ter­pretabil­ity-guided train­ing of neu­ral nets

Quintin Pope28 Sep 2022 16:02 UTC
58 points
2 comments10 min readLW link

New OpenAI Paper—Lan­guage mod­els can ex­plain neu­rons in lan­guage models

MrThink10 May 2023 7:46 UTC
47 points
14 comments1 min readLW link

More Re­cent Progress in the The­ory of Neu­ral Networks

jylin046 Oct 2022 16:57 UTC
82 points
6 comments4 min readLW link

Poly­se­man­tic­ity and Ca­pac­ity in Neu­ral Networks

7 Oct 2022 17:51 UTC
87 points
14 comments3 min readLW link

Ar­ti­cle Re­view: Google’s AlphaTensor

Robert_AIZI12 Oct 2022 18:04 UTC
8 points
4 comments10 min readLW link

[Question] Pre­vi­ous Work on Re­cre­at­ing Neu­ral Net­work In­put from In­ter­me­di­ate Layer Activations

bglass12 Oct 2022 19:28 UTC
1 point
3 comments1 min readLW link

Re­fusal mechanisms: ini­tial ex­per­i­ments with Llama-2-7b-chat

8 Dec 2023 17:08 UTC
81 points
7 comments7 min readLW link

(OLD) An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers

Neel Nanda18 Oct 2022 21:08 UTC
72 points
5 comments12 min readLW link
(www.neelnanda.io)

A Bare­bones Guide to Mechanis­tic In­ter­pretabil­ity Prerequisites

Neel Nanda24 Oct 2022 20:45 UTC
63 points
12 comments3 min readLW link
(neelnanda.io)

A Walk­through of A Math­e­mat­i­cal Frame­work for Trans­former Circuits

Neel Nanda25 Oct 2022 20:24 UTC
52 points
7 comments1 min readLW link
(www.youtube.com)

At­ten­tion Out­put SAEs Im­prove Cir­cuit Analysis

21 Jun 2024 12:56 UTC
31 points
0 comments19 min readLW link

Ac­ti­va­tion ad­di­tions in a small resi­d­ual network

Garrett Baker22 May 2023 20:28 UTC
22 points
4 comments3 min readLW link

[Book] In­ter­pretable Ma­chine Learn­ing: A Guide for Mak­ing Black Box Models Explainable

Esben Kran31 Oct 2022 11:38 UTC
20 points
1 comment1 min readLW link
(christophm.github.io)

“Cars and Elephants”: a hand­wavy ar­gu­ment/​anal­ogy against mechanis­tic interpretability

David Scott Krueger (formerly: capybaralet)31 Oct 2022 21:26 UTC
48 points
25 comments2 min readLW link

Real-Time Re­search Record­ing: Can a Trans­former Re-Derive Po­si­tional Info?

Neel Nanda1 Nov 2022 23:56 UTC
69 points
16 comments1 min readLW link
(youtu.be)

A Mys­tery About High Di­men­sional Con­cept Encoding

Fabien Roger3 Nov 2022 17:05 UTC
46 points
13 comments7 min readLW link

AXRP Epi­sode 36 - Adam Shai and Paul Riech­ers on Com­pu­ta­tional Mechanics

DanielFilan29 Sep 2024 5:50 UTC
25 points
0 comments55 min readLW link

[Linkpost] In­ter­pretabil­ity Dreams

DanielFilan24 May 2023 21:08 UTC
39 points
2 comments2 min readLW link
(transformer-circuits.pub)

A Walk­through of In­ter­pretabil­ity in the Wild (w/​ au­thors Kevin Wang, Arthur Conmy & Alexan­dre Variengien)

Neel Nanda7 Nov 2022 22:39 UTC
30 points
15 comments3 min readLW link
(youtu.be)

Search ver­sus design

Alex Flint16 Aug 2020 16:53 UTC
108 points
40 comments36 min readLW link1 review

SAE fea­ture ge­om­e­try is out­side the su­per­po­si­tion hypothesis

jake_mendel24 Jun 2024 16:07 UTC
221 points
17 comments11 min readLW link

Why and When In­ter­pretabil­ity Work is Dangerous

Nicholas / Heather Kross28 May 2023 0:27 UTC
20 points
8 comments8 min readLW link
(www.thinkingmuchbetter.com)

Re­sults from the Tur­ing Sem­i­nar hackathon

7 Dec 2023 14:50 UTC
29 points
1 comment6 min readLW link

Ex­plor­ing SAE fea­tures in LLMs with defi­ni­tion trees and to­ken lists

mwatkins4 Oct 2024 22:15 UTC
37 points
5 comments6 min readLW link

A Walk­through of In-Con­text Learn­ing and In­duc­tion Heads (w/​ Charles Frye) Part 1 of 2

Neel Nanda22 Nov 2022 17:12 UTC
20 points
0 comments1 min readLW link
(www.youtube.com)

Sub­sets and quo­tients in interpretability

Erik Jenner2 Dec 2022 23:13 UTC
26 points
1 comment7 min readLW link

Find­ing gliders in the game of life

paulfchristiano1 Dec 2022 20:40 UTC
101 points
7 comments16 min readLW link
(ai-alignment.com)

Ex­plor­ing Con­cept-Spe­cific Slices in Weight Ma­tri­ces for Net­work Interpretability

DuncanFowler9 Jun 2023 16:39 UTC
1 point
0 comments6 min readLW link

In­fer­ence-Time In­ter­ven­tion: Elic­it­ing Truth­ful An­swers from a Lan­guage Model

likenneth11 Jun 2023 5:38 UTC
195 points
4 comments1 min readLW link
(arxiv.org)

A Selec­tion of Ran­domly Selected SAE Features

1 Apr 2024 9:09 UTC
109 points
2 comments4 min readLW link

[ASoT] Nat­u­ral ab­strac­tions and AlphaZero

Ulisse Mini10 Dec 2022 17:53 UTC
33 points
1 comment1 min readLW link
(arxiv.org)

Assess­ment of AI safety agen­das: think about the down­side risk

Roman Leventov19 Dec 2023 9:00 UTC
13 points
1 comment1 min readLW link

HDBSCAN is Sur­pris­ingly Effec­tive at Find­ing In­ter­pretable Clusters of the SAE De­coder Matrix

11 Oct 2024 23:06 UTC
8 points
2 comments10 min readLW link

Paper: Trans­form­ers learn in-con­text by gra­di­ent descent

LawrenceC16 Dec 2022 11:10 UTC
28 points
11 comments2 min readLW link
(arxiv.org)

Can we effi­ciently ex­plain model be­hav­iors?

paulfchristiano16 Dec 2022 19:40 UTC
64 points
3 comments9 min readLW link
(ai-alignment.com)

Durkon, an open-source tool for In­her­ently In­ter­pretable Modelling

abstractapplic24 Dec 2022 1:49 UTC
37 points
0 comments4 min readLW link

Con­crete Steps to Get Started in Trans­former Mechanis­tic Interpretability

Neel Nanda25 Dec 2022 22:21 UTC
56 points
7 comments12 min readLW link
(www.neelnanda.io)

SAE-VIS: An­nounce­ment Post

31 Mar 2024 15:30 UTC
74 points
8 comments1 min readLW link

Analo­gies be­tween Soft­ware Re­v­erse Eng­ineer­ing and Mechanis­tic Interpretability

26 Dec 2022 12:26 UTC
34 points
6 comments11 min readLW link
(www.neelnanda.io)

200 COP in MI: The Case for Analysing Toy Lan­guage Models

Neel Nanda28 Dec 2022 21:07 UTC
39 points
3 comments7 min readLW link

In­ter­pret­ing Prefer­ence Models w/​ Sparse Autoencoders

1 Jul 2024 21:35 UTC
74 points
12 comments9 min readLW link

200 COP in MI: Look­ing for Cir­cuits in the Wild

Neel Nanda29 Dec 2022 20:59 UTC
16 points
5 comments13 min readLW link

fMRI LIKE APPROACH TO AI ALIGNMENT /​ DECEPTIVE BEHAVIOUR

Escaque 6611 Jul 2023 17:17 UTC
−1 points
3 comments2 min readLW link

Cir­cuits in Su­per­po­si­tion: Com­press­ing many small neu­ral net­works into one

14 Oct 2024 13:06 UTC
126 points
8 comments13 min readLW link

The Com­pu­ta­tional Com­plex­ity of Cir­cuit Dis­cov­ery for In­ner Interpretability

Bogdan Ionut Cirstea17 Oct 2024 13:18 UTC
11 points
2 comments1 min readLW link
(arxiv.org)

200 COP in MI: Ex­plor­ing Poly­se­man­tic­ity and Superposition

Neel Nanda3 Jan 2023 1:52 UTC
34 points
6 comments16 min readLW link

Towards Devel­op­men­tal Interpretability

12 Jul 2023 19:33 UTC
180 points
9 comments9 min readLW link

Com­ments on OpenPhil’s In­ter­pretabil­ity RFP

paulfchristiano5 Nov 2021 22:36 UTC
91 points
5 comments7 min readLW link

200 COP in MI: Analysing Train­ing Dynamics

Neel Nanda4 Jan 2023 16:08 UTC
16 points
0 comments14 min readLW link

Au­toIn­ter­pre­ta­tion Finds Sparse Cod­ing Beats Alternatives

Hoagy17 Jul 2023 1:41 UTC
56 points
1 comment7 min readLW link

Paper: Su­per­po­si­tion, Me­moriza­tion, and Dou­ble Des­cent (An­thropic)

LawrenceC5 Jan 2023 17:54 UTC
53 points
11 comments1 min readLW link
(transformer-circuits.pub)

200 COP in MI: Tech­niques, Tool­ing and Automation

Neel Nanda6 Jan 2023 15:08 UTC
13 points
0 comments15 min readLW link

200 COP in MI: Image Model Interpretability

Neel Nanda8 Jan 2023 14:53 UTC
18 points
3 comments6 min readLW link

He­donic Loops and Tam­ing RL

beren19 Jul 2023 15:12 UTC
20 points
14 comments9 min readLW link

How ARENA course ma­te­rial gets made

CallumMcDougall2 Jul 2024 18:04 UTC
41 points
2 comments7 min readLW link

200 COP in MI: In­ter­pret­ing Re­in­force­ment Learning

Neel Nanda10 Jan 2023 17:37 UTC
25 points
1 comment10 min readLW link

Tiny Mech In­terp Pro­jects: Emer­gent Po­si­tional Embed­dings of Words

Neel Nanda18 Jul 2023 21:24 UTC
51 points
1 comment9 min readLW link

World-Model In­ter­pretabil­ity Is All We Need

Thane Ruthenis14 Jan 2023 19:37 UTC
35 points
22 comments21 min readLW link

Desider­ata for an AI

Nathan Helm-Burger19 Jul 2023 16:18 UTC
9 points
0 comments4 min readLW link

How does GPT-3 spend its 175B pa­ram­e­ters?

Robert_AIZI13 Jan 2023 19:21 UTC
41 points
14 comments6 min readLW link
(aizi.substack.com)

Open prob­lems in ac­ti­va­tion engineering

24 Jul 2023 19:46 UTC
51 points
2 comments1 min readLW link
(coda.io)

Mech In­terp Puz­zle 1: Sus­pi­ciously Similar Embed­dings in GPT-Neo

Neel Nanda16 Jul 2023 22:02 UTC
65 points
15 comments1 min readLW link

200 COP in MI: Study­ing Learned Fea­tures in Lan­guage Models

Neel Nanda19 Jan 2023 3:48 UTC
24 points
2 comments30 min readLW link

[Question] Trans­former Mech In­terp: Any vi­su­al­iza­tions?

Joyee Chen18 Jan 2023 4:32 UTC
3 points
0 comments1 min readLW link

SAEs you can See: Ap­ply­ing Sparse Au­toen­coders to Clustering

Robert_AIZI28 Oct 2024 14:48 UTC
26 points
0 comments9 min readLW link

Does Cir­cuit Anal­y­sis In­ter­pretabil­ity Scale? Ev­i­dence from Mul­ti­ple Choice Ca­pa­bil­ities in Chinchilla

20 Jul 2023 10:50 UTC
44 points
3 comments2 min readLW link
(arxiv.org)

An­thropic an­nounces in­ter­pretabil­ity ad­vances. How much does this ad­vance al­ign­ment?

Seth Herd21 May 2024 22:30 UTC
49 points
4 comments3 min readLW link
(www.anthropic.com)

Really Strong Fea­tures Found in Resi­d­ual Stream

Logan Riggs8 Jul 2023 19:40 UTC
69 points
6 comments2 min readLW link

Mechanis­tic In­ter­pretabil­ity Quick­start Guide

Neel Nanda31 Jan 2023 16:35 UTC
42 points
3 comments6 min readLW link
(www.neelnanda.io)

More find­ings on Me­moriza­tion and dou­ble descent

Marius Hobbhahn1 Feb 2023 18:26 UTC
53 points
2 comments19 min readLW link

More find­ings on max­i­mal data dimension

Marius Hobbhahn2 Feb 2023 18:33 UTC
27 points
1 comment11 min readLW link

Neuronpedia

Johnny Lin26 Jul 2023 16:29 UTC
135 points
51 comments2 min readLW link
(neuronpedia.org)

AXRP Epi­sode 19 - Mechanis­tic In­ter­pretabil­ity with Neel Nanda

DanielFilan4 Feb 2023 3:00 UTC
45 points
0 comments117 min readLW link

AXRP Epi­sode 23 - Mechanis­tic Ano­maly De­tec­tion with Mark Xu

DanielFilan27 Jul 2023 1:50 UTC
22 points
0 comments72 min readLW link

Mech In­terp Pro­ject Ad­vis­ing Call: Me­mori­sa­tion in GPT-2 Small

Neel Nanda4 Feb 2023 14:17 UTC
7 points
0 comments1 min readLW link

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers v2

Neel Nanda7 Jul 2024 17:39 UTC
134 points
15 comments25 min readLW link

[ASoT] Policy Tra­jec­tory Visualization

Ulisse Mini7 Feb 2023 0:13 UTC
9 points
2 comments1 min readLW link

Re­view of AI Align­ment Progress

PeterMcCluskey7 Feb 2023 18:57 UTC
72 points
32 comments7 min readLW link
(bayesianinvestor.com)

Why I’m bear­ish on mechanis­tic in­ter­pretabil­ity: the shards are not in the network

tailcalled13 Sep 2024 17:09 UTC
19 points
40 comments1 min readLW link

Mech In­terp Puz­zle 2: Word2Vec Style Embeddings

Neel Nanda28 Jul 2023 0:50 UTC
40 points
4 comments2 min readLW link

On Devel­op­ing a Math­e­mat­i­cal The­ory of In­ter­pretabil­ity

carboniferous_umbraculum 9 Feb 2023 1:45 UTC
64 points
8 comments6 min readLW link

Apollo Re­search is hiring evals and in­ter­pretabil­ity en­g­ineers & scientists

Marius Hobbhahn4 Aug 2023 10:54 UTC
25 points
0 comments2 min readLW link

Toward A Math­e­mat­i­cal Frame­work for Com­pu­ta­tion in Superposition

18 Jan 2024 21:06 UTC
203 points
18 comments63 min readLW link

The con­cep­tual Dop­pelgänger problem

TsviBT12 Feb 2023 17:23 UTC
12 points
5 comments4 min readLW link

AI al­ign­ment as a trans­la­tion problem

Roman Leventov5 Feb 2024 14:14 UTC
22 points
2 comments3 min readLW link

AXRP Epi­sode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

DanielFilan24 Aug 2024 22:30 UTC
21 points
0 comments74 min readLW link

Gated At­ten­tion Blocks: Pre­limi­nary Progress to­ward Re­mov­ing At­ten­tion Head Superposition

8 Apr 2024 11:14 UTC
37 points
4 comments15 min readLW link

EIS V: Blind Spots In AI Safety In­ter­pretabil­ity Research

scasper16 Feb 2023 19:09 UTC
54 points
24 comments10 min readLW link

Grow­ing Bon­sai Net­works with RNNs

ameo7 Aug 2023 17:34 UTC
21 points
5 comments1 min readLW link
(cprimozic.net)

How Do In­duc­tion Heads Ac­tu­ally Work in Trans­form­ers With Finite Ca­pac­ity?

Fabien Roger23 Mar 2023 9:09 UTC
27 points
0 comments5 min readLW link

Wittgen­stein and ML — pa­ram­e­ters vs architecture

Cleo Nardo24 Mar 2023 4:54 UTC
44 points
9 comments5 min readLW link

Causal Graphs of GPT-2-Small’s Resi­d­ual Stream

David Udell9 Jul 2024 22:06 UTC
53 points
7 comments7 min readLW link

Mea­sur­ing Struc­ture Devel­op­ment in Al­gorith­mic Transformers

22 Aug 2024 8:38 UTC
56 points
4 comments11 min readLW link

In­ter­ven­ing in the Resi­d­ual Stream

MadHatter22 Feb 2023 6:29 UTC
30 points
1 comment9 min readLW link

Video/​an­i­ma­tion: Neel Nanda ex­plains what mechanis­tic in­ter­pretabil­ity is

DanielFilan22 Feb 2023 22:42 UTC
24 points
7 comments1 min readLW link
(youtu.be)

Othello-GPT: Fu­ture Work I Am Ex­cited About

Neel Nanda29 Mar 2023 22:13 UTC
48 points
2 comments33 min readLW link
(neelnanda.io)

Ap­ply for the 2023 Devel­op­men­tal In­ter­pretabil­ity Con­fer­ence!

25 Aug 2023 7:12 UTC
33 points
0 comments2 min readLW link

Paper Walk­through: Au­to­mated Cir­cuit Dis­cov­ery with Arthur Conmy

Neel Nanda29 Aug 2023 22:07 UTC
36 points
1 comment1 min readLW link
(www.youtube.com)

Othello-GPT: Reflec­tions on the Re­search Process

Neel Nanda29 Mar 2023 22:13 UTC
36 points
0 comments15 min readLW link
(neelnanda.io)

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

11 Mar 2023 18:59 UTC
328 points
27 comments23 min readLW link

Map­ping the se­man­tic void: Strange go­ings-on in GPT em­bed­ding spaces

mwatkins14 Dec 2023 13:10 UTC
114 points
31 comments14 min readLW link

Ad­den­dum: ba­sic facts about lan­guage mod­els dur­ing training

beren6 Mar 2023 19:24 UTC
22 points
2 comments5 min readLW link

The Lo­cal In­ter­ac­tion Ba­sis: Iden­ti­fy­ing Com­pu­ta­tion­ally-Rele­vant and Sparsely In­ter­act­ing Fea­tures in Neu­ral Networks

20 May 2024 17:53 UTC
105 points
4 comments3 min readLW link

Maze-solv­ing agents: Add a top-right vec­tor, make the agent go to the top-right

31 Mar 2023 19:20 UTC
101 points
17 comments11 min readLW link

The Translu­cent Thoughts Hy­pothe­ses and Their Implications

Fabien Roger9 Mar 2023 16:30 UTC
141 points
7 comments19 min readLW link

Discrim­i­nat­ing Be­hav­iorally Iden­ti­cal Clas­sifiers: a model prob­lem for ap­ply­ing in­ter­pretabil­ity to scal­able oversight

Sam Marks18 Apr 2024 16:17 UTC
107 points
10 comments12 min readLW link

Paper Repli­ca­tion Walk­through: Re­v­erse-Eng­ineer­ing Mo­du­lar Addition

Neel Nanda12 Mar 2023 13:25 UTC
18 points
0 comments1 min readLW link
(neelnanda.io)

[Sum­mary] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
68 points
0 comments3 min readLW link

At­tri­bu­tion Patch­ing: Ac­ti­va­tion Patch­ing At In­dus­trial Scale

Neel Nanda16 Mar 2023 21:44 UTC
45 points
10 comments58 min readLW link
(www.neelnanda.io)

In­tro­duc­ing Leap Labs, an AI in­ter­pretabil­ity startup

Jessica Rumbelow6 Mar 2023 16:16 UTC
103 points
12 comments1 min readLW link

A cir­cuit for Python doc­strings in a 4-layer at­ten­tion-only transformer

20 Feb 2023 19:35 UTC
95 points
8 comments21 min readLW link

You’re Mea­sur­ing Model Com­plex­ity Wrong

11 Oct 2023 11:46 UTC
87 points
15 comments13 min readLW link

Interpretability

29 Oct 2021 7:28 UTC
60 points
13 comments12 min readLW link

Gi­ant (In)scrutable Ma­tri­ces: (Maybe) the Best of All Pos­si­ble Worlds

1a3orn4 Apr 2023 17:39 UTC
196 points
37 comments5 min readLW link

JumpReLU SAEs + Early Ac­cess to Gemma 2 SAEs

19 Jul 2024 16:10 UTC
48 points
10 comments1 min readLW link
(storage.googleapis.com)

Sparse Cod­ing, for Mechanis­tic In­ter­pretabil­ity and Ac­ti­va­tion Engineering

David Udell23 Sep 2023 19:16 UTC
42 points
7 comments34 min readLW link

[Full Post] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
73 points
10 comments8 min readLW link

Difficulty classes for al­ign­ment properties

Jozdien20 Feb 2024 9:08 UTC
34 points
5 comments2 min readLW link

ProLU: A Non­lin­ear­ity for Sparse Autoencoders

Glen Taggart23 Apr 2024 14:09 UTC
44 points
4 comments9 min readLW link

High­lights: Went­worth, Shah, and Mur­phy on “Re­tar­get­ing the Search”

RobertM14 Sep 2023 2:18 UTC
85 points
4 comments8 min readLW link

Mech In­terp Challenge: Septem­ber—De­ci­pher­ing the Ad­di­tion Model

CallumMcDougall13 Sep 2023 22:23 UTC
35 points
0 comments4 min readLW link

Sparse Au­toen­coders Find Highly In­ter­pretable Direc­tions in Lan­guage Models

21 Sep 2023 15:30 UTC
158 points
8 comments5 min readLW link

Mech In­terp Challenge: Jan­uary—De­ci­pher­ing the Cae­sar Cipher Model

CallumMcDougall1 Jan 2024 18:03 UTC
17 points
0 comments3 min readLW link

Iden­ti­fy­ing se­man­tic neu­rons, mechanis­tic cir­cuits & in­ter­pretabil­ity web apps

13 Apr 2023 11:59 UTC
18 points
0 comments8 min readLW link

Steer­ing GPT-2-XL by adding an ac­ti­va­tion vector

13 May 2023 18:42 UTC
436 points
97 comments50 min readLW link

Why I stopped be­ing into basin broadness

tailcalled25 Apr 2024 20:47 UTC
16 points
3 comments2 min readLW link

Shap­ley Value At­tri­bu­tion in Chain of Thought

leogao14 Apr 2023 5:56 UTC
103 points
7 comments4 min readLW link

Three ways in­ter­pretabil­ity could be impactful

Arthur Conmy18 Sep 2023 1:02 UTC
47 points
8 comments4 min readLW link

Smar­tyHead­erCode: anoma­lous to­kens for GPT3.5 and GPT-4

AdamYedidia15 Apr 2023 22:35 UTC
71 points
18 comments6 min readLW link

Neel Nanda on the Mechanis­tic In­ter­pretabil­ity Re­searcher Mindset

Michaël Trazzi21 Sep 2023 19:47 UTC
37 points
1 comment3 min readLW link
(theinsideview.ai)

Lan­guage Models are a Po­ten­tially Safe Path to Hu­man-Level AGI

Nadav Brandes20 Apr 2023 0:40 UTC
28 points
6 comments8 min readLW link

Be­havi­oural statis­tics for a maze-solv­ing agent

20 Apr 2023 22:26 UTC
46 points
11 comments10 min readLW link

In­ter­pret­ing OpenAI’s Whisper

EllenaR24 Sep 2023 17:53 UTC
114 points
13 comments7 min readLW link

Mech In­terp Challenge: Oc­to­ber—De­ci­pher­ing the Sorted List Model

CallumMcDougall3 Oct 2023 10:57 UTC
23 points
0 comments3 min readLW link

Su­per­po­si­tion is not “just” neu­ron polysemanticity

LawrenceC26 Apr 2024 23:22 UTC
64 points
4 comments13 min readLW link

Should we pub­lish mechanis­tic in­ter­pretabil­ity re­search?

21 Apr 2023 16:19 UTC
105 points
40 comments13 min readLW link

Neu­ral net­work poly­topes (Co­lab note­book)

Zach Furman21 Apr 2023 22:42 UTC
11 points
0 comments1 min readLW link
(colab.research.google.com)

Ex­plain­ing the Trans­former Cir­cuits Frame­work by Example

Felix Hofstätter25 Apr 2023 13:45 UTC
8 points
0 comments15 min readLW link

Im­prov­ing Dic­tionary Learn­ing with Gated Sparse Autoencoders

25 Apr 2024 18:43 UTC
63 points
38 comments1 min readLW link
(arxiv.org)

Fact Find­ing: At­tempt­ing to Re­v­erse-Eng­ineer Fac­tual Re­call on the Neu­ron Level (Post 1)

23 Dec 2023 2:44 UTC
108 points
8 comments22 min readLW link

In­ner Align­ment in Salt-Starved Rats

Steven Byrnes19 Nov 2020 2:40 UTC
137 points
41 comments11 min readLW link2 reviews

Paper: Un­der­stand­ing and Con­trol­ling a Maze-Solv­ing Policy Network

13 Oct 2023 1:38 UTC
70 points
0 comments1 min readLW link
(arxiv.org)

SAEs Dis­cover Mean­ingful Fea­tures in the IOI Task

5 Jun 2024 23:48 UTC
15 points
2 comments10 min readLW link

Physics of Lan­guage mod­els (part 2.1)

Nathan Helm-Burger19 Sep 2024 16:48 UTC
9 points
2 comments1 min readLW link
(youtu.be)

[Paper] All’s Fair In Love And Love: Copy Sup­pres­sion in GPT-2 Small

13 Oct 2023 18:32 UTC
82 points
4 comments8 min readLW link

Mech In­terp Challenge: Novem­ber—De­ci­pher­ing the Cu­mu­la­tive Sum Model

CallumMcDougall2 Nov 2023 17:10 UTC
18 points
2 comments2 min readLW link

A New Class of Glitch To­kens—BPE Subto­ken Ar­ti­facts (BSA)

Lao Mein20 Sep 2024 13:13 UTC
37 points
7 comments5 min readLW link

Why did ChatGPT say that? Prompt en­g­ineer­ing and more, with PIZZA.

Jessica Rumbelow3 Aug 2024 12:07 UTC
40 points
2 comments4 min readLW link

Mechanis­tic In­ter­pretabil­ity Work­shop Hap­pen­ing at ICML 2024!

3 May 2024 1:18 UTC
48 points
6 comments1 min readLW link

Im­prov­ing SAE’s by Sqrt()-ing L1 & Re­mov­ing Low­est Ac­ti­vat­ing Fea­tures

15 Mar 2024 16:30 UTC
26 points
5 comments4 min readLW link

Ev­i­dence of Learned Look-Ahead in a Chess-Play­ing Neu­ral Network

Erik Jenner4 Jun 2024 15:50 UTC
120 points
14 comments13 min readLW link

Multi-di­men­sional re­wards for AGI in­ter­pretabil­ity and control

Steven Byrnes4 Jan 2021 3:08 UTC
19 points
8 comments10 min readLW link

Glitch To­ken Cat­a­log - (Al­most) a Full Clear

Lao Mein21 Sep 2024 12:22 UTC
37 points
3 comments37 min readLW link

MIRI com­ments on Co­tra’s “Case for Align­ing Nar­rowly Su­per­hu­man Models”

Rob Bensinger5 Mar 2021 23:43 UTC
142 points
13 comments26 min readLW link

Trans­parency Trichotomy

Mark Xu28 Mar 2021 20:26 UTC
25 points
2 comments7 min readLW link

Solv­ing the whole AGI con­trol prob­lem, ver­sion 0.0001

Steven Byrnes8 Apr 2021 15:14 UTC
63 points
7 comments26 min readLW link

Dropout can cre­ate a priv­ileged ba­sis in the ReLU out­put model.

lewis smith28 Apr 2023 1:59 UTC
24 points
3 comments5 min readLW link

Knowl­edge Neu­rons in Pre­trained Transformers

evhub17 May 2021 22:54 UTC
100 points
7 comments2 min readLW link
(arxiv.org)

Self-ex­plain­ing SAE features

5 Aug 2024 22:20 UTC
60 points
13 comments10 min readLW link

Garrabrant and Shah on hu­man mod­el­ing in AGI

Rob Bensinger4 Aug 2021 4:35 UTC
60 points
10 comments47 min readLW link

Neu­ral net /​ de­ci­sion tree hy­brids: a po­ten­tial path to­ward bridg­ing the in­ter­pretabil­ity gap

Nathan Helm-Burger23 Sep 2021 0:38 UTC
21 points
2 comments12 min readLW link

AtP*: An effi­cient and scal­able method for lo­cal­iz­ing LLM be­havi­our to components

18 Mar 2024 17:28 UTC
19 points
0 comments1 min readLW link
(arxiv.org)

Let’s buy out Cyc, for use in AGI in­ter­pretabil­ity sys­tems?

Steven Byrnes7 Dec 2021 20:46 UTC
49 points
10 comments2 min readLW link

Solv­ing In­ter­pretabil­ity Week

Logan Riggs13 Dec 2021 17:09 UTC
11 points
5 comments1 min readLW link

In­ter­pretabil­ity with Sparse Au­toen­coders (Co­lab ex­er­cises)

CallumMcDougall29 Nov 2023 12:56 UTC
74 points
9 comments4 min readLW link

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC
127 points
9 comments15 min readLW link

You can re­move GPT2’s Lay­erNorm by fine-tun­ing for an hour

StefanHex8 Aug 2024 18:33 UTC
161 points
11 comments8 min readLW link

How use­ful is mechanis­tic in­ter­pretabil­ity?

1 Dec 2023 2:54 UTC
163 points
54 comments25 min readLW link

Au­tomat­ing LLM Au­dit­ing with Devel­op­men­tal Interpretability

4 Sep 2024 15:50 UTC
17 points
0 comments3 min readLW link

An­nounc­ing Hu­man-al­igned AI Sum­mer School

22 May 2024 8:55 UTC
50 points
0 comments1 min readLW link
(humanaligned.ai)

Ques­tion 3: Con­trol pro­pos­als for min­i­miz­ing bad outcomes

Cameron Berg12 Feb 2022 19:13 UTC
5 points
1 comment7 min readLW link

“What the hell is a rep­re­sen­ta­tion, any­way?” | Clar­ify­ing AI in­ter­pretabil­ity with tools from philos­o­phy of cog­ni­tive sci­ence | Part 1: Ve­hi­cles vs. contents

IwanWilliams9 Jun 2024 14:19 UTC
9 points
1 comment4 min readLW link

Progress Re­port 1: in­ter­pretabil­ity ex­per­i­ments & learn­ing, test­ing com­pres­sion hypotheses

Nathan Helm-Burger22 Mar 2022 20:12 UTC
11 points
0 comments2 min readLW link

[In­tro to brain-like-AGI safety] 9. Take­aways from neuro 2/​2: On AGI motivation

Steven Byrnes23 Mar 2022 12:48 UTC
44 points
11 comments22 min readLW link

Ex­plain­ing grokking through cir­cuit efficiency

8 Sep 2023 14:39 UTC
101 points
11 comments3 min readLW link
(arxiv.org)

Au­to­mat­i­cally find­ing fea­ture vec­tors in the OV cir­cuits of Trans­form­ers with­out us­ing probing

Jacob Dunefsky12 Sep 2023 17:38 UTC
13 points
0 comments29 min readLW link

Un­cov­er­ing La­tent Hu­man Wel­lbe­ing in LLM Embeddings

14 Sep 2023 1:40 UTC
32 points
7 comments8 min readLW link
(far.ai)

Ex­pand­ing the Scope of Superposition

Derek Larson13 Sep 2023 17:38 UTC
10 points
0 comments4 min readLW link

Char­bel-Raphaël and Lu­cius dis­cuss interpretability

30 Oct 2023 5:50 UTC
105 points
7 comments21 min readLW link

Seek­ing Feed­back on My Mechanis­tic In­ter­pretabil­ity Re­search Agenda

RGRGRG12 Sep 2023 18:45 UTC
3 points
1 comment3 min readLW link

Mechanis­tic In­ter­pretabil­ity Read­ing group

26 Sep 2023 16:26 UTC
15 points
0 comments1 min readLW link

An­nounc­ing the CNN In­ter­pretabil­ity Competition

scasper26 Sep 2023 16:21 UTC
22 points
0 comments4 min readLW link

High-level in­ter­pretabil­ity: de­tect­ing an AI’s objectives

28 Sep 2023 19:30 UTC
69 points
4 comments21 min readLW link

New Tool: the Resi­d­ual Stream Viewer

AdamYedidia1 Oct 2023 0:49 UTC
32 points
7 comments4 min readLW link
(tinyurl.com)

In­ter­pretabil­ity Ex­ter­nal­ities Case Study—Hun­gry Hun­gry Hippos

Magdalena Wache20 Sep 2023 14:42 UTC
64 points
22 comments2 min readLW link

Graph­i­cal ten­sor no­ta­tion for interpretability

Jordan Taylor4 Oct 2023 8:04 UTC
137 points
11 comments19 min readLW link

Tak­ing fea­tures out of su­per­po­si­tion with sparse au­toen­coders more quickly with in­formed initialization

Pierre Peigné23 Sep 2023 16:21 UTC
30 points
8 comments5 min readLW link

Early Ex­per­i­ments in Re­ward Model In­ter­pre­ta­tion Us­ing Sparse Autoencoders

3 Oct 2023 7:45 UTC
17 points
0 comments5 min readLW link

What would it mean to un­der­stand how a large lan­guage model (LLM) works? Some quick notes.

Bill Benzon3 Oct 2023 15:11 UTC
20 points
4 comments8 min readLW link

A per­sonal ex­pla­na­tion of ELK con­cept and task.

Zeyu Qin6 Oct 2023 3:55 UTC
1 point
0 comments1 min readLW link

En­tan­gle­ment and in­tu­ition about words and mean­ing

Bill Benzon4 Oct 2023 14:16 UTC
4 points
0 comments2 min readLW link

At­tribut­ing to in­ter­ac­tions with GCPD and GWPD

jenny11 Oct 2023 15:06 UTC
20 points
0 comments6 min readLW link

Com­par­ing An­thropic’s Dic­tionary Learn­ing to Ours

Robert_AIZI7 Oct 2023 23:30 UTC
137 points
8 comments4 min readLW link

Bird-eye view vi­su­al­iza­tion of LLM activations

Sergii8 Oct 2023 12:12 UTC
11 points
2 comments1 min readLW link
(grgv.xyz)

Un­der­stand­ing LLMs: Some ba­sic ob­ser­va­tions about words, syn­tax, and dis­course [w/​ a con­jec­ture about grokking]

Bill Benzon11 Oct 2023 19:13 UTC
6 points
0 comments5 min readLW link

An­nounc­ing Timaeus

22 Oct 2023 11:59 UTC
187 points
15 comments4 min readLW link

Mechanis­tic in­ter­pretabil­ity of LLM anal­ogy-making

Sergii20 Oct 2023 12:53 UTC
2 points
0 comments4 min readLW link
(grgv.xyz)

In­ter­nal Tar­get In­for­ma­tion for AI Oversight

Paul Colognese20 Oct 2023 14:53 UTC
15 points
0 comments5 min readLW link

Re­veal­ing In­ten­tion­al­ity In Lan­guage Models Through AdaVAE Guided Sampling

jdp20 Oct 2023 7:32 UTC
119 points
15 comments22 min readLW link

[Question] Does a broad overview of Mechanis­tic In­ter­pretabil­ity ex­ist?

kourabi16 Oct 2023 1:16 UTC
1 point
0 comments1 min readLW link

ChatGPT tells 20 ver­sions of its pro­to­typ­i­cal story, with a short note on method

Bill Benzon14 Oct 2023 15:27 UTC
6 points
0 comments5 min readLW link

Map­ping ChatGPT’s on­tolog­i­cal land­scape, gra­di­ents and choices [in­ter­pretabil­ity]

Bill Benzon15 Oct 2023 20:12 UTC
1 point
0 comments18 min readLW link

[Question] Can we iso­late neu­rons that rec­og­nize fea­tures vs. those which have some other role?

Joshua Clancy21 Oct 2023 0:30 UTC
4 points
2 comments3 min readLW link

In­ves­ti­gat­ing the learn­ing co­effi­cient of mod­u­lar ad­di­tion: hackathon project

17 Oct 2023 19:51 UTC
94 points
5 comments12 min readLW link

Fea­tures and Ad­ver­saries in MemoryDT

20 Oct 2023 7:32 UTC
31 points
6 comments25 min readLW link

Thoughts On (Solv­ing) Deep Deception

Jozdien21 Oct 2023 22:40 UTC
69 points
2 comments6 min readLW link

Challenge: know ev­ery­thing that the best go bot knows about go

DanielFilan11 May 2021 5:10 UTC
48 points
113 comments2 min readLW link
(danielfilan.com)

Spec­u­la­tions against GPT-n writ­ing al­ign­ment papers

Donald Hobson7 Jun 2021 21:13 UTC
31 points
6 comments2 min readLW link

Try­ing to ap­prox­i­mate Statis­ti­cal Models as Scor­ing Tables

Jsevillamol29 Jun 2021 17:20 UTC
18 points
2 comments9 min readLW link

Pos­si­ble re­search di­rec­tions to im­prove the mechanis­tic ex­pla­na­tion of neu­ral networks

delton1379 Nov 2021 2:36 UTC
31 points
8 comments9 min readLW link

[linkpost] Ac­qui­si­tion of Chess Knowl­edge in AlphaZero

Quintin Pope23 Nov 2021 7:55 UTC
8 points
1 comment1 min readLW link

Teaser: Hard-cod­ing Trans­former Models

MadHatter12 Dec 2021 22:04 UTC
74 points
19 comments1 min readLW link

The Nat­u­ral Ab­strac­tion Hy­poth­e­sis: Im­pli­ca­tions and Evidence

CallumMcDougall14 Dec 2021 23:14 UTC
39 points
9 comments19 min readLW link

Mechanis­tic In­ter­pretabil­ity for the MLP Lay­ers (rough early thoughts)

MadHatter24 Dec 2021 7:24 UTC
12 points
3 comments1 min readLW link
(www.youtube.com)

An Open Philan­thropy grant pro­posal: Causal rep­re­sen­ta­tion learn­ing of hu­man preferences

PabloAMC11 Jan 2022 11:28 UTC
19 points
6 comments8 min readLW link

Gears-Level Men­tal Models of Trans­former Interpretability

KevinRoWang29 Mar 2022 20:09 UTC
72 points
4 comments6 min readLW link

Progress Re­port 2

Nathan Helm-Burger30 Mar 2022 2:29 UTC
4 points
1 comment1 min readLW link

Progress re­port 3: clus­ter­ing trans­former neurons

Nathan Helm-Burger5 Apr 2022 23:13 UTC
5 points
0 comments2 min readLW link

Is GPT3 a Good Ra­tion­al­ist? - In­struc­tGPT3 [2/​2]

simeon_c7 Apr 2022 13:46 UTC
11 points
0 comments7 min readLW link

Progress Re­port 4: logit lens redux

Nathan Helm-Burger8 Apr 2022 18:35 UTC
4 points
0 comments2 min readLW link

Another list of the­o­ries of im­pact for interpretability

Beth Barnes13 Apr 2022 13:29 UTC
33 points
1 comment5 min readLW link

In­tro­duc­tion to the se­quence: In­ter­pretabil­ity Re­search for the Most Im­por­tant Century

Evan R. Murphy12 May 2022 19:59 UTC
16 points
0 comments8 min readLW link

CNN fea­ture vi­su­al­iza­tion in 50 lines of code

StefanHex26 May 2022 11:02 UTC
17 points
4 comments5 min readLW link

QNR prospects are im­por­tant for AI al­ign­ment research

Eric Drexler3 Feb 2022 15:20 UTC
85 points
12 comments11 min readLW link1 review

Thoughts on For­mal­iz­ing Composition

Tom Lieberum7 Jun 2022 7:51 UTC
13 points
0 comments7 min readLW link

Re­search Ques­tions from Stained Glass Windows

StefanHex8 Jun 2022 12:38 UTC
4 points
0 comments2 min readLW link

An­thropic’s SoLU (Soft­max Lin­ear Unit)

Joel Burget4 Jul 2022 18:38 UTC
21 points
1 comment4 min readLW link
(transformer-circuits.pub)

Deep neu­ral net­works are not opaque.

jem-mosig6 Jul 2022 18:03 UTC
22 points
14 comments3 min readLW link

Race Along Rashomon Ridge

7 Jul 2022 3:20 UTC
50 points
15 comments8 min readLW link

Find­ing Skele­tons on Rashomon Ridge

24 Jul 2022 22:31 UTC
30 points
2 comments7 min readLW link

In­ter­pretabil­ity isn’t Free

Joel Burget4 Aug 2022 15:02 UTC
10 points
1 comment2 min readLW link

Dis­sected boxed AI

Nathan112312 Aug 2022 2:37 UTC
−8 points
2 comments1 min readLW link

In­ter­pretabil­ity Tools Are an At­tack Channel

Thane Ruthenis17 Aug 2022 18:47 UTC
42 points
14 comments1 min readLW link

A Bite Sized In­tro­duc­tion to ELK

Luk2718217 Sep 2022 0:28 UTC
5 points
0 comments6 min readLW link

The Shard The­ory Align­ment Scheme

David Udell25 Aug 2022 4:52 UTC
47 points
32 comments2 min readLW link

In­for­mal se­man­tics and Orders

Q Home27 Aug 2022 4:17 UTC
14 points
10 comments26 min readLW link

Search­ing for Mo­du­lar­ity in Large Lan­guage Models

8 Sep 2022 2:25 UTC
44 points
3 comments14 min readLW link

Try­ing to find the un­der­ly­ing struc­ture of com­pu­ta­tional systems

Matthias G. Mayer13 Sep 2022 21:16 UTC
17 points
9 comments4 min readLW link

Co­or­di­nate-Free In­ter­pretabil­ity Theory

johnswentworth14 Sep 2022 23:33 UTC
52 points
16 comments5 min readLW link

Math­e­mat­i­cal Cir­cuits in Neu­ral Networks

Sean Osier22 Sep 2022 3:48 UTC
34 points
4 comments1 min readLW link
(www.youtube.com)

Re­call and Re­gur­gi­ta­tion in GPT2

Megan Kinniment3 Oct 2022 19:35 UTC
43 points
1 comment26 min readLW link

Hard-Cod­ing Neu­ral Computation

MadHatter13 Dec 2021 4:35 UTC
34 points
8 comments27 min readLW link

Vi­su­al­iz­ing Learned Rep­re­sen­ta­tions of Rice Disease

muhia_bee3 Oct 2022 9:09 UTC
7 points
0 comments4 min readLW link
(indecisive-sand-24a.notion.site)

Nat­u­ral Cat­e­gories Update

Logan Zoellner10 Oct 2022 15:19 UTC
33 points
6 comments2 min readLW link

Help out Red­wood Re­search’s in­ter­pretabil­ity team by find­ing heuris­tics im­ple­mented by GPT-2 small

12 Oct 2022 21:25 UTC
50 points
11 comments4 min readLW link

Causal scrub­bing: Appendix

3 Dec 2022 0:58 UTC
18 points
4 comments20 min readLW link

Causal Scrub­bing: a method for rigor­ously test­ing in­ter­pretabil­ity hy­pothe­ses [Red­wood Re­search]

3 Dec 2022 0:58 UTC
205 points
35 comments20 min readLW link1 review

Some Les­sons Learned from Study­ing Indi­rect Ob­ject Iden­ti­fi­ca­tion in GPT-2 small

28 Oct 2022 23:55 UTC
101 points
9 comments9 min readLW link2 reviews
(arxiv.org)

Au­dit­ing games for high-level interpretability

Paul Colognese1 Nov 2022 10:44 UTC
33 points
1 comment7 min readLW link

Mechanis­tic In­ter­pretabil­ity as Re­v­erse Eng­ineer­ing (fol­low-up to “cars and elephants”)

David Scott Krueger (formerly: capybaralet)3 Nov 2022 23:19 UTC
28 points
3 comments1 min readLW link

Toy Models and Tegum Products

Adam Jermyn4 Nov 2022 18:51 UTC
28 points
7 comments5 min readLW link

Why I’m Work­ing On Model Ag­nos­tic Interpretability

Jessica Rumbelow11 Nov 2022 9:24 UTC
27 points
9 comments2 min readLW link

The limited up­side of interpretability

Peter S. Park15 Nov 2022 18:46 UTC
13 points
11 comments1 min readLW link

Cur­rent themes in mechanis­tic in­ter­pretabil­ity research

16 Nov 2022 14:14 UTC
89 points
2 comments12 min readLW link

Eng­ineer­ing Monose­man­tic­ity in Toy Models

18 Nov 2022 1:43 UTC
75 points
7 comments3 min readLW link
(arxiv.org)

The Ground Truth Prob­lem (Or, Why Eval­u­at­ing In­ter­pretabil­ity Meth­ods Is Hard)

Jessica Rumbelow17 Nov 2022 11:06 UTC
27 points
2 comments2 min readLW link

By De­fault, GPTs Think In Plain Sight

Fabien Roger19 Nov 2022 19:15 UTC
86 points
36 comments9 min readLW link

Multi-Com­po­nent Learn­ing and S-Curves

30 Nov 2022 1:37 UTC
63 points
24 comments7 min readLW link

Causal scrub­bing: re­sults on a paren bal­ance checker

3 Dec 2022 0:59 UTC
34 points
2 comments30 min readLW link

Causal scrub­bing: re­sults on in­duc­tion heads

3 Dec 2022 0:59 UTC
34 points
1 comment17 min readLW link

Is the “Valley of Con­fused Ab­strac­tions” real?

jacquesthibs5 Dec 2022 13:36 UTC
19 points
11 comments2 min readLW link

An ex­plo­ra­tion of GPT-2′s em­bed­ding weights

Adam Scherlis13 Dec 2022 0:46 UTC
42 points
4 comments10 min readLW link

How “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion” Fits Into a Broader Align­ment Scheme

Collin15 Dec 2022 18:22 UTC
244 points
39 comments16 min readLW link1 review

Why mechanis­tic in­ter­pretabil­ity does not and can­not con­tribute to long-term AGI safety (from mes­sages with a friend)

Remmelt19 Dec 2022 12:02 UTC
−3 points
9 comments31 min readLW link

Some Notes on the math­e­mat­ics of Toy Au­toen­cod­ing Problems

carboniferous_umbraculum 22 Dec 2022 17:21 UTC
18 points
1 comment12 min readLW link

In­ter­nal In­ter­faces Are a High-Pri­or­ity In­ter­pretabil­ity Target

Thane Ruthenis29 Dec 2022 17:49 UTC
26 points
6 comments7 min readLW link

But is it re­ally in Rome? An in­ves­ti­ga­tion of the ROME model edit­ing technique

jacquesthibs30 Dec 2022 2:40 UTC
104 points
2 comments18 min readLW link

[Question] Are Mix­ture-of-Ex­perts Trans­form­ers More In­ter­pretable Than Dense Trans­form­ers?

simeon_c31 Dec 2022 11:34 UTC
7 points
5 comments1 min readLW link

In­duc­tion heads—illustrated

CallumMcDougall2 Jan 2023 15:35 UTC
111 points
9 comments3 min readLW link

On the Im­por­tance of Open Sourc­ing Re­ward Models

elandgre2 Jan 2023 19:01 UTC
18 points
5 comments6 min readLW link

Ba­sic Facts about Lan­guage Model Internals

4 Jan 2023 13:01 UTC
130 points
19 comments9 min readLW link

AI psy­chol­ogy should ground the the­o­ries of AI con­scious­ness and in­form hu­man-AI eth­i­cal in­ter­ac­tion design

Roman Leventov8 Jan 2023 6:37 UTC
19 points
8 comments2 min readLW link

Try­ing to iso­late ob­jec­tives: ap­proaches to­ward high-level interpretability

Jozdien9 Jan 2023 18:33 UTC
48 points
14 comments8 min readLW link

The AI Con­trol Prob­lem in a wider in­tel­lec­tual context

philosophybear13 Jan 2023 0:28 UTC
11 points
3 comments12 min readLW link

Can we effi­ciently dis­t­in­guish differ­ent mechanisms?

paulfchristiano27 Dec 2022 0:20 UTC
88 points
30 comments16 min readLW link
(ai-alignment.com)

Neu­ral net­works gen­er­al­ize be­cause of this one weird trick

Jesse Hoogland18 Jan 2023 0:10 UTC
171 points
28 comments53 min readLW link
(www.jessehoogland.com)

Reflec­tions on Trust­ing Trust & AI

Itay Yona16 Jan 2023 6:36 UTC
10 points
1 comment3 min readLW link
(mentaleap.ai)

Large lan­guage mod­els learn to rep­re­sent the world

gjm22 Jan 2023 13:10 UTC
102 points
19 comments3 min readLW link

De­con­fus­ing “Ca­pa­bil­ities vs. Align­ment”

RobertM23 Jan 2023 4:46 UTC
27 points
7 comments2 min readLW link

How-to Trans­former Mechanis­tic In­ter­pretabil­ity—in 50 lines of code or less!

StefanHex24 Jan 2023 18:45 UTC
47 points
5 comments13 min readLW link

[RFC] Pos­si­ble ways to ex­pand on “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion”.

25 Jan 2023 19:03 UTC
48 points
6 comments12 min readLW link

Spooky ac­tion at a dis­tance in the loss landscape

28 Jan 2023 0:22 UTC
61 points
4 comments7 min readLW link
(www.jessehoogland.com)

No Really, At­ten­tion is ALL You Need—At­ten­tion can do feed­for­ward networks

Robert_AIZI31 Jan 2023 18:48 UTC
29 points
7 comments6 min readLW link
(aizi.substack.com)

ChatGPT: Tan­tal­iz­ing af­terthoughts in search of story tra­jec­to­ries [in­duc­tion heads]

Bill Benzon3 Feb 2023 10:35 UTC
4 points
0 comments20 min readLW link

Some mis­cel­la­neous thoughts on ChatGPT, sto­ries, and me­chan­i­cal interpretability

Bill Benzon4 Feb 2023 19:35 UTC
2 points
0 comments3 min readLW link

Gra­di­ent sur­fing: the hid­den role of regularization

Jesse Hoogland6 Feb 2023 3:50 UTC
37 points
9 comments14 min readLW link
(www.jessehoogland.com)

De­ci­sion Trans­former Interpretability

6 Feb 2023 7:29 UTC
84 points
13 comments24 min readLW link

Ad­den­dum: More Effi­cient FFNs via Attention

Robert_AIZI6 Feb 2023 18:55 UTC
10 points
2 comments5 min readLW link
(aizi.substack.com)

LLM Ba­sics: Embed­ding Spaces—Trans­former To­ken Vec­tors Are Not Points in Space

NickyP13 Feb 2023 18:52 UTC
79 points
11 comments15 min readLW link

A multi-dis­ci­plinary view on AI safety research

Roman Leventov8 Feb 2023 16:50 UTC
43 points
4 comments26 min readLW link

The Eng­ineer’s In­ter­pretabil­ity Se­quence (EIS) I: Intro

scasper9 Feb 2023 16:28 UTC
46 points
24 comments3 min readLW link

EIS II: What is “In­ter­pretabil­ity”?

scasper9 Feb 2023 16:48 UTC
28 points
6 comments4 min readLW link

We Found An Neu­ron in GPT-2

11 Feb 2023 18:27 UTC
143 points
23 comments7 min readLW link
(clementneo.com)

Idea: Net­work mod­u­lar­ity and in­ter­pretabil­ity by sex­ual reproduction

qbolec12 Feb 2023 23:06 UTC
3 points
3 comments1 min readLW link

Ex­plain­ing SolidGoldMag­ikarp by look­ing at it from ran­dom directions

Robert_AIZI14 Feb 2023 14:54 UTC
8 points
0 comments8 min readLW link
(aizi.substack.com)

EIS III: Broad Cri­tiques of In­ter­pretabil­ity Research

scasper14 Feb 2023 18:24 UTC
20 points
2 comments11 min readLW link

EIS IV: A Spotlight on Fea­ture At­tri­bu­tion/​Saliency

scasper15 Feb 2023 18:46 UTC
19 points
1 comment4 min readLW link

EIS VI: Cri­tiques of Mechanis­tic In­ter­pretabil­ity Work in AI Safety

scasper17 Feb 2023 20:48 UTC
49 points
9 comments12 min readLW link

EIS VII: A Challenge for Mechanists

scasper18 Feb 2023 18:27 UTC
36 points
4 comments3 min readLW link

The shal­low re­al­ity of ‘deep learn­ing the­ory’

Jesse Hoogland22 Feb 2023 4:16 UTC
34 points
11 comments3 min readLW link
(www.jessehoogland.com)

EIS VIII: An Eng­ineer’s Un­der­stand­ing of De­cep­tive Alignment

scasper19 Feb 2023 15:25 UTC
30 points
5 comments4 min readLW link

EIS IX: In­ter­pretabil­ity and Adversaries

scasper20 Feb 2023 18:25 UTC
30 points
7 comments8 min readLW link

EIS X: Con­tinual Learn­ing, Mo­du­lar­ity, Com­pres­sion, and Biolog­i­cal Brains

scasper21 Feb 2023 16:59 UTC
14 points
4 comments3 min readLW link

EIS XI: Mov­ing Forward

scasper22 Feb 2023 19:05 UTC
19 points
2 comments9 min readLW link

Search­ing for a model’s con­cepts by their shape – a the­o­ret­i­cal framework

23 Feb 2023 20:14 UTC
51 points
0 comments19 min readLW link

EIS XII: Sum­mary

scasper23 Feb 2023 17:45 UTC
18 points
0 comments6 min readLW link

In­ter­pret­ing Embed­ding Spaces by Conceptualization

Adi Simhi28 Feb 2023 18:38 UTC
3 points
0 comments1 min readLW link
(arxiv.org)

In­side the mind of a su­per­hu­man Go model: How does Leela Zero read lad­ders?

Haoxing Du1 Mar 2023 1:47 UTC
157 points
8 comments30 min readLW link

My cur­rent think­ing about ChatGPT @3QD [Gär­den­fors, Wolfram, and the value of spec­u­la­tion]

Bill Benzon1 Mar 2023 10:50 UTC
2 points
0 comments5 min readLW link

ChatGPT tells sto­ries, and a note about re­verse en­g­ineer­ing: A Work­ing Paper

Bill Benzon3 Mar 2023 15:12 UTC
3 points
0 comments3 min readLW link

Against LLM Reductionism

Erich_Grunewald8 Mar 2023 15:52 UTC
140 points
17 comments18 min readLW link
(www.erichgrunewald.com)

Prac­ti­cal Pit­falls of Causal Scrubbing

27 Mar 2023 7:47 UTC
87 points
17 comments13 min readLW link

Creat­ing a Dis­cord server for Mechanis­tic In­ter­pretabil­ity Projects

Victor Levoso12 Mar 2023 18:00 UTC
30 points
6 comments2 min readLW link

In­put Swap Graphs: Dis­cov­er­ing the role of neu­ral net­work com­po­nents at scale

Alexandre Variengien12 May 2023 9:41 UTC
92 points
0 comments33 min readLW link

Hid­den Cog­ni­tion De­tec­tion Meth­ods and Bench­marks

Paul Colognese26 Feb 2024 5:31 UTC
22 points
11 comments4 min readLW link

Ex­am­in­ing Lan­guage Model Perfor­mance with Re­con­structed Ac­ti­va­tions us­ing Sparse Au­toen­coders

27 Feb 2024 2:43 UTC
42 points
16 comments15 min readLW link

Cal­en­dar fea­ture ge­om­e­try in GPT-2 layer 8 resi­d­ual stream SAEs

17 Aug 2024 1:16 UTC
53 points
0 comments5 min readLW link

Has any­one ex­per­i­mented with Do­drio, a tool for ex­plor­ing trans­former mod­els through in­ter­ac­tive vi­su­al­iza­tion?

Bill Benzon11 Dec 2023 20:34 UTC
4 points
0 comments1 min readLW link

Cat­e­gor­i­cal Or­ga­ni­za­tion in Me­mory: ChatGPT Or­ga­nizes the 665 Topic Tags from My New Sa­vanna Blog

Bill Benzon14 Dec 2023 13:02 UTC
0 points
6 comments2 min readLW link

Take­aways from a Mechanis­tic In­ter­pretabil­ity pro­ject on “For­bid­den Facts”

15 Dec 2023 11:05 UTC
33 points
8 comments10 min readLW link

How does a toy 2 digit sub­trac­tion trans­former pre­dict the sign of the out­put?

Evan Anders19 Dec 2023 18:56 UTC
14 points
0 comments8 min readLW link
(evanhanders.blog)

A Univer­sal Emer­gent De­com­po­si­tion of Retrieval Tasks in Lan­guage Models

19 Dec 2023 11:52 UTC
84 points
3 comments10 min readLW link
(arxiv.org)

What’s in the box?! – Towards in­ter­pretabil­ity by dis­t­in­guish­ing niches of value within neu­ral net­works.

Joshua Clancy29 Feb 2024 18:33 UTC
3 points
4 comments128 min readLW link

In­ter­pretabil­ity: In­te­grated Gra­di­ents is a de­cent at­tri­bu­tion method

20 May 2024 17:55 UTC
22 points
7 comments6 min readLW link

Sparse MLP Distillation

slavachalnev15 Jan 2024 19:39 UTC
30 points
3 comments6 min readLW link

Fact Find­ing: Do Early Lay­ers Spe­cial­ise in Lo­cal Pro­cess­ing? (Post 5)

23 Dec 2023 2:46 UTC
18 points
0 comments4 min readLW link

Fact Find­ing: How to Think About In­ter­pret­ing Me­mori­sa­tion (Post 4)

23 Dec 2023 2:46 UTC
22 points
0 comments9 min readLW link

How does a toy 2 digit sub­trac­tion trans­former pre­dict the differ­ence?

Evan Anders22 Dec 2023 21:17 UTC
12 points
0 comments10 min readLW link
(evanhanders.blog)

Fact Find­ing: Sim­plify­ing the Cir­cuit (Post 2)

23 Dec 2023 2:45 UTC
25 points
3 comments14 min readLW link

We In­spected Every Head In GPT-2 Small us­ing SAEs So You Don’t Have To

6 Mar 2024 5:03 UTC
58 points
0 comments12 min readLW link

Ex­plor­ing the Resi­d­ual Stream of Trans­form­ers for Mechanis­tic In­ter­pretabil­ity — Explained

Zeping Yu26 Dec 2023 0:36 UTC
7 points
1 comment11 min readLW link

[Question] SAE sparse fea­ture graph us­ing only resi­d­ual layers

Jaehyuk Lim23 May 2024 13:32 UTC
0 points
3 comments1 min readLW link

Case Stud­ies in Re­v­erse-Eng­ineer­ing Sparse Au­toen­coder Fea­tures by Us­ing MLP Linearization

14 Jan 2024 2:06 UTC
23 points
0 comments42 min readLW link

Bi­ases in Bi­ases, or Cri­tique of the Critique

ThePathYouWillChoose19 Aug 2024 17:11 UTC
1 point
0 comments1 min readLW link

Ano­ma­lous Con­cept De­tec­tion for De­tect­ing Hid­den Cognition

Paul Colognese4 Mar 2024 16:52 UTC
24 points
3 comments10 min readLW link

Task vec­tors & anal­ogy mak­ing in LLMs

Sergii8 Jan 2024 15:17 UTC
9 points
1 comment4 min readLW link
(grgv.xyz)

Find­ing De­cep­tion in Lan­guage Models

20 Aug 2024 9:42 UTC
18 points
4 comments4 min readLW link

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

16 Jan 2024 0:26 UTC
83 points
9 comments18 min readLW link

What’s go­ing on with Per-Com­po­nent Weight Up­dates?

4gate22 Aug 2024 21:22 UTC
1 point
0 comments6 min readLW link

How poly­se­man­tic can one neu­ron be? In­ves­ti­gat­ing fea­tures in TinyS­to­ries.

Evan Anders16 Jan 2024 19:10 UTC
14 points
0 comments8 min readLW link
(evanhanders.blog)

Ex­plor­ing the Evolu­tion and Mi­gra­tion of Differ­ent Layer Embed­ding in LLMs

Ruixuan Huang8 Mar 2024 15:01 UTC
6 points
0 comments8 min readLW link

In­ter­pretabil­ity as Com­pres­sion: Re­con­sid­er­ing SAE Ex­pla­na­tions of Neu­ral Ac­ti­va­tions with MDL-SAEs

23 Aug 2024 18:52 UTC
39 points
5 comments16 min readLW link

Craft­ing Poly­se­man­tic Trans­former Bench­marks with Known Circuits

23 Aug 2024 22:03 UTC
10 points
0 comments25 min readLW link

Find­ing Back­ward Chain­ing Cir­cuits in Trans­form­ers Trained on Tree Search

28 May 2024 5:29 UTC
50 points
1 comment9 min readLW link
(arxiv.org)

Ques­tions I’d Want to Ask an AGI+ to Test Its Un­der­stand­ing of Ethics

sweenesm26 Jan 2024 23:40 UTC
14 points
6 comments4 min readLW link

De­cep­tion and Jailbreak Se­quence: 1. Iter­a­tive Refine­ment Stages of De­cep­tion in LLMs

22 Aug 2024 7:32 UTC
23 points
1 comment21 min readLW link

Ex­plor­ing OpenAI’s La­tent Direc­tions: Tests, Ob­ser­va­tions, and Pok­ing Around

Johnny Lin31 Jan 2024 6:01 UTC
26 points
4 comments14 min readLW link

Un­der­stand­ing SAE Fea­tures with the Logit Lens

11 Mar 2024 0:16 UTC
59 points
0 comments14 min readLW link

Show­ing SAE La­tents Are Not Atomic Us­ing Meta-SAEs

24 Aug 2024 0:56 UTC
60 points
9 comments20 min readLW link

Open Source Sparse Au­toen­coders for all Resi­d­ual Stream Lay­ers of GPT2-Small

Joseph Bloom2 Feb 2024 6:54 UTC
100 points
37 comments15 min readLW link

At­ten­tion SAEs Scale to GPT-2 Small

3 Feb 2024 6:50 UTC
77 points
4 comments8 min readLW link

Fluent dream­ing for lan­guage mod­els (AI in­ter­pretabil­ity method)

6 Feb 2024 6:02 UTC
45 points
5 comments1 min readLW link
(arxiv.org)

Un­der­stand­ing Hid­den Com­pu­ta­tions in Chain-of-Thought Reasoning

rokosbasilisk24 Aug 2024 16:35 UTC
6 points
1 comment1 min readLW link

A Chess-GPT Lin­ear Emer­gent World Representation

Adam Karvonen8 Feb 2024 4:25 UTC
105 points
14 comments7 min readLW link
(adamkarvonen.github.io)

Use­ful start­ing code for interpretability

eggsyntax13 Feb 2024 23:13 UTC
25 points
2 comments1 min readLW link

Lay­ing the Foun­da­tions for Vi­sion and Mul­ti­modal Mechanis­tic In­ter­pretabil­ity & Open Problems

13 Mar 2024 17:09 UTC
44 points
13 comments14 min readLW link

Ad­dress­ing Fea­ture Sup­pres­sion in SAEs

16 Feb 2024 18:32 UTC
85 points
3 comments10 min readLW link

Can quan­tised au­toen­coders find and in­ter­pret cir­cuits in lan­guage mod­els?

charlieoneill24 Mar 2024 20:05 UTC
28 points
4 comments24 min readLW link

Auto-match­ing hid­den lay­ers in Py­torch LLMs

chanind19 Feb 2024 12:40 UTC
2 points
0 comments3 min readLW link

Sparse au­toen­coders find com­posed fea­tures in small toy mod­els

14 Mar 2024 18:00 UTC
33 points
12 comments15 min readLW link

Do sparse au­toen­coders find “true fea­tures”?

Demian Till22 Feb 2024 18:06 UTC
72 points
33 comments11 min readLW link

Notes on In­ter­nal Ob­jec­tives in Toy Models of Agents

Paul Colognese22 Feb 2024 8:02 UTC
16 points
0 comments8 min readLW link

The role of philo­soph­i­cal think­ing in un­der­stand­ing large lan­guage mod­els: Cal­ibrat­ing and clos­ing the gap be­tween first-per­son ex­pe­rience and un­der­ly­ing mechanisms

Bill Benzon23 Feb 2024 12:19 UTC
4 points
0 comments10 min readLW link

Towards White Box Deep Learning

Maciej Satkiewicz27 Mar 2024 18:20 UTC
17 points
5 comments1 min readLW link
(arxiv.org)

An­nounc­ing Neu­ron­pe­dia: Plat­form for ac­cel­er­at­ing re­search into Sparse Autoencoders

25 Mar 2024 21:17 UTC
91 points
7 comments7 min readLW link

De­com­piling Tracr Trans­form­ers—An in­ter­pretabil­ity experiment

Hannes Thurnherr27 Mar 2024 9:49 UTC
4 points
0 comments14 min readLW link

De­cep­tion and Jailbreak Se­quence: 2. Iter­a­tive Refine­ment Stages of Jailbreaks in LLM

Winnie Yang28 Aug 2024 8:41 UTC
7 points
2 comments31 min readLW link

Ophiol­ogy (or, how the Mamba ar­chi­tec­ture works)

9 Apr 2024 19:31 UTC
67 points
8 comments10 min readLW link

DSLT 0. Distill­ing Sin­gu­lar Learn­ing Theory

Liam Carroll16 Jun 2023 9:50 UTC
76 points
6 comments5 min readLW link

Nor­mal­iz­ing Sparse Autoencoders

Fengyuan Hu8 Apr 2024 6:17 UTC
21 points
18 comments13 min readLW link

Can Large Lan­guage Models effec­tively iden­tify cy­ber­se­cu­rity risks?

emile delcourt30 Aug 2024 20:20 UTC
18 points
0 comments11 min readLW link

Scal­ing Laws and Superposition

Pavan Katta10 Apr 2024 15:36 UTC
9 points
4 comments5 min readLW link
(www.pavankatta.com)

[Question] Bar­cod­ing LLM Train­ing Data Sub­sets. Any­one try­ing this for in­ter­pretabil­ity?

right..enough?13 Apr 2024 3:09 UTC
7 points
0 comments7 min readLW link

Is This Lie De­tec­tor Really Just a Lie De­tec­tor? An In­ves­ti­ga­tion of LLM Probe Speci­fic­ity.

Josh Levy4 Jun 2024 15:45 UTC
38 points
0 comments17 min readLW link

Ex­per­i­ments with an al­ter­na­tive method to pro­mote spar­sity in sparse autoencoders

Eoin Farrell15 Apr 2024 18:21 UTC
29 points
7 comments12 min readLW link

Trans­form­ers Rep­re­sent Belief State Geom­e­try in their Resi­d­ual Stream

Adam Shai16 Apr 2024 21:16 UTC
411 points
100 comments12 min readLW link

graph­patch: a Python Library for Ac­ti­va­tion Patching

Occam's Laser5 Jun 2024 15:08 UTC
13 points
2 comments1 min readLW link

Past Tense Features

Can20 Apr 2024 14:34 UTC
12 points
0 comments4 min readLW link

Re­dun­dant At­ten­tion Heads in Large Lan­guage Models For In Con­text Learning

skunnavakkam1 Sep 2024 20:08 UTC
7 points
1 comment4 min readLW link
(skunnavakkam.github.io)

Transcoders en­able fine-grained in­ter­pretable cir­cuit anal­y­sis for lan­guage models

30 Apr 2024 17:58 UTC
69 points
14 comments17 min readLW link

Re­la­tion­ships among words, met­al­in­gual defi­ni­tion, and interpretability

Bill Benzon7 Jun 2024 19:18 UTC
2 points
0 comments5 min readLW link

Vi­su­al­iz­ing neu­ral net­work planning

9 May 2024 6:40 UTC
4 points
0 comments5 min readLW link

Align­ment Gaps

kcyras8 Jun 2024 15:23 UTC
10 points
3 comments8 min readLW link

Closed-Source Evaluations

Jono8 Jun 2024 14:18 UTC
15 points
4 comments1 min readLW link

How To Do Patch­ing Fast

Joseph Miller11 May 2024 20:13 UTC
40 points
6 comments4 min readLW link

In­ves­ti­gat­ing Sen­si­tive Direc­tions in GPT-2: An Im­proved Baseline and Com­par­a­tive Anal­y­sis of SAEs

6 Sep 2024 2:28 UTC
27 points
0 comments12 min readLW link

In­tro­duc­ing SARA: a new ac­ti­va­tion steer­ing technique

Alejandro Tlaie9 Jun 2024 15:33 UTC
17 points
7 comments6 min readLW link

Ex­plor­ing Llama-3-8B MLP Neurons

ntt1239 Jun 2024 14:19 UTC
10 points
0 comments4 min readLW link
(neuralblog.github.io)

Adam Op­ti­mizer Causes Priv­ileged Ba­sis in Trans­former LM Resi­d­ual Stream

6 Sep 2024 17:55 UTC
70 points
7 comments4 min readLW link

[Question] LLM/​AI hype

Student19283746515 Jun 2024 20:12 UTC
1 point
0 comments1 min readLW link

Logit Prisms: De­com­pos­ing Trans­former Out­puts for Mechanis­tic Interpretability

ntt12317 Jun 2024 11:46 UTC
5 points
4 comments6 min readLW link
(neuralblog.github.io)

Analysing Ad­ver­sar­ial At­tacks with Lin­ear Probing

17 Jun 2024 14:16 UTC
9 points
0 comments8 min readLW link

Sparse Fea­tures Through Time

Rogan Inglis24 Jun 2024 18:06 UTC
12 points
1 comment1 min readLW link
(roganinglis.io)

Rep­re­sen­ta­tion Tuning

Christopher Ackerman27 Jun 2024 17:44 UTC
35 points
9 comments13 min readLW link

Ac­ti­va­tion Pat­tern SVD: A pro­posal for SAE Interpretability

Daniel Tan28 Jun 2024 22:12 UTC
15 points
2 comments2 min readLW link

De­com­pos­ing the QK cir­cuit with Bilin­ear Sparse Dic­tionary Learning

2 Jul 2024 13:17 UTC
81 points
7 comments12 min readLW link

Othel­loGPT learned a bag of heuristics

2 Jul 2024 9:12 UTC
108 points
10 comments9 min readLW link

[In­terim re­search re­port] Ac­ti­va­tion plateaus & sen­si­tive di­rec­tions in GPT2

5 Jul 2024 17:05 UTC
64 points
2 comments5 min readLW link

Trans­former Cir­cuit Faith­ful­ness Met­rics Are Not Robust

12 Jul 2024 3:47 UTC
104 points
5 comments7 min readLW link
(arxiv.org)

Stitch­ing SAEs of differ­ent sizes

13 Jul 2024 17:19 UTC
39 points
12 comments12 min readLW link

An In­tro­duc­tion to Rep­re­sen­ta­tion Eng­ineer­ing—an ac­ti­va­tion-based paradigm for con­trol­ling LLMs

Jan Wehner14 Jul 2024 10:37 UTC
35 points
5 comments17 min readLW link

De­cep­tive agents can col­lude to hide dan­ger­ous fea­tures in SAEs

15 Jul 2024 17:07 UTC
27 points
0 comments7 min readLW link

Mech In­terp Lacks Good Paradigms

Daniel Tan16 Jul 2024 15:47 UTC
33 points
0 comments14 min readLW link

Ar­rakis—A toolkit to con­duct, track and vi­su­al­ize mechanis­tic in­ter­pretabil­ity ex­per­i­ments.

Yash Srivastava17 Jul 2024 2:02 UTC
2 points
2 comments5 min readLW link

Su­per­po­si­tion through Ac­tive Learn­ing Lens

akankshanc17 Sep 2024 17:32 UTC
1 point
0 comments10 min readLW link

In­ter­pretabil­ity in Ac­tion: Ex­plo­ra­tory Anal­y­sis of VPT, a Minecraft Agent

18 Jul 2024 17:02 UTC
9 points
0 comments1 min readLW link
(arxiv.org)

SAEs (usu­ally) Trans­fer Between Base and Chat Models

18 Jul 2024 10:29 UTC
65 points
0 comments10 min readLW link

Truth is Univer­sal: Ro­bust De­tec­tion of Lies in LLMs

Lennart Buerger19 Jul 2024 14:07 UTC
24 points
3 comments2 min readLW link
(arxiv.org)

Fea­ture Tar­geted LLC Es­ti­ma­tion Dist­in­guishes SAE Fea­tures from Ran­dom Directions

19 Jul 2024 20:32 UTC
59 points
6 comments16 min readLW link

BatchTopK: A Sim­ple Im­prove­ment for TopK-SAEs

20 Jul 2024 2:20 UTC
52 points
0 comments4 min readLW link

Ini­tial Ex­per­i­ments Us­ing SAEs to Help De­tect AI Gen­er­ated Text

Aaron_Scher22 Jul 2024 5:16 UTC
17 points
0 comments14 min readLW link

Pac­ing Out­side the Box: RNNs Learn to Plan in Sokoban

25 Jul 2024 22:00 UTC
59 points
8 comments2 min readLW link
(arxiv.org)

Open Source Au­to­mated In­ter­pretabil­ity for Sparse Au­toen­coder Features

30 Jul 2024 21:11 UTC
67 points
1 comment13 min readLW link
(blog.eleuther.ai)

Un­der­stand­ing Po­si­tional Fea­tures in Layer 0 SAEs

29 Jul 2024 9:36 UTC
43 points
0 comments5 min readLW link

An In­ter­pretabil­ity Illu­sion from Pop­u­la­tion Statis­tics in Causal Analysis

Daniel Tan29 Jul 2024 14:50 UTC
9 points
3 comments1 min readLW link

Con­struct­ing Neu­ral Net­work Pa­ram­e­ters with Down­stream Trainability

ch271828n31 Jul 2024 18:13 UTC
1 point
0 comments1 min readLW link
(github.com)

Limi­ta­tions on the In­ter­pretabil­ity of Learned Fea­tures from Sparse Dic­tionary Learning

Tom Angsten30 Jul 2024 16:36 UTC
6 points
0 comments9 min readLW link

The Resi­d­ual Ex­pan­sion: A Frame­work for think­ing about Trans­former Circuits

Daniel Tan2 Aug 2024 11:04 UTC
16 points
13 comments3 min readLW link

Eval­u­at­ing Sparse Au­toen­coders with Board Game Models

2 Aug 2024 19:50 UTC
38 points
1 comment9 min readLW link

La­bel­ling, Vari­ables, and In-Con­text Learn­ing in Llama2

Joshua Penman3 Aug 2024 19:36 UTC
6 points
0 comments1 min readLW link
(colab.research.google.com)

Toy Models of Su­per­po­si­tion: what about BitNets?

Alejandro Tlaie8 Aug 2024 16:29 UTC
5 points
1 comment5 min readLW link

Emer­gence, The Blind Spot of GenAI In­ter­pretabil­ity?

Quentin FEUILLADE--MONTIXI10 Aug 2024 10:07 UTC
15 points
8 comments3 min readLW link

Ex­tract­ing SAE task fea­tures for in-con­text learning

12 Aug 2024 20:34 UTC
31 points
1 comment9 min readLW link

[Paper] A is for Ab­sorp­tion: Study­ing Fea­ture Split­ting and Ab­sorp­tion in Sparse Autoencoders

25 Sep 2024 9:31 UTC
69 points
15 comments3 min readLW link
(arxiv.org)

GPT-2 Some­times Fails at IOI

Ronak_Mehta14 Aug 2024 23:24 UTC
13 points
0 comments2 min readLW link
(ronakrm.github.io)

Eval­u­at­ing Syn­thetic Ac­ti­va­tions com­posed of SAE La­tents in GPT-2

25 Sep 2024 20:37 UTC
27 points
0 comments3 min readLW link
(arxiv.org)

Char­ac­ter­iz­ing sta­ble re­gions in the resi­d­ual stream of LLMs

26 Sep 2024 13:44 UTC
38 points
4 comments1 min readLW link
(arxiv.org)

The Geom­e­try of Feel­ings and Non­sense in Large Lan­guage Models

27 Sep 2024 17:49 UTC
58 points
10 comments4 min readLW link

Avoid­ing jailbreaks by dis­cour­ag­ing their rep­re­sen­ta­tion in ac­ti­va­tion space

Guido Bergman27 Sep 2024 17:49 UTC
6 points
2 comments9 min readLW link

Knowl­edge Base 1: Could it in­crease in­tel­li­gence and make it safer?

iwis30 Sep 2024 16:00 UTC
−4 points
0 comments4 min readLW link

Steer­ing LLMs’ Be­hav­ior with Con­cept Ac­ti­va­tion Vectors

Ruixuan Huang28 Sep 2024 9:53 UTC
8 points
0 comments10 min readLW link

Base LLMs re­fuse too

29 Sep 2024 16:04 UTC
60 points
20 comments10 min readLW link

Ex­plor­ing Shard-like Be­hav­ior: Em­piri­cal In­sights into Con­tex­tual De­ci­sion-Mak­ing in RL Agents

Alejandro Aristizabal29 Sep 2024 0:32 UTC
6 points
0 comments15 min readLW link

Devel­op­men­tal Stages in Multi-Prob­lem Grokking

James Sullivan29 Sep 2024 18:58 UTC
4 points
0 comments6 min readLW link

Ex­plor­ing De­com­pos­abil­ity of SAE Features

Vikram_N30 Sep 2024 18:28 UTC
1 point
0 comments3 min readLW link

LLMs are likely not conscious

research_prime_space29 Sep 2024 20:57 UTC
8 points
8 comments1 min readLW link

Toy Models of Su­per­po­si­tion: Sim­plified by Hand

Axel Sorensen29 Sep 2024 21:19 UTC
9 points
3 comments8 min readLW link

Toy Models of Fea­ture Ab­sorp­tion in SAEs

7 Oct 2024 9:56 UTC
46 points
8 comments10 min readLW link

In­ter­pretabil­ity of SAE Fea­tures Rep­re­sent­ing Check in ChessGPT

Jonathan Kutasov5 Oct 2024 20:43 UTC
27 points
2 comments8 min readLW link

(Maybe) A Bag of Heuris­tics is All There Is & A Bag of Heuris­tics is All You Need

Sodium3 Oct 2024 19:11 UTC
34 points
17 comments16 min readLW link

Do­main-spe­cific SAEs

jacob_drori7 Oct 2024 20:15 UTC
27 points
0 comments5 min readLW link

There is a globe in your LLM

jacob_drori8 Oct 2024 0:43 UTC
86 points
4 comments1 min readLW link

Hamil­to­nian Dy­nam­ics in AI: A Novel Ap­proach to Op­ti­miz­ing Rea­son­ing in Lan­guage Models

Javier Marin Valenzuela9 Oct 2024 19:14 UTC
3 points
0 comments10 min readLW link

SAE fea­tures for re­fusal and syco­phancy steer­ing vectors

12 Oct 2024 14:54 UTC
26 points
4 comments7 min readLW link

Stan­dard SAEs Might Be In­co­her­ent: A Choos­ing Prob­lem & A “Con­cise” Solution

Kola Ayonrinde30 Oct 2024 22:50 UTC
26 points
0 comments12 min readLW link

It’s im­por­tant to know when to stop: Mechanis­tic Ex­plo­ra­tion of Gemma 2 List Generation

Gerard Boxo14 Oct 2024 17:04 UTC
8 points
0 comments6 min readLW link
(gboxo.github.io)

A short pro­ject on Mamba: grokking & interpretability

Alejandro Tlaie18 Oct 2024 16:59 UTC
21 points
0 comments6 min readLW link

Monose­man­tic­ity & Quantization

Rahul Chand22 Oct 2024 22:57 UTC
1 point
0 comments9 min readLW link

En­abling New Ap­pli­ca­tions with To­day’s Mechanis­tic In­ter­pretabil­ity Toolkit

ananya_joshi25 Oct 2024 17:53 UTC
3 points
0 comments3 min readLW link

Open Source Repli­ca­tion of An­thropic’s Cross­coder pa­per for model-diffing

27 Oct 2024 18:46 UTC
38 points
4 comments5 min readLW link

Bridg­ing the VLM and mech in­terp com­mu­ni­ties for mul­ti­modal in­ter­pretabil­ity

Sonia Joseph28 Oct 2024 14:41 UTC
19 points
5 comments15 min readLW link

SAE Prob­ing: What is it good for? Ab­solutely some­thing!

1 Nov 2024 19:23 UTC
31 points
0 comments11 min readLW link

Com­po­si­tion Cir­cuits in Vi­sion Trans­form­ers (Hy­poth­e­sis)

phenomanon1 Nov 2024 22:16 UTC
1 point
0 comments3 min readLW link

Test­ing “True” Lan­guage Un­der­stand­ing in LLMs: A Sim­ple Proposal

MtryaSam2 Nov 2024 19:12 UTC
9 points
2 comments2 min readLW link

Evolu­tion­ary prompt op­ti­miza­tion for SAE fea­ture visualization

14 Nov 2024 13:06 UTC
16 points
0 comments9 min readLW link

Iden­ti­fy­ing Func­tion­ally Im­por­tant Fea­tures with End-to-End Sparse Dic­tionary Learning

17 May 2024 16:25 UTC
57 points
10 comments4 min readLW link
(arxiv.org)

SAEs are highly dataset de­pen­dent: a case study on the re­fusal direction

7 Nov 2024 5:22 UTC
62 points
4 comments14 min readLW link

An­a­lyz­ing how SAE fea­tures evolve across a for­ward pass

7 Nov 2024 22:07 UTC
44 points
0 comments1 min readLW link
(arxiv.org)

Antonym Heads Pre­dict Se­man­tic Op­po­sites in Lan­guage Models

Jake Ward15 Nov 2024 15:32 UTC
1 point
0 comments5 min readLW link

Effects of Non-Uniform Spar­sity on Su­per­po­si­tion in Toy Models

Shreyans Jain14 Nov 2024 16:59 UTC
4 points
3 comments6 min readLW link

Em­piri­cal risk min­i­miza­tion is fun­da­men­tally confused

Jesse Hoogland22 Mar 2023 16:58 UTC
32 points
5 comments1 min readLW link

Sen­tience in Machines—How Do We Test for This Ob­jec­tively?

Mayowa Osibodu26 Mar 2023 18:56 UTC
−2 points
0 comments2 min readLW link
(www.researchgate.net)

Ap­prox­i­ma­tion is ex­pen­sive, but the lunch is cheap

19 Apr 2023 14:19 UTC
70 points
3 comments16 min readLW link

Some com­mon con­fu­sion about in­duc­tion heads

Alexandre Variengien28 Mar 2023 21:51 UTC
64 points
4 comments5 min readLW link

Spread­sheet for 200 Con­crete Prob­lems In Interpretability

Jay Bailey29 Mar 2023 6:51 UTC
13 points
0 comments1 min readLW link

The Quan­ti­za­tion Model of Neu­ral Scaling

nz31 Mar 2023 16:02 UTC
17 points
0 comments1 min readLW link
(arxiv.org)

AISC 2023, Progress Re­port for March: Team In­ter­pretable Architectures

2 Apr 2023 16:19 UTC
14 points
0 comments14 min readLW link

Ex­plo­ra­tory Anal­y­sis of RLHF Trans­form­ers with TransformerLens

Curt Tigges3 Apr 2023 16:09 UTC
21 points
2 comments11 min readLW link
(blog.eleuther.ai)

If in­ter­pretabil­ity re­search goes well, it may get dangerous

So8res3 Apr 2023 21:48 UTC
200 points
11 comments2 min readLW link

Univer­sal­ity and Hid­den In­for­ma­tion in Con­cept Bot­tle­neck Models

Hoagy5 Apr 2023 14:00 UTC
23 points
0 comments11 min readLW link

No con­vinc­ing ev­i­dence for gra­di­ent de­scent in ac­ti­va­tion space

Blaine12 Apr 2023 4:48 UTC
82 points
9 comments20 min readLW link

Bing AI Gen­er­at­ing Voyn­ich Manuscript Con­tinu­a­tions—It does not know how it knows

Matthew_Opitz10 Apr 2023 20:22 UTC
15 points
6 comments13 min readLW link

Towards a solu­tion to the al­ign­ment prob­lem via ob­jec­tive de­tec­tion and eval­u­a­tion

Paul Colognese12 Apr 2023 15:39 UTC
9 points
7 comments12 min readLW link

Mechanis­ti­cally in­ter­pret­ing time in GPT-2 small

16 Apr 2023 17:57 UTC
68 points
6 comments21 min readLW link

Re­search Re­port: In­cor­rect­ness Cascades

Robert_AIZI14 Apr 2023 12:49 UTC
19 points
0 comments10 min readLW link
(aizi.substack.com)

An in­tro­duc­tion to lan­guage model interpretability

Alexandre Variengien20 Apr 2023 22:22 UTC
14 points
0 comments9 min readLW link

I was Wrong, Si­mu­la­tor The­ory is Real

Robert_AIZI26 Apr 2023 17:45 UTC
75 points
7 comments3 min readLW link
(aizi.substack.com)

z is not the cause of x

hrbigelow23 Oct 2023 17:43 UTC
6 points
2 comments9 min readLW link

Grokking Beyond Neu­ral Networks

Jack Miller30 Oct 2023 17:28 UTC
10 points
0 comments2 min readLW link
(arxiv.org)

Ro­bust­ness of Con­trast-Con­sis­tent Search to Ad­ver­sar­ial Prompting

1 Nov 2023 12:46 UTC
18 points
1 comment7 min readLW link

Es­ti­mat­ing effec­tive di­men­sion­al­ity of MNIST models

Arjun Panickssery2 Nov 2023 14:13 UTC
41 points
3 comments1 min readLW link

Growth and Form in a Toy Model of Superposition

8 Nov 2023 11:08 UTC
87 points
7 comments14 min readLW link

What’s go­ing on? LLMs and IS-A sen­tences

Bill Benzon8 Nov 2023 16:58 UTC
6 points
15 comments4 min readLW link

Poly­se­man­tic At­ten­tion Head in a 4-Layer Transformer

9 Nov 2023 16:16 UTC
51 points
0 comments6 min readLW link

PhD Po­si­tion: AI In­ter­pretabil­ity in Ber­lin, Germany

Tiberius28 Apr 2023 13:44 UTC
3 points
0 comments1 min readLW link
(stephanw.net)

AISC Pro­ject: Model­ling Tra­jec­to­ries of Lan­guage Models

NickyP13 Nov 2023 14:33 UTC
27 points
0 comments12 min readLW link

Elic­it­ing La­tent Knowl­edge in Com­pre­hen­sive AI Ser­vices Models

acabodi17 Nov 2023 2:36 UTC
6 points
0 comments5 min readLW link

In­ci­den­tal polysemanticity

15 Nov 2023 4:00 UTC
43 points
7 comments11 min readLW link

AISC pro­ject: TinyEvals

Jett Janiak22 Nov 2023 20:47 UTC
22 points
0 comments4 min readLW link

A day in the life of a mechanis­tic in­ter­pretabil­ity researcher

Bill Benzon28 Nov 2023 14:45 UTC
3 points
3 comments1 min readLW link

Towards an Ethics Calcu­la­tor for Use by an AGI

sweenesm12 Dec 2023 18:37 UTC
3 points
2 comments11 min readLW link

Mechanis­tic in­ter­pretabil­ity through clustering

Alistair Fraser4 Dec 2023 18:49 UTC
1 point
0 comments1 min readLW link

Colour ver­sus Shape Goal Mis­gen­er­al­iza­tion in Re­in­force­ment Learn­ing: A Case Study

Karolis Jucys8 Dec 2023 13:18 UTC
13 points
1 comment4 min readLW link
(arxiv.org)

Lan­guage Model Me­moriza­tion, Copy­right Law, and Con­di­tional Pre­train­ing Alignment

RogerDearnaley7 Dec 2023 6:14 UTC
9 points
0 comments11 min readLW link

Gra­di­ent hacking

evhub16 Oct 2019 0:53 UTC
106 points
39 comments3 min readLW link2 reviews

Will trans­parency help catch de­cep­tion? Per­haps not

Matthew Barnett4 Nov 2019 20:52 UTC
43 points
5 comments7 min readLW link

Ro­hin Shah on rea­sons for AI optimism

abergal31 Oct 2019 12:10 UTC
40 points
58 comments1 min readLW link
(aiimpacts.org)

Un­der­stand­ing mesa-op­ti­miza­tion us­ing toy models

7 May 2023 17:00 UTC
43 points
2 comments10 min readLW link

A Search for More ChatGPT /​ GPT-3.5 /​ GPT-4 “Un­speak­able” Glitch Tokens

Martin Fell9 May 2023 14:36 UTC
26 points
9 comments6 min readLW link

A tech­ni­cal note on bil­in­ear lay­ers for interpretability

Lee Sharkey8 May 2023 6:06 UTC
58 points
0 comments1 min readLW link
(arxiv.org)

A com­par­i­son of causal scrub­bing, causal ab­strac­tions, and re­lated methods

8 Jun 2023 23:40 UTC
73 points
3 comments22 min readLW link

Lan­guage mod­els can ex­plain neu­rons in lan­guage models

nz9 May 2023 17:29 UTC
23 points
0 comments1 min readLW link
(openai.com)

Solv­ing the Mechanis­tic In­ter­pretabil­ity challenges: EIS VII Challenge 1

9 May 2023 19:41 UTC
119 points
1 comment10 min readLW link

[Question] Have you heard about MIT’s “liquid neu­ral net­works”? What do you think about them?

Ppau9 May 2023 20:16 UTC
35 points
14 comments1 min readLW link

‘Fun­da­men­tal’ vs ‘ap­plied’ mechanis­tic in­ter­pretabil­ity research

Lee Sharkey23 May 2023 18:26 UTC
65 points
6 comments3 min readLW link

[Question] AI in­ter­pretabil­ity could be harm­ful?

Roman Leventov10 May 2023 20:43 UTC
13 points
2 comments1 min readLW link

Con­trast Pairs Drive the Em­piri­cal Perfor­mance of Con­trast Con­sis­tent Search (CCS)

Scott Emmons31 May 2023 17:09 UTC
97 points
0 comments6 min readLW link

My cur­rent work­flow to study the in­ter­nal mechanisms of LLM

Yulu Pi16 May 2023 15:27 UTC
4 points
0 comments1 min readLW link

A Mechanis­tic In­ter­pretabil­ity Anal­y­sis of a GridWorld Agent-Si­mu­la­tor (Part 1 of N)

Joseph Bloom16 May 2023 22:59 UTC
36 points
2 comments16 min readLW link

Gen­der Vec­tors in ROME’s La­tent Space

Xodarap21 May 2023 18:46 UTC
14 points
2 comments3 min readLW link

Solv­ing the Mechanis­tic In­ter­pretabil­ity challenges: EIS VII Challenge 2

25 May 2023 15:37 UTC
71 points
1 comment13 min readLW link

Align­ing an H-JEPA agent via train­ing on the out­puts of an LLM-based “ex­em­plary ac­tor”

Roman Leventov29 May 2023 11:08 UTC
12 points
10 comments30 min readLW link

The king token

p.b.28 May 2023 19:18 UTC
17 points
0 comments4 min readLW link

Short Re­mark on the (sub­jec­tive) math­e­mat­i­cal ‘nat­u­ral­ness’ of the Nanda—Lie­berum ad­di­tion mod­ulo 113 algorithm

carboniferous_umbraculum 1 Jun 2023 11:31 UTC
104 points
12 comments2 min readLW link

[Linkpost] Rosetta Neu­rons: Min­ing the Com­mon Units in a Model Zoo

Bogdan Ionut Cirstea17 Jun 2023 16:38 UTC
12 points
0 comments1 min readLW link

[Re­search Up­date] Sparse Au­toen­coder fea­tures are bimodal

Robert_AIZI22 Jun 2023 13:15 UTC
24 points
1 comment5 min readLW link
(aizi.substack.com)

Un­der­stand­ing understanding

mthq23 Aug 2019 18:10 UTC
24 points
1 comment2 min readLW link

The risk-re­ward trade­off of in­ter­pretabil­ity research

5 Jul 2023 17:05 UTC
15 points
1 comment6 min readLW link

Lo­cal­iz­ing goal mis­gen­er­al­iza­tion in a maze-solv­ing policy network

jan betley6 Jul 2023 16:21 UTC
37 points
2 comments7 min readLW link

In­ter­pret­ing Mo­du­lar Ad­di­tion in MLPs

Bart Bussmann7 Jul 2023 9:22 UTC
19 points
0 comments6 min readLW link

LLM mis­al­ign­ment can prob­a­bly be found with­out man­ual prompt engineering

ProgramCrafter8 Jul 2023 14:35 UTC
1 point
0 comments1 min readLW link

in­ter­pret­ing GPT: the logit lens

nostalgebraist31 Aug 2020 2:47 UTC
223 points
37 comments11 min readLW link

Im­pact sto­ries for model in­ter­nals: an ex­er­cise for in­ter­pretabil­ity researchers

jenny25 Sep 2023 23:15 UTC
29 points
3 comments7 min readLW link

Still no Lie De­tec­tor for LLMs

18 Jul 2023 19:56 UTC
47 points
2 comments21 min readLW link

Ac­ti­va­tion adding ex­per­i­ments with llama-7b

Nina Panickssery16 Jul 2023 4:17 UTC
51 points
1 comment3 min readLW link

GPT-2′s po­si­tional em­bed­ding ma­trix is a helix

AdamYedidia21 Jul 2023 4:16 UTC
44 points
21 comments4 min readLW link

[Linkpost] In­ter­pret­ing Mul­ti­modal Video Trans­form­ers Us­ing Brain Recordings

Bogdan Ionut Cirstea21 Jul 2023 11:26 UTC
5 points
0 comments1 min readLW link

Train­ing Pro­cess Trans­parency through Gra­di­ent In­ter­pretabil­ity: Early ex­per­i­ments on toy lan­guage models

21 Jul 2023 14:52 UTC
56 points
1 comment1 min readLW link

Be­cause of Lay­erNorm, Direc­tions in GPT-2 MLP Lay­ers are Monosemantic

ojorgensen28 Jul 2023 19:43 UTC
13 points
3 comments13 min readLW link

Thoughts about the Mechanis­tic In­ter­pretabil­ity Challenge #2 (EIS VII #2)

RGRGRG28 Jul 2023 20:44 UTC
23 points
5 comments20 min readLW link

AI Safety 101 : In­tro­duc­tion to Vi­sion Interpretability

28 Jul 2023 17:32 UTC
41 points
0 comments1 min readLW link
(github.com)

A Short Memo on AI In­ter­pretabil­ity Rain­bows

scasper27 Jul 2023 23:05 UTC
18 points
0 comments2 min readLW link

Visi­ble loss land­scape bas­ins don’t cor­re­spond to dis­tinct algorithms

Mikhail Samin28 Jul 2023 16:19 UTC
68 points
13 comments4 min readLW link

[Linkpost] Mul­ti­modal Neu­rons in Pre­trained Text-Only Transformers

Bogdan Ionut Cirstea4 Aug 2023 15:29 UTC
11 points
0 comments1 min readLW link

Ground-Truth La­bel Im­bal­ance Im­pairs the Perfor­mance of Con­trast-Con­sis­tent Search (and Other Con­trast-Pair-Based Un­su­per­vised Meth­ods)

5 Aug 2023 17:55 UTC
6 points
2 comments7 min readLW link
(drive.google.com)

Mech In­terp Challenge: Au­gust—De­ci­pher­ing the First Unique Char­ac­ter Model

CallumMcDougall9 Aug 2023 19:14 UTC
36 points
1 comment3 min readLW link

The po­si­tional em­bed­ding ma­trix and pre­vi­ous-to­ken heads: how do they ac­tu­ally work?

AdamYedidia10 Aug 2023 1:58 UTC
26 points
4 comments13 min readLW link

An in­ter­ac­tive in­tro­duc­tion to grokking and mechanis­tic interpretability

7 Aug 2023 19:09 UTC
23 points
3 comments1 min readLW link
(pair.withgoogle.com)

De­com­pos­ing in­de­pen­dent gen­er­al­iza­tions in neu­ral net­works via Hes­sian analysis

14 Aug 2023 17:04 UTC
83 points
4 comments1 min readLW link

Un­der­stand­ing the In­for­ma­tion Flow in­side Large Lan­guage Models

15 Aug 2023 21:13 UTC
19 points
0 comments17 min readLW link

En­hanc­ing Cor­rigi­bil­ity in AI Sys­tems through Ro­bust Feed­back Loops

Justausername24 Aug 2023 3:53 UTC
1 point
0 comments6 min readLW link

Causal­ity and a Cost Se­man­tics for Neu­ral Networks

scottviteri21 Aug 2023 21:02 UTC
22 points
1 comment1 min readLW link

[Question] Would it be use­ful to col­lect the con­texts, where var­i­ous LLMs think the same?

Martin Vlach24 Aug 2023 22:01 UTC
6 points
1 comment1 min readLW link

Un­der­stand­ing Coun­ter­bal­anced Sub­trac­tions for Bet­ter Ac­ti­va­tion Additions

ojorgensen17 Aug 2023 13:53 UTC
21 points
0 comments14 min readLW link

Memetic Judo #3: The In­tel­li­gence of Stochas­tic Par­rots v.2

Max TK20 Aug 2023 15:18 UTC
8 points
33 comments6 min readLW link

An OV-Co­her­ent Toy Model of At­ten­tion Head Superposition

29 Aug 2023 19:44 UTC
26 points
2 comments6 min readLW link

An ad­ver­sar­ial ex­am­ple for Direct Logit At­tri­bu­tion: mem­ory man­age­ment in gelu-4l

30 Aug 2023 17:36 UTC
17 points
0 comments8 min readLW link
(arxiv.org)

Bar­ri­ers to Mechanis­tic In­ter­pretabil­ity for AGI Safety

Connor Leahy29 Aug 2023 10:56 UTC
63 points
13 comments1 min readLW link
(www.youtube.com)

Open Call for Re­search As­sis­tants in Devel­op­men­tal Interpretability

30 Aug 2023 9:02 UTC
55 points
11 comments4 min readLW link

An In­ter­pretabil­ity Illu­sion for Ac­ti­va­tion Patch­ing of Ar­bi­trary Subspaces

29 Aug 2023 1:04 UTC
77 points
4 comments1 min readLW link

In­ter­pret­ing a ma­trix-val­ued word em­bed­ding with a math­e­mat­i­cally proven char­ac­ter­i­za­tion of all optima

Joseph Van Name4 Sep 2023 16:19 UTC
3 points
4 comments12 min readLW link
No comments.