Interpretability (ML & AI)

TagLast edit: 10 Nov 2023 13:11 UTC by niplav

Transparency and interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model’s output, but the model can’t tell you why it made that output. This makes it hard to determine the cause of biases in ML models.

A prominent subfield of interpretability of neural networks is mechanistic interpretability, which attempts to understand how neural networks perform the tasks they perform, for example by finding circuits in transformer models. This can be contrasted to subfieds of interpretability which seek to attribute some output to a part of a specific input, such as clarifying which pixels in an input image caused a computer vision model to output the classification “horse”.

A small update to the Sparse Coding interim research report

Lee Sharkey, Dan Braun and beren

30 Apr 2023 19:54 UTC

61 points

5 comments1 min readLW link

Interpretability in ML: A Broad Overview

lifelonglearner4 Aug 2020 19:03 UTC

53 points

5 comments15 min readLW link

Re-Examining LayerNorm

Eric Winsor1 Dec 2022 22:20 UTC

125 points

12 comments5 min readLW link

[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey, Dan Braun and beren

13 Dec 2022 15:41 UTC

149 points

23 comments22 min readLW link 2 reviews

Finding Neurons in a Haystack: Case Studies with Sparse Probing

wesg and Neel Nanda

3 May 2023 13:30 UTC

33 points

5 comments2 min readLW link

(arxiv.org)

200 Concrete Open Problems in Mechanistic Interpretability: Introduction

Neel Nanda28 Dec 2022 21:06 UTC

106 points

0 comments10 min readLW link

Chris Olah’s views on AGI safety

evhub1 Nov 2019 20:13 UTC

207 points

38 comments12 min readLW link 2 reviews

A Longlist of Theories of Impact for Interpretability

Neel Nanda11 Mar 2022 14:55 UTC

124 points

37 comments5 min readLW link 2 reviews

How To Go From Interpretability To Alignment: Just Retarget The Search

johnswentworth10 Aug 2022 16:08 UTC

202 points

34 comments3 min readLW link 1 review

The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren and Sid Black

28 Nov 2022 12:54 UTC

198 points

33 comments31 min readLW link

[Question] Papers to start getting into NLP-focused alignment research

Feraidoon24 Sep 2022 23:53 UTC

6 points

0 comments1 min readLW link

A Mechanistic Interpretability Analysis of Grokking

Neel Nanda and Tom Lieberum

15 Aug 2022 2:41 UTC

373 points

47 comments36 min readLW link 1 review

(colab.research.google.com)

Searching for Search

NicholasKees and janus

28 Nov 2022 15:31 UTC

94 points

9 comments14 min readLW link 1 review

A Rocket–Interpretability Analogy

plex21 Oct 2024 13:55 UTC

149 points

31 comments1 min readLW link

Efficient Dictionary Learning with Switch Sparse Autoencoders

Anish Mudide22 Jul 2024 18:45 UTC

118 points

19 comments12 min readLW link

SolidGoldMagikarp (plus, prompt generation)

Jessica Rumbelow and mwatkins

5 Feb 2023 22:02 UTC

676 points

205 comments12 min readLW link

Transparency and AGI safety

jylin0411 Jan 2021 18:51 UTC

54 points

12 comments30 min readLW link

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Taras Kutsyk, Tommaso Mencattini and Ciprian Florea

29 Sep 2024 19:37 UTC

26 points

8 comments25 min readLW link

Striking Implications for Learning Theory, Interpretability — and Safety?

RogerDearnaley5 Jan 2024 8:46 UTC

37 points

4 comments2 min readLW link

A transparency and interpretability tech tree

evhub16 Jun 2022 23:44 UTC

163 points

11 comments18 min readLW link 1 review

Against Almost Every Theory of Impact of Interpretability

Charbel-Raphaël17 Aug 2023 18:44 UTC

322 points

86 comments26 min readLW link

Residual stream norms grow exponentially over the forward pass

StefanHex and TurnTrout

7 May 2023 0:46 UTC

76 points

24 comments11 min readLW link

How to use and interpret activation patching

StefanHex and Neel Nanda

24 Apr 2024 8:35 UTC

12 points

0 comments18 min readLW link

How Interpretability can be Impactful

Connall Garrod18 Jul 2022 0:06 UTC

18 points

0 comments37 min readLW link

Takeaways From 3 Years Working In Machine Learning

George3d68 Apr 2022 17:14 UTC

35 points

10 comments11 min readLW link

(www.epistem.ink)

Comments on Anthropic’s Scaling Monosemanticity

Robert_AIZI3 Jun 2024 12:15 UTC

97 points

8 comments7 min readLW link

My tentative interpretability research agenda—topology matching.

Maxwell Clarke8 Oct 2022 22:14 UTC

10 points

2 comments4 min readLW link

Transformer Circuits

evhub22 Dec 2021 21:09 UTC

144 points

4 comments3 min readLW link

(transformer-circuits.pub)

Interpreting the Learning of Deceit

RogerDearnaley18 Dec 2023 8:12 UTC

30 points

14 comments9 min readLW link

Actually, Othello-GPT Has A Linear Emergent World Representation

Neel Nanda29 Mar 2023 22:13 UTC

211 points

26 comments19 min readLW link

(neelnanda.io)

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC

58 points

0 comments59 min readLW link

Machine Unlearning Evaluations as Interpretability Benchmarks

NickyP and Nandi

23 Oct 2023 16:33 UTC

33 points

2 comments11 min readLW link

The Case for Radical Optimism about Interpretability

Quintin Pope16 Dec 2021 23:38 UTC

66 points

16 comments8 min readLW link 1 review

Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

lifelonglearner and Peter Hase

9 Apr 2021 19:19 UTC

141 points

17 comments102 min readLW link

What is Interpretability?

RobertKirk, Tomáš Gavenčiak and Ada Böhm

17 Mar 2020 20:23 UTC

35 points

0 comments11 min readLW link

Ideation and Trajectory Modelling in Language Models

NickyP5 Oct 2023 19:21 UTC

16 points

2 comments10 min readLW link

SAE reconstruction errors are (empirically) pathological

wesg29 Mar 2024 16:37 UTC

105 points

16 comments8 min readLW link

[Proposal] Method of locating useful subnets in large models

Quintin Pope13 Oct 2021 20:52 UTC

9 points

0 comments2 min readLW link

Extracting and Evaluating Causal Direction in LLMs’ Activations

Fabien Roger and simeon_c

14 Dec 2022 14:33 UTC

29 points

5 comments11 min readLW link

Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda

Logan Riggs and Gurkenglas

3 Sep 2020 18:27 UTC

68 points

11 comments2 min readLW link

Is Interpretability All We Need?

RogerDearnaley14 Nov 2023 5:31 UTC

1 point

1 comment1 min readLW link

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Zac Hatfield-Dodds5 Oct 2023 21:01 UTC

287 points

21 comments2 min readLW link

(transformer-circuits.pub)

Compact Proofs of Model Performance via Mechanistic Interpretability

LawrenceC, rajashree, Adrià Garriga-alonso and Jason Gross

24 Jun 2024 19:27 UTC

95 points

3 comments8 min readLW link

(arxiv.org)

Interpreting Neural Networks through the Polytope Lens

Sid Black, Lee Sharkey, Connor Leahy, beren, CRG, merizian, Eric Winsor and Dan Braun

23 Sep 2022 17:58 UTC

144 points

29 comments33 min readLW link

Announcing Apollo Research

Marius Hobbhahn, beren, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni and Jérémy Scheurer

30 May 2023 16:17 UTC

215 points

11 comments8 min readLW link

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex and Nicholas Goldowsky-Dill

18 Jul 2024 14:15 UTC

117 points

18 comments18 min readLW link

Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib111, wesg and Neel Nanda

27 Apr 2024 11:13 UTC

228 points

93 comments10 min readLW link

Theories of impact for Science of Deep Learning

Marius Hobbhahn1 Dec 2022 14:39 UTC

24 points

0 comments11 min readLW link

The Plan − 2022 Update

johnswentworth1 Dec 2022 20:43 UTC

239 points

37 comments8 min readLW link 1 review

The ‘strong’ feature hypothesis could be wrong

lewis smith2 Aug 2024 14:33 UTC

218 points

17 comments17 min readLW link

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

hugofry29 Apr 2024 20:57 UTC

89 points

8 comments11 min readLW link

Deep learning models might be secretly (almost) linear

beren24 Apr 2023 18:43 UTC

117 points

29 comments4 min readLW link

Basic facts about language models during training

beren21 Feb 2023 11:46 UTC

97 points

15 comments18 min readLW link

LLMs Universally Learn a Feature Representing Token Frequency / Rarity

Sean Osier30 Jun 2024 2:48 UTC

12 points

5 comments6 min readLW link

(github.com)

LLM Modularity: The Separability of Capabilities in Large Language Models

NickyP26 Mar 2023 21:57 UTC

99 points

3 comments41 min readLW link

Mechanistic Anomaly Detection Research Update

Nora Belrose and David Johnston

6 Aug 2024 10:33 UTC

11 points

0 comments1 min readLW link

(blog.eleuther.ai)

Sparsify: A mechanistic interpretability research agenda

Lee Sharkey3 Apr 2024 12:34 UTC

94 points

22 comments22 min readLW link

Interpreting and Steering Features in Images

Gytis Daujotas20 Jun 2024 18:33 UTC

65 points

6 comments5 min readLW link

A Comprehensive Mechanistic Interpretability Explainer & Glossary

Neel Nanda21 Dec 2022 12:35 UTC

85 points

6 comments2 min readLW link

(neelnanda.io)

(tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders

Logan Riggs5 Jul 2023 16:49 UTC

60 points

1 comment7 min readLW link

Introduction to inaccessible information

Ryan Kidd9 Dec 2021 1:28 UTC

27 points

6 comments8 min readLW link

EIS XIV: Is mechanistic interpretability about to be practically useful?

scasper11 Oct 2024 22:13 UTC

67 points

4 comments7 min readLW link

200 COP in MI: Interpreting Algorithmic Problems

Neel Nanda31 Dec 2022 19:55 UTC

33 points

2 comments10 min readLW link

Circumventing interpretability: How to defeat mind-readers

Lee Sharkey14 Jul 2022 16:59 UTC

114 points

15 comments33 min readLW link

Timaeus’s First Four Months

Jesse Hoogland, Daniel Murfet, Stan van Wingerden and Alexander Gietelink Oldenziel

28 Feb 2024 17:01 UTC

172 points

6 comments6 min readLW link

Deep Forgetting & Unlearning for Safely-Scoped LLMs

scasper5 Dec 2023 16:48 UTC

123 points

29 comments13 min readLW link

An Analytic Perspective on AI Alignment

DanielFilan1 Mar 2020 4:10 UTC

54 points

45 comments8 min readLW link

(danielfilan.com)

Verification and Transparency

DanielFilan8 Aug 2019 1:50 UTC

35 points

6 comments2 min readLW link

(danielfilan.com)

Mechanistic Transparency for Machine Learning

DanielFilan11 Jul 2018 0:34 UTC

54 points

9 comments4 min readLW link

How can Interpretability help Alignment?

RobertKirk, Tomáš Gavenčiak and axioman

23 May 2020 16:16 UTC

37 points

3 comments9 min readLW link

The case for becoming a black-box investigator of language models

Buck6 May 2022 14:35 UTC

126 points

20 comments3 min readLW link

One Way to Think About ML Transparency

Matthew Barnett2 Sep 2019 23:27 UTC

26 points

28 comments5 min readLW link

Relaxed adversarial training for inner alignment

evhub10 Sep 2019 23:03 UTC

69 points

27 comments27 min readLW link

Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

johnswentworth4 Jun 2022 5:41 UTC

148 points

55 comments2 min readLW link 1 review

Sparsity and interpretability?

Ada Böhm, RobertKirk and Tomáš Gavenčiak

1 Jun 2020 13:25 UTC

41 points

3 comments7 min readLW link

How Do Selection Theorems Relate To Interpretability?

johnswentworth9 Jun 2022 19:39 UTC

60 points

14 comments3 min readLW link

Progress Report 6: get the tool working

Nathan Helm-Burger10 Jun 2022 11:18 UTC

4 points

0 comments2 min readLW link

[Question] Can you MRI a deep learning model?

Yair Halberstadt13 Jun 2022 13:43 UTC

3 points

3 comments1 min readLW link

Mechanism for feature learning in neural networks and backpropagation-free machine learning models

Matt Goldenberg19 Mar 2024 14:55 UTC

8 points

1 comment1 min readLW link

(www.science.org)

AXRP Episode 21 - Interpretability for Engineers with Stephen Casper

DanielFilan2 May 2023 0:50 UTC

12 points

1 comment66 min readLW link

Visualizing Neural networks, how to blame the bias

Donald Hobson9 Jul 2022 15:52 UTC

7 points

1 comment6 min readLW link

[Question] How optimistic should we be about AI figuring out how to interpret itself?

oh5432125 Jul 2022 22:09 UTC

3 points

1 comment1 min readLW link

Precursor checking for deceptive alignment

evhub3 Aug 2022 22:56 UTC

24 points

0 comments14 min readLW link

[Linkpost]Transformer-Based LM Surprisal Predicts Human Reading Times Best with About Two Billion Training Tokens

Curtis Huebner4 May 2023 17:16 UTC

10 points

1 comment1 min readLW link

(arxiv.org)

Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

johnswentworth8 Aug 2022 18:05 UTC

130 points

12 comments3 min readLW link

AI Transparency: Why it’s critical and how to obtain it.

Zohar Jackson14 Aug 2022 10:31 UTC

6 points

1 comment5 min readLW link

What Makes an Idea Understandable? On Architecturally and Culturally Natural Ideas.

NickyP, Peter S. Park and Stephen Fowler

16 Aug 2022 2:09 UTC

21 points

2 comments16 min readLW link

Stagewise Development in Neural Networks

Jesse Hoogland, Liam Carroll and Daniel Murfet

20 Mar 2024 19:54 UTC

89 points

1 comment11 min readLW link

Finding Sparse Linear Connections between Features in LLMs

Logan Riggs, Sam Mitchell and Adam Kaufman

9 Dec 2023 2:27 UTC

69 points

5 comments10 min readLW link

What Makes A Good Measurement Device?

johnswentworth24 Aug 2022 22:45 UTC

37 points

7 comments2 min readLW link

Rational Animations’ intro to mechanistic interpretability

Writer14 Jun 2024 16:10 UTC

45 points

1 comment11 min readLW link

(youtu.be)

Taking the parameters which seem to matter and rotating them until they don’t

Garrett Baker26 Aug 2022 18:26 UTC

120 points

48 comments1 min readLW link

[Linkpost] Play with SAEs on Llama 3

Tom McGrath, Eric Ho and Dan Balsam

25 Sep 2024 22:35 UTC

40 points

2 comments1 min readLW link

Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT

Robert_AIZI5 Mar 2024 13:55 UTC

61 points

24 comments10 min readLW link

(aizi.substack.com)

A rough idea for solving ELK: An approach for training generalist agents like GATO to make plans and describe them to humans clearly and honestly.

Michael Soareverix8 Sep 2022 15:20 UTC

2 points

2 comments2 min readLW link

Swap and Scale

Stephen Fowler9 Sep 2022 22:41 UTC

17 points

3 comments1 min readLW link

[Linkpost] A survey on over 300 works about interpretability in deep networks

scasper12 Sep 2022 19:07 UTC

97 points

7 comments2 min readLW link

(arxiv.org)

Exciting New Interpretability Paper!

research_prime_space9 May 2023 16:39 UTC

12 points

1 comment1 min readLW link

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

scasper21 May 2024 20:15 UTC

157 points

16 comments3 min readLW link

Sparse trinary weighted RNNs as a path to better language model interpretability

Am8ryllis17 Sep 2022 19:48 UTC

19 points

13 comments3 min readLW link

Toy Models of Superposition

evhub21 Sep 2022 23:48 UTC

69 points

4 comments5 min readLW link 1 review

(transformer-circuits.pub)

Apollo Research 1-year update

Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke and rusheb

29 May 2024 17:44 UTC

93 points

0 comments7 min readLW link

AGI-Automated Interpretability is Suicide

__RicG__10 May 2023 14:20 UTC

23 points

33 comments7 min readLW link

QAPR 3: interpretability-guided training of neural nets

Quintin Pope28 Sep 2022 16:02 UTC

58 points

2 comments10 min readLW link

New OpenAI Paper—Language models can explain neurons in language models

MrThink10 May 2023 7:46 UTC

47 points

14 comments1 min readLW link

More Recent Progress in the Theory of Neural Networks

jylin046 Oct 2022 16:57 UTC

82 points

6 comments4 min readLW link

Polysemanticity and Capacity in Neural Networks

Buck, Adam Jermyn and Kshitij Sachan

7 Oct 2022 17:51 UTC

87 points

14 comments3 min readLW link

Article Review: Google’s AlphaTensor

Robert_AIZI12 Oct 2022 18:04 UTC

8 points

4 comments10 min readLW link

[Question] Previous Work on Recreating Neural Network Input from Intermediate Layer Activations

bglass12 Oct 2022 19:28 UTC

1 point

3 comments1 min readLW link

Refusal mechanisms: initial experiments with Llama-2-7b-chat

Andy Arditi and Oscar Obeso

8 Dec 2023 17:08 UTC

81 points

7 comments7 min readLW link

(OLD) An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers

Neel Nanda18 Oct 2022 21:08 UTC

72 points

5 comments12 min readLW link

(www.neelnanda.io)

A Barebones Guide to Mechanistic Interpretability Prerequisites

Neel Nanda24 Oct 2022 20:45 UTC

63 points

12 comments3 min readLW link

(neelnanda.io)

A Walkthrough of A Mathematical Framework for Transformer Circuits

Neel Nanda25 Oct 2022 20:24 UTC

52 points

7 comments1 min readLW link

(www.youtube.com)

Attention Output SAEs Improve Circuit Analysis

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

21 Jun 2024 12:56 UTC

31 points

0 comments19 min readLW link

Activation additions in a small residual network

Garrett Baker22 May 2023 20:28 UTC

22 points

4 comments3 min readLW link

[Book] Interpretable Machine Learning: A Guide for Making Black Box Models Explainable

Esben Kran31 Oct 2022 11:38 UTC

20 points

1 comment1 min readLW link

(christophm.github.io)

“Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability

David Scott Krueger (formerly: capybaralet)31 Oct 2022 21:26 UTC

48 points

25 comments2 min readLW link

Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?

Neel Nanda1 Nov 2022 23:56 UTC

69 points

16 comments1 min readLW link

(youtu.be)

A Mystery About High Dimensional Concept Encoding

Fabien Roger3 Nov 2022 17:05 UTC

46 points

13 comments7 min readLW link

AXRP Episode 36 - Adam Shai and Paul Riechers on Computational Mechanics

DanielFilan29 Sep 2024 5:50 UTC

25 points

0 comments55 min readLW link

[Linkpost] Interpretability Dreams

DanielFilan24 May 2023 21:08 UTC

39 points

2 comments2 min readLW link

(transformer-circuits.pub)

A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Neel Nanda7 Nov 2022 22:39 UTC

30 points

15 comments3 min readLW link

(youtu.be)

Search versus design

Alex Flint16 Aug 2020 16:53 UTC

108 points

40 comments36 min readLW link 1 review

SAE feature geometry is outside the superposition hypothesis

jake_mendel24 Jun 2024 16:07 UTC

221 points

17 comments11 min readLW link

Why and When Interpretability Work is Dangerous

Nicholas / Heather Kross28 May 2023 0:27 UTC

20 points

8 comments8 min readLW link

(www.thinkingmuchbetter.com)

Results from the Turing Seminar hackathon

Charbel-Raphaël, jeanne_ and WCargo

7 Dec 2023 14:50 UTC

29 points

1 comment6 min readLW link

Exploring SAE features in LLMs with definition trees and token lists

mwatkins4 Oct 2024 22:15 UTC

37 points

5 comments6 min readLW link

A Walkthrough of In-Context Learning and Induction Heads (w/ Charles Frye) Part 1 of 2

Neel Nanda22 Nov 2022 17:12 UTC

20 points

0 comments1 min readLW link

(www.youtube.com)

Subsets and quotients in interpretability

Erik Jenner2 Dec 2022 23:13 UTC

26 points

1 comment7 min readLW link

Finding gliders in the game of life

paulfchristiano1 Dec 2022 20:40 UTC

101 points

7 comments16 min readLW link

(ai-alignment.com)

Exploring Concept-Specific Slices in Weight Matrices for Network Interpretability

DuncanFowler9 Jun 2023 16:39 UTC

1 point

0 comments6 min readLW link

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

likenneth11 Jun 2023 5:38 UTC

195 points

4 comments1 min readLW link

(arxiv.org)

A Selection of Randomly Selected SAE Features

CallumMcDougall and Joseph Bloom

1 Apr 2024 9:09 UTC

109 points

2 comments4 min readLW link

[ASoT] Natural abstractions and AlphaZero

Ulisse Mini10 Dec 2022 17:53 UTC

33 points

1 comment1 min readLW link

(arxiv.org)

Assessment of AI safety agendas: think about the downside risk

Roman Leventov19 Dec 2023 9:00 UTC

13 points

1 comment1 min readLW link

HDBSCAN is Surprisingly Effective at Finding Interpretable Clusters of the SAE Decoder Matrix

Jaehyuk Lim, Kanishk Tantia and Sinem

11 Oct 2024 23:06 UTC

8 points

2 comments10 min readLW link

Paper: Transformers learn in-context by gradient descent

LawrenceC16 Dec 2022 11:10 UTC

28 points

11 comments2 min readLW link

(arxiv.org)

Can we efficiently explain model behaviors?

paulfchristiano16 Dec 2022 19:40 UTC

64 points

3 comments9 min readLW link

(ai-alignment.com)

Durkon, an open-source tool for Inherently Interpretable Modelling

abstractapplic24 Dec 2022 1:49 UTC

37 points

0 comments4 min readLW link

Concrete Steps to Get Started in Transformer Mechanistic Interpretability

Neel Nanda25 Dec 2022 22:21 UTC

56 points

7 comments12 min readLW link

(www.neelnanda.io)

SAE-VIS: Announcement Post

CallumMcDougall and Joseph Bloom

31 Mar 2024 15:30 UTC

74 points

8 comments1 min readLW link

Analogies between Software Reverse Engineering and Mechanistic Interpretability

Neel Nanda and Itay Yona

26 Dec 2022 12:26 UTC

34 points

6 comments11 min readLW link

(www.neelnanda.io)

200 COP in MI: The Case for Analysing Toy Language Models

Neel Nanda28 Dec 2022 21:07 UTC

39 points

3 comments7 min readLW link

Interpreting Preference Models w/ Sparse Autoencoders

Logan Riggs and Jannik Brinkmann

1 Jul 2024 21:35 UTC

74 points

12 comments9 min readLW link

200 COP in MI: Looking for Circuits in the Wild

Neel Nanda29 Dec 2022 20:59 UTC

16 points

5 comments13 min readLW link

fMRI LIKE APPROACH TO AI ALIGNMENT / DECEPTIVE BEHAVIOUR

Escaque 6611 Jul 2023 17:17 UTC

−1 points

3 comments2 min readLW link

Circuits in Superposition: Compressing many small neural networks into one

Lucius Bushnaq and jake_mendel

14 Oct 2024 13:06 UTC

126 points

8 comments13 min readLW link

The Computational Complexity of Circuit Discovery for Inner Interpretability

Bogdan Ionut Cirstea17 Oct 2024 13:18 UTC

11 points

2 comments1 min readLW link

(arxiv.org)

200 COP in MI: Exploring Polysemanticity and Superposition

Neel Nanda3 Jan 2023 1:52 UTC

34 points

6 comments16 min readLW link

Towards Developmental Interpretability

Jesse Hoogland, Alexander Gietelink Oldenziel, Daniel Murfet and Stan van Wingerden

12 Jul 2023 19:33 UTC

180 points

9 comments9 min readLW link

Comments on OpenPhil’s Interpretability RFP

paulfchristiano5 Nov 2021 22:36 UTC

91 points

5 comments7 min readLW link

200 COP in MI: Analysing Training Dynamics

Neel Nanda4 Jan 2023 16:08 UTC

16 points

0 comments14 min readLW link

AutoInterpretation Finds Sparse Coding Beats Alternatives

Hoagy17 Jul 2023 1:41 UTC

56 points

1 comment7 min readLW link

Paper: Superposition, Memorization, and Double Descent (Anthropic)

LawrenceC5 Jan 2023 17:54 UTC

53 points

11 comments1 min readLW link

(transformer-circuits.pub)

200 COP in MI: Techniques, Tooling and Automation

Neel Nanda6 Jan 2023 15:08 UTC

13 points

0 comments15 min readLW link

200 COP in MI: Image Model Interpretability

Neel Nanda8 Jan 2023 14:53 UTC

18 points

3 comments6 min readLW link

Hedonic Loops and Taming RL

beren19 Jul 2023 15:12 UTC

20 points

14 comments9 min readLW link

How ARENA course material gets made

CallumMcDougall2 Jul 2024 18:04 UTC

41 points

2 comments7 min readLW link

200 COP in MI: Interpreting Reinforcement Learning

Neel Nanda10 Jan 2023 17:37 UTC

25 points

1 comment10 min readLW link

Tiny Mech Interp Projects: Emergent Positional Embeddings of Words

Neel Nanda18 Jul 2023 21:24 UTC

51 points

1 comment9 min readLW link

World-Model Interpretability Is All We Need

Thane Ruthenis14 Jan 2023 19:37 UTC

35 points

22 comments21 min readLW link

Desiderata for an AI

Nathan Helm-Burger19 Jul 2023 16:18 UTC

9 points

0 comments4 min readLW link

How does GPT-3 spend its 175B parameters?

Robert_AIZI13 Jan 2023 19:21 UTC

41 points

14 comments6 min readLW link

(aizi.substack.com)

Open problems in activation engineering

TurnTrout, woog, lisathiergart, Monte M and Ulisse Mini

24 Jul 2023 19:46 UTC

51 points

2 comments1 min readLW link

(coda.io)

Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo

Neel Nanda16 Jul 2023 22:02 UTC

65 points

15 comments1 min readLW link

200 COP in MI: Studying Learned Features in Language Models

Neel Nanda19 Jan 2023 3:48 UTC

24 points

2 comments30 min readLW link

[Question] Transformer Mech Interp: Any visualizations?

Joyee Chen18 Jan 2023 4:32 UTC

3 points

0 comments1 min readLW link

SAEs you can See: Applying Sparse Autoencoders to Clustering

Robert_AIZI28 Oct 2024 14:48 UTC

26 points

0 comments9 min readLW link

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Neel Nanda, Tom Lieberum, Matthew Rahtz, János Kramár, Geoffrey Irving, Rohin Shah and Vlad Mikulik

20 Jul 2023 10:50 UTC

44 points

3 comments2 min readLW link

(arxiv.org)

Anthropic announces interpretability advances. How much does this advance alignment?

Seth Herd21 May 2024 22:30 UTC

49 points

4 comments3 min readLW link

(www.anthropic.com)

Really Strong Features Found in Residual Stream

Logan Riggs8 Jul 2023 19:40 UTC

69 points

6 comments2 min readLW link

Mechanistic Interpretability Quickstart Guide

Neel Nanda31 Jan 2023 16:35 UTC

42 points

3 comments6 min readLW link

(www.neelnanda.io)

More findings on Memorization and double descent

Marius Hobbhahn1 Feb 2023 18:26 UTC

53 points

2 comments19 min readLW link

More findings on maximal data dimension

Marius Hobbhahn2 Feb 2023 18:33 UTC

27 points

1 comment11 min readLW link

Neuronpedia

Johnny Lin26 Jul 2023 16:29 UTC

135 points

51 comments2 min readLW link

(neuronpedia.org)

AXRP Episode 19 - Mechanistic Interpretability with Neel Nanda

DanielFilan4 Feb 2023 3:00 UTC

45 points

0 comments117 min readLW link

AXRP Episode 23 - Mechanistic Anomaly Detection with Mark Xu

DanielFilan27 Jul 2023 1:50 UTC

22 points

0 comments72 min readLW link

Mech Interp Project Advising Call: Memorisation in GPT-2 Small

Neel Nanda4 Feb 2023 14:17 UTC

7 points

0 comments1 min readLW link

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Neel Nanda7 Jul 2024 17:39 UTC

134 points

15 comments25 min readLW link

[ASoT] Policy Trajectory Visualization

Ulisse Mini7 Feb 2023 0:13 UTC

9 points

2 comments1 min readLW link

Review of AI Alignment Progress

PeterMcCluskey7 Feb 2023 18:57 UTC

72 points

32 comments7 min readLW link

(bayesianinvestor.com)

Why I’m bearish on mechanistic interpretability: the shards are not in the network

tailcalled13 Sep 2024 17:09 UTC

19 points

40 comments1 min readLW link

Mech Interp Puzzle 2: Word2Vec Style Embeddings

Neel Nanda28 Jul 2023 0:50 UTC

40 points

4 comments2 min readLW link

On Developing a Mathematical Theory of Interpretability

carboniferous_umbraculum 9 Feb 2023 1:45 UTC

64 points

8 comments6 min readLW link

Apollo Research is hiring evals and interpretability engineers & scientists

Marius Hobbhahn4 Aug 2023 10:54 UTC

25 points

0 comments2 min readLW link

Toward A Mathematical Framework for Computation in Superposition

Dmitry Vaintrob, jake_mendel and Kaarel

18 Jan 2024 21:06 UTC

203 points

18 comments63 min readLW link

The conceptual Doppelgänger problem

TsviBT12 Feb 2023 17:23 UTC

12 points

5 comments4 min readLW link

AI alignment as a translation problem

Roman Leventov5 Feb 2024 14:14 UTC

22 points

2 comments3 min readLW link

AXRP Episode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

DanielFilan24 Aug 2024 22:30 UTC

21 points

0 comments74 min readLW link

Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition

cmathw, Dennis Akar and Lee Sharkey

8 Apr 2024 11:14 UTC

37 points

4 comments15 min readLW link

EIS V: Blind Spots In AI Safety Interpretability Research

scasper16 Feb 2023 19:09 UTC

54 points

24 comments10 min readLW link

Growing Bonsai Networks with RNNs

ameo7 Aug 2023 17:34 UTC

21 points

5 comments1 min readLW link

(cprimozic.net)

How Do Induction Heads Actually Work in Transformers With Finite Capacity?

Fabien Roger23 Mar 2023 9:09 UTC

27 points

0 comments5 min readLW link

Wittgenstein and ML — parameters vs architecture

Cleo Nardo24 Mar 2023 4:54 UTC

44 points

9 comments5 min readLW link

Causal Graphs of GPT-2-Small’s Residual Stream

David Udell9 Jul 2024 22:06 UTC

53 points

7 comments7 min readLW link

Measuring Structure Development in Algorithmic Transformers

Micurie and Einar Urdshals

22 Aug 2024 8:38 UTC

56 points

4 comments11 min readLW link

Intervening in the Residual Stream

MadHatter22 Feb 2023 6:29 UTC

30 points

1 comment9 min readLW link

Video/animation: Neel Nanda explains what mechanistic interpretability is

DanielFilan22 Feb 2023 22:42 UTC

24 points

7 comments1 min readLW link

(youtu.be)

Othello-GPT: Future Work I Am Excited About

Neel Nanda29 Mar 2023 22:13 UTC

48 points

2 comments33 min readLW link

(neelnanda.io)

Apply for the 2023 Developmental Interpretability Conference!

Stan van Wingerden, Alexander Gietelink Oldenziel, Jesse Hoogland and Daniel Murfet

25 Aug 2023 7:12 UTC

33 points

0 comments2 min readLW link

Paper Walkthrough: Automated Circuit Discovery with Arthur Conmy

Neel Nanda29 Aug 2023 22:07 UTC

36 points

1 comment1 min readLW link

(www.youtube.com)

Othello-GPT: Reflections on the Research Process

Neel Nanda29 Mar 2023 22:13 UTC

36 points

0 comments15 min readLW link

(neelnanda.io)

Understanding and controlling a maze-solving policy network

TurnTrout, peligrietzer, Ulisse Mini, Monte M and David Udell

11 Mar 2023 18:59 UTC

328 points

27 comments23 min readLW link

Mapping the semantic void: Strange goings-on in GPT embedding spaces

mwatkins14 Dec 2023 13:10 UTC

114 points

31 comments14 min readLW link

Addendum: basic facts about language models during training

beren6 Mar 2023 19:24 UTC

22 points

2 comments5 min readLW link

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache and Marius Hobbhahn

20 May 2024 17:53 UTC

105 points

4 comments3 min readLW link

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

TurnTrout, peligrietzer and lisathiergart

31 Mar 2023 19:20 UTC

101 points

17 comments11 min readLW link

The Translucent Thoughts Hypotheses and Their Implications

Fabien Roger9 Mar 2023 16:30 UTC

141 points

7 comments19 min readLW link

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Sam Marks18 Apr 2024 16:17 UTC

107 points

10 comments12 min readLW link

Paper Replication Walkthrough: Reverse-Engineering Modular Addition

Neel Nanda12 Mar 2023 13:25 UTC

18 points

0 comments1 min readLW link

(neelnanda.io)

[Summary] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

19 Apr 2024 19:06 UTC

68 points

0 comments3 min readLW link

Attribution Patching: Activation Patching At Industrial Scale

Neel Nanda16 Mar 2023 21:44 UTC

45 points

10 comments58 min readLW link

(www.neelnanda.io)

Introducing Leap Labs, an AI interpretability startup

Jessica Rumbelow6 Mar 2023 16:16 UTC

103 points

12 comments1 min readLW link

A circuit for Python docstrings in a 4-layer attention-only transformer

StefanHex and Jett Janiak

20 Feb 2023 19:35 UTC

95 points

8 comments21 min readLW link

You’re Measuring Model Complexity Wrong

Jesse Hoogland and Stan van Wingerden

11 Oct 2023 11:46 UTC

87 points

15 comments13 min readLW link

Interpretability

abergal and Nick_Beckstead

29 Oct 2021 7:28 UTC

60 points

13 comments12 min readLW link

Giant (In)scrutable Matrices: (Maybe) the Best of All Possible Worlds

1a3orn4 Apr 2023 17:39 UTC

196 points

37 comments5 min readLW link

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Senthooran Rajamanoharan, Tom Lieberum, nps29, Arthur Conmy, Vikrant Varma, János Kramár and Neel Nanda

19 Jul 2024 16:10 UTC

48 points

10 comments1 min readLW link

(storage.googleapis.com)

Sparse Coding, for Mechanistic Interpretability and Activation Engineering

David Udell23 Sep 2023 19:16 UTC

42 points

7 comments34 min readLW link

[Full Post] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

19 Apr 2024 19:06 UTC

73 points

10 comments8 min readLW link

Difficulty classes for alignment properties

Jozdien20 Feb 2024 9:08 UTC

34 points

5 comments2 min readLW link

ProLU: A Nonlinearity for Sparse Autoencoders

Glen Taggart23 Apr 2024 14:09 UTC

44 points

4 comments9 min readLW link

Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search”

RobertM14 Sep 2023 2:18 UTC

85 points

4 comments8 min readLW link

Mech Interp Challenge: September—Deciphering the Addition Model

CallumMcDougall13 Sep 2023 22:23 UTC

35 points

0 comments4 min readLW link

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Logan Riggs, Hoagy, Aidan Ewart and Robert_AIZI

21 Sep 2023 15:30 UTC

158 points

8 comments5 min readLW link

Mech Interp Challenge: January—Deciphering the Caesar Cipher Model

CallumMcDougall1 Jan 2024 18:03 UTC

17 points

0 comments3 min readLW link

Identifying semantic neurons, mechanistic circuits & interpretability web apps

Esben Kran and Neel Nanda

13 Apr 2023 11:59 UTC

18 points

0 comments8 min readLW link

Steering GPT-2-XL by adding an activation vector

TurnTrout, Monte M, David Udell, lisathiergart and Ulisse Mini

13 May 2023 18:42 UTC

436 points

97 comments50 min readLW link

Why I stopped being into basin broadness

tailcalled25 Apr 2024 20:47 UTC

16 points

3 comments2 min readLW link

Shapley Value Attribution in Chain of Thought

leogao14 Apr 2023 5:56 UTC

103 points

7 comments4 min readLW link

Three ways interpretability could be impactful

Arthur Conmy18 Sep 2023 1:02 UTC

47 points

8 comments4 min readLW link

SmartyHeaderCode: anomalous tokens for GPT3.5 and GPT-4

AdamYedidia15 Apr 2023 22:35 UTC

71 points

18 comments6 min readLW link

Neel Nanda on the Mechanistic Interpretability Researcher Mindset

Michaël Trazzi21 Sep 2023 19:47 UTC

37 points

1 comment3 min readLW link

(theinsideview.ai)

Language Models are a Potentially Safe Path to Human-Level AGI

Nadav Brandes20 Apr 2023 0:40 UTC

28 points

6 comments8 min readLW link

Behavioural statistics for a maze-solving agent

peligrietzer and TurnTrout

20 Apr 2023 22:26 UTC

46 points

11 comments10 min readLW link

Interpreting OpenAI’s Whisper

EllenaR24 Sep 2023 17:53 UTC

114 points

13 comments7 min readLW link

Mech Interp Challenge: October—Deciphering the Sorted List Model

CallumMcDougall3 Oct 2023 10:57 UTC

23 points

0 comments3 min readLW link

Superposition is not “just” neuron polysemanticity

LawrenceC26 Apr 2024 23:22 UTC

64 points

4 comments13 min readLW link

Should we publish mechanistic interpretability research?

Marius Hobbhahn and LawrenceC

21 Apr 2023 16:19 UTC

105 points

40 comments13 min readLW link

Neural network polytopes (Colab notebook)

Zach Furman21 Apr 2023 22:42 UTC

11 points

0 comments1 min readLW link

(colab.research.google.com)

Explaining the Transformer Circuits Framework by Example

Felix Hofstätter25 Apr 2023 13:45 UTC

8 points

0 comments15 min readLW link

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, lewis smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah and Neel Nanda

25 Apr 2024 18:43 UTC

63 points

38 comments1 min readLW link

(arxiv.org)

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Neel Nanda, Senthooran Rajamanoharan, János Kramár and Rohin Shah

23 Dec 2023 2:44 UTC

108 points

8 comments22 min readLW link

Inner Alignment in Salt-Starved Rats

Steven Byrnes19 Nov 2020 2:40 UTC

137 points

41 comments11 min readLW link 2 reviews

Paper: Understanding and Controlling a Maze-Solving Policy Network

TurnTrout, Ulisse Mini, peligrietzer, mrinank_sharma, Austin Meek, Monte M and lisathiergart

13 Oct 2023 1:38 UTC

70 points

0 comments1 min readLW link

(arxiv.org)

SAEs Discover Meaningful Features in the IOI Task

Alex Makelov, Georg Lange and Neel Nanda

5 Jun 2024 23:48 UTC

15 points

2 comments10 min readLW link

Physics of Language models (part 2.1)

Nathan Helm-Burger19 Sep 2024 16:48 UTC

9 points

2 comments1 min readLW link

(youtu.be)

[Paper] All’s Fair In Love And Love: Copy Suppression in GPT-2 Small

CallumMcDougall, Arthur Conmy, starship006, Tom McGrath and Neel Nanda

13 Oct 2023 18:32 UTC

82 points

4 comments8 min readLW link

Mech Interp Challenge: November—Deciphering the Cumulative Sum Model

CallumMcDougall2 Nov 2023 17:10 UTC

18 points

2 comments2 min readLW link

A New Class of Glitch Tokens—BPE Subtoken Artifacts (BSA)

Lao Mein20 Sep 2024 13:13 UTC

37 points

7 comments5 min readLW link

Why did ChatGPT say that? Prompt engineering and more, with PIZZA.

Jessica Rumbelow3 Aug 2024 12:07 UTC

40 points

2 comments4 min readLW link

Mechanistic Interpretability Workshop Happening at ICML 2024!

Neel Nanda, LawrenceC and Fazl

3 May 2024 1:18 UTC

48 points

6 comments1 min readLW link

Improving SAE’s by Sqrt()-ing L1 & Removing Lowest Activating Features

Logan Riggs and Jannik Brinkmann

15 Mar 2024 16:30 UTC

26 points

5 comments4 min readLW link

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

Erik Jenner4 Jun 2024 15:50 UTC

120 points

14 comments13 min readLW link

Multi-dimensional rewards for AGI interpretability and control

Steven Byrnes4 Jan 2021 3:08 UTC

19 points

8 comments10 min readLW link

Glitch Token Catalog - (Almost) a Full Clear

Lao Mein21 Sep 2024 12:22 UTC

37 points

3 comments37 min readLW link

MIRI comments on Cotra’s “Case for Aligning Narrowly Superhuman Models”

Rob Bensinger5 Mar 2021 23:43 UTC

142 points

13 comments26 min readLW link

Transparency Trichotomy

Mark Xu28 Mar 2021 20:26 UTC

25 points

2 comments7 min readLW link

Solving the whole AGI control problem, version 0.0001

Steven Byrnes8 Apr 2021 15:14 UTC

63 points

7 comments26 min readLW link

Dropout can create a privileged basis in the ReLU output model.

lewis smith28 Apr 2023 1:59 UTC

24 points

3 comments5 min readLW link

Knowledge Neurons in Pretrained Transformers

evhub17 May 2021 22:54 UTC

100 points

7 comments2 min readLW link

(arxiv.org)

Self-explaining SAE features

Dmitrii Kharlapenko, neverix, Neel Nanda and Arthur Conmy

5 Aug 2024 22:20 UTC

60 points

13 comments10 min readLW link

Garrabrant and Shah on human modeling in AGI

Rob Bensinger4 Aug 2021 4:35 UTC

60 points

10 comments47 min readLW link

Neural net / decision tree hybrids: a potential path toward bridging the interpretability gap

Nathan Helm-Burger23 Sep 2021 0:38 UTC

21 points

2 comments12 min readLW link

AtP*: An efficient and scalable method for localizing LLM behaviour to components

Neel Nanda, János Kramár, Tom Lieberum and Rohin Shah

18 Mar 2024 17:28 UTC

19 points

0 comments1 min readLW link

(arxiv.org)

Let’s buy out Cyc, for use in AGI interpretability systems?

Steven Byrnes7 Dec 2021 20:46 UTC

49 points

10 comments2 min readLW link

Solving Interpretability Week

Logan Riggs13 Dec 2021 17:09 UTC

11 points

5 comments1 min readLW link

Interpretability with Sparse Autoencoders (Colab exercises)

CallumMcDougall29 Nov 2023 12:56 UTC

74 points

9 comments4 min readLW link

My Overview of the AI Alignment Landscape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC

127 points

9 comments15 min readLW link

You can remove GPT2’s LayerNorm by fine-tuning for an hour

StefanHex8 Aug 2024 18:33 UTC

161 points

11 comments8 min readLW link

How useful is mechanistic interpretability?

ryan_greenblatt, Neel Nanda, Buck and habryka

1 Dec 2023 2:54 UTC

163 points

54 comments25 min readLW link

Automating LLM Auditing with Developmental Interpretability

htlou and evhub

4 Sep 2024 15:50 UTC

17 points

0 comments3 min readLW link

Announcing Human-aligned AI Summer School

Jan_Kulveit and Tomáš Gavenčiak

22 May 2024 8:55 UTC

50 points

0 comments1 min readLW link

(humanaligned.ai)

Question 3: Control proposals for minimizing bad outcomes

Cameron Berg12 Feb 2022 19:13 UTC

5 points

1 comment7 min readLW link

“What the hell is a representation, anyway?” | Clarifying AI interpretability with tools from philosophy of cognitive science | Part 1: Vehicles vs. contents

IwanWilliams9 Jun 2024 14:19 UTC

9 points

1 comment4 min readLW link

Progress Report 1: interpretability experiments & learning, testing compression hypotheses

Nathan Helm-Burger22 Mar 2022 20:12 UTC

11 points

0 comments2 min readLW link

[Intro to brain-like-AGI safety] 9. Takeaways from neuro 2/2: On AGI motivation

Steven Byrnes23 Mar 2022 12:48 UTC

44 points

11 comments22 min readLW link

Explaining grokking through circuit efficiency

Vikrant Varma and Rohin Shah

8 Sep 2023 14:39 UTC

101 points

11 comments3 min readLW link

(arxiv.org)

Automatically finding feature vectors in the OV circuits of Transformers without using probing

Jacob Dunefsky12 Sep 2023 17:38 UTC

13 points

0 comments29 min readLW link

Uncovering Latent Human Wellbeing in LLM Embeddings

ChengCheng, Pedro Freire, Dan H and Scott Emmons

14 Sep 2023 1:40 UTC

32 points

7 comments8 min readLW link

(far.ai)

Expanding the Scope of Superposition

Derek Larson13 Sep 2023 17:38 UTC

10 points

0 comments4 min readLW link

Charbel-Raphaël and Lucius discuss interpretability

Mateusz Bagiński, Charbel-Raphaël and Lucius Bushnaq

30 Oct 2023 5:50 UTC

105 points

7 comments21 min readLW link

Seeking Feedback on My Mechanistic Interpretability Research Agenda

RGRGRG12 Sep 2023 18:45 UTC

3 points

1 comment3 min readLW link

Mechanistic Interpretability Reading group

1stuserhere and woog

26 Sep 2023 16:26 UTC

15 points

0 comments1 min readLW link

Announcing the CNN Interpretability Competition

scasper26 Sep 2023 16:21 UTC

22 points

0 comments4 min readLW link

High-level interpretability: detecting an AI’s objectives

Paul Colognese and Jozdien

28 Sep 2023 19:30 UTC

69 points

4 comments21 min readLW link

New Tool: the Residual Stream Viewer

AdamYedidia1 Oct 2023 0:49 UTC

32 points

7 comments4 min readLW link

(tinyurl.com)

Interpretability Externalities Case Study—Hungry Hungry Hippos

Magdalena Wache20 Sep 2023 14:42 UTC

64 points

22 comments2 min readLW link

Graphical tensor notation for interpretability

Jordan Taylor4 Oct 2023 8:04 UTC

137 points

11 comments19 min readLW link

Taking features out of superposition with sparse autoencoders more quickly with informed initialization

Pierre Peigné23 Sep 2023 16:21 UTC

30 points

8 comments5 min readLW link

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

lukemarks, Amirali Abdullah, Rauno Arike, Fazl and nothoughtsheadempty

3 Oct 2023 7:45 UTC

17 points

0 comments5 min readLW link

What would it mean to understand how a large language model (LLM) works? Some quick notes.

Bill Benzon3 Oct 2023 15:11 UTC

20 points

4 comments8 min readLW link

A personal explanation of ELK concept and task.

Zeyu Qin6 Oct 2023 3:55 UTC

1 point

0 comments1 min readLW link

Entanglement and intuition about words and meaning

Bill Benzon4 Oct 2023 14:16 UTC

4 points

0 comments2 min readLW link

Attributing to interactions with GCPD and GWPD

jenny11 Oct 2023 15:06 UTC

20 points

0 comments6 min readLW link

Comparing Anthropic’s Dictionary Learning to Ours

Robert_AIZI7 Oct 2023 23:30 UTC

137 points

8 comments4 min readLW link

Bird-eye view visualization of LLM activations

Sergii8 Oct 2023 12:12 UTC

11 points

2 comments1 min readLW link

(grgv.xyz)

Understanding LLMs: Some basic observations about words, syntax, and discourse [w/ a conjecture about grokking]

Bill Benzon11 Oct 2023 19:13 UTC

6 points

0 comments5 min readLW link

Announcing Timaeus

Jesse Hoogland, Daniel Murfet, Alexander Gietelink Oldenziel and Stan van Wingerden

22 Oct 2023 11:59 UTC

187 points

15 comments4 min readLW link

Mechanistic interpretability of LLM analogy-making

Sergii20 Oct 2023 12:53 UTC

2 points

0 comments4 min readLW link

(grgv.xyz)

Internal Target Information for AI Oversight

Paul Colognese20 Oct 2023 14:53 UTC

15 points

0 comments5 min readLW link

Revealing Intentionality In Language Models Through AdaVAE Guided Sampling

jdp20 Oct 2023 7:32 UTC

119 points

15 comments22 min readLW link

[Question] Does a broad overview of Mechanistic Interpretability exist?

kourabi16 Oct 2023 1:16 UTC

1 point

0 comments1 min readLW link

ChatGPT tells 20 versions of its prototypical story, with a short note on method

Bill Benzon14 Oct 2023 15:27 UTC

6 points

0 comments5 min readLW link

Mapping ChatGPT’s ontological landscape, gradients and choices [interpretability]

Bill Benzon15 Oct 2023 20:12 UTC

1 point

0 comments18 min readLW link

[Question] Can we isolate neurons that recognize features vs. those which have some other role?

Joshua Clancy21 Oct 2023 0:30 UTC

4 points

2 comments3 min readLW link

Investigating the learning coefficient of modular addition: hackathon project

Nina Panickssery and Dmitry Vaintrob

17 Oct 2023 19:51 UTC

94 points

5 comments12 min readLW link

Features and Adversaries in MemoryDT

Joseph Bloom and Jay Bailey

20 Oct 2023 7:32 UTC

31 points

6 comments25 min readLW link

Thoughts On (Solving) Deep Deception

Jozdien21 Oct 2023 22:40 UTC

69 points

2 comments6 min readLW link

Challenge: know everything that the best go bot knows about go

DanielFilan11 May 2021 5:10 UTC

48 points

113 comments2 min readLW link

(danielfilan.com)

Speculations against GPT-n writing alignment papers

Donald Hobson7 Jun 2021 21:13 UTC

31 points

6 comments2 min readLW link

Trying to approximate Statistical Models as Scoring Tables

Jsevillamol29 Jun 2021 17:20 UTC

18 points

2 comments9 min readLW link

Possible research directions to improve the mechanistic explanation of neural networks

delton1379 Nov 2021 2:36 UTC

31 points

8 comments9 min readLW link

[linkpost] Acquisition of Chess Knowledge in AlphaZero

Quintin Pope23 Nov 2021 7:55 UTC

8 points

1 comment1 min readLW link

Teaser: Hard-coding Transformer Models

MadHatter12 Dec 2021 22:04 UTC

74 points

19 comments1 min readLW link

The Natural Abstraction Hypothesis: Implications and Evidence

CallumMcDougall14 Dec 2021 23:14 UTC

39 points

9 comments19 min readLW link

Mechanistic Interpretability for the MLP Layers (rough early thoughts)

MadHatter24 Dec 2021 7:24 UTC

12 points

3 comments1 min readLW link

(www.youtube.com)

An Open Philanthropy grant proposal: Causal representation learning of human preferences

PabloAMC11 Jan 2022 11:28 UTC

19 points

6 comments8 min readLW link

Gears-Level Mental Models of Transformer Interpretability

KevinRoWang29 Mar 2022 20:09 UTC

72 points

4 comments6 min readLW link

Progress Report 2

Nathan Helm-Burger30 Mar 2022 2:29 UTC

4 points

1 comment1 min readLW link

Progress report 3: clustering transformer neurons

Nathan Helm-Burger5 Apr 2022 23:13 UTC

5 points

0 comments2 min readLW link

Is GPT3 a Good Rationalist? - InstructGPT3 [2/2]

simeon_c7 Apr 2022 13:46 UTC

11 points

0 comments7 min readLW link

Progress Report 4: logit lens redux

Nathan Helm-Burger8 Apr 2022 18:35 UTC

4 points

0 comments2 min readLW link

Another list of theories of impact for interpretability

Beth Barnes13 Apr 2022 13:29 UTC

33 points

1 comment5 min readLW link

Introduction to the sequence: Interpretability Research for the Most Important Century

Evan R. Murphy12 May 2022 19:59 UTC

16 points

0 comments8 min readLW link

CNN feature visualization in 50 lines of code

StefanHex26 May 2022 11:02 UTC

17 points

4 comments5 min readLW link

QNR prospects are important for AI alignment research

Eric Drexler3 Feb 2022 15:20 UTC

85 points

12 comments11 min readLW link 1 review

Thoughts on Formalizing Composition

Tom Lieberum7 Jun 2022 7:51 UTC

13 points

0 comments7 min readLW link

Research Questions from Stained Glass Windows

StefanHex8 Jun 2022 12:38 UTC

4 points

0 comments2 min readLW link

Anthropic’s SoLU (Softmax Linear Unit)

Joel Burget4 Jul 2022 18:38 UTC

21 points

1 comment4 min readLW link

(transformer-circuits.pub)

Deep neural networks are not opaque.

jem-mosig6 Jul 2022 18:03 UTC

22 points

14 comments3 min readLW link

Race Along Rashomon Ridge

Stephen Fowler, Peter S. Park and MichaelEinhorn

7 Jul 2022 3:20 UTC

50 points

15 comments8 min readLW link

Finding Skeletons on Rashomon Ridge

David Udell, Peter S. Park and NickyP

24 Jul 2022 22:31 UTC

30 points

2 comments7 min readLW link

Interpretability isn’t Free

Joel Burget4 Aug 2022 15:02 UTC

10 points

1 comment2 min readLW link

Dissected boxed AI

Nathan112312 Aug 2022 2:37 UTC

−8 points

2 comments1 min readLW link

Interpretability Tools Are an Attack Channel

Thane Ruthenis17 Aug 2022 18:47 UTC

42 points

14 comments1 min readLW link

A Bite Sized Introduction to ELK

Luk2718217 Sep 2022 0:28 UTC

5 points

0 comments6 min readLW link

The Shard Theory Alignment Scheme

David Udell25 Aug 2022 4:52 UTC

47 points

32 comments2 min readLW link

Informal semantics and Orders

Q Home27 Aug 2022 4:17 UTC

14 points

10 comments26 min readLW link

Searching for Modularity in Large Language Models

NickyP and Stephen Fowler

8 Sep 2022 2:25 UTC

44 points

3 comments14 min readLW link

Trying to find the underlying structure of computational systems

Matthias G. Mayer13 Sep 2022 21:16 UTC

17 points

9 comments4 min readLW link

Coordinate-Free Interpretability Theory

johnswentworth14 Sep 2022 23:33 UTC

52 points

16 comments5 min readLW link

Mathematical Circuits in Neural Networks

Sean Osier22 Sep 2022 3:48 UTC

34 points

4 comments1 min readLW link

(www.youtube.com)

Recall and Regurgitation in GPT2

Megan Kinniment3 Oct 2022 19:35 UTC

43 points

1 comment26 min readLW link

Hard-Coding Neural Computation

MadHatter13 Dec 2021 4:35 UTC

34 points

8 comments27 min readLW link

Visualizing Learned Representations of Rice Disease

muhia_bee3 Oct 2022 9:09 UTC

7 points

0 comments4 min readLW link

(indecisive-sand-24a.notion.site)

Natural Categories Update

Logan Zoellner10 Oct 2022 15:19 UTC

33 points

6 comments2 min readLW link

Help out Redwood Research’s interpretability team by finding heuristics implemented by GPT-2 small

Haoxing Du and Buck

12 Oct 2022 21:25 UTC

50 points

11 comments4 min readLW link

Causal scrubbing: Appendix

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:58 UTC

18 points

4 comments20 min readLW link

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:58 UTC

205 points

35 comments20 min readLW link 1 review

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

KevinRoWang, Alexandre Variengien, Arthur Conmy, Buck and jsteinhardt

28 Oct 2022 23:55 UTC

101 points

9 comments9 min readLW link 2 reviews

(arxiv.org)

Auditing games for high-level interpretability

Paul Colognese1 Nov 2022 10:44 UTC

33 points

1 comment7 min readLW link

Mechanistic Interpretability as Reverse Engineering (follow-up to “cars and elephants”)

David Scott Krueger (formerly: capybaralet)3 Nov 2022 23:19 UTC

28 points

3 comments1 min readLW link

Toy Models and Tegum Products

Adam Jermyn4 Nov 2022 18:51 UTC

28 points

7 comments5 min readLW link

Why I’m Working On Model Agnostic Interpretability

Jessica Rumbelow11 Nov 2022 9:24 UTC

27 points

9 comments2 min readLW link

The limited upside of interpretability

Peter S. Park15 Nov 2022 18:46 UTC

13 points

11 comments1 min readLW link

Current themes in mechanistic interpretability research

Lee Sharkey, Sid Black and beren

16 Nov 2022 14:14 UTC

89 points

2 comments12 min readLW link

Engineering Monosemanticity in Toy Models

Adam Jermyn, evhub and Nicholas Schiefer

18 Nov 2022 1:43 UTC

75 points

7 comments3 min readLW link

(arxiv.org)

The Ground Truth Problem (Or, Why Evaluating Interpretability Methods Is Hard)

Jessica Rumbelow17 Nov 2022 11:06 UTC

27 points

2 comments2 min readLW link

By Default, GPTs Think In Plain Sight

Fabien Roger19 Nov 2022 19:15 UTC

86 points

36 comments9 min readLW link

Multi-Component Learning and S-Curves

Adam Jermyn and Buck

30 Nov 2022 1:37 UTC

63 points

24 comments7 min readLW link

Causal scrubbing: results on a paren balance checker

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:59 UTC

34 points

2 comments30 min readLW link

Causal scrubbing: results on induction heads

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:59 UTC

34 points

1 comment17 min readLW link

Is the “Valley of Confused Abstractions” real?

jacquesthibs5 Dec 2022 13:36 UTC

19 points

11 comments2 min readLW link

An exploration of GPT-2′s embedding weights

Adam Scherlis13 Dec 2022 0:46 UTC

42 points

4 comments10 min readLW link

How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

Collin15 Dec 2022 18:22 UTC

244 points

39 comments16 min readLW link 1 review

Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)

Remmelt19 Dec 2022 12:02 UTC

−3 points

9 comments31 min readLW link

Some Notes on the mathematics of Toy Autoencoding Problems

carboniferous_umbraculum 22 Dec 2022 17:21 UTC

18 points

1 comment12 min readLW link

Internal Interfaces Are a High-Priority Interpretability Target

Thane Ruthenis29 Dec 2022 17:49 UTC

26 points

6 comments7 min readLW link

But is it really in Rome? An investigation of the ROME model editing technique

jacquesthibs30 Dec 2022 2:40 UTC

104 points

2 comments18 min readLW link

[Question] Are Mixture-of-Experts Transformers More Interpretable Than Dense Transformers?

simeon_c31 Dec 2022 11:34 UTC

7 points

5 comments1 min readLW link

Induction heads—illustrated

CallumMcDougall2 Jan 2023 15:35 UTC

111 points

9 comments3 min readLW link

On the Importance of Open Sourcing Reward Models

elandgre2 Jan 2023 19:01 UTC

18 points

5 comments6 min readLW link

Basic Facts about Language Model Internals

beren and Eric Winsor

4 Jan 2023 13:01 UTC

130 points

19 comments9 min readLW link

AI psychology should ground the theories of AI consciousness and inform human-AI ethical interaction design

Roman Leventov8 Jan 2023 6:37 UTC

19 points

8 comments2 min readLW link

Trying to isolate objectives: approaches toward high-level interpretability

Jozdien9 Jan 2023 18:33 UTC

48 points

14 comments8 min readLW link

The AI Control Problem in a wider intellectual context

philosophybear13 Jan 2023 0:28 UTC

11 points

3 comments12 min readLW link

Can we efficiently distinguish different mechanisms?

paulfchristiano27 Dec 2022 0:20 UTC

88 points

30 comments16 min readLW link

(ai-alignment.com)

Neural networks generalize because of this one weird trick

Jesse Hoogland18 Jan 2023 0:10 UTC

171 points

28 comments53 min readLW link

(www.jessehoogland.com)

Reflections on Trusting Trust & AI

Itay Yona16 Jan 2023 6:36 UTC

10 points

1 comment3 min readLW link

(mentaleap.ai)

Large language models learn to represent the world

gjm22 Jan 2023 13:10 UTC

102 points

19 comments3 min readLW link

Deconfusing “Capabilities vs. Alignment”

RobertM23 Jan 2023 4:46 UTC

27 points

7 comments2 min readLW link

How-to Transformer Mechanistic Interpretability—in 50 lines of code or less!

StefanHex24 Jan 2023 18:45 UTC

47 points

5 comments13 min readLW link

[RFC] Possible ways to expand on “Discovering Latent Knowledge in Language Models Without Supervision”.

gekaklam, Walter Laurito , Kaarel and Kay Kozaronek

25 Jan 2023 19:03 UTC

48 points

6 comments12 min readLW link

Spooky action at a distance in the loss landscape

Jesse Hoogland and Filip Sondej

28 Jan 2023 0:22 UTC

61 points

4 comments7 min readLW link

(www.jessehoogland.com)

No Really, Attention is ALL You Need—Attention can do feedforward networks

Robert_AIZI31 Jan 2023 18:48 UTC

29 points

7 comments6 min readLW link

(aizi.substack.com)

ChatGPT: Tantalizing afterthoughts in search of story trajectories [induction heads]

Bill Benzon3 Feb 2023 10:35 UTC

4 points

0 comments20 min readLW link

Some miscellaneous thoughts on ChatGPT, stories, and mechanical interpretability

Bill Benzon4 Feb 2023 19:35 UTC

2 points

0 comments3 min readLW link

Gradient surfing: the hidden role of regularization

Jesse Hoogland6 Feb 2023 3:50 UTC

37 points

9 comments14 min readLW link

(www.jessehoogland.com)

Decision Transformer Interpretability

Joseph Bloom and Paul Colognese

6 Feb 2023 7:29 UTC

84 points

13 comments24 min readLW link

Addendum: More Efficient FFNs via Attention

Robert_AIZI6 Feb 2023 18:55 UTC

10 points

2 comments5 min readLW link

(aizi.substack.com)

LLM Basics: Embedding Spaces—Transformer Token Vectors Are Not Points in Space

NickyP13 Feb 2023 18:52 UTC

79 points

11 comments15 min readLW link

A multi-disciplinary view on AI safety research

Roman Leventov8 Feb 2023 16:50 UTC

43 points

4 comments26 min readLW link

The Engineer’s Interpretability Sequence (EIS) I: Intro

scasper9 Feb 2023 16:28 UTC

46 points

24 comments3 min readLW link

EIS II: What is “Interpretability”?

scasper9 Feb 2023 16:48 UTC

28 points

6 comments4 min readLW link

We Found An Neuron in GPT-2

Joseph Miller and Clement Neo

11 Feb 2023 18:27 UTC

143 points

23 comments7 min readLW link

(clementneo.com)

Idea: Network modularity and interpretability by sexual reproduction

qbolec12 Feb 2023 23:06 UTC

3 points

3 comments1 min readLW link

Explaining SolidGoldMagikarp by looking at it from random directions

Robert_AIZI14 Feb 2023 14:54 UTC

8 points

0 comments8 min readLW link

(aizi.substack.com)

EIS III: Broad Critiques of Interpretability Research

scasper14 Feb 2023 18:24 UTC

20 points

2 comments11 min readLW link

EIS IV: A Spotlight on Feature Attribution/Saliency

scasper15 Feb 2023 18:46 UTC

19 points

1 comment4 min readLW link

EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety

scasper17 Feb 2023 20:48 UTC

49 points

9 comments12 min readLW link

EIS VII: A Challenge for Mechanists

scasper18 Feb 2023 18:27 UTC

36 points

4 comments3 min readLW link

The shallow reality of ‘deep learning theory’

Jesse Hoogland22 Feb 2023 4:16 UTC

34 points

11 comments3 min readLW link

(www.jessehoogland.com)

EIS VIII: An Engineer’s Understanding of Deceptive Alignment

scasper19 Feb 2023 15:25 UTC

30 points

5 comments4 min readLW link

EIS IX: Interpretability and Adversaries

scasper20 Feb 2023 18:25 UTC

30 points

7 comments8 min readLW link

EIS X: Continual Learning, Modularity, Compression, and Biological Brains

scasper21 Feb 2023 16:59 UTC

14 points

4 comments3 min readLW link

EIS XI: Moving Forward

scasper22 Feb 2023 19:05 UTC

19 points

2 comments9 min readLW link

Searching for a model’s concepts by their shape – a theoretical framework

Kaarel, gekaklam, Walter Laurito , Kay Kozaronek, AlexMennen and June Ku

23 Feb 2023 20:14 UTC

51 points

0 comments19 min readLW link

EIS XII: Summary

scasper23 Feb 2023 17:45 UTC

18 points

0 comments6 min readLW link

Interpreting Embedding Spaces by Conceptualization

Adi Simhi28 Feb 2023 18:38 UTC

3 points

0 comments1 min readLW link

(arxiv.org)

Inside the mind of a superhuman Go model: How does Leela Zero read ladders?

Haoxing Du1 Mar 2023 1:47 UTC

157 points

8 comments30 min readLW link

My current thinking about ChatGPT @3QD [Gärdenfors, Wolfram, and the value of speculation]

Bill Benzon1 Mar 2023 10:50 UTC

2 points

0 comments5 min readLW link

ChatGPT tells stories, and a note about reverse engineering: A Working Paper

Bill Benzon3 Mar 2023 15:12 UTC

3 points

0 comments3 min readLW link

Against LLM Reductionism

Erich_Grunewald8 Mar 2023 15:52 UTC

140 points

17 comments18 min readLW link

(www.erichgrunewald.com)

Practical Pitfalls of Causal Scrubbing

Jérémy Scheurer, Phil3, tony, jacquesthibs and David Lindner

27 Mar 2023 7:47 UTC

87 points

17 comments13 min readLW link

Creating a Discord server for Mechanistic Interpretability Projects

Victor Levoso12 Mar 2023 18:00 UTC

30 points

6 comments2 min readLW link

Input Swap Graphs: Discovering the role of neural network components at scale

Alexandre Variengien12 May 2023 9:41 UTC

92 points

0 comments33 min readLW link

Hidden Cognition Detection Methods and Benchmarks

Paul Colognese26 Feb 2024 5:31 UTC

22 points

11 comments4 min readLW link

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

Evan Anders and Joseph Bloom

27 Feb 2024 2:43 UTC

42 points

16 comments15 min readLW link

Calendar feature geometry in GPT-2 layer 8 residual stream SAEs

Patrick Leask, Bart Bussmann and Neel Nanda

17 Aug 2024 1:16 UTC

53 points

0 comments5 min readLW link

Has anyone experimented with Dodrio, a tool for exploring transformer models through interactive visualization?

Bill Benzon11 Dec 2023 20:34 UTC

4 points

0 comments1 min readLW link

Categorical Organization in Memory: ChatGPT Organizes the 665 Topic Tags from My New Savanna Blog

Bill Benzon14 Dec 2023 13:02 UTC

0 points

6 comments2 min readLW link

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Tony Wang, Miles Wang and kaivu

15 Dec 2023 11:05 UTC

33 points

8 comments10 min readLW link

How does a toy 2 digit subtraction transformer predict the sign of the output?

Evan Anders19 Dec 2023 18:56 UTC

14 points

0 comments8 min readLW link

(evanhanders.blog)

A Universal Emergent Decomposition of Retrieval Tasks in Language Models

Alexandre Variengien and Eric Winsor

19 Dec 2023 11:52 UTC

84 points

3 comments10 min readLW link

(arxiv.org)

What’s in the box?! – Towards interpretability by distinguishing niches of value within neural networks.

Joshua Clancy29 Feb 2024 18:33 UTC

3 points

4 comments128 min readLW link

Interpretability: Integrated Gradients is a decent attribution method

Lucius Bushnaq, jake_mendel, StefanHex and Kaarel

20 May 2024 17:55 UTC

22 points

7 comments6 min readLW link

Sparse MLP Distillation

slavachalnev15 Jan 2024 19:39 UTC

30 points

3 comments6 min readLW link

Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5)

Neel Nanda, Senthooran Rajamanoharan, János Kramár and Rohin Shah

23 Dec 2023 2:46 UTC

18 points

0 comments4 min readLW link

Fact Finding: How to Think About Interpreting Memorisation (Post 4)

Senthooran Rajamanoharan, Neel Nanda, János Kramár and Rohin Shah

23 Dec 2023 2:46 UTC

22 points

0 comments9 min readLW link

How does a toy 2 digit subtraction transformer predict the difference?

Evan Anders22 Dec 2023 21:17 UTC

12 points

0 comments10 min readLW link

(evanhanders.blog)

Fact Finding: Simplifying the Circuit (Post 2)

Senthooran Rajamanoharan, Neel Nanda, János Kramár and Rohin Shah

23 Dec 2023 2:45 UTC

25 points

3 comments14 min readLW link

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

robertzk, Connor Kissane, Arthur Conmy and Neel Nanda

6 Mar 2024 5:03 UTC

58 points

0 comments12 min readLW link

Exploring the Residual Stream of Transformers for Mechanistic Interpretability — Explained

Zeping Yu26 Dec 2023 0:36 UTC

7 points

1 comment11 min readLW link

[Question] SAE sparse feature graph using only residual layers

Jaehyuk Lim23 May 2024 13:32 UTC

0 points

3 comments1 min readLW link

Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization

Jacob Dunefsky, Philippe Chlenski, Senthooran Rajamanoharan and Neel Nanda

14 Jan 2024 2:06 UTC

23 points

0 comments42 min readLW link

Biases in Biases, or Critique of the Critique

ThePathYouWillChoose19 Aug 2024 17:11 UTC

1 point

0 comments1 min readLW link

Anomalous Concept Detection for Detecting Hidden Cognition

Paul Colognese4 Mar 2024 16:52 UTC

24 points

3 comments10 min readLW link

Task vectors & analogy making in LLMs

Sergii8 Jan 2024 15:17 UTC

9 points

1 comment4 min readLW link

(grgv.xyz)

Finding Deception in Language Models

Esben Kran and Archana Vaidheeswaran

20 Aug 2024 9:42 UTC

18 points

4 comments4 min readLW link

Sparse Autoencoders Work on Attention Layer Outputs

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

16 Jan 2024 0:26 UTC

83 points

9 comments18 min readLW link

What’s going on with Per-Component Weight Updates?

4gate22 Aug 2024 21:22 UTC

1 point

0 comments6 min readLW link

How polysemantic can one neuron be? Investigating features in TinyStories.

Evan Anders16 Jan 2024 19:10 UTC

14 points

0 comments8 min readLW link

(evanhanders.blog)

Exploring the Evolution and Migration of Different Layer Embedding in LLMs

Ruixuan Huang8 Mar 2024 15:01 UTC

6 points

0 comments8 min readLW link

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Kola Ayonrinde, Michael Pearce and Lee Sharkey

23 Aug 2024 18:52 UTC

39 points

5 comments16 min readLW link

Crafting Polysemantic Transformer Benchmarks with Known Circuits

Evan Anders and Adrià Garriga-alonso

23 Aug 2024 22:03 UTC

10 points

0 comments25 min readLW link

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

abhayesian, Jannik Brinkmann and Victor Levoso

28 May 2024 5:29 UTC

50 points

1 comment9 min readLW link

(arxiv.org)

Questions I’d Want to Ask an AGI+ to Test Its Understanding of Ethics

sweenesm26 Jan 2024 23:40 UTC

14 points

6 comments4 min readLW link

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs

Winnie Yang and Jojo Yang

22 Aug 2024 7:32 UTC

23 points

1 comment21 min readLW link

Exploring OpenAI’s Latent Directions: Tests, Observations, and Poking Around

Johnny Lin31 Jan 2024 6:01 UTC

26 points

4 comments14 min readLW link

Understanding SAE Features with the Logit Lens

Joseph Bloom and Johnny Lin

11 Mar 2024 0:16 UTC

59 points

0 comments14 min readLW link

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Bart Bussmann, Michael Pearce, Patrick Leask, Joseph Bloom, Lee Sharkey and Neel Nanda

24 Aug 2024 0:56 UTC

60 points

9 comments20 min readLW link

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

Joseph Bloom2 Feb 2024 6:54 UTC

100 points

37 comments15 min readLW link

Attention SAEs Scale to GPT-2 Small

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

3 Feb 2024 6:50 UTC

77 points

4 comments8 min readLW link

Fluent dreaming for language models (AI interpretability method)

tbenthompson, mikes and Zygi Straznickas

6 Feb 2024 6:02 UTC

45 points

5 comments1 min readLW link

(arxiv.org)

Understanding Hidden Computations in Chain-of-Thought Reasoning

rokosbasilisk24 Aug 2024 16:35 UTC

6 points

1 comment1 min readLW link

A Chess-GPT Linear Emergent World Representation

Adam Karvonen8 Feb 2024 4:25 UTC

105 points

14 comments7 min readLW link

(adamkarvonen.github.io)

Useful starting code for interpretability

eggsyntax13 Feb 2024 23:13 UTC

25 points

2 comments1 min readLW link

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems

Sonia Joseph and Neel Nanda

13 Mar 2024 17:09 UTC

44 points

13 comments14 min readLW link

Addressing Feature Suppression in SAEs

Benjamin Wright and Lee Sharkey

16 Feb 2024 18:32 UTC

85 points

3 comments10 min readLW link

Can quantised autoencoders find and interpret circuits in language models?

charlieoneill24 Mar 2024 20:05 UTC

28 points

4 comments24 min readLW link

Auto-matching hidden layers in Pytorch LLMs

chanind19 Feb 2024 12:40 UTC

2 points

0 comments3 min readLW link

Sparse autoencoders find composed features in small toy models

Evan Anders, Clement Neo, Jason Hoelscher-Obermaier and Jessica N. Howard

14 Mar 2024 18:00 UTC

33 points

12 comments15 min readLW link

Do sparse autoencoders find “true features”?

Demian Till22 Feb 2024 18:06 UTC

72 points

33 comments11 min readLW link

Notes on Internal Objectives in Toy Models of Agents

Paul Colognese22 Feb 2024 8:02 UTC

16 points

0 comments8 min readLW link

The role of philosophical thinking in understanding large language models: Calibrating and closing the gap between first-person experience and underlying mechanisms

Bill Benzon23 Feb 2024 12:19 UTC

4 points

0 comments10 min readLW link

Towards White Box Deep Learning

Maciej Satkiewicz27 Mar 2024 18:20 UTC

17 points

5 comments1 min readLW link

(arxiv.org)

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders

Johnny Lin and Joseph Bloom

25 Mar 2024 21:17 UTC

91 points

7 comments7 min readLW link

Decompiling Tracr Transformers—An interpretability experiment

Hannes Thurnherr27 Mar 2024 9:49 UTC

4 points

0 comments14 min readLW link

Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM

Winnie Yang28 Aug 2024 8:41 UTC

7 points

2 comments31 min readLW link

Ophiology (or, how the Mamba architecture works)

Danielle Ensign, SrGonao and Adrià Garriga-alonso

9 Apr 2024 19:31 UTC

67 points

8 comments10 min readLW link

DSLT 0. Distilling Singular Learning Theory

Liam Carroll16 Jun 2023 9:50 UTC

76 points

6 comments5 min readLW link

Normalizing Sparse Autoencoders

Fengyuan Hu8 Apr 2024 6:17 UTC

21 points

18 comments13 min readLW link

Can Large Language Models effectively identify cybersecurity risks?

emile delcourt30 Aug 2024 20:20 UTC

18 points

0 comments11 min readLW link

Scaling Laws and Superposition

Pavan Katta10 Apr 2024 15:36 UTC

9 points

4 comments5 min readLW link

(www.pavankatta.com)

[Question] Barcoding LLM Training Data Subsets. Anyone trying this for interpretability?

right..enough?13 Apr 2024 3:09 UTC

7 points

0 comments7 min readLW link

Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity.

Josh Levy4 Jun 2024 15:45 UTC

38 points

0 comments17 min readLW link

Experiments with an alternative method to promote sparsity in sparse autoencoders

Eoin Farrell15 Apr 2024 18:21 UTC

29 points

7 comments12 min readLW link

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai16 Apr 2024 21:16 UTC

411 points

100 comments12 min readLW link

graphpatch: a Python Library for Activation Patching

Occam's Laser5 Jun 2024 15:08 UTC

13 points

2 comments1 min readLW link

Past Tense Features

Can20 Apr 2024 14:34 UTC

12 points

0 comments4 min readLW link

Redundant Attention Heads in Large Language Models For In Context Learning

skunnavakkam1 Sep 2024 20:08 UTC

7 points

1 comment4 min readLW link

(skunnavakkam.github.io)

Transcoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky, Philippe Chlenski and Neel Nanda

30 Apr 2024 17:58 UTC

69 points

14 comments17 min readLW link

Relationships among words, metalingual definition, and interpretability

Bill Benzon7 Jun 2024 19:18 UTC

2 points

0 comments5 min readLW link

Visualizing neural network planning

Nevan Wichers, Victor Tao, Fazl and Riccardo Volpato

9 May 2024 6:40 UTC

4 points

0 comments5 min readLW link

Alignment Gaps

kcyras8 Jun 2024 15:23 UTC

10 points

3 comments8 min readLW link

Closed-Source Evaluations

Jono8 Jun 2024 14:18 UTC

15 points

4 comments1 min readLW link

How To Do Patching Fast

Joseph Miller11 May 2024 20:13 UTC

40 points

6 comments4 min readLW link

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Daniel Lee and StefanHex

6 Sep 2024 2:28 UTC

27 points

0 comments12 min readLW link

Introducing SARA: a new activation steering technique

Alejandro Tlaie9 Jun 2024 15:33 UTC

17 points

7 comments6 min readLW link

Exploring Llama-3-8B MLP Neurons

ntt1239 Jun 2024 14:19 UTC

10 points

0 comments4 min readLW link

(neuralblog.github.io)

Adam Optimizer Causes Privileged Basis in Transformer LM Residual Stream

Diego Caples and rrenaud

6 Sep 2024 17:55 UTC

70 points

7 comments4 min readLW link

[Question] LLM/AI hype

Student19283746515 Jun 2024 20:12 UTC

1 point

0 comments1 min readLW link

Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability

ntt12317 Jun 2024 11:46 UTC

5 points

4 comments6 min readLW link

(neuralblog.github.io)

Analysing Adversarial Attacks with Linear Probing

Yoann Poupart, Imene Kerboua, Clement Neo and Jason Hoelscher-Obermaier

17 Jun 2024 14:16 UTC

9 points

0 comments8 min readLW link

Sparse Features Through Time

Rogan Inglis24 Jun 2024 18:06 UTC

12 points

1 comment1 min readLW link

(roganinglis.io)

Representation Tuning

Christopher Ackerman27 Jun 2024 17:44 UTC

35 points

9 comments13 min readLW link

Activation Pattern SVD: A proposal for SAE Interpretability

Daniel Tan28 Jun 2024 22:12 UTC

15 points

2 comments2 min readLW link

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning

keith_wynroe and Lee Sharkey

2 Jul 2024 13:17 UTC

81 points

7 comments12 min readLW link

OthelloGPT learned a bag of heuristics

jylin04, JackS, Adam Karvonen and Can

2 Jul 2024 9:12 UTC

108 points

10 comments9 min readLW link

[Interim research report] Activation plateaus & sensitive directions in GPT2

StefanHex and jake_mendel

5 Jul 2024 17:05 UTC

64 points

2 comments5 min readLW link

Transformer Circuit Faithfulness Metrics Are Not Robust

Joseph Miller, bilalchughtai and William_S

12 Jul 2024 3:47 UTC

104 points

5 comments7 min readLW link

(arxiv.org)

Stitching SAEs of different sizes

Bart Bussmann, Patrick Leask, Joseph Bloom, Curt Tigges and Neel Nanda

13 Jul 2024 17:19 UTC

39 points

12 comments12 min readLW link

An Introduction to Representation Engineering—an activation-based paradigm for controlling LLMs

Jan Wehner14 Jul 2024 10:37 UTC

35 points

5 comments17 min readLW link

Deceptive agents can collude to hide dangerous features in SAEs

Simon Lermen and Mateusz Dziemian

15 Jul 2024 17:07 UTC

27 points

0 comments7 min readLW link

Mech Interp Lacks Good Paradigms

Daniel Tan16 Jul 2024 15:47 UTC

33 points

0 comments14 min readLW link

Arrakis—A toolkit to conduct, track and visualize mechanistic interpretability experiments.

Yash Srivastava17 Jul 2024 2:02 UTC

2 points

2 comments5 min readLW link

Superposition through Active Learning Lens

akankshanc17 Sep 2024 17:32 UTC

1 point

0 comments10 min readLW link

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Karolis Jucys, george_adams and Sonia Joseph

18 Jul 2024 17:02 UTC

9 points

0 comments1 min readLW link

(arxiv.org)

SAEs (usually) Transfer Between Base and Chat Models

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

18 Jul 2024 10:29 UTC

65 points

0 comments10 min readLW link

Truth is Universal: Robust Detection of Lies in LLMs

Lennart Buerger19 Jul 2024 14:07 UTC

24 points

3 comments2 min readLW link

(arxiv.org)

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions

Lidor Banuel Dabbah and Aviel Boag

19 Jul 2024 20:32 UTC

59 points

6 comments16 min readLW link

BatchTopK: A Simple Improvement for TopK-SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

20 Jul 2024 2:20 UTC

52 points

0 comments4 min readLW link

Initial Experiments Using SAEs to Help Detect AI Generated Text

Aaron_Scher22 Jul 2024 5:16 UTC

17 points

0 comments14 min readLW link

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Adrià Garriga-alonso, taufeeque, AdamGleave and ChengCheng

25 Jul 2024 22:00 UTC

59 points

8 comments2 min readLW link

(arxiv.org)

Open Source Automated Interpretability for Sparse Autoencoder Features

kh4dien, SrGonao, jacob_drori and Nora Belrose

30 Jul 2024 21:11 UTC

67 points

1 comment13 min readLW link

(blog.eleuther.ai)

Understanding Positional Features in Layer 0 SAEs

bilalchughtai and Yeu-Tong Lau

29 Jul 2024 9:36 UTC

43 points

0 comments5 min readLW link

An Interpretability Illusion from Population Statistics in Causal Analysis

Daniel Tan29 Jul 2024 14:50 UTC

9 points

3 comments1 min readLW link

Constructing Neural Network Parameters with Downstream Trainability

ch271828n31 Jul 2024 18:13 UTC

1 point

0 comments1 min readLW link

(github.com)

Limitations on the Interpretability of Learned Features from Sparse Dictionary Learning

Tom Angsten30 Jul 2024 16:36 UTC

6 points

0 comments9 min readLW link

The Residual Expansion: A Framework for thinking about Transformer Circuits

Daniel Tan2 Aug 2024 11:04 UTC

16 points

13 comments3 min readLW link

Evaluating Sparse Autoencoders with Board Game Models

Adam Karvonen, Sam Marks, Can, Benjamin Wright, Jannik Brinkmann, Logan Riggs and Rico Angell

2 Aug 2024 19:50 UTC

38 points

1 comment9 min readLW link

Labelling, Variables, and In-Context Learning in Llama2

Joshua Penman3 Aug 2024 19:36 UTC

6 points

0 comments1 min readLW link

(colab.research.google.com)

Toy Models of Superposition: what about BitNets?

Alejandro Tlaie8 Aug 2024 16:29 UTC

5 points

1 comment5 min readLW link

Emergence, The Blind Spot of GenAI Interpretability?

Quentin FEUILLADE--MONTIXI10 Aug 2024 10:07 UTC

15 points

8 comments3 min readLW link

Extracting SAE task features for in-context learning

Dmitrii Kharlapenko, neverix, Neel Nanda and Arthur Conmy

12 Aug 2024 20:34 UTC

31 points

1 comment9 min readLW link

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

chanind, TomasD, hrdkbhatnagar and Joseph Bloom

25 Sep 2024 9:31 UTC

69 points

15 comments3 min readLW link

(arxiv.org)

GPT-2 Sometimes Fails at IOI

Ronak_Mehta14 Aug 2024 23:24 UTC

13 points

0 comments2 min readLW link

(ronakrm.github.io)

Evaluating Synthetic Activations composed of SAE Latents in GPT-2

Giorgi Giglemiani, nlpet, Chatrik, Jett Janiak and StefanHex

25 Sep 2024 20:37 UTC

27 points

0 comments3 min readLW link

(arxiv.org)

Characterizing stable regions in the residual stream of LLMs

Jett Janiak, jacek, Chatrik, Giorgi Giglemiani, nlpet and StefanHex

26 Sep 2024 13:44 UTC

38 points

4 comments1 min readLW link

(arxiv.org)

The Geometry of Feelings and Nonsense in Large Language Models

7vik and Nandi

27 Sep 2024 17:49 UTC

58 points

10 comments4 min readLW link

Avoiding jailbreaks by discouraging their representation in activation space

Guido Bergman27 Sep 2024 17:49 UTC

6 points

2 comments9 min readLW link

Knowledge Base 1: Could it increase intelligence and make it safer?

iwis30 Sep 2024 16:00 UTC

−4 points

0 comments4 min readLW link

Steering LLMs’ Behavior with Concept Activation Vectors

Ruixuan Huang28 Sep 2024 9:53 UTC

8 points

0 comments10 min readLW link

Base LLMs refuse too

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

29 Sep 2024 16:04 UTC

60 points

20 comments10 min readLW link

Exploring Shard-like Behavior: Empirical Insights into Contextual Decision-Making in RL Agents

Alejandro Aristizabal29 Sep 2024 0:32 UTC

6 points

0 comments15 min readLW link

Developmental Stages in Multi-Problem Grokking

James Sullivan29 Sep 2024 18:58 UTC

4 points

0 comments6 min readLW link

Exploring Decomposability of SAE Features

Vikram_N30 Sep 2024 18:28 UTC

1 point

0 comments3 min readLW link

LLMs are likely not conscious

research_prime_space29 Sep 2024 20:57 UTC

8 points

8 comments1 min readLW link

Toy Models of Superposition: Simplified by Hand

Axel Sorensen29 Sep 2024 21:19 UTC

9 points

3 comments8 min readLW link

Toy Models of Feature Absorption in SAEs

chanind, hrdkbhatnagar, TomasD and Joseph Bloom

7 Oct 2024 9:56 UTC

46 points

8 comments10 min readLW link

Interpretability of SAE Features Representing Check in ChessGPT

Jonathan Kutasov5 Oct 2024 20:43 UTC

27 points

2 comments8 min readLW link

(Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need

Sodium3 Oct 2024 19:11 UTC

34 points

17 comments16 min readLW link

Domain-specific SAEs

jacob_drori7 Oct 2024 20:15 UTC

27 points

0 comments5 min readLW link

There is a globe in your LLM

jacob_drori8 Oct 2024 0:43 UTC

86 points

4 comments1 min readLW link

Hamiltonian Dynamics in AI: A Novel Approach to Optimizing Reasoning in Language Models

Javier Marin Valenzuela9 Oct 2024 19:14 UTC

3 points

0 comments10 min readLW link

SAE features for refusal and sycophancy steering vectors

neverix, Dmitrii Kharlapenko, Arthur Conmy and Neel Nanda

12 Oct 2024 14:54 UTC

26 points

4 comments7 min readLW link

Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution

Kola Ayonrinde30 Oct 2024 22:50 UTC

26 points

0 comments12 min readLW link

It’s important to know when to stop: Mechanistic Exploration of Gemma 2 List Generation

Gerard Boxo14 Oct 2024 17:04 UTC

8 points

0 comments6 min readLW link

(gboxo.github.io)

A short project on Mamba: grokking & interpretability

Alejandro Tlaie18 Oct 2024 16:59 UTC

21 points

0 comments6 min readLW link

Monosemanticity & Quantization

Rahul Chand22 Oct 2024 22:57 UTC

1 point

0 comments9 min readLW link

Enabling New Applications with Today’s Mechanistic Interpretability Toolkit

ananya_joshi25 Oct 2024 17:53 UTC

3 points

0 comments3 min readLW link

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

27 Oct 2024 18:46 UTC

38 points

4 comments5 min readLW link

Bridging the VLM and mech interp communities for multimodal interpretability

Sonia Joseph28 Oct 2024 14:41 UTC

19 points

5 comments15 min readLW link

SAE Probing: What is it good for? Absolutely something!

Subhash Kantamneni, JoshEngels, Senthooran Rajamanoharan and Neel Nanda

1 Nov 2024 19:23 UTC

31 points

0 comments11 min readLW link

Composition Circuits in Vision Transformers (Hypothesis)

phenomanon1 Nov 2024 22:16 UTC

1 point

0 comments3 min readLW link

Testing “True” Language Understanding in LLMs: A Simple Proposal

MtryaSam2 Nov 2024 19:12 UTC

9 points

2 comments2 min readLW link

Evolutionary prompt optimization for SAE feature visualization

neverix, Daniel Tan, Dmitrii Kharlapenko, Neel Nanda and Arthur Conmy

14 Nov 2024 13:06 UTC

16 points

0 comments9 min readLW link

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill and Lee Sharkey

17 May 2024 16:25 UTC

57 points

10 comments4 min readLW link

(arxiv.org)

SAEs are highly dataset dependent: a case study on the refusal direction

Connor Kissane, robertzk, Neel Nanda and Arthur Conmy

7 Nov 2024 5:22 UTC

62 points

4 comments14 min readLW link

Analyzing how SAE features evolve across a forward pass

bensenberner, danibalcells, Michael Oesterle, Ediz Ucar and StefanHex

7 Nov 2024 22:07 UTC

44 points

0 comments1 min readLW link

(arxiv.org)

Antonym Heads Predict Semantic Opposites in Language Models

Jake Ward15 Nov 2024 15:32 UTC

1 point

0 comments5 min readLW link

Effects of Non-Uniform Sparsity on Superposition in Toy Models

Shreyans Jain14 Nov 2024 16:59 UTC

4 points

3 comments6 min readLW link

Empirical risk minimization is fundamentally confused

Jesse Hoogland22 Mar 2023 16:58 UTC

32 points

5 comments1 min readLW link

Sentience in Machines—How Do We Test for This Objectively?

Mayowa Osibodu26 Mar 2023 18:56 UTC

−2 points

0 comments2 min readLW link

(www.researchgate.net)

Approximation is expensive, but the lunch is cheap

Jesse Hoogland and Zach Furman

19 Apr 2023 14:19 UTC

70 points

3 comments16 min readLW link

Some common confusion about induction heads

Alexandre Variengien28 Mar 2023 21:51 UTC

64 points

4 comments5 min readLW link

Spreadsheet for 200 Concrete Problems In Interpretability

Jay Bailey29 Mar 2023 6:51 UTC

13 points

0 comments1 min readLW link

The Quantization Model of Neural Scaling

nz31 Mar 2023 16:02 UTC

17 points

0 comments1 min readLW link

(arxiv.org)

AISC 2023, Progress Report for March: Team Interpretable Architectures

Robert Kralisch, Eris, teahorse and Sohaib Imran

2 Apr 2023 16:19 UTC

14 points

0 comments14 min readLW link

Exploratory Analysis of RLHF Transformers with TransformerLens

Curt Tigges3 Apr 2023 16:09 UTC

21 points

2 comments11 min readLW link

(blog.eleuther.ai)

If interpretability research goes well, it may get dangerous

So8res3 Apr 2023 21:48 UTC

200 points

11 comments2 min readLW link

Universality and Hidden Information in Concept Bottleneck Models

Hoagy5 Apr 2023 14:00 UTC

23 points

0 comments11 min readLW link

No convincing evidence for gradient descent in activation space

Blaine12 Apr 2023 4:48 UTC

82 points

9 comments20 min readLW link

Bing AI Generating Voynich Manuscript Continuations—It does not know how it knows

Matthew_Opitz10 Apr 2023 20:22 UTC

15 points

6 comments13 min readLW link

Towards a solution to the alignment problem via objective detection and evaluation

Paul Colognese12 Apr 2023 15:39 UTC

9 points

7 comments12 min readLW link

Mechanistically interpreting time in GPT-2 small

rgould, Elizabeth Ho and Arthur Conmy

16 Apr 2023 17:57 UTC

68 points

6 comments21 min readLW link

Research Report: Incorrectness Cascades

Robert_AIZI14 Apr 2023 12:49 UTC

19 points

0 comments10 min readLW link

(aizi.substack.com)

An introduction to language model interpretability

Alexandre Variengien20 Apr 2023 22:22 UTC

14 points

0 comments9 min readLW link

I was Wrong, Simulator Theory is Real

Robert_AIZI26 Apr 2023 17:45 UTC

75 points

7 comments3 min readLW link

(aizi.substack.com)

z is not the cause of x

hrbigelow23 Oct 2023 17:43 UTC

6 points

2 comments9 min readLW link

Grokking Beyond Neural Networks

Jack Miller30 Oct 2023 17:28 UTC

10 points

0 comments2 min readLW link

(arxiv.org)

Robustness of Contrast-Consistent Search to Adversarial Prompting

Nandi, i, Jamie Wright, Seamus_F and hugofry

1 Nov 2023 12:46 UTC

18 points

1 comment7 min readLW link

Estimating effective dimensionality of MNIST models

Arjun Panickssery2 Nov 2023 14:13 UTC

41 points

3 comments1 min readLW link

Growth and Form in a Toy Model of Superposition

Liam Carroll and Edmund Lau

8 Nov 2023 11:08 UTC

87 points

7 comments14 min readLW link

What’s going on? LLMs and IS-A sentences

Bill Benzon8 Nov 2023 16:58 UTC

6 points

15 comments4 min readLW link

Polysemantic Attention Head in a 4-Layer Transformer

Jett Janiak, cmathw and StefanHex

9 Nov 2023 16:16 UTC

51 points

0 comments6 min readLW link

PhD Position: AI Interpretability in Berlin, Germany

Tiberius28 Apr 2023 13:44 UTC

3 points

0 comments1 min readLW link

(stephanw.net)

AISC Project: Modelling Trajectories of Language Models

NickyP13 Nov 2023 14:33 UTC

27 points

0 comments12 min readLW link

Eliciting Latent Knowledge in Comprehensive AI Services Models

acabodi17 Nov 2023 2:36 UTC

6 points

0 comments5 min readLW link

Incidental polysemanticity

Victor Lecomte, Kushal Thaman, tmychow and Rylan Schaeffer

15 Nov 2023 4:00 UTC

43 points

7 comments11 min readLW link

AISC project: TinyEvals

Jett Janiak22 Nov 2023 20:47 UTC

22 points

0 comments4 min readLW link

A day in the life of a mechanistic interpretability researcher

Bill Benzon28 Nov 2023 14:45 UTC

3 points

3 comments1 min readLW link

Towards an Ethics Calculator for Use by an AGI

sweenesm12 Dec 2023 18:37 UTC

3 points

2 comments11 min readLW link

Mechanistic interpretability through clustering

Alistair Fraser4 Dec 2023 18:49 UTC

1 point

0 comments1 min readLW link

Colour versus Shape Goal Misgeneralization in Reinforcement Learning: A Case Study

Karolis Jucys8 Dec 2023 13:18 UTC

13 points

1 comment4 min readLW link

(arxiv.org)

Language Model Memorization, Copyright Law, and Conditional Pretraining Alignment

RogerDearnaley7 Dec 2023 6:14 UTC

9 points

0 comments11 min readLW link

Gradient hacking

evhub16 Oct 2019 0:53 UTC

106 points

39 comments3 min readLW link 2 reviews

Will transparency help catch deception? Perhaps not

Matthew Barnett4 Nov 2019 20:52 UTC

43 points

5 comments7 min readLW link

Rohin Shah on reasons for AI optimism

abergal31 Oct 2019 12:10 UTC

40 points

58 comments1 min readLW link

(aiimpacts.org)

Understanding mesa-optimization using toy models

tilmanr, rusheb, Guillaume Corlouer, Dan Valentine, afspies, mivanitskiy and Can

7 May 2023 17:00 UTC

43 points

2 comments10 min readLW link

A Search for More ChatGPT / GPT-3.5 / GPT-4 “Unspeakable” Glitch Tokens

Martin Fell9 May 2023 14:36 UTC

26 points

9 comments6 min readLW link

A technical note on bilinear layers for interpretability

Lee Sharkey8 May 2023 6:06 UTC

58 points

0 comments1 min readLW link

(arxiv.org)

A comparison of causal scrubbing, causal abstractions, and related methods

Erik Jenner, Adrià Garriga-alonso and Egor Zverev

8 Jun 2023 23:40 UTC

73 points

3 comments22 min readLW link

Language models can explain neurons in language models

nz9 May 2023 17:29 UTC

23 points

0 comments1 min readLW link

(openai.com)

Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1

StefanHex and Marius Hobbhahn

9 May 2023 19:41 UTC

119 points

1 comment10 min readLW link

[Question] Have you heard about MIT’s “liquid neural networks”? What do you think about them?

Ppau9 May 2023 20:16 UTC

35 points

14 comments1 min readLW link

‘Fundamental’ vs ‘applied’ mechanistic interpretability research

Lee Sharkey23 May 2023 18:26 UTC

65 points

6 comments3 min readLW link

[Question] AI interpretability could be harmful?

Roman Leventov10 May 2023 20:43 UTC

13 points

2 comments1 min readLW link

Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS)

Scott Emmons31 May 2023 17:09 UTC

97 points

0 comments6 min readLW link

My current workflow to study the internal mechanisms of LLM

Yulu Pi16 May 2023 15:27 UTC

4 points

0 comments1 min readLW link

A Mechanistic Interpretability Analysis of a GridWorld Agent-Simulator (Part 1 of N)

Joseph Bloom16 May 2023 22:59 UTC

36 points

2 comments16 min readLW link

Gender Vectors in ROME’s Latent Space

Xodarap21 May 2023 18:46 UTC

14 points

2 comments3 min readLW link

Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2

StefanHex and Marius Hobbhahn

25 May 2023 15:37 UTC

71 points

1 comment13 min readLW link

Aligning an H-JEPA agent via training on the outputs of an LLM-based “exemplary actor”

Roman Leventov29 May 2023 11:08 UTC

12 points

10 comments30 min readLW link

The king token

p.b.28 May 2023 19:18 UTC

17 points

0 comments4 min readLW link

Short Remark on the (subjective) mathematical ‘naturalness’ of the Nanda—Lieberum addition modulo 113 algorithm

carboniferous_umbraculum 1 Jun 2023 11:31 UTC

104 points

12 comments2 min readLW link

[Linkpost] Rosetta Neurons: Mining the Common Units in a Model Zoo

Bogdan Ionut Cirstea17 Jun 2023 16:38 UTC

12 points

0 comments1 min readLW link

[Research Update] Sparse Autoencoder features are bimodal

Robert_AIZI22 Jun 2023 13:15 UTC

24 points

1 comment5 min readLW link

(aizi.substack.com)

Understanding understanding

mthq23 Aug 2019 18:10 UTC

24 points

1 comment2 min readLW link

The risk-reward tradeoff of interpretability research

JustinShovelain and Elliot Mckernon

5 Jul 2023 17:05 UTC

15 points

1 comment6 min readLW link

Localizing goal misgeneralization in a maze-solving policy network

jan betley6 Jul 2023 16:21 UTC

37 points

2 comments7 min readLW link

Interpreting Modular Addition in MLPs

Bart Bussmann7 Jul 2023 9:22 UTC

19 points

0 comments6 min readLW link

LLM misalignment can probably be found without manual prompt engineering

ProgramCrafter8 Jul 2023 14:35 UTC

1 point

0 comments1 min readLW link

interpreting GPT: the logit lens

nostalgebraist31 Aug 2020 2:47 UTC

223 points

37 comments11 min readLW link

Impact stories for model internals: an exercise for interpretability researchers

jenny25 Sep 2023 23:15 UTC

29 points

3 comments7 min readLW link

Still no Lie Detector for LLMs

Whispermute and ben_levinstein

18 Jul 2023 19:56 UTC

47 points

2 comments21 min readLW link

Activation adding experiments with llama-7b

Nina Panickssery16 Jul 2023 4:17 UTC

51 points

1 comment3 min readLW link

GPT-2′s positional embedding matrix is a helix

AdamYedidia21 Jul 2023 4:16 UTC

44 points

21 comments4 min readLW link

[Linkpost] Interpreting Multimodal Video Transformers Using Brain Recordings

Bogdan Ionut Cirstea21 Jul 2023 11:26 UTC

5 points

0 comments1 min readLW link

Training Process Transparency through Gradient Interpretability: Early experiments on toy language models

robertzk and evhub

21 Jul 2023 14:52 UTC

56 points

1 comment1 min readLW link

Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic

ojorgensen28 Jul 2023 19:43 UTC

13 points

3 comments13 min readLW link

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

RGRGRG28 Jul 2023 20:44 UTC

23 points

5 comments20 min readLW link

AI Safety 101 : Introduction to Vision Interpretability

jeanne_ and Charbel-Raphaël

28 Jul 2023 17:32 UTC

41 points

0 comments1 min readLW link

(github.com)

A Short Memo on AI Interpretability Rainbows

scasper27 Jul 2023 23:05 UTC

18 points

0 comments2 min readLW link

Visible loss landscape basins don’t correspond to distinct algorithms

Mikhail Samin28 Jul 2023 16:19 UTC

68 points

13 comments4 min readLW link

[Linkpost] Multimodal Neurons in Pretrained Text-Only Transformers

Bogdan Ionut Cirstea4 Aug 2023 15:29 UTC

11 points

0 comments1 min readLW link

Ground-Truth Label Imbalance Impairs the Performance of Contrast-Consistent Search (and Other Contrast-Pair-Based Unsupervised Methods)

Tom Angsten and Ami Hays

5 Aug 2023 17:55 UTC

6 points

2 comments7 min readLW link

(drive.google.com)

Mech Interp Challenge: August—Deciphering the First Unique Character Model

CallumMcDougall9 Aug 2023 19:14 UTC

36 points

1 comment3 min readLW link

The positional embedding matrix and previous-token heads: how do they actually work?

AdamYedidia10 Aug 2023 1:58 UTC

26 points

4 comments13 min readLW link

An interactive introduction to grokking and mechanistic interpretability

Adam Pearce and Asma Ghandeharioun

7 Aug 2023 19:09 UTC

23 points

3 comments1 min readLW link

(pair.withgoogle.com)

Decomposing independent generalizations in neural networks via Hessian analysis

Dmitry Vaintrob and Nina Panickssery

14 Aug 2023 17:04 UTC

83 points

4 comments1 min readLW link

Understanding the Information Flow inside Large Language Models

Felix Hofstätter and cozyfractal

15 Aug 2023 21:13 UTC

19 points

0 comments17 min readLW link

Enhancing Corrigibility in AI Systems through Robust Feedback Loops

Justausername24 Aug 2023 3:53 UTC

1 point

0 comments6 min readLW link

Causality and a Cost Semantics for Neural Networks

scottviteri21 Aug 2023 21:02 UTC

22 points

1 comment1 min readLW link

[Question] Would it be useful to collect the contexts, where various LLMs think the same?

Martin Vlach24 Aug 2023 22:01 UTC

6 points

1 comment1 min readLW link

Understanding Counterbalanced Subtractions for Better Activation Additions

ojorgensen17 Aug 2023 13:53 UTC

21 points

0 comments14 min readLW link

Memetic Judo #3: The Intelligence of Stochastic Parrots v.2

Max TK20 Aug 2023 15:18 UTC

8 points

33 comments6 min readLW link

An OV-Coherent Toy Model of Attention Head Superposition

Lauren Greenspan and keith_wynroe

29 Aug 2023 19:44 UTC

26 points

2 comments6 min readLW link

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

Can, Yeu-Tong Lau, James Dao and Jett Janiak

30 Aug 2023 17:36 UTC

17 points

0 comments8 min readLW link

(arxiv.org)

Barriers to Mechanistic Interpretability for AGI Safety

Connor Leahy29 Aug 2023 10:56 UTC

63 points

13 comments1 min readLW link

(www.youtube.com)

Open Call for Research Assistants in Developmental Interpretability

Jesse Hoogland, Daniel Murfet, Alexander Gietelink Oldenziel and Stan van Wingerden

30 Aug 2023 9:02 UTC

55 points

11 comments4 min readLW link

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Georg Lange, Alex Makelov and Neel Nanda

29 Aug 2023 1:04 UTC

77 points

4 comments1 min readLW link

Interpreting a matrix-valued word embedding with a mathematically proven characterization of all optima

Joseph Van Name4 Sep 2023 16:19 UTC

3 points

4 comments12 min readLW link

No comments.

In­ter­pretabil­ity (ML & AI)

See Also

Research

Interpretability (ML & AI)