Mechanistic Interpretability Workshop Happening at ICML 2024!

Neel Nanda, LawrenceC and Fazl

3 May 2024 1:18 UTC

47 points

3 comments1 min readLW link

Take SCIFs, it’s dangerous to go alone

latterframe, Jeffrey Ladish and schroederdewitt

1 May 2024 8:02 UTC

33 points

1 comment3 min readLW link

AXRP Episode 30 - AI Security with Jeffrey Ladish

DanielFilan1 May 2024 2:50 UTC

25 points

0 comments79 min readLW link

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

30 Apr 2024 18:51 UTC

143 points

26 comments45 min readLW link

Transcoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky, Philippe Chlenski and Neel Nanda

30 Apr 2024 17:58 UTC

54 points

11 comments17 min readLW link

Towards a formalization of the agent structure problem

Alex_Altair29 Apr 2024 20:28 UTC

46 points

2 comments14 min readLW link

AISC9 has ended and there will be an AISC10

Linda Linsefors29 Apr 2024 10:53 UTC

61 points

2 comments2 min readLW link

[Aspiration-based designs] Outlook: dealing with complexity

Jobst Heitzig, jossoliver and thomasfinn

28 Apr 2024 13:06 UTC

11 points

3 comments2 min readLW link

[Aspiration-based designs] 3. Performance and safety criteria, and aspiration intervals

Jobst Heitzig28 Apr 2024 13:04 UTC

10 points

0 comments12 min readLW link

[Aspiration-based designs] 2. Formal framework, basic algorithm

Jobst Heitzig, Simon Dima and Simon Fischer

28 Apr 2024 13:02 UTC

16 points

2 comments16 min readLW link

[Aspiration-based designs] 1. Informal introduction

B Jacobs, Jobst Heitzig, Simon Fischer and Simon Dima

28 Apr 2024 13:00 UTC

40 points

4 comments8 min readLW link

Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib111, wesg and Neel Nanda

27 Apr 2024 11:13 UTC

176 points

66 comments10 min readLW link

Superposition is not “just” neuron polysemanticity

LawrenceC26 Apr 2024 23:22 UTC

50 points

4 comments13 min readLW link

An Introduction to AI Sandbagging

Teun van der Weij, Felix Hofstätter and Francis Rhys Ward

26 Apr 2024 13:40 UTC

41 points

1 comment8 min readLW link

AXRP Episode 29 - Science of Deep Learning with Vikrant Varma

DanielFilan25 Apr 2024 19:10 UTC

19 points

1 comment63 min readLW link

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, lsgos, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah and Neel Nanda

25 Apr 2024 18:43 UTC

61 points

35 comments1 min readLW link

(arxiv.org)

Simple probes can catch sleeper agents

Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez and evhub

23 Apr 2024 21:10 UTC

117 points

15 comments1 min readLW link

(www.anthropic.com)

Dequantifying first-order theories

jessicata23 Apr 2024 19:04 UTC

39 points

9 comments8 min readLW link

(unstableontology.com)

ProLU: A Nonlinearity for Sparse Autoencoders

Glen Taggart23 Apr 2024 14:09 UTC

36 points

2 comments8 min readLW link

Time complexity for deterministic string machines

alcatal21 Apr 2024 22:35 UTC

14 points

0 comments21 min readLW link