AXRP Episode 31 - Singular Learning Theory with Daniel Murfet

DanielFilan7 May 2024 3:50 UTC

65 points

4 comments71 min readLW link

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Olli Järviniemi and evhub

6 May 2024 7:07 UTC

79 points

4 comments1 min readLW link

(arxiv.org)

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

30 Apr 2024 18:51 UTC

152 points

36 comments45 min readLW link

Visualizing neural network planning

Nevan Wichers, Victor Tao, Fazl and Riccardo Volpato

9 May 2024 6:40 UTC

4 points

0 comments5 min readLW link

Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib111, wesg and Neel Nanda

27 Apr 2024 11:13 UTC

184 points

75 comments10 min readLW link

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai16 Apr 2024 21:16 UTC

366 points

83 comments12 min readLW link

Mechanistic Interpretability Workshop Happening at ICML 2024!

Neel Nanda, LawrenceC and Fazl

3 May 2024 1:18 UTC

47 points

4 comments1 min readLW link

Transcoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky, Philippe Chlenski and Neel Nanda

30 Apr 2024 17:58 UTC

56 points

12 comments17 min readLW link

Simple probes can catch sleeper agents

Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez and evhub

23 Apr 2024 21:10 UTC

117 points

15 comments1 min readLW link

(www.anthropic.com)

AISC9 has ended and there will be an AISC10

Linda Linsefors29 Apr 2024 10:53 UTC

62 points

2 comments2 min readLW link

Towards a formalization of the agent structure problem

Alex_Altair29 Apr 2024 20:28 UTC

48 points

2 comments14 min readLW link

Take SCIFs, it’s dangerous to go alone

latterframe, Jeffrey Ladish and schroederdewitt

1 May 2024 8:02 UTC

34 points

1 comment3 min readLW link

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, lsgos, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah and Neel Nanda

25 Apr 2024 18:43 UTC

62 points

35 comments1 min readLW link

(arxiv.org)

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Sam Marks18 Apr 2024 16:17 UTC

101 points

7 comments12 min readLW link

Superposition is not “just” neuron polysemanticity

LawrenceC26 Apr 2024 23:22 UTC

52 points

4 comments13 min readLW link

[Aspiration-based designs] 1. Informal introduction

B Jacobs, Jobst Heitzig, Simon Fischer and Simon Dima

28 Apr 2024 13:00 UTC

40 points

4 comments8 min readLW link

AXRP Episode 30 - AI Security with Jeffrey Ladish

DanielFilan1 May 2024 2:50 UTC

25 points

0 comments79 min readLW link

Modern Transformers are AGI, and Human-Level

abramdemski26 Mar 2024 17:46 UTC

205 points

89 comments5 min readLW link

An Introduction to AI Sandbagging

Teun van der Weij, Felix Hofstätter and Francis Rhys Ward

26 Apr 2024 13:40 UTC

41 points

1 comment8 min readLW link

[Full Post] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lsgos, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

19 Apr 2024 19:06 UTC

71 points

8 comments8 min readLW link