EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

scasper21 May 2024 20:15 UTC

91 points

6 comments3 min readLW link

Announcing Human-aligned AI Summer School

Jan_Kulveit and Tomáš Gavenčiak

22 May 2024 8:55 UTC

15 points

0 comments1 min readLW link

(humanaligned.ai)

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache and Marius Hobbhahn

20 May 2024 17:53 UTC

77 points

2 comments3 min readLW link

The Problem With the Word ‘Alignment’

peligrietzer and particlemania

21 May 2024 3:48 UTC

53 points

4 comments6 min readLW link

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joar Skalse17 May 2024 19:13 UTC

63 points

6 comments2 min readLW link

Infra-Bayesian haggling

hannagabor20 May 2024 12:23 UTC

17 points

0 comments20 min readLW link

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill and Lee Sharkey

17 May 2024 16:25 UTC

53 points

2 comments4 min readLW link

(arxiv.org)

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai16 Apr 2024 21:16 UTC

380 points

91 comments12 min readLW link

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

30 Apr 2024 18:51 UTC

166 points

37 comments45 min readLW link

Instruction-following AGI is easier and more likely than value aligned AGI

Seth Herd15 May 2024 19:38 UTC

35 points

21 comments12 min readLW link

Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib111, wesg and Neel Nanda

27 Apr 2024 11:13 UTC

191 points

79 comments10 min readLW link

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Olli Järviniemi and evhub

6 May 2024 7:07 UTC

82 points

4 comments1 min readLW link

(arxiv.org)

AXRP Episode 31 - Singular Learning Theory with Daniel Murfet

DanielFilan7 May 2024 3:50 UTC

73 points

4 comments71 min readLW link

Simple probes can catch sleeper agents

Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez and evhub

23 Apr 2024 21:10 UTC

119 points

17 comments1 min readLW link

(www.anthropic.com)

Linear infra-Bayesian Bandits

Vanessa Kosoy10 May 2024 6:41 UTC

38 points

5 comments1 min readLW link

(arxiv.org)

AI Safety Strategies Landscape

Charbel-Raphaël9 May 2024 17:33 UTC

31 points

1 comment42 min readLW link

Modern Transformers are AGI, and Human-Level

abramdemski26 Mar 2024 17:46 UTC

213 points

89 comments5 min readLW link

Transcoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky, Philippe Chlenski and Neel Nanda

30 Apr 2024 17:58 UTC

59 points

14 comments17 min readLW link

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Sam Marks18 Apr 2024 16:17 UTC

101 points

7 comments12 min readLW link

Constructability: Plainly-coded AGIs may be feasible in the near future

Épiphanie Gédéon and Charbel-Raphaël

27 Apr 2024 16:04 UTC

66 points

12 comments13 min readLW link