Instruction-following AGI is easier and more likely than value aligned AGI

Seth Herd15 May 2024 19:38 UTC

33 points

14 comments12 min readLW link

Transcoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky, Philippe Chlenski and Neel Nanda

30 Apr 2024 17:58 UTC

59 points

14 comments17 min readLW link

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joar Skalse17 May 2024 19:13 UTC

37 points

1 comment2 min readLW link

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill and Lee Sharkey

17 May 2024 16:25 UTC

38 points

1 comment4 min readLW link

(publications.apolloresearch.ai)

Sparse Autoencoders Work on Attention Layer Outputs

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

16 Jan 2024 0:26 UTC

82 points

9 comments18 min readLW link

Davidad’s Bold Plan for Alignment: An In-Depth Explanation

Charbel-Raphaël and Gabin

19 Apr 2023 16:09 UTC

154 points

31 comments21 min readLW link

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai16 Apr 2024 21:16 UTC

378 points

90 comments12 min readLW link

Revealing Intentionality In Language Models Through AdaVAE Guided Sampling

jdp20 Oct 2023 7:32 UTC

119 points

15 comments22 min readLW link

Simple probes can catch sleeper agents

Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez and evhub

23 Apr 2024 21:10 UTC

119 points

17 comments1 min readLW link

(www.anthropic.com)

Linear infra-Bayesian Bandits

Vanessa Kosoy10 May 2024 6:41 UTC

38 points

5 comments1 min readLW link

(arxiv.org)

Towards a formalization of the agent structure problem

Alex_Altair29 Apr 2024 20:28 UTC

52 points

4 comments14 min readLW link

The “no sandbagging on checkable tasks” hypothesis

Joe Carlsmith31 Jul 2023 23:06 UTC

51 points

13 comments9 min readLW link

AI Safety Strategies Landscape

Charbel-Raphaël9 May 2024 17:33 UTC

29 points

1 comment42 min readLW link

There are no coherence theorems

20 Feb 2023 21:25 UTC

121 points

115 comments19 min readLW link

An Introduction to AI Sandbagging

Teun van der Weij, Felix Hofstätter and Francis Rhys Ward

26 Apr 2024 13:40 UTC

41 points

5 comments8 min readLW link

How to train your own “Sleeper Agents”

evhub7 Feb 2024 0:31 UTC

91 points

10 comments2 min readLW link

Towards Developmental Interpretability

Jesse Hoogland, Alexander Gietelink Oldenziel, Daniel Murfet and Stan van Wingerden

12 Jul 2023 19:33 UTC

173 points

9 comments9 min readLW link

AISC9 has ended and there will be an AISC10

Linda Linsefors29 Apr 2024 10:53 UTC

62 points

4 comments2 min readLW link

Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns

stuhlmueller21 Jul 2020 20:06 UTC

82 points

41 comments3 min readLW link

Fixing The Good Regulator Theorem

johnswentworth9 Feb 2021 20:30 UTC

136 points

38 comments8 min readLW link 1 review