Interpreting the Learning of Deceit

RogerDearnaley18 Dec 2023 8:12 UTC

30 points

10 comments9 min readLW link

Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib111, wesg and Neel Nanda

27 Apr 2024 11:13 UTC

185 points

77 comments10 min readLW link

Linear infra-Bayesian Bandits

Vanessa Kosoy10 May 2024 6:41 UTC

27 points

0 comments1 min readLW link

(arxiv.org)

An Introduction to AI Sandbagging

Teun van der Weij, Felix Hofstätter and Francis Rhys Ward

26 Apr 2024 13:40 UTC

41 points

2 comments8 min readLW link

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

30 Apr 2024 18:51 UTC

156 points

37 comments45 min readLW link

AI Safety Strategies Landscape

Charbel-Raphaël9 May 2024 17:33 UTC

21 points

0 comments42 min readLW link

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Neel Nanda, Senthooran Rajamanoharan, János Kramár and Rohin Shah

23 Dec 2023 2:44 UTC

106 points

6 comments22 min readLW link

Summing up “Scheming AIs” (Section 5)

Joe Carlsmith9 Dec 2023 15:48 UTC

2 points

1 comment11 min readLW link

Visualizing neural network planning

Nevan Wichers, Victor Tao, Fazl and Riccardo Volpato

9 May 2024 6:40 UTC

4 points

0 comments5 min readLW link

CLR’s recent work on multi-agent systems

JesseClifton9 Mar 2021 2:28 UTC

54 points

2 comments13 min readLW link

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai16 Apr 2024 21:16 UTC

367 points

83 comments12 min readLW link

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders

Johnny Lin and Joseph Bloom

25 Mar 2024 21:17 UTC

89 points

7 comments7 min readLW link

A circuit for Python docstrings in a 4-layer attention-only transformer

StefanHex and Jett

20 Feb 2023 19:35 UTC

91 points

8 comments21 min readLW link

Mysteries of mode collapse

janus8 Nov 2022 10:37 UTC

281 points

57 comments14 min readLW link 1 review

[Question] What convincing warning shot could help prevent extinction from AI?

Charbel-Raphaël and cozyfractal

13 Apr 2024 18:09 UTC

103 points

18 comments2 min readLW link

AXRP Episode 31 - Singular Learning Theory with Daniel Murfet

DanielFilan7 May 2024 3:50 UTC

65 points

4 comments71 min readLW link

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Georg Lange, Alex Makelov and Neel Nanda

29 Aug 2023 1:04 UTC

75 points

4 comments1 min readLW link

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Olli Järviniemi and evhub

6 May 2024 7:07 UTC

81 points

4 comments1 min readLW link

(arxiv.org)

AI Control: Improving Safety Despite Intentional Subversion

Buck, Fabien Roger, ryan_greenblatt and Kshitij Sachan

13 Dec 2023 15:51 UTC

196 points

7 comments10 min readLW link

My Objections to “We’re All Gonna Die with Eliezer Yudkowsky”

Quintin Pope21 Mar 2023 0:06 UTC

356 points

225 comments39 min readLW link