CLR’s recent work on multi-agent systems

JesseClifton9 Mar 2021 2:28 UTC

54 points

2 comments13 min readLW link

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai16 Apr 2024 21:16 UTC

366 points

83 comments12 min readLW link

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders

Johnny Lin and Joseph Bloom

25 Mar 2024 21:17 UTC

89 points

7 comments7 min readLW link

A circuit for Python docstrings in a 4-layer attention-only transformer

StefanHex and Jett

20 Feb 2023 19:35 UTC

91 points

8 comments21 min readLW link

Mysteries of mode collapse

janus8 Nov 2022 10:37 UTC

281 points

57 comments14 min readLW link 1 review

AXRP Episode 31 - Singular Learning Theory with Daniel Murfet

DanielFilan7 May 2024 3:50 UTC

64 points

4 comments71 min readLW link

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

30 Apr 2024 18:51 UTC

152 points

36 comments45 min readLW link

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Georg Lange, Alex Makelov and Neel Nanda

29 Aug 2023 1:04 UTC

75 points

4 comments1 min readLW link

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Olli Järviniemi and evhub

6 May 2024 7:07 UTC

78 points

4 comments1 min readLW link

(arxiv.org)

AI Control: Improving Safety Despite Intentional Subversion

Buck, Fabien Roger, ryan_greenblatt and Kshitij Sachan

13 Dec 2023 15:51 UTC

196 points

7 comments10 min readLW link

My Objections to “We’re All Gonna Die with Eliezer Yudkowsky”

Quintin Pope21 Mar 2023 0:06 UTC

356 points

225 comments39 min readLW link

A Longlist of Theories of Impact for Interpretability

Neel Nanda11 Mar 2022 14:55 UTC

127 points

36 comments5 min readLW link 2 reviews

Why Would AI “Aim” To Defeat Humanity?

HoldenKarnofsky29 Nov 2022 19:30 UTC

69 points

10 comments33 min readLW link

(www.cold-takes.com)

Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib111, wesg and Neel Nanda

27 Apr 2024 11:13 UTC

184 points

75 comments10 min readLW link

Searching for Searching for Search

Rubi J. Hudson14 Feb 2024 23:51 UTC

21 points

4 comments7 min readLW link

Some background for reasoning about dual-use alignment research

Charlie Steiner18 May 2023 14:50 UTC

121 points

20 comments9 min readLW link

What I mean by “alignment is in large part about making cognition aimable at all”

So8res30 Jan 2023 15:22 UTC

167 points

25 comments2 min readLW link

Don’t Dismiss Simple Alignment Approaches

Chris_Leong7 Oct 2023 0:35 UTC

128 points

9 comments4 min readLW link

Counting arguments provide no evidence for AI doom

Nora Belrose and Quintin Pope

27 Feb 2024 23:03 UTC

99 points

177 comments14 min readLW link

Transcoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky, Philippe Chlenski and Neel Nanda

30 Apr 2024 17:58 UTC

56 points

12 comments17 min readLW link