RSS

CLR’s re­cent work on multi-agent systems

JesseClifton9 Mar 2021 2:28 UTC
54 points
2 comments13 min readLW link

Trans­form­ers Rep­re­sent Belief State Geom­e­try in their Resi­d­ual Stream

Adam Shai16 Apr 2024 21:16 UTC
366 points
83 comments12 min readLW link

An­nounc­ing Neu­ron­pe­dia: Plat­form for ac­cel­er­at­ing re­search into Sparse Autoencoders

25 Mar 2024 21:17 UTC
89 points
7 comments7 min readLW link

A cir­cuit for Python doc­strings in a 4-layer at­ten­tion-only transformer

20 Feb 2023 19:35 UTC
91 points
8 comments21 min readLW link

Mys­ter­ies of mode collapse

janus8 Nov 2022 10:37 UTC
281 points
57 comments14 min readLW link1 review

AXRP Epi­sode 31 - Sin­gu­lar Learn­ing The­ory with Daniel Murfet

DanielFilan7 May 2024 3:50 UTC
64 points
4 comments71 min readLW link

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

30 Apr 2024 18:51 UTC
152 points
36 comments45 min readLW link

An In­ter­pretabil­ity Illu­sion for Ac­ti­va­tion Patch­ing of Ar­bi­trary Subspaces

29 Aug 2023 1:04 UTC
75 points
4 comments1 min readLW link

Un­cov­er­ing De­cep­tive Ten­den­cies in Lan­guage Models: A Si­mu­lated Com­pany AI Assistant

6 May 2024 7:07 UTC
78 points
4 comments1 min readLW link
(arxiv.org)

AI Con­trol: Im­prov­ing Safety De­spite In­ten­tional Subversion

13 Dec 2023 15:51 UTC
196 points
7 comments10 min readLW link

My Ob­jec­tions to “We’re All Gonna Die with Eliezer Yud­kowsky”

Quintin Pope21 Mar 2023 0:06 UTC
356 points
225 comments39 min readLW link

A Longlist of The­o­ries of Im­pact for Interpretability

Neel Nanda11 Mar 2022 14:55 UTC
127 points
36 comments5 min readLW link2 reviews

Why Would AI “Aim” To Defeat Hu­man­ity?

HoldenKarnofsky29 Nov 2022 19:30 UTC
69 points
10 comments33 min readLW link
(www.cold-takes.com)

Re­fusal in LLMs is me­di­ated by a sin­gle direction

27 Apr 2024 11:13 UTC
184 points
75 comments10 min readLW link

Search­ing for Search­ing for Search

Rubi J. Hudson14 Feb 2024 23:51 UTC
21 points
4 comments7 min readLW link

Some back­ground for rea­son­ing about dual-use al­ign­ment research

Charlie Steiner18 May 2023 14:50 UTC
121 points
20 comments9 min readLW link

What I mean by “al­ign­ment is in large part about mak­ing cog­ni­tion aimable at all”

So8res30 Jan 2023 15:22 UTC
167 points
25 comments2 min readLW link

Don’t Dis­miss Sim­ple Align­ment Approaches

Chris_Leong7 Oct 2023 0:35 UTC
128 points
9 comments4 min readLW link

Count­ing ar­gu­ments provide no ev­i­dence for AI doom

27 Feb 2024 23:03 UTC
99 points
177 comments14 min readLW link

Transcoders en­able fine-grained in­ter­pretable cir­cuit anal­y­sis for lan­guage models

30 Apr 2024 17:58 UTC
56 points
12 comments17 min readLW link