RSS

In­ter­pret­ing the Learn­ing of Deceit

RogerDearnaley18 Dec 2023 8:12 UTC
30 points
10 comments9 min readLW link

Re­fusal in LLMs is me­di­ated by a sin­gle direction

27 Apr 2024 11:13 UTC
185 points
77 comments10 min readLW link

Lin­ear in­fra-Bayesian Bandits

Vanessa Kosoy10 May 2024 6:41 UTC
27 points
0 comments1 min readLW link
(arxiv.org)

An In­tro­duc­tion to AI Sandbagging

26 Apr 2024 13:40 UTC
41 points
2 comments8 min readLW link

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

30 Apr 2024 18:51 UTC
156 points
37 comments45 min readLW link

AI Safety Strate­gies Landscape

Charbel-Raphaël9 May 2024 17:33 UTC
21 points
0 comments42 min readLW link

Fact Find­ing: At­tempt­ing to Re­v­erse-Eng­ineer Fac­tual Re­call on the Neu­ron Level (Post 1)

23 Dec 2023 2:44 UTC
106 points
6 comments22 min readLW link

Sum­ming up “Schem­ing AIs” (Sec­tion 5)

Joe Carlsmith9 Dec 2023 15:48 UTC
2 points
1 comment11 min readLW link

Vi­su­al­iz­ing neu­ral net­work planning

9 May 2024 6:40 UTC
4 points
0 comments5 min readLW link

CLR’s re­cent work on multi-agent systems

JesseClifton9 Mar 2021 2:28 UTC
54 points
2 comments13 min readLW link

Trans­form­ers Rep­re­sent Belief State Geom­e­try in their Resi­d­ual Stream

Adam Shai16 Apr 2024 21:16 UTC
367 points
83 comments12 min readLW link

An­nounc­ing Neu­ron­pe­dia: Plat­form for ac­cel­er­at­ing re­search into Sparse Autoencoders

25 Mar 2024 21:17 UTC
89 points
7 comments7 min readLW link

A cir­cuit for Python doc­strings in a 4-layer at­ten­tion-only transformer

20 Feb 2023 19:35 UTC
91 points
8 comments21 min readLW link

Mys­ter­ies of mode collapse

janus8 Nov 2022 10:37 UTC
281 points
57 comments14 min readLW link1 review

[Question] What con­vinc­ing warn­ing shot could help pre­vent ex­tinc­tion from AI?

13 Apr 2024 18:09 UTC
103 points
18 comments2 min readLW link

AXRP Epi­sode 31 - Sin­gu­lar Learn­ing The­ory with Daniel Murfet

DanielFilan7 May 2024 3:50 UTC
65 points
4 comments71 min readLW link

An In­ter­pretabil­ity Illu­sion for Ac­ti­va­tion Patch­ing of Ar­bi­trary Subspaces

29 Aug 2023 1:04 UTC
75 points
4 comments1 min readLW link

Un­cov­er­ing De­cep­tive Ten­den­cies in Lan­guage Models: A Si­mu­lated Com­pany AI Assistant

6 May 2024 7:07 UTC
81 points
4 comments1 min readLW link
(arxiv.org)

AI Con­trol: Im­prov­ing Safety De­spite In­ten­tional Subversion

13 Dec 2023 15:51 UTC
196 points
7 comments10 min readLW link

My Ob­jec­tions to “We’re All Gonna Die with Eliezer Yud­kowsky”

Quintin Pope21 Mar 2023 0:06 UTC
356 points
225 comments39 min readLW link