I would have shit in that alley, too

Declan Molony18 Jun 2024 4:41 UTC
431 points
134 comments4 min readLW link

Safety isn’t safety with­out a so­cial model (or: dis­pel­ling the myth of per se tech­ni­cal safety)

Andrew_Critch14 Jun 2024 0:16 UTC
338 points
38 comments4 min readLW link

My AI Model Delta Com­pared To Yudkowsky

johnswentworth10 Jun 2024 16:12 UTC
276 points
102 comments4 min readLW link

Get­ting 50% (SoTA) on ARC-AGI with GPT-4o

ryan_greenblatt17 Jun 2024 18:44 UTC
262 points
49 comments13 min readLW link

SAE fea­ture ge­om­e­try is out­side the su­per­po­si­tion hypothesis

jake_mendel24 Jun 2024 16:07 UTC
221 points
17 comments11 min readLW link

LLM Gen­er­al­ity is a Timeline Crux

eggsyntax24 Jun 2024 12:52 UTC
217 points
119 comments7 min readLW link

Re­sponse to Aschen­bren­ner’s “Si­tu­a­tional Aware­ness”

Rob Bensinger6 Jun 2024 22:57 UTC
193 points
27 comments3 min readLW link

My AI Model Delta Com­pared To Christiano

johnswentworth12 Jun 2024 18:19 UTC
190 points
73 comments4 min readLW link

Two easy things that maybe Just Work to im­prove AI discourse

jacobjacob8 Jun 2024 15:51 UTC
189 points
35 comments2 min readLW link

Hum­ming is not a free $100 bill

Elizabeth6 Jun 2024 20:10 UTC
183 points
6 comments3 min readLW link
(acesounderglass.com)

Boy­cott OpenAI

PeterMcCluskey18 Jun 2024 19:52 UTC
163 points
26 comments1 min readLW link
(bayesianinvestor.com)

An­nounc­ing ILIAD — The­o­ret­i­cal AI Align­ment Conference

5 Jun 2024 9:37 UTC
162 points
18 comments2 min readLW link

Sy­co­phancy to sub­ter­fuge: In­ves­ti­gat­ing re­ward tam­per­ing in large lan­guage models

17 Jun 2024 18:41 UTC
161 points
22 comments8 min readLW link
(arxiv.org)

Con­nect­ing the Dots: LLMs can In­fer & Ver­bal­ize La­tent Struc­ture from Train­ing Data

21 Jun 2024 15:54 UTC
160 points
13 comments8 min readLW link
(arxiv.org)

For­mal ver­ifi­ca­tion, heuris­tic ex­pla­na­tions and sur­prise accounting

Jacob_Hilton25 Jun 2024 15:40 UTC
156 points
11 comments9 min readLW link
(www.alignment.org)

The In­cred­ible Fen­tanyl-De­tect­ing Machine

sarahconstantin28 Jun 2024 22:10 UTC
154 points
26 comments7 min readLW link
(sarahconstantin.substack.com)

0. CAST: Cor­rigi­bil­ity as Sin­gu­lar Target

Max Harms7 Jun 2024 22:29 UTC
137 points
12 comments8 min readLW link

Lov­ing a world you don’t trust

Joe Carlsmith18 Jun 2024 19:31 UTC
134 points
13 comments33 min readLW link

How it All Went Down: The Puz­zle Hunt that took us way, way Less Online

A*2 Jun 2024 8:01 UTC
134 points
5 comments5 min readLW link

Why I don’t be­lieve in the placebo effect

transhumanist_atom_understander10 Jun 2024 2:37 UTC
131 points
22 comments9 min readLW link

[Question] What do co­her­ence ar­gu­ments ac­tu­ally prove about agen­tic be­hav­ior?

sunwillrise1 Jun 2024 9:37 UTC
123 points
35 comments6 min readLW link

Ev­i­dence of Learned Look-Ahead in a Chess-Play­ing Neu­ral Network

Erik Jenner4 Jun 2024 15:50 UTC
120 points
14 comments13 min readLW link

The Stan­dard Analogy

Zack_M_Davis3 Jun 2024 17:15 UTC
118 points
28 comments12 min readLW link

AI catas­tro­phes and rogue deployments

Buck3 Jun 2024 17:04 UTC
118 points
16 comments8 min readLW link

An­thropic’s Cer­tifi­cate of Incorporation

Zach Stein-Perlman12 Jun 2024 13:00 UTC
115 points
4 comments4 min readLW link

The Leopold Model: Anal­y­sis and Reactions

Zvi14 Jun 2024 15:10 UTC
108 points
19 comments57 min readLW link
(thezvi.wordpress.com)

De­mys­tify­ing “Align­ment” through a Comic

milanrosko9 Jun 2024 8:24 UTC
106 points
19 comments1 min readLW link

In favour of ex­plor­ing nag­ging doubts about x-risk

owencb25 Jun 2024 23:52 UTC
105 points
2 comments1 min readLW link

Scal­ing and eval­u­at­ing sparse autoencoders

leogao6 Jun 2024 22:50 UTC
105 points
6 comments1 min readLW link

On Dwarksh’s Pod­cast with Leopold Aschenbrenner

Zvi10 Jun 2024 12:40 UTC
101 points
7 comments59 min readLW link
(thezvi.wordpress.com)

The Minor­ity Coalition

Richard_Ngo24 Jun 2024 20:01 UTC
99 points
7 comments5 min readLW link
(www.narrativeark.xyz)

CIV: a story

Richard_Ngo15 Jun 2024 22:36 UTC
98 points
6 comments9 min readLW link
(www.narrativeark.xyz)

OpenAI #8: The Right to Warn

Zvi17 Jun 2024 12:00 UTC
97 points
8 comments34 min readLW link
(thezvi.wordpress.com)

Com­ments on An­thropic’s Scal­ing Monosemanticity

Robert_AIZI3 Jun 2024 12:15 UTC
97 points
8 comments7 min readLW link

Ac­cess to pow­er­ful AI might make com­puter se­cu­rity rad­i­cally easier

Buck8 Jun 2024 6:00 UTC
96 points
14 comments6 min readLW link

Com­pact Proofs of Model Perfor­mance via Mechanis­tic Interpretability

24 Jun 2024 19:27 UTC
95 points
3 comments8 min readLW link
(arxiv.org)

On Claude 3.5 Sonnet

Zvi24 Jun 2024 12:00 UTC
95 points
14 comments13 min readLW link
(thezvi.wordpress.com)

Ilya Sutskever cre­ated a new AGI startup

harfe19 Jun 2024 17:17 UTC
95 points
35 comments1 min readLW link
(ssi.inc)

Live The­ory Part 0: Tak­ing In­tel­li­gence Seriously

Sahil26 Jun 2024 21:37 UTC
94 points
3 comments8 min readLW link

Towards a Less Bul­lshit Model of Semantics

17 Jun 2024 15:51 UTC
94 points
44 comments21 min readLW link

Take­off speeds pre­sen­ta­tion at Anthropic

Tom Davidson4 Jun 2024 22:46 UTC
92 points
0 comments25 min readLW link

Just ad­mit that you’ve zoned out

joec4 Jun 2024 2:51 UTC
91 points
22 comments2 min readLW link

Quotes from Leopold Aschen­bren­ner’s Si­tu­a­tional Aware­ness Paper

Zvi7 Jun 2024 11:40 UTC
91 points
10 comments37 min readLW link
(thezvi.wordpress.com)

I’m a bit skep­ti­cal of AlphaFold 3

Oleg Trott25 Jun 2024 0:04 UTC
87 points
14 comments2 min readLW link

De­tect­ing Ge­net­i­cally Eng­ineered Viruses With Me­tage­nomic Sequencing

jefftk27 Jun 2024 14:01 UTC
87 points
10 comments1 min readLW link
(naobservatory.org)

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

13 Jun 2024 10:04 UTC
84 points
10 comments2 min readLW link
(arxiv.org)

[Paper] Stress-test­ing ca­pa­bil­ity elic­i­ta­tion with pass­word-locked models

4 Jun 2024 14:52 UTC
84 points
10 comments12 min readLW link
(arxiv.org)

Ac­tu­ally, Power Plants May Be an AI Train­ing Bot­tle­neck.

Lao Mein20 Jun 2024 4:41 UTC
83 points
13 comments2 min readLW link

Cor­rigi­bil­ity = Tool-ness?

28 Jun 2024 1:19 UTC
78 points
8 comments9 min readLW link

Se­condary forces of debt

KatjaGrace27 Jun 2024 21:10 UTC
77 points
18 comments2 min readLW link
(worldspiritsockpuppet.com)