I would have shit in that alley, too

Declan MolonyJun 18, 2024, 4:41 AM
458 points
134 comments4 min readLW link

Safety isn’t safety with­out a so­cial model (or: dis­pel­ling the myth of per se tech­ni­cal safety)

Andrew_CritchJun 14, 2024, 12:16 AM
357 points
38 comments4 min readLW link

My AI Model Delta Com­pared To Yudkowsky

johnswentworthJun 10, 2024, 4:12 PM
280 points
103 comments4 min readLW link

Get­ting 50% (SoTA) on ARC-AGI with GPT-4o

ryan_greenblattJun 17, 2024, 6:44 PM
263 points
50 comments13 min readLW link

SAE fea­ture ge­om­e­try is out­side the su­per­po­si­tion hypothesis

jake_mendelJun 24, 2024, 4:07 PM
228 points
17 comments11 min readLW link

LLM Gen­er­al­ity is a Timeline Crux

eggsyntaxJun 24, 2024, 12:52 PM
218 points
119 comments7 min readLW link

Re­sponse to Aschen­bren­ner’s “Si­tu­a­tional Aware­ness”

Rob BensingerJun 6, 2024, 10:57 PM
194 points
27 comments3 min readLW link

My AI Model Delta Com­pared To Christiano

johnswentworthJun 12, 2024, 6:19 PM
191 points
73 comments4 min readLW link

Two easy things that maybe Just Work to im­prove AI discourse

Bird ConceptJun 8, 2024, 3:51 PM
190 points
35 comments2 min readLW link

Hum­ming is not a free $100 bill

ElizabethJun 6, 2024, 8:10 PM
185 points
6 comments3 min readLW link
(acesounderglass.com)

Boy­cott OpenAI

PeterMcCluskeyJun 18, 2024, 7:52 PM
164 points
26 comments1 min readLW link
(bayesianinvestor.com)

An­nounc­ing ILIAD — The­o­ret­i­cal AI Align­ment Conference

Jun 5, 2024, 9:37 AM
163 points
18 comments2 min readLW link

Con­nect­ing the Dots: LLMs can In­fer & Ver­bal­ize La­tent Struc­ture from Train­ing Data

Jun 21, 2024, 3:54 PM
163 points
13 comments8 min readLW link
(arxiv.org)

Sy­co­phancy to sub­ter­fuge: In­ves­ti­gat­ing re­ward tam­per­ing in large lan­guage models

Jun 17, 2024, 6:41 PM
161 points
22 comments8 min readLW link
(arxiv.org)

For­mal ver­ifi­ca­tion, heuris­tic ex­pla­na­tions and sur­prise accounting

Jacob_HiltonJun 25, 2024, 3:40 PM
156 points
11 comments9 min readLW link
(www.alignment.org)

The In­cred­ible Fen­tanyl-De­tect­ing Machine

sarahconstantinJun 28, 2024, 10:10 PM
156 points
26 comments7 min readLW link
(sarahconstantin.substack.com)

0. CAST: Cor­rigi­bil­ity as Sin­gu­lar Target

Max HarmsJun 7, 2024, 10:29 PM
147 points
14 comments8 min readLW link

Lov­ing a world you don’t trust

Joe CarlsmithJun 18, 2024, 7:31 PM
135 points
13 comments33 min readLW link

How it All Went Down: The Puz­zle Hunt that took us way, way Less Online

A*Jun 2, 2024, 8:01 AM
135 points
5 comments5 min readLW link

Why I don’t be­lieve in the placebo effect

transhumanist_atom_understanderJun 10, 2024, 2:37 AM
134 points
22 comments9 min readLW link

The Stan­dard Analogy

Zack_M_DavisJun 3, 2024, 5:15 PM
125 points
28 comments12 min readLW link

[Question] What do co­her­ence ar­gu­ments ac­tu­ally prove about agen­tic be­hav­ior?

sunwillriseJun 1, 2024, 9:37 AM
123 points
39 comments6 min readLW link

Ev­i­dence of Learned Look-Ahead in a Chess-Play­ing Neu­ral Network

Erik JennerJun 4, 2024, 3:50 PM
121 points
14 comments13 min readLW link

AI catas­tro­phes and rogue deployments

BuckJun 3, 2024, 5:04 PM
120 points
16 comments8 min readLW link

An­thropic’s Cer­tifi­cate of Incorporation

Zach Stein-PerlmanJun 12, 2024, 1:00 PM
115 points
7 comments4 min readLW link

The Leopold Model: Anal­y­sis and Reactions

ZviJun 14, 2024, 3:10 PM
109 points
19 comments57 min readLW link
(thezvi.wordpress.com)

De­mys­tify­ing “Align­ment” through a Comic

milanroskoJun 9, 2024, 8:24 AM
106 points
19 comments1 min readLW link

Scal­ing and eval­u­at­ing sparse autoencoders

leogaoJun 6, 2024, 10:50 PM
106 points
6 comments1 min readLW link

In favour of ex­plor­ing nag­ging doubts about x-risk

owencbJun 25, 2024, 11:52 PM
105 points
2 commentsLW link

The Minor­ity Coalition

Richard_NgoJun 24, 2024, 8:01 PM
103 points
9 comments5 min readLW link
(www.narrativeark.xyz)

Live The­ory Part 0: Tak­ing In­tel­li­gence Seriously

SahilJun 26, 2024, 9:37 PM
103 points
3 comments8 min readLW link

On Dwarksh’s Pod­cast with Leopold Aschenbrenner

ZviJun 10, 2024, 12:40 PM
102 points
7 comments59 min readLW link
(thezvi.wordpress.com)

Ac­cess to pow­er­ful AI might make com­puter se­cu­rity rad­i­cally easier

BuckJun 8, 2024, 6:00 AM
101 points
14 comments6 min readLW link

CIV: a story

Richard_NgoJun 15, 2024, 10:36 PM
98 points
6 comments9 min readLW link
(www.narrativeark.xyz)

Com­ments on An­thropic’s Scal­ing Monosemanticity

Robert_AIZIJun 3, 2024, 12:15 PM
98 points
8 comments7 min readLW link

OpenAI #8: The Right to Warn

ZviJun 17, 2024, 12:00 PM
97 points
8 comments34 min readLW link
(thezvi.wordpress.com)

Quotes from Leopold Aschen­bren­ner’s Si­tu­a­tional Aware­ness Paper

ZviJun 7, 2024, 11:40 AM
97 points
10 comments37 min readLW link
(thezvi.wordpress.com)

Com­pact Proofs of Model Perfor­mance via Mechanis­tic Interpretability

Jun 24, 2024, 7:27 PM
96 points
4 comments8 min readLW link
(arxiv.org)

On Claude 3.5 Sonnet

ZviJun 24, 2024, 12:00 PM
95 points
14 comments13 min readLW link
(thezvi.wordpress.com)

Ilya Sutskever cre­ated a new AGI startup

harfeJun 19, 2024, 5:17 PM
95 points
35 comments1 min readLW link
(ssi.inc)

Towards a Less Bul­lshit Model of Semantics

Jun 17, 2024, 3:51 PM
94 points
44 comments21 min readLW link

Take­off speeds pre­sen­ta­tion at Anthropic

Tom DavidsonJun 4, 2024, 10:46 PM
92 points
0 comments25 min readLW link

Just ad­mit that you’ve zoned out

joecJun 4, 2024, 2:51 AM
91 points
22 comments2 min readLW link

I’m a bit skep­ti­cal of AlphaFold 3

Oleg TrottJun 25, 2024, 12:04 AM
87 points
14 comments2 min readLW link

De­tect­ing Ge­net­i­cally Eng­ineered Viruses With Me­tage­nomic Sequencing

jefftkJun 27, 2024, 2:01 PM
87 points
10 commentsLW link
(naobservatory.org)

[Paper] Stress-test­ing ca­pa­bil­ity elic­i­ta­tion with pass­word-locked models

Jun 4, 2024, 2:52 PM
85 points
10 comments12 min readLW link
(arxiv.org)

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

Jun 13, 2024, 10:04 AM
84 points
10 comments2 min readLW link
(arxiv.org)

Ac­tu­ally, Power Plants May Be an AI Train­ing Bot­tle­neck.

Lao MeinJun 20, 2024, 4:41 AM
83 points
13 comments2 min readLW link

AI take­off and nu­clear war

owencbJun 11, 2024, 7:36 PM
80 points
6 comments11 min readLW link
(strangecities.substack.com)

Se­condary forces of debt

KatjaGraceJun 27, 2024, 9:10 PM
78 points
18 comments2 min readLW link
(worldspiritsockpuppet.com)