The Waluigi Effect (mega-post)

Cleo NardoMar 3, 2023, 3:22 AM
628 points
188 comments16 min readLW link

My Ob­jec­tions to “We’re All Gonna Die with Eliezer Yud­kowsky”

Quintin PopeMar 21, 2023, 12:06 AM
359 points
233 comments39 min readLW link1 review

Shut­ting Down the Light­cone Offices

Mar 14, 2023, 10:47 PM
338 points
103 comments17 min readLW link2 reviews

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

Mar 11, 2023, 6:59 PM
333 points
28 comments23 min readLW link

The Parable of the King and the Ran­dom Process

moridinamaelMar 1, 2023, 10:18 PM
312 points
26 comments6 min readLW link3 reviews

Paus­ing AI Devel­op­ments Isn’t Enough. We Need to Shut it All Down by Eliezer Yudkowsky

jacquesthibsMar 29, 2023, 11:16 PM
291 points
297 comments3 min readLW link
(time.com)

Dis­cus­sion with Nate Soares on a key al­ign­ment difficulty

HoldenKarnofskyMar 13, 2023, 9:20 PM
265 points
43 comments22 min readLW link1 review

“Care­fully Boot­strapped Align­ment” is or­ga­ni­za­tion­ally hard

RaemonMar 17, 2023, 6:00 PM
262 points
23 comments11 min readLW link1 review

Deep Deceptiveness

So8resMar 21, 2023, 2:51 AM
251 points
60 comments14 min readLW link1 review

Nat­u­ral Ab­strac­tions: Key claims, The­o­rems, and Critiques

Mar 16, 2023, 4:37 PM
241 points
26 comments45 min readLW link3 reviews

More in­for­ma­tion about the dan­ger­ous ca­pa­bil­ity eval­u­a­tions we did with GPT-4 and Claude.

Beth BarnesMar 19, 2023, 12:25 AM
233 points
54 comments8 min readLW link
(evals.alignment.org)

An AI risk ar­gu­ment that res­onates with NYTimes readers

Julian BradshawMar 12, 2023, 11:09 PM
212 points
14 comments1 min readLW link

Ac­tu­ally, Othello-GPT Has A Lin­ear Emer­gent World Representation

Neel NandaMar 29, 2023, 10:13 PM
211 points
26 comments19 min readLW link
(neelnanda.io)

The salt in pasta wa­ter fallacy

Thomas SepulchreMar 27, 2023, 2:53 PM
204 points
46 comments3 min readLW link2 reviews

GPT-4 Plugs In

ZviMar 27, 2023, 12:10 PM
198 points
47 comments6 min readLW link
(thezvi.wordpress.com)

Acausal normalcy

Andrew_CritchMar 3, 2023, 11:34 PM
195 points
36 comments8 min readLW link1 review

Why Not Just… Build Weak AI Tools For AI Align­ment Re­search?

johnswentworthMar 5, 2023, 12:12 AM
183 points
18 comments6 min readLW link

ChatGPT (and now GPT4) is very eas­ily dis­tracted from its rules

dmcsMar 15, 2023, 5:55 PM
180 points
42 comments1 min readLW link

A rough and in­com­plete re­view of some of John Went­worth’s research

So8resMar 28, 2023, 6:52 PM
175 points
18 comments18 min readLW link

An­thropic’s Core Views on AI Safety

Zac Hatfield-DoddsMar 9, 2023, 4:55 PM
172 points
39 comments2 min readLW link
(www.anthropic.com)

A stylized di­alogue on John Went­worth’s claims about mar­kets and optimization

So8resMar 25, 2023, 10:32 PM
169 points
22 comments8 min readLW link

What Dis­cov­er­ing La­tent Knowl­edge Did and Did Not Find

Fabien RogerMar 13, 2023, 7:29 PM
166 points
17 comments11 min readLW link

Towards un­der­stand­ing-based safety evaluations

evhubMar 15, 2023, 6:18 PM
164 points
16 comments5 min readLW link

What would a com­pute mon­i­tor­ing plan look like? [Linkpost]

Orpheus16Mar 26, 2023, 7:33 PM
158 points
10 comments4 min readLW link
(arxiv.org)

In­side the mind of a su­per­hu­man Go model: How does Leela Zero read lad­ders?

Haoxing DuMar 1, 2023, 1:47 AM
157 points
8 comments30 min readLW link

AI: Prac­ti­cal Ad­vice for the Worried

ZviMar 1, 2023, 12:30 PM
155 points
49 comments16 min readLW link2 reviews
(thezvi.wordpress.com)

POC || GTFO cul­ture as par­tial an­ti­dote to al­ign­ment wordcelism

lcMar 15, 2023, 10:21 AM
155 points
13 comments7 min readLW link2 reviews

Why Not Just Out­source Align­ment Re­search To An AI?

johnswentworthMar 9, 2023, 9:49 PM
151 points
50 comments9 min readLW link1 review

GPT-4

nzMar 14, 2023, 5:02 PM
151 points
150 comments1 min readLW link
(openai.com)

Why I’m not into the Free En­ergy Principle

Steven ByrnesMar 2, 2023, 7:27 PM
149 points
50 comments9 min readLW link1 review

Com­ments on OpenAI’s “Plan­ning for AGI and be­yond”

So8resMar 3, 2023, 11:01 PM
148 points
2 comments14 min readLW link

Dan Luu on “You can only com­mu­ni­cate one top pri­or­ity”

RaemonMar 18, 2023, 6:55 PM
148 points
18 comments3 min readLW link
(twitter.com)

Re­marks 1–18 on GPT (com­pressed)

Cleo NardoMar 20, 2023, 10:27 PM
145 points
35 comments31 min readLW link

The Translu­cent Thoughts Hy­pothe­ses and Their Implications

Fabien RogerMar 9, 2023, 4:30 PM
142 points
7 comments19 min readLW link

Speed run­ning ev­ery­one through the bad al­ign­ment bingo. $5k bounty for a LW con­ver­sa­tional agent

ArthurBMar 9, 2023, 9:26 AM
140 points
33 comments2 min readLW link

Against LLM Reductionism

Erich_GrunewaldMar 8, 2023, 3:52 PM
140 points
17 comments18 min readLW link
(www.erichgrunewald.com)

Con­ced­ing a short timelines bet early

Matthew BarnettMar 16, 2023, 9:49 PM
133 points
17 comments1 min readLW link

Good News, Every­one!

jbashMar 25, 2023, 1:48 PM
132 points
23 comments2 min readLW link

We have to Upgrade

Jed McCalebMar 23, 2023, 5:53 PM
129 points
35 comments2 min readLW link

[Linkpost] Some high-level thoughts on the Deep­Mind al­ign­ment team’s strategy

Mar 7, 2023, 11:55 AM
128 points
13 comments5 min readLW link
(drive.google.com)

High Sta­tus Eschews Quan­tifi­ca­tion of Performance

niplav19 Mar 2023 22:14 UTC
128 points
36 comments5 min readLW link

FLI open let­ter: Pause gi­ant AI experiments

Zach Stein-Perlman29 Mar 2023 4:04 UTC
126 points
123 comments2 min readLW link
(futureoflife.org)

How bad a fu­ture do ML re­searchers ex­pect?

KatjaGrace9 Mar 2023 4:50 UTC
122 points
8 comments2 min readLW link
(aiimpacts.org)

Man­i­fold: If okay AGI, why?

Eliezer Yudkowsky25 Mar 2023 22:43 UTC
120 points
37 comments1 min readLW link
(manifold.markets)

ARC tests to see if GPT-4 can es­cape hu­man con­trol; GPT-4 failed to do so

Christopher King15 Mar 2023 0:29 UTC
116 points
22 comments2 min readLW link

Par­a­sitic Lan­guage Games: main­tain­ing am­bi­guity to hide con­flict while burn­ing the commons

Hazard12 Mar 2023 5:25 UTC
115 points
17 comments13 min readLW link

“Pub­lish or Per­ish” (a quick note on why you should try to make your work leg­ible to ex­ist­ing aca­demic com­mu­ni­ties)

David Scott Krueger (formerly: capybaralet)18 Mar 2023 19:01 UTC
112 points
49 comments1 min readLW link1 review

GPT can write Quines now (GPT-4)

Andrew_Critch14 Mar 2023 19:18 UTC
112 points
30 comments1 min readLW link

Here, have a calm­ness video

Kaj_Sotala16 Mar 2023 10:00 UTC
111 points
15 comments2 min readLW link
(www.youtube.com)

“Liquidity” vs “solvency” in bank runs (and some notes on Sili­con Valley Bank)

rossry12 Mar 2023 9:16 UTC
108 points
27 comments12 min readLW link