How much do you be­lieve your re­sults?

Eric Neyman6 May 2023 20:31 UTC
476 points
17 comments15 min readLW link3 reviews
(ericneyman.wordpress.com)

Steer­ing GPT-2-XL by adding an ac­ti­va­tion vector

13 May 2023 18:42 UTC
437 points
97 comments50 min readLW link

State­ment on AI Ex­tinc­tion—Signed by AGI Labs, Top Aca­demics, and Many Other Notable Figures

Dan H30 May 2023 9:05 UTC
372 points
77 comments1 min readLW link
(www.safe.ai)

How to have Poly­geni­cally Screened Children

GeneSmith7 May 2023 16:01 UTC
354 points
127 comments27 min readLW link

Book Re­view: How Minds Change

bc4026bd4aaa5b7fe25 May 2023 17:55 UTC
310 points
52 comments15 min readLW link

Pre­dictable up­dat­ing about AI risk

Joe Carlsmith8 May 2023 21:53 UTC
289 points
25 comments36 min readLW link1 review

My May 2023 pri­ori­ties for AI x-safety: more em­pa­thy, more unifi­ca­tion of con­cerns, and less vil­ifi­ca­tion of OpenAI

Andrew_Critch24 May 2023 0:02 UTC
268 points
39 comments8 min readLW link

Men­tal Health and the Align­ment Prob­lem: A Com­pila­tion of Re­sources (up­dated April 2023)

10 May 2023 19:04 UTC
255 points
54 comments21 min readLW link

An­nounc­ing Apollo Research

30 May 2023 16:17 UTC
217 points
11 comments8 min readLW link

Twiblings, four-par­ent ba­bies and other re­pro­duc­tive technology

GeneSmith20 May 2023 17:11 UTC
189 points
33 comments6 min readLW link

When is Good­hart catas­trophic?

9 May 2023 3:59 UTC
179 points
28 comments8 min readLW link

De­ci­sion The­ory with the Magic Parts Highlighted

moridinamael16 May 2023 17:39 UTC
175 points
24 comments5 min readLW link

Prizes for ma­trix com­ple­tion problems

paulfchristiano3 May 2023 23:30 UTC
164 points
52 comments1 min readLW link
(www.alignment.org)

Con­jec­ture in­ter­nal sur­vey: AGI timelines and prob­a­bil­ity of hu­man ex­tinc­tion from ad­vanced AI

Maris Sala22 May 2023 14:31 UTC
155 points
5 comments3 min readLW link
(www.conjecture.dev)

Re­quest: stop ad­vanc­ing AI capabilities

So8res26 May 2023 17:42 UTC
153 points
24 comments1 min readLW link

Ad­vice for newly busy people

Severin T. Seehrich11 May 2023 16:46 UTC
149 points
3 comments5 min readLW link

Sen­tience matters

So8res29 May 2023 21:25 UTC
143 points
96 comments2 min readLW link

A brief col­lec­tion of Hin­ton’s re­cent com­ments on AGI risk

Kaj_Sotala4 May 2023 23:31 UTC
143 points
9 comments11 min readLW link

Clar­ify­ing and pre­dict­ing AGI

Richard_Ngo4 May 2023 15:55 UTC
141 points
44 comments4 min readLW link

Dark For­est Theories

Raemon12 May 2023 20:21 UTC
139 points
51 comments2 min readLW link1 review

LeCun’s “A Path Towards Au­tonomous Ma­chine In­tel­li­gence” has an un­solved tech­ni­cal al­ign­ment problem

Steven Byrnes8 May 2023 19:35 UTC
137 points
37 comments15 min readLW link

AGI safety ca­reer advice

Richard_Ngo2 May 2023 7:36 UTC
132 points
24 comments13 min readLW link

Trust de­vel­ops grad­u­ally via mak­ing bids and set­ting boundaries

Richard_Ngo19 May 2023 22:16 UTC
131 points
12 comments4 min readLW link

Some back­ground for rea­son­ing about dual-use al­ign­ment research

Charlie Steiner18 May 2023 14:50 UTC
126 points
21 comments9 min readLW link

Who reg­u­lates the reg­u­la­tors? We need to go be­yond the re­view-and-ap­proval paradigm

jasoncrawford4 May 2023 22:11 UTC
122 points
29 comments13 min readLW link
(rootsofprogress.org)

Solv­ing the Mechanis­tic In­ter­pretabil­ity challenges: EIS VII Challenge 1

9 May 2023 19:41 UTC
119 points
1 comment10 min readLW link

From fear to excitement

Richard_Ngo15 May 2023 6:23 UTC
116 points
9 comments3 min readLW link

In­ves­ti­gat­ing Fabrication

LoganStrohl18 May 2023 17:46 UTC
112 points
14 comments16 min readLW link

Ret­ro­spec­tive: Les­sons from the Failed Align­ment Startup AISafety.com

Søren Elverlin12 May 2023 18:07 UTC
104 points
9 comments3 min readLW link

Open Thread With Ex­per­i­men­tal Fea­ture: Reactions

jimrandomh24 May 2023 16:46 UTC
101 points
189 comments3 min readLW link

A Case for the Least For­giv­ing Take On Alignment

Thane Ruthenis2 May 2023 21:34 UTC
100 points
84 comments22 min readLW link

Ge­off Hin­ton Quits Google

Adam Shai1 May 2023 21:03 UTC
98 points
14 comments1 min readLW link

Shah (Deep­Mind) and Leahy (Con­jec­ture) Dis­cuss Align­ment Cruxes

1 May 2023 16:47 UTC
96 points
10 comments30 min readLW link

Most peo­ple should prob­a­bly feel safe most of the time

Kaj_Sotala9 May 2023 9:35 UTC
95 points
28 comments10 min readLW link

Bayesian Net­works Aren’t Ne­c­es­sar­ily Causal

Zack_M_Davis14 May 2023 1:42 UTC
95 points
37 comments8 min readLW link

AI Safety in China: Part 2

Lao Mein22 May 2023 14:50 UTC
95 points
28 comments2 min readLW link

Deep­Mind: Model eval­u­a­tion for ex­treme risks

Zach Stein-Perlman25 May 2023 3:00 UTC
94 points
12 comments1 min readLW link1 review
(arxiv.org)

What if they gave an In­dus­trial Revolu­tion and no­body came?

jasoncrawford17 May 2023 19:41 UTC
93 points
10 comments19 min readLW link
(rootsofprogress.org)

Yoshua Ben­gio: How Rogue AIs may Arise

harfe23 May 2023 18:28 UTC
92 points
12 comments18 min readLW link
(yoshuabengio.org)

In­put Swap Graphs: Dis­cov­er­ing the role of neu­ral net­work com­po­nents at scale

Alexandre Variengien12 May 2023 9:41 UTC
92 points
0 comments33 min readLW link

Judg­ments of­ten smug­gle in im­plicit standards

Richard_Ngo15 May 2023 18:50 UTC
91 points
4 comments3 min readLW link

An ar­tifi­cially struc­tured ar­gu­ment for ex­pect­ing AGI ruin

Rob Bensinger7 May 2023 21:52 UTC
91 points
26 comments19 min readLW link

Co­er­cion is an adap­ta­tion to scarcity; trust is an adap­ta­tion to abundance

Richard_Ngo23 May 2023 18:14 UTC
90 points
11 comments4 min readLW link

An Anal­ogy for Un­der­stand­ing Transformers

CallumMcDougall13 May 2023 12:20 UTC
89 points
6 comments9 min readLW link

LessWrong Com­mu­nity Week­end 2023 [Ap­pli­ca­tions now closed]

Henry Prowbell1 May 2023 9:08 UTC
89 points
0 comments6 min readLW link

The bul­ls­eye frame­work: My case against AI doom

titotal30 May 2023 11:52 UTC
89 points
35 comments1 min readLW link

Con­di­tional Pre­dic­tion with Zero-Sum Train­ing Solves Self-Fulfilling Prophecies

26 May 2023 17:44 UTC
88 points
13 comments24 min readLW link

Re­acts now en­abled on 100% of posts, though still just ex­per­i­ment­ing

Ruby28 May 2023 5:36 UTC
88 points
73 comments2 min readLW link

New User’s Guide to LessWrong

Ruby17 May 2023 0:55 UTC
88 points
52 comments11 min readLW link

Les­sons learned from offer­ing in-office nu­tri­tional testing

Elizabeth15 May 2023 23:20 UTC
86 points
11 comments14 min readLW link
(acesounderglass.com)