Reli­able Sources: The Story of David Gerard

TracingWoodgrains10 Jul 2024 19:50 UTC
381 points
53 comments43 min readLW link

Univer­sal Ba­sic In­come and Poverty

Eliezer Yudkowsky26 Jul 2024 7:23 UTC
281 points
131 comments9 min readLW link

80,000 hours should re­move OpenAI from the Job Board (and similar EA orgs should do similarly)

Raemon3 Jul 2024 20:34 UTC
272 points
71 comments1 min readLW link

Su­perba­bies: Put­ting The Pie­ces Together

sarahconstantin11 Jul 2024 20:40 UTC
215 points
37 comments10 min readLW link
(sarahconstantin.substack.com)

Towards more co­op­er­a­tive AI safety strategies

Richard_Ngo16 Jul 2024 4:36 UTC
208 points
132 comments4 min readLW link

Op­ti­mistic As­sump­tions, Longterm Plan­ning, and “Cope”

Raemon17 Jul 2024 22:14 UTC
193 points
46 comments7 min readLW link

Self-Other Over­lap: A Ne­glected Ap­proach to AI Alignment

30 Jul 2024 16:22 UTC
192 points
43 comments12 min readLW link

Safety con­sul­ta­tions for AI lab employees

Zach Stein-Perlman27 Jul 2024 15:00 UTC
181 points
4 comments1 min readLW link

This is already your sec­ond chance

Malmesbury28 Jul 2024 17:13 UTC
174 points
13 comments8 min readLW link

De­com­pos­ing Agency — ca­pa­bil­ities with­out desires

11 Jul 2024 9:38 UTC
140 points
32 comments12 min readLW link
(strangecities.substack.com)

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers v2

Neel Nanda7 Jul 2024 17:39 UTC
134 points
15 comments25 min readLW link

“AI achieves silver-medal stan­dard solv­ing In­ter­na­tional Math­e­mat­i­cal Olympiad prob­lems”

gjm25 Jul 2024 15:58 UTC
133 points
38 comments2 min readLW link
(deepmind.google)

On say­ing “Thank you” in­stead of “I’m Sorry”

Michael Cohn8 Jul 2024 3:13 UTC
130 points
16 comments3 min readLW link

Pan­theon Interface

8 Jul 2024 19:03 UTC
124 points
22 comments6 min readLW link

Effi­cient Dic­tionary Learn­ing with Switch Sparse Autoencoders

Anish Mudide22 Jul 2024 18:45 UTC
118 points
19 comments12 min readLW link

A List of 45+ Mech In­terp Pro­ject Ideas from Apollo Re­search’s In­ter­pretabil­ity Team

18 Jul 2024 14:15 UTC
117 points
18 comments18 min readLW link

You should go to ML conferences

Jan_Kulveit24 Jul 2024 11:47 UTC
110 points
13 comments4 min readLW link

In­tro­duc­tion to French AI Policy

Lucie Philippon4 Jul 2024 3:39 UTC
110 points
12 comments6 min readLW link

Othel­loGPT learned a bag of heuristics

2 Jul 2024 9:12 UTC
108 points
10 comments9 min readLW link

Most smart and skil­led peo­ple are out­side of the EA/​ra­tio­nal­ist com­mu­nity: an analysis

titotal12 Jul 2024 12:13 UTC
107 points
36 comments1 min readLW link
(open.substack.com)

Trans­former Cir­cuit Faith­ful­ness Met­rics Are Not Robust

12 Jul 2024 3:47 UTC
104 points
5 comments7 min readLW link
(arxiv.org)

Me, My­self, and AI: the Si­tu­a­tional Aware­ness Dataset (SAD) for LLMs

8 Jul 2024 22:24 UTC
103 points
28 comments5 min readLW link

Poker is a bad game for teach­ing epistemics. Fig­gie is a bet­ter one.

rossry8 Jul 2024 6:05 UTC
102 points
47 comments11 min readLW link
(blog.rossry.net)

A sim­ple model of math skill

Alex_Altair21 Jul 2024 18:57 UTC
100 points
16 comments8 min readLW link

Dialogue in­tro­duc­tion to Sin­gu­lar Learn­ing Theory

Olli Järviniemi8 Jul 2024 16:58 UTC
97 points
14 comments8 min readLW link

I found >800 or­thog­o­nal “write code” steer­ing vectors

15 Jul 2024 19:06 UTC
95 points
19 comments7 min readLW link
(jacobgw.com)

Against Aschen­bren­ner: How ‘Si­tu­a­tional Aware­ness’ con­structs a nar­ra­tive that un­der­mines safety and threat­ens humanity

GideonF15 Jul 2024 18:37 UTC
93 points
17 comments21 min readLW link
(forum.effectivealtruism.org)

A Solomonoff In­duc­tor Walks Into a Bar: Schel­ling Points for Communication

26 Jul 2024 0:33 UTC
93 points
1 comment13 min readLW link

New page: Integrity

Zach Stein-Perlman10 Jul 2024 15:00 UTC
91 points
3 comments1 min readLW link

AI #73: Openly Evil AI

Zvi18 Jul 2024 14:40 UTC
89 points
20 comments52 min readLW link
(thezvi.wordpress.com)

Covert Mal­i­cious Finetuning

2 Jul 2024 2:41 UTC
88 points
4 comments3 min readLW link

Re: An­thropic’s sug­gested SB-1047 amendments

RobertM27 Jul 2024 22:32 UTC
87 points
13 comments9 min readLW link
(www.documentcloud.org)

Scal­able over­sight as a quan­ti­ta­tive rather than qual­i­ta­tive problem

Buck6 Jul 2024 17:42 UTC
85 points
11 comments3 min readLW link

Reflec­tions on Less Online

Error7 Jul 2024 3:49 UTC
85 points
15 comments18 min readLW link

Fluent, Cruxy Predictions

Raemon10 Jul 2024 18:00 UTC
85 points
14 comments14 min readLW link

A sim­ple case for ex­treme in­ner misalignment

Richard_Ngo13 Jul 2024 15:40 UTC
85 points
41 comments7 min readLW link

What are you get­ting paid in?

Austin Chen17 Jul 2024 19:23 UTC
84 points
14 comments4 min readLW link
(www.approachwithalacrity.com)

De­com­pos­ing the QK cir­cuit with Bilin­ear Sparse Dic­tionary Learning

2 Jul 2024 13:17 UTC
81 points
7 comments12 min readLW link

3C’s: A Recipe For Mathing Concepts

3 Jul 2024 1:06 UTC
80 points
5 comments7 min readLW link

On the CrowdStrike Incident

Zvi22 Jul 2024 12:40 UTC
75 points
14 comments17 min readLW link
(thezvi.wordpress.com)

In­ter­pret­ing Prefer­ence Models w/​ Sparse Autoencoders

1 Jul 2024 21:35 UTC
74 points
12 comments9 min readLW link

LK-99 in retrospect

bhauth7 Jul 2024 2:06 UTC
72 points
21 comments3 min readLW link
(www.bhauth.com)

D&D.Sci Sce­nario Index

23 Jul 2024 2:00 UTC
72 points
0 comments2 min readLW link

Yoshua Ben­gio: Rea­son­ing through ar­gu­ments against tak­ing AI safety seriously

Judd Rosenblatt11 Jul 2024 23:53 UTC
70 points
3 comments1 min readLW link
(yoshuabengio.org)

Mul­ti­plex Gene Edit­ing: Where Are We Now?

sarahconstantin16 Jul 2024 20:50 UTC
69 points
6 comments7 min readLW link
(sarahconstantin.substack.com)

An­a­lyz­ing Deep­Mind’s Prob­a­bil­is­tic Meth­ods for Eval­u­at­ing Agent Capabilities

22 Jul 2024 16:17 UTC
69 points
0 comments16 min readLW link

Brief notes on the Wikipe­dia game

Olli Järviniemi14 Jul 2024 2:28 UTC
68 points
9 comments4 min readLW link

In­de­ci­sion and in­ter­nal­ized au­thor­ity figures

Kaj_Sotala6 Jul 2024 10:10 UTC
67 points
1 comment2 min readLW link
(kajsotala.fi)

Ti­maeus is hiring!

12 Jul 2024 23:42 UTC
67 points
6 comments2 min readLW link

What and Why: Devel­op­men­tal In­ter­pretabil­ity of Re­in­force­ment Learning

Garrett Baker9 Jul 2024 14:09 UTC
67 points
4 comments6 min readLW link