Reli­able Sources: The Story of David Gerard

TracingWoodgrainsJul 10, 2024, 7:50 PM
390 points
54 comments43 min readLW link

Univer­sal Ba­sic In­come and Poverty

Eliezer YudkowskyJul 26, 2024, 7:23 AM
321 points
139 comments9 min readLW link

80,000 hours should re­move OpenAI from the Job Board (and similar EA orgs should do similarly)

RaemonJul 3, 2024, 8:34 PM
274 points
71 commentsLW link

Towards more co­op­er­a­tive AI safety strategies

Richard_NgoJul 16, 2024, 4:36 AM
215 points
133 comments4 min readLW link

Su­perba­bies: Put­ting The Pie­ces Together

sarahconstantinJul 11, 2024, 8:40 PM
215 points
37 comments10 min readLW link
(sarahconstantin.substack.com)

Self-Other Over­lap: A Ne­glected Ap­proach to AI Alignment

Jul 30, 2024, 4:22 PM
215 points
51 comments12 min readLW link

Op­ti­mistic As­sump­tions, Longterm Plan­ning, and “Cope”

RaemonJul 17, 2024, 10:14 PM
214 points
46 comments7 min readLW link

This is already your sec­ond chance

MalmesburyJul 28, 2024, 5:13 PM
184 points
13 comments8 min readLW link

Safety con­sul­ta­tions for AI lab employees

Zach Stein-PerlmanJul 27, 2024, 3:00 PM
181 points
4 comments1 min readLW link

De­com­pos­ing Agency — ca­pa­bil­ities with­out desires

Jul 11, 2024, 9:38 AM
153 points
32 comments12 min readLW link
(strangecities.substack.com)

On say­ing “Thank you” in­stead of “I’m Sorry”

Michael CohnJul 8, 2024, 3:13 AM
136 points
16 comments3 min readLW link

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers v2

Neel NandaJul 7, 2024, 5:39 PM
135 points
16 comments25 min readLW link

“AI achieves silver-medal stan­dard solv­ing In­ter­na­tional Math­e­mat­i­cal Olympiad prob­lems”

gjmJul 25, 2024, 3:58 PM
133 points
38 comments2 min readLW link
(deepmind.google)

Pan­theon Interface

Jul 8, 2024, 7:03 PM
126 points
22 comments6 min readLW link

A List of 45+ Mech In­terp Pro­ject Ideas from Apollo Re­search’s In­ter­pretabil­ity Team

Jul 18, 2024, 2:15 PM
121 points
18 comments18 min readLW link

Effi­cient Dic­tionary Learn­ing with Switch Sparse Autoencoders

Anish MudideJul 22, 2024, 6:45 PM
118 points
20 comments12 min readLW link

You should go to ML conferences

Jan_KulveitJul 24, 2024, 11:47 AM
112 points
13 comments4 min readLW link

In­tro­duc­tion to French AI Policy

Lucie PhilipponJul 4, 2024, 3:39 AM
111 points
12 comments6 min readLW link

Othel­loGPT learned a bag of heuristics

Jul 2, 2024, 9:12 AM
111 points
10 comments9 min readLW link

Me, My­self, and AI: the Si­tu­a­tional Aware­ness Dataset (SAD) for LLMs

Jul 8, 2024, 10:24 PM
109 points
37 comments5 min readLW link

Most smart and skil­led peo­ple are out­side of the EA/​ra­tio­nal­ist com­mu­nity: an analysis

titotalJul 12, 2024, 12:13 PM
109 points
39 commentsLW link
(open.substack.com)

Poker is a bad game for teach­ing epistemics. Fig­gie is a bet­ter one.

rossryJul 8, 2024, 6:05 AM
106 points
47 comments11 min readLW link
(blog.rossry.net)

Trans­former Cir­cuit Faith­ful­ness Met­rics Are Not Robust

Jul 12, 2024, 3:47 AM
104 points
5 comments7 min readLW link
(arxiv.org)

I found >800 or­thog­o­nal “write code” steer­ing vectors

Jul 15, 2024, 7:06 PM
102 points
19 comments7 min readLW link
(jacobgw.com)

A sim­ple model of math skill

Alex_AltairJul 21, 2024, 6:57 PM
101 points
16 comments8 min readLW link

Dialogue in­tro­duc­tion to Sin­gu­lar Learn­ing Theory

Olli JärviniemiJul 8, 2024, 4:58 PM
100 points
15 comments8 min readLW link

Against Aschen­bren­ner: How ‘Si­tu­a­tional Aware­ness’ con­structs a nar­ra­tive that un­der­mines safety and threat­ens humanity

GideonFJul 15, 2024, 6:37 PM
99 points
17 comments21 min readLW link
(forum.effectivealtruism.org)

A Solomonoff In­duc­tor Walks Into a Bar: Schel­ling Points for Communication

Jul 26, 2024, 12:33 AM
93 points
2 comments13 min readLW link

What are you get­ting paid in?

Austin ChenJul 17, 2024, 7:23 PM
92 points
14 comments4 min readLW link
(www.approachwithalacrity.com)

New page: Integrity

Zach Stein-PerlmanJul 10, 2024, 3:00 PM
91 points
3 comments1 min readLW link

Reflec­tions on Less Online

ErrorJul 7, 2024, 3:49 AM
89 points
15 comments18 min readLW link

Covert Mal­i­cious Finetuning

Jul 2, 2024, 2:41 AM
89 points
4 comments3 min readLW link

AI #73: Openly Evil AI

ZviJul 18, 2024, 2:40 PM
89 points
20 comments52 min readLW link
(thezvi.wordpress.com)

Re: An­thropic’s sug­gested SB-1047 amendments

RobertMJul 27, 2024, 10:32 PM
87 points
13 comments9 min readLW link
(www.documentcloud.org)

Fluent, Cruxy Predictions

RaemonJul 10, 2024, 6:00 PM
86 points
14 comments14 min readLW link

De­com­pos­ing the QK cir­cuit with Bilin­ear Sparse Dic­tionary Learning

Jul 2, 2024, 1:17 PM
86 points
7 comments12 min readLW link

Scal­able over­sight as a quan­ti­ta­tive rather than qual­i­ta­tive problem

BuckJul 6, 2024, 5:42 PM
85 points
11 comments3 min readLW link

A sim­ple case for ex­treme in­ner misalignment

Richard_NgoJul 13, 2024, 3:40 PM
84 points
41 comments7 min readLW link

3C’s: A Recipe For Mathing Concepts

Jul 3, 2024, 1:06 AM
81 points
5 comments7 min readLW link

On the CrowdStrike Incident

ZviJul 22, 2024, 12:40 PM
75 points
14 comments17 min readLW link
(thezvi.wordpress.com)

In­ter­pret­ing Prefer­ence Models w/​ Sparse Autoencoders

1 Jul 2024 21:35 UTC
74 points
12 comments9 min readLW link

Mul­ti­plex Gene Edit­ing: Where Are We Now?

sarahconstantin16 Jul 2024 20:50 UTC
73 points
6 comments7 min readLW link
(sarahconstantin.substack.com)

D&D.Sci Sce­nario Index

23 Jul 2024 2:00 UTC
73 points
0 comments2 min readLW link

LK-99 in retrospect

bhauth7 Jul 2024 2:06 UTC
72 points
21 comments3 min readLW link
(www.bhauth.com)

Yoshua Ben­gio: Rea­son­ing through ar­gu­ments against tak­ing AI safety seriously

Judd Rosenblatt11 Jul 2024 23:53 UTC
70 points
3 comments1 min readLW link
(yoshuabengio.org)

An­a­lyz­ing Deep­Mind’s Prob­a­bil­is­tic Meth­ods for Eval­u­at­ing Agent Capabilities

22 Jul 2024 16:17 UTC
69 points
0 comments16 min readLW link

In­de­ci­sion and in­ter­nal­ized au­thor­ity figures

Kaj_Sotala6 Jul 2024 10:10 UTC
69 points
1 comment2 min readLW link
(kajsotala.fi)

An AI Race With China Can Be Bet­ter Than Not Racing

niplav2 Jul 2024 17:57 UTC
69 points
34 comments11 min readLW link

What and Why: Devel­op­men­tal In­ter­pretabil­ity of Re­in­force­ment Learning

Garrett Baker9 Jul 2024 14:09 UTC
68 points
4 comments6 min readLW link

Brief notes on the Wikipe­dia game

Olli Järviniemi14 Jul 2024 2:28 UTC
68 points
9 comments4 min readLW link