All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 202320242025

All Jan Feb Mar Apr May JunJulAug Sep Oct Nov Dec

All1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Reliable Sources: The Story of David Gerard

TracingWoodgrainsJul 10, 2024, 7:50 PM

390 points

54 comments43 min readLW link

Universal Basic Income and Poverty

Eliezer YudkowskyJul 26, 2024, 7:23 AM

327 points

140 comments9 min readLW link

80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly)

RaemonJul 3, 2024, 8:34 PM

274 points

71 comments LW link

Self-Other Overlap: A Neglected Approach to AI Alignment

Marc Carauleanu, Mike Vaiana, Judd Rosenblatt, Diogo de Lucena, Cameron Berg and AE Studio

Jul 30, 2024, 4:22 PM

222 points

51 comments12 min readLW link

Optimistic Assumptions, Longterm Planning, and “Cope”

RaemonJul 17, 2024, 10:14 PM

215 points

46 comments7 min readLW link

Towards more cooperative AI safety strategies

Richard_NgoJul 16, 2024, 4:36 AM

215 points

133 comments4 min readLW link

Superbabies: Putting The Pieces Together

sarahconstantinJul 11, 2024, 8:40 PM

215 points

37 comments10 min readLW link

(sarahconstantin.substack.com)

This is already your second chance

MalmesburyJul 28, 2024, 5:13 PM

185 points

13 comments8 min readLW link

Safety consultations for AI lab employees

Zach Stein-PerlmanJul 27, 2024, 3:00 PM

181 points

4 comments1 min readLW link

Decomposing Agency — capabilities without desires

owencb and Raymond D

Jul 11, 2024, 9:38 AM

153 points

32 comments12 min readLW link

(strangecities.substack.com)

On saying “Thank you” instead of “I’m Sorry”

Michael CohnJul 8, 2024, 3:13 AM

136 points

16 comments3 min readLW link

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Neel NandaJul 7, 2024, 5:39 PM

135 points

16 comments25 min readLW link

“AI achieves silver-medal standard solving International Mathematical Olympiad problems”

gjmJul 25, 2024, 3:58 PM

133 points

38 comments2 min readLW link

(deepmind.google)

Pantheon Interface

NicholasKees and Sofia Vanhanen

Jul 8, 2024, 7:03 PM

126 points

22 comments6 min readLW link

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex and Nicholas Goldowsky-Dill

Jul 18, 2024, 2:15 PM

121 points

18 comments18 min readLW link

Efficient Dictionary Learning with Switch Sparse Autoencoders

Anish MudideJul 22, 2024, 6:45 PM

118 points

20 comments12 min readLW link

You should go to ML conferences

Jan_KulveitJul 24, 2024, 11:47 AM

112 points

13 comments4 min readLW link

Introduction to French AI Policy

Lucie PhilipponJul 4, 2024, 3:39 AM

111 points

12 comments6 min readLW link

OthelloGPT learned a bag of heuristics

jylin04, JackS, Adam Karvonen and Can

Jul 2, 2024, 9:12 AM

111 points

10 comments9 min readLW link

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

L Rudolf L, bilalchughtai, Jan Betley, kaivu, Jérémy Scheurer, Mikita Balesni, AlexMeinke, Owain_Evans and Marius Hobbhahn

Jul 8, 2024, 10:24 PM

109 points

37 comments5 min readLW link

Most smart and skilled people are outside of the EA/rationalist community: an analysis

titotalJul 12, 2024, 12:13 PM

109 points

39 comments LW link

(open.substack.com)

Poker is a bad game for teaching epistemics. Figgie is a better one.

rossryJul 8, 2024, 6:05 AM

105 points

47 comments11 min readLW link

(blog.rossry.net)

Transformer Circuit Faithfulness Metrics Are Not Robust

Joseph Miller, bilalchughtai and William_S

Jul 12, 2024, 3:47 AM

104 points

5 comments7 min readLW link

(arxiv.org)

I found >800 orthogonal “write code” steering vectors

Jacob G-W and TurnTrout

Jul 15, 2024, 7:06 PM

103 points

19 comments7 min readLW link

(jacobgw.com)

A simple model of math skill

Alex_AltairJul 21, 2024, 6:57 PM

101 points

16 comments8 min readLW link

Dialogue introduction to Singular Learning Theory

Olli JärviniemiJul 8, 2024, 4:58 PM

100 points

15 comments8 min readLW link

Against Aschenbrenner: How ‘Situational Awareness’ constructs a narrative that undermines safety and threatens humanity

GideonFJul 15, 2024, 6:37 PM

99 points

17 comments21 min readLW link

(forum.effectivealtruism.org)

A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication

johnswentworth and David Lorell

Jul 26, 2024, 12:33 AM

93 points

2 comments13 min readLW link

What are you getting paid in?

Austin ChenJul 17, 2024, 7:23 PM

92 points

14 comments4 min readLW link

(www.approachwithalacrity.com)

New page: Integrity

Zach Stein-PerlmanJul 10, 2024, 3:00 PM

91 points

3 comments1 min readLW link

Reflections on Less Online

ErrorJul 7, 2024, 3:49 AM

89 points

15 comments18 min readLW link

AI #73: Openly Evil AI

ZviJul 18, 2024, 2:40 PM

89 points

20 comments52 min readLW link

(thezvi.wordpress.com)

Covert Malicious Finetuning

Tony Wang and dannyhalawi

Jul 2, 2024, 2:41 AM

89 points

4 comments3 min readLW link

Re: Anthropic’s suggested SB-1047 amendments

RobertMJul 27, 2024, 10:32 PM

87 points

13 comments9 min readLW link

(www.documentcloud.org)

Fluent, Cruxy Predictions

RaemonJul 10, 2024, 6:00 PM

86 points

14 comments14 min readLW link

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning

keith_wynroe and Lee Sharkey

Jul 2, 2024, 1:17 PM

86 points

7 comments12 min readLW link

Scalable oversight as a quantitative rather than qualitative problem

BuckJul 6, 2024, 5:42 PM

85 points

11 comments3 min readLW link

A simple case for extreme inner misalignment

Richard_NgoJul 13, 2024, 3:40 PM

84 points

41 comments7 min readLW link

3C’s: A Recipe For Mathing Concepts

johnswentworth and David Lorell

Jul 3, 2024, 1:06 AM

81 points

5 comments7 min readLW link

On the CrowdStrike Incident

ZviJul 22, 2024, 12:40 PM

75 points

14 comments17 min readLW link

(thezvi.wordpress.com)

Interpreting Preference Models w/ Sparse Autoencoders

Logan Riggs and Jannik Brinkmann

Jul 1, 2024, 9:35 PM

74 points

12 comments9 min readLW link

Multiplex Gene Editing: Where Are We Now?

sarahconstantinJul 16, 2024, 8:50 PM

73 points

6 comments7 min readLW link

(sarahconstantin.substack.com)

D&D.Sci Scenario Index

aphyer and abstractapplic

Jul 23, 2024, 2:00 AM

73 points

0 comments2 min readLW link

LK-99 in retrospect

bhauthJul 7, 2024, 2:06 AM

72 points

21 comments3 min readLW link

(www.bhauth.com)

Yoshua Bengio: Reasoning through arguments against taking AI safety seriously

Judd RosenblattJul 11, 2024, 11:53 PM

70 points

3 comments1 min readLW link

(yoshuabengio.org)

An AI Race With China Can Be Better Than Not Racing

niplavJul 2, 2024, 5:57 PM

69 points

34 comments11 min readLW link

Analyzing DeepMind’s Probabilistic Methods for Evaluating Agent Capabilities

Axel Højmark, Govind Pimpale, Arjun Panickssery, Marius Hobbhahn and Jérémy Scheurer

Jul 22, 2024, 4:17 PM

69 points

0 comments16 min readLW link

Indecision and internalized authority figures

Kaj_SotalaJul 6, 2024, 10:10 AM

69 points

1 comment2 min readLW link

(kajsotala.fi)

What and Why: Developmental Interpretability of Reinforcement Learning

Garrett BakerJul 9, 2024, 2:09 PM

68 points

4 comments6 min readLW link

Brief notes on the Wikipedia game

Olli JärviniemiJul 14, 2024, 2:28 AM

68 points

9 comments4 min readLW link