Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
Archive
Sequences
About
Search
Log In
All
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
All
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
All
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Page
1
Reliable Sources: The Story of David Gerard
TracingWoodgrains
Jul 10, 2024, 7:50 PM
390
points
54
comments
43
min read
LW
link
Universal Basic Income and Poverty
Eliezer Yudkowsky
Jul 26, 2024, 7:23 AM
321
points
139
comments
9
min read
LW
link
80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly)
Raemon
Jul 3, 2024, 8:34 PM
274
points
71
comments
LW
link
Towards more cooperative AI safety strategies
Richard_Ngo
Jul 16, 2024, 4:36 AM
215
points
133
comments
4
min read
LW
link
Superbabies: Putting The Pieces Together
sarahconstantin
Jul 11, 2024, 8:40 PM
215
points
37
comments
10
min read
LW
link
(sarahconstantin.substack.com)
Self-Other Overlap: A Neglected Approach to AI Alignment
Marc Carauleanu
,
Mike Vaiana
,
Judd Rosenblatt
,
Diogo de Lucena
,
Cameron Berg
and
AE Studio
Jul 30, 2024, 4:22 PM
215
points
51
comments
12
min read
LW
link
Optimistic Assumptions, Longterm Planning, and “Cope”
Raemon
Jul 17, 2024, 10:14 PM
214
points
46
comments
7
min read
LW
link
This is already your second chance
Malmesbury
Jul 28, 2024, 5:13 PM
184
points
13
comments
8
min read
LW
link
Safety consultations for AI lab employees
Zach Stein-Perlman
Jul 27, 2024, 3:00 PM
181
points
4
comments
1
min read
LW
link
Decomposing Agency — capabilities without desires
owencb
and
Raymond D
Jul 11, 2024, 9:38 AM
153
points
32
comments
12
min read
LW
link
(strangecities.substack.com)
On saying “Thank you” instead of “I’m Sorry”
Michael Cohn
Jul 8, 2024, 3:13 AM
136
points
16
comments
3
min read
LW
link
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
Neel Nanda
Jul 7, 2024, 5:39 PM
135
points
16
comments
25
min read
LW
link
“AI achieves silver-medal standard solving International Mathematical Olympiad problems”
gjm
Jul 25, 2024, 3:58 PM
133
points
38
comments
2
min read
LW
link
(deepmind.google)
Pantheon Interface
NicholasKees
and
Sofia Vanhanen
Jul 8, 2024, 7:03 PM
126
points
22
comments
6
min read
LW
link
A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
Lee Sharkey
,
Lucius Bushnaq
,
Dan Braun
,
StefanHex
and
Nicholas Goldowsky-Dill
Jul 18, 2024, 2:15 PM
121
points
18
comments
18
min read
LW
link
Efficient Dictionary Learning with Switch Sparse Autoencoders
Anish Mudide
Jul 22, 2024, 6:45 PM
118
points
20
comments
12
min read
LW
link
You should go to ML conferences
Jan_Kulveit
Jul 24, 2024, 11:47 AM
112
points
13
comments
4
min read
LW
link
Introduction to French AI Policy
Lucie Philippon
Jul 4, 2024, 3:39 AM
111
points
12
comments
6
min read
LW
link
OthelloGPT learned a bag of heuristics
jylin04
,
JackS
,
Adam Karvonen
and
Can
Jul 2, 2024, 9:12 AM
111
points
10
comments
9
min read
LW
link
Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
L Rudolf L
,
bilalchughtai
,
Jan Betley
,
kaivu
,
Jérémy Scheurer
,
Mikita Balesni
,
AlexMeinke
,
Owain_Evans
and
Marius Hobbhahn
Jul 8, 2024, 10:24 PM
109
points
37
comments
5
min read
LW
link
Most smart and skilled people are outside of the EA/rationalist community: an analysis
titotal
Jul 12, 2024, 12:13 PM
109
points
39
comments
LW
link
(open.substack.com)
Poker is a bad game for teaching epistemics. Figgie is a better one.
rossry
Jul 8, 2024, 6:05 AM
106
points
47
comments
11
min read
LW
link
(blog.rossry.net)
Transformer Circuit Faithfulness Metrics Are Not Robust
Joseph Miller
,
bilalchughtai
and
William_S
Jul 12, 2024, 3:47 AM
104
points
5
comments
7
min read
LW
link
(arxiv.org)
I found >800 orthogonal “write code” steering vectors
Jacob G-W
and
TurnTrout
Jul 15, 2024, 7:06 PM
102
points
19
comments
7
min read
LW
link
(jacobgw.com)
A simple model of math skill
Alex_Altair
Jul 21, 2024, 6:57 PM
101
points
16
comments
8
min read
LW
link
Dialogue introduction to Singular Learning Theory
Olli Järviniemi
Jul 8, 2024, 4:58 PM
100
points
15
comments
8
min read
LW
link
Against Aschenbrenner: How ‘Situational Awareness’ constructs a narrative that undermines safety and threatens humanity
GideonF
Jul 15, 2024, 6:37 PM
99
points
17
comments
21
min read
LW
link
(forum.effectivealtruism.org)
A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication
johnswentworth
and
David Lorell
Jul 26, 2024, 12:33 AM
93
points
2
comments
13
min read
LW
link
What are you getting paid in?
Austin Chen
Jul 17, 2024, 7:23 PM
92
points
14
comments
4
min read
LW
link
(www.approachwithalacrity.com)
New page: Integrity
Zach Stein-Perlman
Jul 10, 2024, 3:00 PM
91
points
3
comments
1
min read
LW
link
Reflections on Less Online
Error
Jul 7, 2024, 3:49 AM
89
points
15
comments
18
min read
LW
link
Covert Malicious Finetuning
Tony Wang
and
dannyhalawi
Jul 2, 2024, 2:41 AM
89
points
4
comments
3
min read
LW
link
AI #73: Openly Evil AI
Zvi
Jul 18, 2024, 2:40 PM
89
points
20
comments
52
min read
LW
link
(thezvi.wordpress.com)
Re: Anthropic’s suggested SB-1047 amendments
RobertM
Jul 27, 2024, 10:32 PM
87
points
13
comments
9
min read
LW
link
(www.documentcloud.org)
Fluent, Cruxy Predictions
Raemon
Jul 10, 2024, 6:00 PM
86
points
14
comments
14
min read
LW
link
Decomposing the QK circuit with Bilinear Sparse Dictionary Learning
keith_wynroe
and
Lee Sharkey
Jul 2, 2024, 1:17 PM
86
points
7
comments
12
min read
LW
link
Scalable oversight as a quantitative rather than qualitative problem
Buck
Jul 6, 2024, 5:42 PM
85
points
11
comments
3
min read
LW
link
A simple case for extreme inner misalignment
Richard_Ngo
Jul 13, 2024, 3:40 PM
84
points
41
comments
7
min read
LW
link
3C’s: A Recipe For Mathing Concepts
johnswentworth
and
David Lorell
Jul 3, 2024, 1:06 AM
81
points
5
comments
7
min read
LW
link
On the CrowdStrike Incident
Zvi
Jul 22, 2024, 12:40 PM
75
points
14
comments
17
min read
LW
link
(thezvi.wordpress.com)
Interpreting Preference Models w/ Sparse Autoencoders
Logan Riggs
and
Jannik Brinkmann
1 Jul 2024 21:35 UTC
74
points
12
comments
9
min read
LW
link
Multiplex Gene Editing: Where Are We Now?
sarahconstantin
16 Jul 2024 20:50 UTC
73
points
6
comments
7
min read
LW
link
(sarahconstantin.substack.com)
D&D.Sci Scenario Index
aphyer
and
abstractapplic
23 Jul 2024 2:00 UTC
73
points
0
comments
2
min read
LW
link
LK-99 in retrospect
bhauth
7 Jul 2024 2:06 UTC
72
points
21
comments
3
min read
LW
link
(www.bhauth.com)
Yoshua Bengio: Reasoning through arguments against taking AI safety seriously
Judd Rosenblatt
11 Jul 2024 23:53 UTC
70
points
3
comments
1
min read
LW
link
(yoshuabengio.org)
Analyzing DeepMind’s Probabilistic Methods for Evaluating Agent Capabilities
Axel Højmark
,
Govind Pimpale
,
Arjun Panickssery
,
Marius Hobbhahn
and
Jérémy Scheurer
22 Jul 2024 16:17 UTC
69
points
0
comments
16
min read
LW
link
Indecision and internalized authority figures
Kaj_Sotala
6 Jul 2024 10:10 UTC
69
points
1
comment
2
min read
LW
link
(kajsotala.fi)
An AI Race With China Can Be Better Than Not Racing
niplav
2 Jul 2024 17:57 UTC
69
points
34
comments
11
min read
LW
link
What and Why: Developmental Interpretability of Reinforcement Learning
Garrett Baker
9 Jul 2024 14:09 UTC
68
points
4
comments
6
min read
LW
link
Brief notes on the Wikipedia game
Olli Järviniemi
14 Jul 2024 2:28 UTC
68
points
9
comments
4
min read
LW
link
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel