Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
Archive
Sequences
About
Search
Log In
All
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
All
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
All
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Page
1
There is way too much serendipity
Malmesbury
Jan 19, 2024, 7:37 PM
376
points
56
comments
7
min read
LW
link
Gentleness and the artificial Other
Joe Carlsmith
Jan 2, 2024, 6:21 PM
313
points
33
comments
11
min read
LW
link
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
evhub
,
Carson Denison
,
Meg
,
Monte M
,
David Duvenaud
,
Nicholas Schiefer
and
Ethan Perez
Jan 12, 2024, 7:51 PM
305
points
95
comments
3
min read
LW
link
(arxiv.org)
The case for ensuring that powerful AIs are controlled
ryan_greenblatt
and
Buck
Jan 24, 2024, 4:11 PM
275
points
73
comments
28
min read
LW
link
MIRI 2024 Mission and Strategy Update
Malo
Jan 5, 2024, 12:20 AM
223
points
44
comments
8
min read
LW
link
Toward A Mathematical Framework for Computation in Superposition
Dmitry Vaintrob
,
jake_mendel
and
Kaarel
Jan 18, 2024, 9:06 PM
204
points
18
comments
63
min read
LW
link
The impossible problem of due process
mingyuan
Jan 16, 2024, 5:18 AM
197
points
64
comments
14
min read
LW
link
This might be the last AI Safety Camp
Remmelt
and
Linda Linsefors
Jan 24, 2024, 9:33 AM
196
points
34
comments
1
min read
LW
link
Introducing Alignment Stress-Testing at Anthropic
evhub
Jan 12, 2024, 11:51 PM
182
points
23
comments
2
min read
LW
link
Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI
Jeremy Gillen
and
peterbarnett
Jan 26, 2024, 7:22 AM
161
points
60
comments
57
min read
LW
link
Making every researcher seek grants is a broken model
jasoncrawford
Jan 26, 2024, 4:06 PM
159
points
41
comments
4
min read
LW
link
(rootsofprogress.org)
What’s up with LLMs representing XORs of arbitrary features?
Sam Marks
Jan 3, 2024, 7:44 PM
158
points
63
comments
16
min read
LW
link
Apologizing is a Core Rationalist Skill
johnswentworth
Jan 2, 2024, 5:47 PM
156
points
42
comments
5
min read
LW
link
Deep atheism and AI risk
Joe Carlsmith
Jan 4, 2024, 6:58 PM
153
points
22
comments
27
min read
LW
link
What good is G-factor if you’re dumped in the woods? A field report from a camp counselor.
Hastings
Jan 12, 2024, 1:17 PM
145
points
22
comments
1
min read
LW
link
Notice When People Are Directionally Correct
Chris_Leong
Jan 14, 2024, 2:12 PM
136
points
8
comments
2
min read
LW
link
Processor clock speeds are not how fast AIs think
Ege Erdil
Jan 29, 2024, 2:39 PM
135
points
55
comments
2
min read
LW
link
The case for training frontier AIs on Sumerian-only corpus
Alexandre Variengien
,
Charbel-Raphaël
and
Jonathan Claybrough
Jan 15, 2024, 4:40 PM
130
points
16
comments
3
min read
LW
link
Steering Llama-2 with contrastive activation additions
Nina Panickssery
,
Wuschel Schulz
,
NickGabs
,
Meg
,
evhub
and
TurnTrout
Jan 2, 2024, 12:47 AM
125
points
29
comments
8
min read
LW
link
(arxiv.org)
An even deeper atheism
Joe Carlsmith
Jan 11, 2024, 5:28 PM
125
points
47
comments
15
min read
LW
link
A Shutdown Problem Proposal
johnswentworth
and
David Lorell
Jan 21, 2024, 6:12 PM
125
points
61
comments
6
min read
LW
link
Why I take short timelines seriously
NicholasKees
Jan 28, 2024, 10:27 PM
122
points
29
comments
4
min read
LW
link
The case for more ambitious language model evals
Jozdien
Jan 30, 2024, 12:01 AM
117
points
30
comments
5
min read
LW
link
Gender Exploration
sapphire
Jan 14, 2024, 6:57 PM
117
points
26
comments
5
min read
LW
link
(open.substack.com)
Four visions of Transformative AI success
Steven Byrnes
Jan 17, 2024, 8:45 PM
112
points
22
comments
15
min read
LW
link
Practically A Book Review: Appendix to “Nonlinear’s Evidence: Debunking False and Misleading Claims” (ThingOfThings)
tailcalled
Jan 3, 2024, 5:07 PM
111
points
25
comments
2
min read
LW
link
(thingofthings.substack.com)
Catching AIs red-handed
ryan_greenblatt
and
Buck
Jan 5, 2024, 5:43 PM
110
points
27
comments
17
min read
LW
link
′ petertodd’’s last stand: The final days of open GPT-3 research
mwatkins
Jan 22, 2024, 6:47 PM
109
points
16
comments
45
min read
LW
link
Being nicer than Clippy
Joe Carlsmith
Jan 16, 2024, 7:44 PM
109
points
32
comments
27
min read
LW
link
2023 in AI predictions
jessicata
Jan 1, 2024, 5:23 AM
107
points
35
comments
5
min read
LW
link
Deceptive AI ≠ Deceptively-aligned AI
Steven Byrnes
Jan 7, 2024, 4:55 PM
96
points
19
comments
6
min read
LW
link
Almost everyone I’ve met would be well-served thinking more about what to focus on
Henrik Karlsson
Jan 5, 2024, 9:01 PM
96
points
8
comments
11
min read
LW
link
(www.henrikkarlsson.xyz)
RAND report finds no effect of current LLMs on viability of bioterrorism attacks
StellaAthena
Jan 25, 2024, 7:17 PM
94
points
14
comments
1
min read
LW
link
(www.rand.org)
On the abolition of man
Joe Carlsmith
Jan 18, 2024, 6:17 PM
90
points
18
comments
41
min read
LW
link
The Aspiring Rationalist Congregation
maia
Jan 10, 2024, 10:52 PM
86
points
23
comments
10
min read
LW
link
Sparse Autoencoders Work on Attention Layer Outputs
Connor Kissane
,
robertzk
,
Arthur Conmy
and
Neel Nanda
Jan 16, 2024, 12:26 AM
83
points
9
comments
18
min read
LW
link
Some Vacation Photos
johnswentworth
Jan 4, 2024, 5:15 PM
83
points
0
comments
1
min read
LW
link
An Introduction To The Mandelbrot Set That Doesn’t Mention Complex Numbers
Yitz
Jan 17, 2024, 9:48 AM
82
points
11
comments
9
min read
LW
link
Palworld development blog post
bhauth
Jan 28, 2024, 5:56 AM
82
points
12
comments
1
min read
LW
link
(note.com)
Survey of 2,778 AI authors: six parts in pictures
KatjaGrace
Jan 6, 2024, 4:43 AM
80
points
1
comment
2
min read
LW
link
[Repost] The Copenhagen Interpretation of Ethics
mesaoptimizer
Jan 25, 2024, 3:20 PM
77
points
4
comments
5
min read
LW
link
(web.archive.org)
Universal Love Integration Test: Hitler
Raemon
Jan 10, 2024, 11:55 PM
76
points
65
comments
9
min read
LW
link
When “yang” goes wrong
Joe Carlsmith
Jan 8, 2024, 4:35 PM
73
points
6
comments
13
min read
LW
link
Epistemic Hell
rogersbacon
Jan 27, 2024, 5:13 PM
71
points
20
comments
14
min read
LW
link
We need a Science of Evals
Marius Hobbhahn
and
Jérémy Scheurer
Jan 22, 2024, 8:30 PM
71
points
13
comments
9
min read
LW
link
The True Story of How GPT-2 Became Maximally Lewd
Writer
and
Jai
Jan 18, 2024, 9:03 PM
70
points
7
comments
6
min read
LW
link
(youtu.be)
InterLab – a toolkit for experiments with multi-agent interactions
Tomáš Gavenčiak
,
Ada Böhm
and
Jan_Kulveit
Jan 22, 2024, 6:23 PM
69
points
0
comments
8
min read
LW
link
(acsresearch.org)
[Question]
Will quantum randomness affect the 2028 election?
Thomas Kwa
and
habryka
Jan 24, 2024, 10:54 PM
66
points
52
comments
1
min read
LW
link
OpenAI’s Preparedness Framework: Praise & Recommendations
Orpheus16
Jan 2, 2024, 4:20 PM
66
points
1
comment
7
min read
LW
link
The Perceptron Controversy
Yuxi_Liu
Jan 10, 2024, 11:07 PM
65
points
18
comments
1
min read
LW
link
(yuxi-liu-wired.github.io)
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel