Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
Archive
Sequences
About
Search
Log In
All
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
All
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
All
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Page
1
Alignment Faking in Large Language Models
ryan_greenblatt
,
evhub
,
Carson Denison
,
Benjamin Wright
,
Fabien Roger
,
Monte M
,
Sam Marks
,
Johannes Treutlein
,
Sam Bowman
and
Buck
Dec 18, 2024, 5:19 PM
483
points
75
comments
10
min read
LW
link
Review: Planecrash
L Rudolf L
Dec 27, 2024, 2:18 PM
358
points
45
comments
21
min read
LW
link
(nosetgauge.substack.com)
Biological risk from the mirror world
jasoncrawford
Dec 12, 2024, 7:07 PM
333
points
38
comments
7
min read
LW
link
(newsletter.rootsofprogress.org)
What Goes Without Saying
sarahconstantin
Dec 20, 2024, 6:00 PM
331
points
28
comments
5
min read
LW
link
(sarahconstantin.substack.com)
The Field of AI Alignment: A Postmortem, and What To Do About It
johnswentworth
Dec 26, 2024, 6:48 PM
295
points
160
comments
8
min read
LW
link
By default, capital will matter more than ever after AGI
L Rudolf L
Dec 28, 2024, 5:52 PM
288
points
100
comments
16
min read
LW
link
(nosetgauge.substack.com)
Orienting to 3 year AGI timelines
Nikola Jurkovic
Dec 22, 2024, 1:15 AM
277
points
51
comments
8
min read
LW
link
A Three-Layer Model of LLM Psychology
Jan_Kulveit
Dec 26, 2024, 4:49 PM
217
points
13
comments
8
min read
LW
link
Understanding Shapley Values with Venn Diagrams
Carson L
Dec 6, 2024, 9:56 PM
214
points
34
comments
LW
link
(medium.com)
Communications in Hard Mode (My new job at MIRI)
tanagrabeast
Dec 13, 2024, 8:13 PM
204
points
25
comments
5
min read
LW
link
Frontier Models are Capable of In-context Scheming
Marius Hobbhahn
,
AlexMeinke
,
Bronson Schoen
,
rusheb
,
Jérémy Scheurer
and
Mikita Balesni
Dec 5, 2024, 10:11 PM
203
points
24
comments
7
min read
LW
link
Shallow review of technical AI safety, 2024
technicalities
,
Stag
,
Stephen McAleese
,
jordine
and
Dr. David Mathers
Dec 29, 2024, 12:01 PM
185
points
34
comments
41
min read
LW
link
When Is Insurance Worth It?
kqr
Dec 19, 2024, 7:07 PM
173
points
71
comments
4
min read
LW
link
(entropicthoughts.com)
o1: A Technical Primer
Jesse Hoogland
Dec 9, 2024, 7:09 PM
170
points
19
comments
9
min read
LW
link
(www.youtube.com)
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
cloud
,
Jacob G-W
,
Evzen
,
Joseph Miller
and
TurnTrout
Dec 6, 2024, 10:19 PM
165
points
12
comments
11
min read
LW
link
(arxiv.org)
Subskills of “Listening to Wisdom”
Raemon
Dec 9, 2024, 3:01 AM
154
points
29
comments
42
min read
LW
link
o3
Zach Stein-Perlman
Dec 20, 2024, 6:30 PM
154
points
164
comments
1
min read
LW
link
“Alignment Faking” frame is somewhat fake
Jan_Kulveit
Dec 20, 2024, 9:51 AM
151
points
13
comments
6
min read
LW
link
What o3 Becomes by 2028
Vladimir_Nesov
Dec 22, 2024, 12:37 PM
147
points
15
comments
5
min read
LW
link
The “Think It Faster” Exercise
Raemon
Dec 11, 2024, 7:14 PM
144
points
35
comments
13
min read
LW
link
Hire (or Become) a Thinking Assistant
Raemon
Dec 23, 2024, 3:58 AM
137
points
49
comments
8
min read
LW
link
The Dangers of Mirrored Life
Niko_McCarty
and
fin
Dec 12, 2024, 8:58 PM
119
points
9
comments
29
min read
LW
link
(www.asimov.press)
The Dream Machine
sarahconstantin
Dec 5, 2024, 12:00 AM
117
points
6
comments
12
min read
LW
link
(sarahconstantin.substack.com)
The o1 System Card Is Not About o1
Zvi
Dec 13, 2024, 8:30 PM
116
points
5
comments
16
min read
LW
link
(thezvi.wordpress.com)
Ablations for “Frontier Models are Capable of In-context Scheming”
AlexMeinke
,
Bronson Schoen
,
Marius Hobbhahn
,
Mikita Balesni
,
Jérémy Scheurer
and
rusheb
Dec 17, 2024, 11:58 PM
115
points
1
comment
2
min read
LW
link
AIs Will Increasingly Attempt Shenanigans
Zvi
Dec 16, 2024, 3:20 PM
114
points
2
comments
26
min read
LW
link
(thezvi.wordpress.com)
How to replicate and extend our alignment faking demo
Fabien Roger
Dec 19, 2024, 9:44 PM
113
points
5
comments
2
min read
LW
link
(alignment.anthropic.com)
Why I’m Moving from Mechanistic to Prosaic Interpretability
Daniel Tan
Dec 30, 2024, 6:35 AM
113
points
34
comments
5
min read
LW
link
Sorry for the downtime, looks like we got DDosd
habryka
Dec 2, 2024, 4:14 AM
112
points
13
comments
1
min read
LW
link
Takes on “Alignment Faking in Large Language Models”
Joe Carlsmith
Dec 18, 2024, 6:22 PM
105
points
7
comments
62
min read
LW
link
A shortcoming of concrete demonstrations as AGI risk advocacy
Steven Byrnes
Dec 11, 2024, 4:48 PM
105
points
27
comments
2
min read
LW
link
A breakdown of AI capability levels focused on AI R&D labor acceleration
ryan_greenblatt
Dec 22, 2024, 8:56 PM
104
points
5
comments
6
min read
LW
link
[Question]
What are the strongest arguments for very short timelines?
Kaj_Sotala
Dec 23, 2024, 9:38 AM
101
points
79
comments
1
min read
LW
link
2024 Unofficial LessWrong Census/Survey
Screwtape
Dec 2, 2024, 5:30 AM
101
points
49
comments
1
min read
LW
link
The nihilism of NeurIPS
charlieoneill
Dec 20, 2024, 11:58 PM
100
points
7
comments
4
min read
LW
link
Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models
Andrew Mack
and
TurnTrout
Dec 3, 2024, 9:19 PM
100
points
7
comments
41
min read
LW
link
Matryoshka Sparse Autoencoders
Noa Nabeshima
Dec 14, 2024, 2:52 AM
98
points
15
comments
11
min read
LW
link
MIRI’s 2024 End-of-Year Update
Rob Bensinger
Dec 3, 2024, 4:33 AM
98
points
2
comments
4
min read
LW
link
Should you be worried about H5N1?
gw
Dec 5, 2024, 9:11 PM
89
points
2
comments
5
min read
LW
link
(www.georgeyw.com)
AIs Will Increasingly Fake Alignment
Zvi
Dec 24, 2024, 1:00 PM
89
points
0
comments
52
min read
LW
link
(thezvi.wordpress.com)
Is “VNM-agent” one of several options, for what minds can grow up into?
AnnaSalamon
Dec 30, 2024, 6:36 AM
89
points
55
comments
2
min read
LW
link
Parable of the vanilla ice cream curse (and how it would prevent a car from starting!)
Mati_Roy
Dec 8, 2024, 6:57 AM
89
points
21
comments
3
min read
LW
link
🇫🇷 Announcing CeSIA: The French Center for AI Safety
Charbel-Raphaël
Dec 20, 2024, 2:17 PM
88
points
2
comments
8
min read
LW
link
Circling as practice for “just be yourself”
Kaj_Sotala
Dec 16, 2024, 7:40 AM
86
points
5
comments
4
min read
LW
link
(kajsotala.fi)
Some arguments against a land value tax
Matthew Barnett
Dec 29, 2024, 3:17 PM
83
points
40
comments
15
min read
LW
link
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
Can
,
Adam Karvonen
,
Johnny Lin
,
Curt Tigges
,
Joseph Bloom
,
chanind
,
Yeu-Tong Lau
,
Eoin Farrell
,
Arthur Conmy
,
CallumMcDougall
,
Kola Ayonrinde
,
Matthew Wearden
,
Sam Marks
and
Neel Nanda
Dec 11, 2024, 6:30 AM
82
points
6
comments
2
min read
LW
link
(www.neuronpedia.org)
Effective Evil’s AI Misalignment Plan
lsusr
Dec 15, 2024, 7:39 AM
82
points
9
comments
3
min read
LW
link
Testing which LLM architectures can do hidden serial reasoning
Filip Sondej
Dec 16, 2024, 1:48 PM
81
points
9
comments
4
min read
LW
link
Remap your caps lock key
bilalchughtai
Dec 15, 2024, 2:03 PM
80
points
18
comments
1
min read
LW
link
Best-of-N Jailbreaking
John Hughes
,
saraprice
,
Aengus Lynch
,
Rylan Schaeffer
,
Fazl
,
Henry Sleight
,
Ethan Perez
and
mrinank_sharma
Dec 14, 2024, 4:58 AM
78
points
5
comments
2
min read
LW
link
(arxiv.org)
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel