Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
Archive
Sequences
About
Search
Log In
All
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
All
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
All
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Page
1
Against Almost Every Theory of Impact of Interpretability
Charbel-Raphaël
Aug 17, 2023, 6:44 PM
329
points
90
comments
26
min read
LW
link
2
reviews
Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
evhub
,
Nicholas Schiefer
,
Carson Denison
and
Ethan Perez
Aug 8, 2023, 1:30 AM
318
points
30
comments
18
min read
LW
link
1
review
Dear Self; we need to talk about ambition
Elizabeth
Aug 27, 2023, 11:10 PM
270
points
28
comments
8
min read
LW
link
2
reviews
(acesounderglass.com)
My current LK99 questions
Eliezer Yudkowsky
Aug 1, 2023, 10:48 PM
206
points
38
comments
5
min read
LW
link
Feedbackloop-first Rationality
Raemon
Aug 7, 2023, 5:58 PM
203
points
67
comments
8
min read
LW
link
2
reviews
Large Language Models will be Great for Censorship
Ethan Edwards
Aug 21, 2023, 7:03 PM
185
points
14
comments
8
min read
LW
link
(ethanedwards.substack.com)
OpenAI API base models are not sycophantic, at any size
nostalgebraist
Aug 29, 2023, 12:58 AM
183
points
20
comments
2
min read
LW
link
(colab.research.google.com)
A list of core AI safety problems and how I hope to solve them
davidad
Aug 26, 2023, 3:12 PM
165
points
29
comments
5
min read
LW
link
Password-locked models: a stress case for capabilities evaluation
Fabien Roger
Aug 3, 2023, 2:53 PM
156
points
14
comments
6
min read
LW
link
ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks
Beth Barnes
Aug 1, 2023, 6:30 PM
153
points
12
comments
5
min read
LW
link
(evals.alignment.org)
Assume Bad Faith
Zack_M_Davis
Aug 25, 2023, 5:36 PM
151
points
63
comments
7
min read
LW
link
3
reviews
The U.S. is becoming less stable
lc
Aug 18, 2023, 9:13 PM
147
points
68
comments
2
min read
LW
link
6 non-obvious mental health issues specific to AI safety
Igor Ivanov
Aug 18, 2023, 3:46 PM
147
points
24
comments
4
min read
LW
link
The “public debate” about AI is confusing for the general public and for policymakers because it is a three-sided debate
Adam David Long
Aug 1, 2023, 12:08 AM
146
points
30
comments
4
min read
LW
link
Responses to apparent rationalist confusions about game / decision theory
Anthony DiGiovanni
Aug 30, 2023, 10:02 PM
142
points
20
comments
12
min read
LW
link
1
review
Inflection.ai is a major AGI lab
Nikola Jurkovic
Aug 9, 2023, 1:05 AM
137
points
13
comments
2
min read
LW
link
Ten Thousand Years of Solitude
agp
Aug 15, 2023, 5:45 PM
136
points
19
comments
4
min read
LW
link
(www.discovermagazine.com)
Invulnerable Incomplete Preferences: A Formal Statement
SCP
Aug 30, 2023, 9:59 PM
134
points
39
comments
35
min read
LW
link
Book Launch: “The Carving of Reality,” Best of LessWrong vol. III
Raemon
Aug 16, 2023, 11:52 PM
131
points
22
comments
5
min read
LW
link
When discussing AI risks, talk about capabilities, not intelligence
Vika
Aug 11, 2023, 1:38 PM
124
points
7
comments
3
min read
LW
link
(vkrakovna.wordpress.com)
Introducing the Center for AI Policy (& we’re hiring!)
Thomas Larsen
Aug 28, 2023, 9:17 PM
123
points
50
comments
2
min read
LW
link
(www.aipolicy.us)
Report on Frontier Model Training
YafahEdelman
Aug 30, 2023, 8:02 PM
122
points
21
comments
21
min read
LW
link
(docs.google.com)
Summary of and Thoughts on the Hotz/Yudkowsky Debate
Zvi
Aug 16, 2023, 4:50 PM
105
points
47
comments
9
min read
LW
link
(thezvi.wordpress.com)
Biosecurity Culture, Computer Security Culture
jefftk
Aug 30, 2023, 4:40 PM
103
points
11
comments
2
min read
LW
link
(www.jefftk.com)
A Theory of Laughter
Steven Byrnes
Aug 23, 2023, 3:05 PM
102
points
14
comments
28
min read
LW
link
[Question]
Exercise: Solve “Thinking Physics”
Raemon
Aug 1, 2023, 12:44 AM
101
points
30
comments
5
min read
LW
link
1
review
What’s A “Market”?
johnswentworth
Aug 8, 2023, 11:29 PM
94
points
16
comments
10
min read
LW
link
Biological Anchors: The Trick that Might or Might Not Work
Scott Alexander
Aug 12, 2023, 12:53 AM
91
points
3
comments
33
min read
LW
link
(astralcodexten.substack.com)
LTFF and EAIF are unusually funding-constrained right now
Linch
and
calebp99
Aug 30, 2023, 1:03 AM
90
points
24
comments
15
min read
LW
link
(forum.effectivealtruism.org)
We Should Prepare for a Larger Representation of Academia in AI Safety
Leon Lang
Aug 13, 2023, 6:03 PM
90
points
14
comments
5
min read
LW
link
Problems with Robin Hanson’s Quillette Article On AI
DaemonicSigil
Aug 6, 2023, 10:13 PM
89
points
33
comments
8
min read
LW
link
Dating Roundup #1: This is Why You’re Single
Zvi
Aug 29, 2023, 12:50 PM
87
points
28
comments
38
min read
LW
link
(thezvi.wordpress.com)
Decomposing independent generalizations in neural networks via Hessian analysis
Dmitry Vaintrob
and
Nina Panickssery
Aug 14, 2023, 5:04 PM
84
points
4
comments
1
min read
LW
link
My checklist for publishing a blog post
Steven Byrnes
Aug 15, 2023, 3:04 PM
84
points
6
comments
3
min read
LW
link
Stepping down as moderator on LW
Kaj_Sotala
Aug 14, 2023, 10:46 AM
82
points
1
comment
1
min read
LW
link
The Low-Hanging Fruit Prior and sloped valleys in the loss landscape
Dmitry Vaintrob
and
Nina Panickssery
Aug 23, 2023, 9:12 PM
82
points
1
comment
13
min read
LW
link
Long-Term Future Fund: April 2023 grant recommendations
abergal
,
calebp99
,
Linch
,
habryka
,
Thomas Larsen
and
Vaniver
Aug 2, 2023, 7:54 AM
81
points
3
comments
50
min read
LW
link
The God of Humanity, and the God of the Robot Utilitarians
Raemon
Aug 24, 2023, 8:27 AM
79
points
13
comments
2
min read
LW
link
1
review
The Economics of the Asteroid Deflection Problem (Dominant Assurance Contracts)
moyamo
Aug 29, 2023, 6:28 PM
78
points
71
comments
15
min read
LW
link
Digital brains beat biological ones because diffusion is too slow
GeneSmith
Aug 26, 2023, 2:22 AM
78
points
21
comments
5
min read
LW
link
An Interpretability Illusion for Activation Patching of Arbitrary Subspaces
Georg Lange
,
Alex Makelov
and
Neel Nanda
29 Aug 2023 1:04 UTC
77
points
4
comments
1
min read
LW
link
A Proof of Löb’s Theorem using Computability Theory
jessicata
16 Aug 2023 18:57 UTC
76
points
0
comments
17
min read
LW
link
(unstableontology.com)
Computational Thread Art
CallumMcDougall
6 Aug 2023 21:42 UTC
76
points
2
comments
6
min read
LW
link
A plea for more funding shortfall transparency
porby
7 Aug 2023 21:33 UTC
73
points
4
comments
2
min read
LW
link
AI Forecasting: Two Years In
jsteinhardt
19 Aug 2023 23:40 UTC
72
points
15
comments
11
min read
LW
link
(bounded-regret.ghost.io)
AI pause/governance advocacy might be net-negative, especially without a focus on explaining x-risk
Mikhail Samin
27 Aug 2023 23:05 UTC
72
points
9
comments
6
min read
LW
link
Aumann-agreement is common
tailcalled
26 Aug 2023 20:22 UTC
71
points
33
comments
7
min read
LW
link
1
review
When Omnipotence is Not Enough
lsusr
25 Aug 2023 19:50 UTC
71
points
4
comments
2
min read
LW
link
1
review
Modulating sycophancy in an RLHF model via activation steering
Nina Panickssery
9 Aug 2023 7:06 UTC
69
points
20
comments
12
min read
LW
link
3 levels of threat obfuscation
HoldenKarnofsky
2 Aug 2023 14:58 UTC
69
points
14
comments
7
min read
LW
link
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel