Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
Archive
Sequences
About
Search
Log In
All
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
All
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
All
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Page
1
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Zac Hatfield-Dodds
Oct 5, 2023, 9:01 PM
288
points
22
comments
2
min read
LW
link
1
review
(transformer-circuits.pub)
Alignment Implications of LLM Successes: a Debate in One Act
Zack_M_Davis
Oct 21, 2023, 3:22 PM
265
points
56
comments
13
min read
LW
link
2
reviews
Book Review: Going Infinite
Zvi
Oct 24, 2023, 3:00 PM
242
points
113
comments
97
min read
LW
link
1
review
(thezvi.wordpress.com)
Announcing MIRI’s new CEO and leadership team
Gretta Duleba
Oct 10, 2023, 7:22 PM
222
points
52
comments
3
min read
LW
link
Thoughts on responsible scaling policies and regulation
paulfchristiano
Oct 24, 2023, 10:21 PM
221
points
33
comments
6
min read
LW
link
Labs should be explicit about why they are building AGI
peterbarnett
Oct 17, 2023, 9:09 PM
210
points
18
comments
1
min read
LW
link
1
review
We’re Not Ready: thoughts on “pausing” and responsible scaling policies
HoldenKarnofsky
Oct 27, 2023, 3:19 PM
200
points
33
comments
8
min read
LW
link
Comp Sci in 2027 (Short story by Eliezer Yudkowsky)
sudo
Oct 29, 2023, 11:09 PM
196
points
24
comments
10
min read
LW
link
1
review
(nitter.net)
Evaluating the historical value misspecification argument
Matthew Barnett
Oct 5, 2023, 6:34 PM
190
points
162
comments
7
min read
LW
link
3
reviews
Announcing Timaeus
Jesse Hoogland
,
Daniel Murfet
,
Alexander Gietelink Oldenziel
and
Stan van Wingerden
Oct 22, 2023, 11:59 AM
188
points
15
comments
4
min read
LW
link
AI as a science, and three obstacles to alignment strategies
So8res
Oct 25, 2023, 9:00 PM
187
points
80
comments
11
min read
LW
link
Thomas Kwa’s MIRI research experience
Thomas Kwa
,
peterbarnett
,
Vivek Hebbar
,
Jeremy Gillen
,
Bird Concept
and
Raemon
Oct 2, 2023, 4:42 PM
173
points
53
comments
1
min read
LW
link
President Biden Issues Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence
Tristan Williams
Oct 30, 2023, 11:15 AM
171
points
39
comments
LW
link
(www.whitehouse.gov)
Architects of Our Own Demise: We Should Stop Developing AI Carelessly
Roko
Oct 26, 2023, 12:36 AM
170
points
75
comments
3
min read
LW
link
RSPs are pauses done right
evhub
Oct 14, 2023, 4:06 AM
164
points
73
comments
7
min read
LW
link
1
review
Holly Elmore and Rob Miles dialogue on AI Safety Advocacy
Bird Concept
,
Robert Miles
and
Holly_Elmore
Oct 20, 2023, 9:04 PM
162
points
30
comments
27
min read
LW
link
Announcing Dialogues
Ben Pace
Oct 7, 2023, 2:57 AM
155
points
59
comments
4
min read
LW
link
Will no one rid me of this turbulent pest?
Metacelsus
Oct 14, 2023, 3:27 PM
154
points
23
comments
10
min read
LW
link
(denovo.substack.com)
LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
Simon Lermen
and
Jeffrey Ladish
Oct 12, 2023, 7:58 PM
151
points
29
comments
14
min read
LW
link
At 87, Pearl is still able to change his mind
rotatingpaguro
Oct 18, 2023, 4:46 AM
148
points
15
comments
5
min read
LW
link
Graphical tensor notation for interpretability
Jordan Taylor
Oct 4, 2023, 8:04 AM
141
points
11
comments
19
min read
LW
link
The 99% principle for personal problems
Kaj_Sotala
Oct 2, 2023, 8:20 AM
139
points
20
comments
2
min read
LW
link
(kajsotala.fi)
Comparing Anthropic’s Dictionary Learning to Ours
Robert_AIZI
Oct 7, 2023, 11:30 PM
137
points
8
comments
4
min read
LW
link
Don’t Dismiss Simple Alignment Approaches
Chris_Leong
Oct 7, 2023, 12:35 AM
137
points
9
comments
4
min read
LW
link
Response to Quintin Pope’s Evolution Provides No Evidence For the Sharp Left Turn
Zvi
Oct 5, 2023, 11:39 AM
129
points
29
comments
9
min read
LW
link
Goodhart’s Law in Reinforcement Learning
jacek
,
Joar Skalse
,
OliverHayman
,
charlie_griffin
and
Xingjian Bai
Oct 16, 2023, 12:54 AM
126
points
22
comments
7
min read
LW
link
Responsible Scaling Policies Are Risk Management Done Wrong
simeon_c
Oct 25, 2023, 11:46 PM
123
points
35
comments
22
min read
LW
link
1
review
(www.navigatingrisks.ai)
I Would Have Solved Alignment, But I Was Worried That Would Advance Timelines
307th
Oct 20, 2023, 4:37 PM
122
points
33
comments
9
min read
LW
link
Stampy’s AI Safety Info soft launch
steven0461
and
Robert Miles
Oct 5, 2023, 10:13 PM
120
points
9
comments
2
min read
LW
link
Revealing Intentionality In Language Models Through AdaVAE Guided Sampling
jdp
Oct 20, 2023, 7:32 AM
119
points
15
comments
22
min read
LW
link
unRLHF—Efficiently undoing LLM safeguards
Pranav Gade
,
Jeffrey Ladish
and
Simon Lermen
Oct 12, 2023, 7:58 PM
117
points
15
comments
20
min read
LW
link
A new intro to Quantum Physics, with the math fixed
titotal
Oct 29, 2023, 3:11 PM
113
points
24
comments
17
min read
LW
link
(titotal.substack.com)
The Witching Hour
Richard_Ngo
Oct 10, 2023, 12:19 AM
113
points
1
comment
9
min read
LW
link
(www.narrativeark.xyz)
Symbol/Referent Confusions in Language Model Alignment Experiments
johnswentworth
Oct 26, 2023, 7:49 PM
112
points
49
comments
6
min read
LW
link
1
review
Improving the Welfare of AIs: A Nearcasted Proposal
ryan_greenblatt
Oct 30, 2023, 2:51 PM
112
points
7
comments
20
min read
LW
link
1
review
Charbel-Raphaël and Lucius discuss interpretability
Mateusz Bagiński
,
Charbel-Raphaël
and
Lucius Bushnaq
Oct 30, 2023, 5:50 AM
111
points
7
comments
21
min read
LW
link
Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation
Fabien Roger
and
Buck
Oct 23, 2023, 4:37 PM
107
points
3
comments
8
min read
LW
link
TOMORROW: the largest AI Safety protest ever!
Holly_Elmore
Oct 20, 2023, 6:15 PM
105
points
26
comments
2
min read
LW
link
Apply for MATS Winter 2023-24!
utilistrutil
,
Ryan Kidd
and
LauraVaughan
Oct 21, 2023, 2:27 AM
104
points
6
comments
5
min read
LW
link
Value systematization: how values become coherent (and misaligned)
Richard_Ngo
Oct 27, 2023, 7:06 PM
102
points
49
comments
13
min read
LW
link
What’s up with “Responsible Scaling Policies”?
habryka
and
ryan_greenblatt
Oct 29, 2023, 4:17 AM
99
points
9
comments
20
min read
LW
link
1
review
Truthseeking when your disagreements lie in moral philosophy
Elizabeth
and
Tristan Williams
Oct 10, 2023, 12:00 AM
99
points
4
comments
4
min read
LW
link
(acesounderglass.com)
What’s Hard About The Shutdown Problem
johnswentworth
Oct 20, 2023, 9:13 PM
98
points
33
comments
4
min read
LW
link
[Question]
Lying to chess players for alignment
Zane
Oct 25, 2023, 5:47 PM
97
points
54
comments
1
min read
LW
link
I don’t find the lie detection results that surprising (by an author of the paper)
JanB
Oct 4, 2023, 5:10 PM
97
points
8
comments
3
min read
LW
link
Investigating the learning coefficient of modular addition: hackathon project
Nina Panickssery
and
Dmitry Vaintrob
Oct 17, 2023, 7:51 PM
94
points
5
comments
12
min read
LW
link
Sam Altman’s sister claims Sam sexually abused her—Part 1: Introduction, outline, author’s notes
pythagoras5015
Oct 7, 2023, 9:06 PM
94
points
108
comments
8
min read
LW
link
Trying to understand John Wentworth’s research agenda
johnswentworth
,
habryka
and
David Lorell
Oct 20, 2023, 12:05 AM
93
points
13
comments
12
min read
LW
link
You’re Measuring Model Complexity Wrong
Jesse Hoogland
and
Stan van Wingerden
Oct 11, 2023, 11:46 AM
93
points
17
comments
13
min read
LW
link
Open Source Replication & Commentary on Anthropic’s Dictionary Learning Paper
Neel Nanda
Oct 23, 2023, 10:38 PM
93
points
12
comments
9
min read
LW
link
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel