Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
Archive
Sequences
About
Search
Log In
All
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
All
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
All
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Page
1
LessWrong’s (first) album: I Have Been A Good Bing
habryka
and
kave
Apr 1, 2024, 7:33 AM
571
points
181
comments
11
min read
LW
link
Transformers Represent Belief State Geometry in their Residual Stream
Adam Shai
Apr 16, 2024, 9:16 PM
419
points
100
comments
12
min read
LW
link
Thoughts on seed oil
dynomight
Apr 20, 2024, 12:29 PM
356
points
129
comments
17
min read
LW
link
(dynomight.net)
[April Fools’ Day] Introducing Open Asteroid Impact
Linch
Apr 1, 2024, 8:14 AM
337
points
29
comments
LW
link
(openasteroidimpact.org)
Express interest in an “FHI of the West”
habryka
Apr 18, 2024, 3:32 AM
268
points
41
comments
3
min read
LW
link
Paul Christiano named as US AI Safety Institute Head of AI Safety
Joel Burget
Apr 16, 2024, 4:22 PM
256
points
58
comments
1
min read
LW
link
(www.commerce.gov)
Refusal in LLMs is mediated by a single direction
Andy Arditi
,
Oscar Obeso
,
Aaquib111
,
wesg
and
Neel Nanda
Apr 27, 2024, 11:13 AM
246
points
95
comments
10
min read
LW
link
Funny Anecdote of Eliezer From His Sister
Noah Birnbaum
Apr 22, 2024, 10:05 PM
207
points
6
comments
2
min read
LW
link
[Question]
Examples of Highly Counterfactual Discoveries?
johnswentworth
Apr 23, 2024, 10:19 PM
195
points
102
comments
1
min read
LW
link
On Not Pulling The Ladder Up Behind You
Screwtape
Apr 26, 2024, 9:58 PM
189
points
21
comments
9
min read
LW
link
OMMC Announces RIP
Adam Scholl
and
aysja
Apr 1, 2024, 11:20 PM
189
points
5
comments
2
min read
LW
link
Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer
johnswentworth
and
David Lorell
Apr 18, 2024, 12:27 AM
185
points
21
comments
7
min read
LW
link
FHI (Future of Humanity Institute) has shut down (2005–2024)
gwern
Apr 17, 2024, 1:54 PM
176
points
22
comments
1
min read
LW
link
(www.futureofhumanityinstitute.org)
Reconsider the anti-cavity bacteria if you are Asian
Lao Mein
Apr 15, 2024, 7:02 AM
170
points
43
comments
4
min read
LW
link
Ironing Out the Squiggles
Zack_M_Davis
Apr 29, 2024, 4:13 PM
157
points
36
comments
11
min read
LW
link
Priors and Prejudice
MathiasKB
Apr 22, 2024, 3:00 PM
151
points
31
comments
7
min read
LW
link
Daniel Dennett has died (1942-2024)
kave
Apr 19, 2024, 4:17 PM
150
points
5
comments
1
min read
LW
link
(dailynous.com)
LLMs for Alignment Research: a safety priority?
abramdemski
Apr 4, 2024, 8:03 PM
145
points
24
comments
11
min read
LW
link
When is a mind me?
Rob Bensinger
Apr 17, 2024, 5:56 AM
144
points
130
comments
15
min read
LW
link
My experience using financial commitments to overcome akrasia
William Howard
Apr 15, 2024, 10:57 PM
137
points
33
comments
18
min read
LW
link
A Dozen Ways to Get More Dakka
Davidmanheim
Apr 8, 2024, 4:45 AM
134
points
11
comments
3
min read
LW
link
Simple probes can catch sleeper agents
Monte M
,
Carson Denison
,
Zac Hatfield-Dodds
,
David Duvenaud
,
Sam Bowman
,
Ethan Perez
and
evhub
Apr 23, 2024, 9:10 PM
133
points
21
comments
1
min read
LW
link
(www.anthropic.com)
RTFB: On the New Proposed CAIP AI Bill
Zvi
Apr 10, 2024, 6:30 PM
119
points
14
comments
34
min read
LW
link
(thezvi.wordpress.com)
Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight
Sam Marks
Apr 18, 2024, 4:17 PM
113
points
10
comments
12
min read
LW
link
A Selection of Randomly Selected SAE Features
CallumMcDougall
and
Joseph Bloom
Apr 1, 2024, 9:09 AM
109
points
2
comments
4
min read
LW
link
[Question]
What convincing warning shot could help prevent extinction from AI?
Charbel-Raphaël
and
cozyfractal
Apr 13, 2024, 6:09 PM
106
points
22
comments
2
min read
LW
link
The first future and the best future
KatjaGrace
Apr 25, 2024, 6:40 AM
106
points
12
comments
1
min read
LW
link
(worldspiritsockpuppet.com)
Carl Sagan, nuking the moon, and not nuking the moon
eukaryote
Apr 13, 2024, 4:08 AM
104
points
8
comments
6
min read
LW
link
(eukaryotewritesblog.com)
Sparsify: A mechanistic interpretability research agenda
Lee Sharkey
Apr 3, 2024, 12:34 PM
96
points
23
comments
22
min read
LW
link
MIRI’s April 2024 Newsletter
Harlan
Apr 12, 2024, 11:38 PM
95
points
0
comments
3
min read
LW
link
(intelligence.org)
Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers
hugofry
Apr 29, 2024, 8:57 PM
93
points
8
comments
11
min read
LW
link
Partial value takeover without world takeover
KatjaGrace
Apr 5, 2024, 6:20 AM
89
points
23
comments
3
min read
LW
link
(worldspiritsockpuppet.com)
Rejecting Television
Declan Molony
Apr 23, 2024, 4:59 AM
89
points
10
comments
6
min read
LW
link
Constructability: Plainly-coded AGIs may be feasible in the near future
Épiphanie Gédéon
and
Charbel-Raphaël
Apr 27, 2024, 4:04 PM
85
points
13
comments
13
min read
LW
link
Essay competition on the Automation of Wisdom and Philosophy — $25k in prizes
owencb
and
AI Impacts
Apr 16, 2024, 10:10 AM
82
points
12
comments
8
min read
LW
link
(blog.aiimpacts.org)
[Full Post] Progress Update #1 from the GDM Mech Interp Team
Neel Nanda
,
Arthur Conmy
,
lewis smith
,
Senthooran Rajamanoharan
,
Tom Lieberum
,
János Kramár
and
Vikrant Varma
Apr 19, 2024, 7:06 PM
79
points
10
comments
8
min read
LW
link
A couple productivity tips for overthinkers
Steven Byrnes
Apr 20, 2024, 4:05 PM
78
points
13
comments
4
min read
LW
link
Best in Class Life Improvement
sapphire
Apr 4, 2024, 1:51 AM
78
points
20
comments
1
min read
LW
link
Coherence of Caches and Agents
johnswentworth
Apr 1, 2024, 11:04 PM
77
points
9
comments
11
min read
LW
link
Creating unrestricted AI Agents with Command R+
Simon Lermen
Apr 16, 2024, 2:52 PM
77
points
13
comments
5
min read
LW
link
Mid-conditional love
KatjaGrace
Apr 17, 2024, 4:00 AM
76
points
21
comments
2
min read
LW
link
(worldspiritsockpuppet.com)
AISC9 has ended and there will be an AISC10
Linda Linsefors
Apr 29, 2024, 10:53 AM
75
points
4
comments
2
min read
LW
link
Announcing Suffering For Good
Garrett Baker
Apr 1, 2024, 5:08 PM
75
points
5
comments
1
min read
LW
link
A gentle introduction to mechanistic anomaly detection
Erik Jenner
Apr 3, 2024, 11:06 PM
73
points
2
comments
11
min read
LW
link
A Gentle Introduction to Risk Frameworks Beyond Forecasting
pendingsurvival
11 Apr 2024 18:03 UTC
73
points
10
comments
27
min read
LW
link
Prompts for Big-Picture Planning
Raemon
13 Apr 2024 3:04 UTC
72
points
1
comment
3
min read
LW
link
[Summary] Progress Update #1 from the GDM Mech Interp Team
Neel Nanda
,
Arthur Conmy
,
lewis smith
,
Senthooran Rajamanoharan
,
Tom Lieberum
,
János Kramár
and
Vikrant Varma
19 Apr 2024 19:06 UTC
72
points
0
comments
3
min read
LW
link
Generalized Stat Mech: The Boltzmann Approach
David Lorell
and
johnswentworth
12 Apr 2024 17:47 UTC
71
points
7
comments
20
min read
LW
link
LW Frontpage Experiments! (aka “Take the wheel, Shoggoth!”)
Ruby
and
RobertM
23 Apr 2024 3:58 UTC
71
points
27
comments
5
min read
LW
link
How We Picture Bayesian Agents
johnswentworth
and
David Lorell
8 Apr 2024 18:12 UTC
70
points
14
comments
7
min read
LW
link
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel