Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
Archive
Sequences
About
Search
Log In
All
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
All
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
All
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Page
1
I would have shit in that alley, too
Declan Molony
Jun 18, 2024, 4:41 AM
458
points
134
comments
4
min read
LW
link
Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)
Andrew_Critch
Jun 14, 2024, 12:16 AM
357
points
38
comments
4
min read
LW
link
My AI Model Delta Compared To Yudkowsky
johnswentworth
Jun 10, 2024, 4:12 PM
280
points
103
comments
4
min read
LW
link
Getting 50% (SoTA) on ARC-AGI with GPT-4o
ryan_greenblatt
Jun 17, 2024, 6:44 PM
263
points
50
comments
13
min read
LW
link
SAE feature geometry is outside the superposition hypothesis
jake_mendel
Jun 24, 2024, 4:07 PM
228
points
17
comments
11
min read
LW
link
LLM Generality is a Timeline Crux
eggsyntax
Jun 24, 2024, 12:52 PM
218
points
119
comments
7
min read
LW
link
Response to Aschenbrenner’s “Situational Awareness”
Rob Bensinger
Jun 6, 2024, 10:57 PM
194
points
27
comments
3
min read
LW
link
My AI Model Delta Compared To Christiano
johnswentworth
Jun 12, 2024, 6:19 PM
191
points
73
comments
4
min read
LW
link
Two easy things that maybe Just Work to improve AI discourse
Bird Concept
Jun 8, 2024, 3:51 PM
190
points
35
comments
2
min read
LW
link
Humming is not a free $100 bill
Elizabeth
Jun 6, 2024, 8:10 PM
185
points
6
comments
3
min read
LW
link
(acesounderglass.com)
Boycott OpenAI
PeterMcCluskey
Jun 18, 2024, 7:52 PM
164
points
26
comments
1
min read
LW
link
(bayesianinvestor.com)
Announcing ILIAD — Theoretical AI Alignment Conference
Nora_Ammann
and
Alexander Gietelink Oldenziel
Jun 5, 2024, 9:37 AM
163
points
18
comments
2
min read
LW
link
Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data
Johannes Treutlein
and
Owain_Evans
Jun 21, 2024, 3:54 PM
163
points
13
comments
8
min read
LW
link
(arxiv.org)
Sycophancy to subterfuge: Investigating reward tampering in large language models
Carson Denison
and
evhub
Jun 17, 2024, 6:41 PM
161
points
22
comments
8
min read
LW
link
(arxiv.org)
Formal verification, heuristic explanations and surprise accounting
Jacob_Hilton
Jun 25, 2024, 3:40 PM
156
points
11
comments
9
min read
LW
link
(www.alignment.org)
The Incredible Fentanyl-Detecting Machine
sarahconstantin
Jun 28, 2024, 10:10 PM
156
points
26
comments
7
min read
LW
link
(sarahconstantin.substack.com)
0. CAST: Corrigibility as Singular Target
Max Harms
Jun 7, 2024, 10:29 PM
147
points
14
comments
8
min read
LW
link
Loving a world you don’t trust
Joe Carlsmith
Jun 18, 2024, 7:31 PM
135
points
13
comments
33
min read
LW
link
How it All Went Down: The Puzzle Hunt that took us way, way Less Online
A*
Jun 2, 2024, 8:01 AM
135
points
5
comments
5
min read
LW
link
Why I don’t believe in the placebo effect
transhumanist_atom_understander
Jun 10, 2024, 2:37 AM
134
points
22
comments
9
min read
LW
link
The Standard Analogy
Zack_M_Davis
Jun 3, 2024, 5:15 PM
125
points
28
comments
12
min read
LW
link
[Question]
What do coherence arguments actually prove about agentic behavior?
sunwillrise
Jun 1, 2024, 9:37 AM
123
points
39
comments
6
min read
LW
link
Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Erik Jenner
Jun 4, 2024, 3:50 PM
121
points
14
comments
13
min read
LW
link
AI catastrophes and rogue deployments
Buck
Jun 3, 2024, 5:04 PM
120
points
16
comments
8
min read
LW
link
Anthropic’s Certificate of Incorporation
Zach Stein-Perlman
Jun 12, 2024, 1:00 PM
115
points
7
comments
4
min read
LW
link
The Leopold Model: Analysis and Reactions
Zvi
Jun 14, 2024, 3:10 PM
109
points
19
comments
57
min read
LW
link
(thezvi.wordpress.com)
Demystifying “Alignment” through a Comic
milanrosko
Jun 9, 2024, 8:24 AM
106
points
19
comments
1
min read
LW
link
Scaling and evaluating sparse autoencoders
leogao
Jun 6, 2024, 10:50 PM
106
points
6
comments
1
min read
LW
link
In favour of exploring nagging doubts about x-risk
owencb
Jun 25, 2024, 11:52 PM
105
points
2
comments
LW
link
The Minority Coalition
Richard_Ngo
Jun 24, 2024, 8:01 PM
103
points
9
comments
5
min read
LW
link
(www.narrativeark.xyz)
Live Theory Part 0: Taking Intelligence Seriously
Sahil
Jun 26, 2024, 9:37 PM
103
points
3
comments
8
min read
LW
link
On Dwarksh’s Podcast with Leopold Aschenbrenner
Zvi
Jun 10, 2024, 12:40 PM
102
points
7
comments
59
min read
LW
link
(thezvi.wordpress.com)
Access to powerful AI might make computer security radically easier
Buck
Jun 8, 2024, 6:00 AM
101
points
14
comments
6
min read
LW
link
CIV: a story
Richard_Ngo
Jun 15, 2024, 10:36 PM
98
points
6
comments
9
min read
LW
link
(www.narrativeark.xyz)
Comments on Anthropic’s Scaling Monosemanticity
Robert_AIZI
Jun 3, 2024, 12:15 PM
98
points
8
comments
7
min read
LW
link
OpenAI #8: The Right to Warn
Zvi
Jun 17, 2024, 12:00 PM
97
points
8
comments
34
min read
LW
link
(thezvi.wordpress.com)
Quotes from Leopold Aschenbrenner’s Situational Awareness Paper
Zvi
Jun 7, 2024, 11:40 AM
97
points
10
comments
37
min read
LW
link
(thezvi.wordpress.com)
Compact Proofs of Model Performance via Mechanistic Interpretability
LawrenceC
,
rajashree
,
Adrià Garriga-alonso
and
Jason Gross
Jun 24, 2024, 7:27 PM
96
points
4
comments
8
min read
LW
link
(arxiv.org)
On Claude 3.5 Sonnet
Zvi
Jun 24, 2024, 12:00 PM
95
points
14
comments
13
min read
LW
link
(thezvi.wordpress.com)
Ilya Sutskever created a new AGI startup
harfe
Jun 19, 2024, 5:17 PM
95
points
35
comments
1
min read
LW
link
(ssi.inc)
Towards a Less Bullshit Model of Semantics
johnswentworth
and
David Lorell
Jun 17, 2024, 3:51 PM
94
points
44
comments
21
min read
LW
link
Takeoff speeds presentation at Anthropic
Tom Davidson
Jun 4, 2024, 10:46 PM
92
points
0
comments
25
min read
LW
link
Just admit that you’ve zoned out
joec
Jun 4, 2024, 2:51 AM
91
points
22
comments
2
min read
LW
link
I’m a bit skeptical of AlphaFold 3
Oleg Trott
Jun 25, 2024, 12:04 AM
87
points
14
comments
2
min read
LW
link
Detecting Genetically Engineered Viruses With Metagenomic Sequencing
jefftk
Jun 27, 2024, 2:01 PM
87
points
10
comments
LW
link
(naobservatory.org)
[Paper] Stress-testing capability elicitation with password-locked models
Fabien Roger
and
ryan_greenblatt
Jun 4, 2024, 2:52 PM
85
points
10
comments
12
min read
LW
link
(arxiv.org)
[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
,
Felix Hofstätter
,
Ollie J
,
Sam F. Brown
and
Francis Rhys Ward
Jun 13, 2024, 10:04 AM
84
points
10
comments
2
min read
LW
link
(arxiv.org)
Actually, Power Plants May Be an AI Training Bottleneck.
Lao Mein
Jun 20, 2024, 4:41 AM
83
points
13
comments
2
min read
LW
link
AI takeoff and nuclear war
owencb
Jun 11, 2024, 7:36 PM
80
points
6
comments
11
min read
LW
link
(strangecities.substack.com)
Secondary forces of debt
KatjaGrace
Jun 27, 2024, 9:10 PM
78
points
18
comments
2
min read
LW
link
(worldspiritsockpuppet.com)
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel