Archive
Sequences
About
Search
Log In
Home
Featured
All
Tags
Recent
Comments
Questions
Events
Shortform
Alignment Forum
AF Comments
RSS
New
Hot
Active
Old
Page
1
The case for countermeasures to memetic spread of misaligned values
Alex Mallen
May 28, 2025, 9:12 PM
22
points
1
comment
7
min read
LW
link
Formalizing Embeddedness Failures in Universal Artificial Intelligence
Cole Wyeth
May 26, 2025, 12:36 PM
39
points
0
comments
1
min read
LW
link
(arxiv.org)
Reward button alignment
Steven Byrnes
May 22, 2025, 5:36 PM
50
points
15
comments
12
min read
LW
link
Unexploitable search: blocking malicious use of free parameters
Jacob Pfau
and
Geoffrey Irving
May 21, 2025, 5:23 PM
34
points
16
comments
6
min read
LW
link
Modeling versus Implementation
Cole Wyeth
May 18, 2025, 1:38 PM
27
points
10
comments
3
min read
LW
link
Problems with instruction-following as an alignment target
Seth Herd
May 15, 2025, 3:41 PM
48
points
14
comments
10
min read
LW
link
Dodging systematic human errors in scalable oversight
Geoffrey Irving
May 14, 2025, 3:19 PM
33
points
3
comments
4
min read
LW
link
Working through a small tiling result
James Payor
May 13, 2025, 8:28 PM
66
points
9
comments
5
min read
LW
link
Measuring Schelling Coordination—Reflections on Subversion Strategy Eval
Graeme Ford
May 12, 2025, 7:06 PM
5
points
0
comments
8
min read
LW
link
Political sycophancy as a model organism of scheming
Alex Mallen
and
Vivek Hebbar
May 12, 2025, 5:49 PM
39
points
0
comments
14
min read
LW
link
AIs at the current capability level may be important for future safety work
ryan_greenblatt
May 12, 2025, 2:06 PM
81
points
2
comments
4
min read
LW
link
Highly Opinionated Advice on How to Write ML Papers
Neel Nanda
May 12, 2025, 1:59 AM
58
points
4
comments
32
min read
LW
link
Absolute Zero: Alpha Zero for LLM
alapmi
May 11, 2025, 8:42 PM
21
points
13
comments
1
min read
LW
link
Glass box learners want to be black box
Cole Wyeth
May 10, 2025, 11:05 AM
46
points
10
comments
4
min read
LW
link
Mind the Coherence Gap: Lessons from Steering Llama with Goodfire
eitan sprejer
May 9, 2025, 9:29 PM
4
points
1
comment
6
min read
LW
link
Slow corporations as an intuition pump for AI R&D automation
ryan_greenblatt
and
elifland
May 9, 2025, 2:49 PM
91
points
23
comments
9
min read
LW
link
Video & transcript: Challenges for Safe & Beneficial Brain-Like AGI
Steven Byrnes
May 8, 2025, 9:11 PM
24
points
0
comments
18
min read
LW
link
Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
Buck
and
Julian Stastny
May 8, 2025, 7:06 PM
75
points
1
comment
15
min read
LW
link
An alignment safety case sketch based on debate
Marie_DB
,
Jacob Pfau
,
Benjamin Hilton
and
Geoffrey Irving
May 8, 2025, 3:02 PM
55
points
19
comments
25
min read
LW
link
(arxiv.org)
UK AISI’s Alignment Team: Research Agenda
Benjamin Hilton
,
Jacob Pfau
,
Marie_DB
and
Geoffrey Irving
May 7, 2025, 4:33 PM
109
points
2
comments
11
min read
LW
link
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel