Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
New
Hot
Active
Old
Page
1
Towards mutually assured cooperation
mikko
12 Jun 2025 15:15 UTC
5
points
0
comments
2
min read
LW
link
Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)
LawrenceC
11 Jun 2025 19:27 UTC
149
points
2
comments
16
min read
LW
link
Religion for Rationalists
Gordon Seidoh Worley
11 Jun 2025 19:05 UTC
13
points
37
comments
4
min read
LW
link
How to think with images
Dinkar Juyal
11 Jun 2025 15:49 UTC
5
points
2
comments
15
min read
LW
link
(dinkarjuyal.github.io)
Difficulties of Eschatological policy making [Linkpost]
Noosphere89
11 Jun 2025 14:12 UTC
11
points
3
comments
3
min read
LW
link
(jack-clark.net)
Hydra
Matrice Jacobine
11 Jun 2025 14:07 UTC
24
points
0
comments
1
min read
LW
link
(philosophybear.substack.com)
SafeRLHub: An Interactive Resource for RL Safety and Interpretability
Siya
and
deneille
11 Jun 2025 5:47 UTC
3
points
0
comments
7
min read
LW
link
More on policy arguments and the AB problem
Sniffnoy
11 Jun 2025 4:42 UTC
10
points
0
comments
4
min read
LW
link
the void
nostalgebraist
11 Jun 2025 3:19 UTC
147
points
28
comments
1
min read
LW
link
(nostalgebraist.tumblr.com)
Mech interp is not pre-paradigmatic
Lee Sharkey
10 Jun 2025 13:39 UTC
158
points
2
comments
12
min read
LW
link
Research Without Permission
Priyanka Bharadwaj
10 Jun 2025 7:33 UTC
26
points
1
comment
3
min read
LW
link
Some Human That I Used to Know (Filk)
Gordon Seidoh Worley
10 Jun 2025 4:29 UTC
11
points
3
comments
1
min read
LW
link
A quick list of reward hacking interventions
Alex Mallen
10 Jun 2025 0:58 UTC
6
points
0
comments
2
min read
LW
link
Ghiblification for Privacy
jefftk
10 Jun 2025 0:30 UTC
67
points
30
comments
1
min read
LW
link
(www.jefftk.com)
Personal Agents: AIs as trusted advisors, caretakers, and user proxies
JWJohnston
9 Jun 2025 21:26 UTC
2
points
0
comments
2
min read
LW
link
Causation, Correlation, and Confounding: A Graphical Explainer
Tim Hua
9 Jun 2025 20:46 UTC
9
points
2
comments
9
min read
LW
link
When is it important that open-weight models aren’t released? My thoughts on the benefits and dangers of open-weight models in response to developments in CBRN capabilities.
ryan_greenblatt
9 Jun 2025 19:19 UTC
63
points
10
comments
9
min read
LW
link
METR’s Observations of Reward Hacking in Recent Frontier Models
Daniel Kokotajlo
9 Jun 2025 18:03 UTC
97
points
6
comments
11
min read
LW
link
(metr.org)
Expectation = intention = setpoint
jimmy
9 Jun 2025 17:33 UTC
31
points
12
comments
13
min read
LW
link
Identifying “Deception Vectors” In Models
Stephen Martin
9 Jun 2025 17:30 UTC
5
points
0
comments
1
min read
LW
link
(arxiv.org)
Back to top
Next