Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
AdamGleave
Karma:
913
All
Posts
Comments
New
Top
Old
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
ChengCheng
,
Brendan Murphy
,
Adrià Garriga-alonso
,
Yashvardhan Sharma
,
dsbowen
,
smallsilo
,
Yawen Duan
,
ChrisCundy
,
Hannah Betts
,
AdamGleave
and
Kellin Pelrine
Feb 7, 2025, 3:57 AM
29
points
0
comments
10
min read
LW
link
GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning
ChengCheng
,
Brendan Murphy
,
AdamGleave
and
Kellin Pelrine
Nov 1, 2024, 12:10 AM
18
points
0
comments
6
min read
LW
link
(far.ai)
Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Adrià Garriga-alonso
,
taufeeque
,
AdamGleave
and
ChengCheng
Jul 25, 2024, 10:00 PM
59
points
8
comments
2
min read
LW
link
(arxiv.org)
Does robustness improve with scale?
ChengCheng
,
niki.h
,
Ian McKenzie
,
Oskar Hollinsworth
,
Tom Tseng
and
AdamGleave
Jul 25, 2024, 8:55 PM
14
points
0
comments
1
min read
LW
link
(far.ai)
Beyond the Board: Exploring AI Robustness Through Go
AdamGleave
Jun 19, 2024, 4:40 PM
41
points
2
comments
1
min read
LW
link
(far.ai)
More people getting into AI safety should do a PhD
AdamGleave
Mar 14, 2024, 10:14 PM
60
points
24
comments
12
min read
LW
link
(gleave.me)
2023 Alignment Research Updates from FAR AI
AdamGleave
and
EuanMcLean
Dec 4, 2023, 10:32 PM
18
points
0
comments
8
min read
LW
link
(far.ai)
What’s new at FAR AI
AdamGleave
and
EuanMcLean
Dec 4, 2023, 9:18 PM
41
points
0
comments
5
min read
LW
link
(far.ai)
Even Superhuman Go AIs Have Surprising Failure Modes
AdamGleave
,
EuanMcLean
,
Tony Wang
,
Kellin Pelrine
,
Tom Tseng
,
Yawen Duan
,
Joseph Miller
and
MichaelDennis
Jul 20, 2023, 5:31 PM
129
points
22
comments
10
min read
LW
link
(far.ai)
AI Safety in a World of Vulnerable Machine Learning Systems
AdamGleave
and
EuanMcLean
Mar 8, 2023, 2:40 AM
70
points
28
comments
29
min read
LW
link
(far.ai)
CIRL Corrigibility is Fragile
Rachel Freedman
and
AdamGleave
Dec 21, 2022, 1:40 AM
58
points
8
comments
12
min read
LW
link
Introducing the Fund for Alignment Research (We’re Hiring!)
AdamGleave
,
Scott Emmons
,
Ethan Perez
and
Claudia Shi
Jul 6, 2022, 2:07 AM
62
points
0
comments
4
min read
LW
link
Back to top
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel