RSS

AdamGleave

Karma: 913

Illu­sory Safety: Redteam­ing Deep­Seek R1 and the Strongest Fine-Tun­able Models of OpenAI, An­thropic, and Google

Feb 7, 2025, 3:57 AM
29 points
0 comments10 min readLW link

GPT-4o Guardrails Gone: Data Poi­son­ing & Jailbreak-Tuning

Nov 1, 2024, 12:10 AM
18 points
0 comments6 min readLW link
(far.ai)

Pac­ing Out­side the Box: RNNs Learn to Plan in Sokoban

Jul 25, 2024, 10:00 PM
59 points
8 comments2 min readLW link
(arxiv.org)

Does ro­bust­ness im­prove with scale?

Jul 25, 2024, 8:55 PM
14 points
0 comments1 min readLW link
(far.ai)

Beyond the Board: Ex­plor­ing AI Ro­bust­ness Through Go

AdamGleaveJun 19, 2024, 4:40 PM
41 points
2 comments1 min readLW link
(far.ai)

More peo­ple get­ting into AI safety should do a PhD

AdamGleaveMar 14, 2024, 10:14 PM
60 points
24 comments12 min readLW link
(gleave.me)

2023 Align­ment Re­search Up­dates from FAR AI

Dec 4, 2023, 10:32 PM
18 points
0 comments8 min readLW link
(far.ai)

What’s new at FAR AI

Dec 4, 2023, 9:18 PM
41 points
0 comments5 min readLW link
(far.ai)

Even Su­per­hu­man Go AIs Have Sur­pris­ing Failure Modes

Jul 20, 2023, 5:31 PM
129 points
22 comments10 min readLW link
(far.ai)