RSS

Kellin Pelrine

Karma: 156

A Sys­tem­atic Study Ex­tend­ing “Emer­gent Misal­ign­ment”: Causal Effects of Fine-Tun­ing Data on Model Vulnerability

Jun 11, 2025, 7:30 PM
6 points
0 comments5 min readLW link

Illu­sory Safety: Redteam­ing Deep­Seek R1 and the Strongest Fine-Tun­able Models of OpenAI, An­thropic, and Google

Feb 7, 2025, 3:57 AM
29 points
0 comments10 min readLW link

GPT-4o Guardrails Gone: Data Poi­son­ing & Jailbreak-Tuning

Nov 1, 2024, 12:10 AM
18 points
0 comments6 min readLW link
(far.ai)

Even Su­per­hu­man Go AIs Have Sur­pris­ing Failure Modes

Jul 20, 2023, 5:31 PM
130 points
22 comments10 min readLW link
(far.ai)