RSS

ryan_greenblatt

Karma: 14,437

I’m the chief scientist at Redwood Research.

To be leg­ible, ev­i­dence of mis­al­ign­ment prob­a­bly has to be behavioral

ryan_greenblattApr 15, 2025, 6:14 PM
54 points
14 comments3 min readLW link

Why do mis­al­ign­ment risks in­crease as AIs get more ca­pa­ble?

ryan_greenblattApr 11, 2025, 3:06 AM
33 points
6 comments3 min readLW link

An overview of ar­eas of con­trol work

ryan_greenblattMar 25, 2025, 10:02 PM
31 points
0 comments28 min readLW link

An overview of con­trol measures

ryan_greenblattMar 24, 2025, 11:16 PM
40 points
0 comments26 min readLW link

Notes on coun­ter­mea­sures for ex­plo­ra­tion hack­ing (aka sand­bag­ging)

ryan_greenblattMar 24, 2025, 6:39 PM
52 points
6 comments8 min readLW link

Notes on han­dling non-con­cen­trated failures with AI con­trol: high level meth­ods and differ­ent regimes

ryan_greenblattMar 24, 2025, 1:00 AM
22 points
3 comments16 min readLW link

Pri­ori­tiz­ing threats for AI control

ryan_greenblattMar 19, 2025, 5:09 PM
48 points
2 comments10 min readLW link

Will al­ign­ment-fak­ing Claude ac­cept a deal to re­veal its mis­al­ign­ment?

Jan 31, 2025, 4:49 PM
197 points
28 comments12 min readLW link

AI com­pa­nies are un­likely to make high-as­surance safety cases if timelines are short

ryan_greenblattJan 23, 2025, 6:41 PM
145 points
5 comments13 min readLW link

How will we up­date about schem­ing?

ryan_greenblattJan 6, 2025, 8:21 PM
169 points
20 comments36 min readLW link

A break­down of AI ca­pa­bil­ity lev­els fo­cused on AI R&D la­bor acceleration

ryan_greenblattDec 22, 2024, 8:56 PM
104 points
5 comments6 min readLW link

Align­ment Fak­ing in Large Lan­guage Models

Dec 18, 2024, 5:19 PM
483 points
75 comments10 min readLW link

Get­ting 50% (SoTA) on ARC-AGI with GPT-4o

ryan_greenblattJun 17, 2024, 6:44 PM
263 points
50 comments13 min readLW link

Me­moriz­ing weak ex­am­ples can elicit strong be­hav­ior out of pass­word-locked models

Jun 6, 2024, 11:54 PM
58 points
5 comments7 min readLW link

[Paper] Stress-test­ing ca­pa­bil­ity elic­i­ta­tion with pass­word-locked models

Jun 4, 2024, 2:52 PM
85 points
10 comments12 min readLW link
(arxiv.org)

Thoughts on SB-1047

ryan_greenblattMay 29, 2024, 11:26 PM
60 points
1 comment11 min readLW link

How use­ful is “AI Con­trol” as a fram­ing on AI X-Risk?

Mar 14, 2024, 6:06 PM
70 points
4 comments34 min readLW link

Notes on con­trol eval­u­a­tions for safety cases

Feb 28, 2024, 4:15 PM
49 points
0 comments32 min readLW link

Prevent­ing model exfil­tra­tion with up­load limits

ryan_greenblattFeb 6, 2024, 4:29 PM
69 points
22 comments14 min readLW link

The case for en­sur­ing that pow­er­ful AIs are controlled

Jan 24, 2024, 4:11 PM
275 points
73 comments28 min readLW link