RSS

ryan_greenblatt

Karma: 12,430

I’m the chief scientist at Redwood Research.

AI com­pa­nies are un­likely to make high-as­surance safety cases if timelines are short

ryan_greenblatt23 Jan 2025 18:41 UTC
137 points
4 comments13 min readLW link

How will we up­date about schem­ing?

ryan_greenblatt6 Jan 2025 20:21 UTC
149 points
19 comments36 min readLW link

A break­down of AI ca­pa­bil­ity lev­els fo­cused on AI R&D la­bor acceleration

ryan_greenblatt22 Dec 2024 20:56 UTC
102 points
5 comments6 min readLW link

Align­ment Fak­ing in Large Lan­guage Models

18 Dec 2024 17:19 UTC
476 points
68 comments10 min readLW link

Get­ting 50% (SoTA) on ARC-AGI with GPT-4o

ryan_greenblatt17 Jun 2024 18:44 UTC
262 points
50 comments13 min readLW link

Me­moriz­ing weak ex­am­ples can elicit strong be­hav­ior out of pass­word-locked models

6 Jun 2024 23:54 UTC
58 points
5 comments7 min readLW link

[Paper] Stress-test­ing ca­pa­bil­ity elic­i­ta­tion with pass­word-locked models

4 Jun 2024 14:52 UTC
85 points
10 comments12 min readLW link
(arxiv.org)

Thoughts on SB-1047

ryan_greenblatt29 May 2024 23:26 UTC
59 points
1 comment11 min readLW link

How use­ful is “AI Con­trol” as a fram­ing on AI X-Risk?

14 Mar 2024 18:06 UTC
70 points
4 comments34 min readLW link

Notes on con­trol eval­u­a­tions for safety cases

28 Feb 2024 16:15 UTC
49 points
0 comments32 min readLW link

Prevent­ing model exfil­tra­tion with up­load limits

ryan_greenblatt6 Feb 2024 16:29 UTC
69 points
22 comments14 min readLW link

The case for en­sur­ing that pow­er­ful AIs are controlled

24 Jan 2024 16:11 UTC
272 points
68 comments28 min readLW link

Manag­ing catas­trophic mi­suse with­out ro­bust AIs

16 Jan 2024 17:27 UTC
63 points
17 comments11 min readLW link

Catch­ing AIs red-handed

5 Jan 2024 17:43 UTC
106 points
27 comments17 min readLW link

Mea­sure­ment tam­per­ing de­tec­tion as a spe­cial case of weak-to-strong generalization

23 Dec 2023 0:05 UTC
57 points
10 comments4 min readLW link

Scal­able Over­sight and Weak-to-Strong Gen­er­al­iza­tion: Com­pat­i­ble ap­proaches to the same problem

16 Dec 2023 5:49 UTC
76 points
4 comments6 min readLW link1 review

AI Con­trol: Im­prov­ing Safety De­spite In­ten­tional Subversion

13 Dec 2023 15:51 UTC
235 points
23 comments10 min readLW link4 reviews

Au­dit­ing failures vs con­cen­trated failures

11 Dec 2023 2:47 UTC
44 points
1 comment7 min readLW link1 review

How use­ful is mechanis­tic in­ter­pretabil­ity?

1 Dec 2023 2:54 UTC
166 points
54 comments25 min readLW link

Prevent­ing Lan­guage Models from hid­ing their reasoning

31 Oct 2023 14:34 UTC
119 points
15 comments12 min readLW link1 review