ryan_greenblatt

Karma: 12,430

I’m the chief scientist at Redwood Research.

AI companies are unlikely to make high-assurance safety cases if timelines are short

ryan_greenblatt23 Jan 2025 18:41 UTC

137 points

4 comments13 min readLW link

How will we update about scheming?

ryan_greenblatt6 Jan 2025 20:21 UTC

149 points

19 comments36 min readLW link

A breakdown of AI capability levels focused on AI R&D labor acceleration

ryan_greenblatt22 Dec 2024 20:56 UTC

102 points

5 comments6 min readLW link

Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman and Buck

18 Dec 2024 17:19 UTC

476 points

68 comments10 min readLW link

Getting 50% (SoTA) on ARC-AGI with GPT-4o

ryan_greenblatt17 Jun 2024 18:44 UTC

262 points

50 comments13 min readLW link

Memorizing weak examples can elicit strong behavior out of password-locked models

Fabien Roger and ryan_greenblatt

6 Jun 2024 23:54 UTC

58 points

5 comments7 min readLW link

[Paper] Stress-testing capability elicitation with password-locked models

Fabien Roger and ryan_greenblatt

4 Jun 2024 14:52 UTC

85 points

10 comments12 min readLW link

(arxiv.org)

Thoughts on SB-1047

ryan_greenblatt29 May 2024 23:26 UTC

59 points

1 comment11 min readLW link

How useful is “AI Control” as a framing on AI X-Risk?

habryka and ryan_greenblatt

14 Mar 2024 18:06 UTC

70 points

4 comments34 min readLW link

Notes on control evaluations for safety cases

ryan_greenblatt, Buck and Fabien Roger

28 Feb 2024 16:15 UTC

49 points

0 comments32 min readLW link

Preventing model exfiltration with upload limits

ryan_greenblatt6 Feb 2024 16:29 UTC

69 points

22 comments14 min readLW link

The case for ensuring that powerful AIs are controlled

ryan_greenblatt and Buck

24 Jan 2024 16:11 UTC

272 points

68 comments28 min readLW link

Managing catastrophic misuse without robust AIs

ryan_greenblatt and Buck

16 Jan 2024 17:27 UTC

63 points

17 comments11 min readLW link

Catching AIs red-handed

ryan_greenblatt and Buck

5 Jan 2024 17:43 UTC

106 points

27 comments17 min readLW link

Measurement tampering detection as a special case of weak-to-strong generalization

ryan_greenblatt, Fabien Roger and Buck

23 Dec 2023 0:05 UTC

57 points

10 comments4 min readLW link

Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem

Ansh Radhakrishnan, Buck, ryan_greenblatt and Fabien Roger

16 Dec 2023 5:49 UTC

76 points

4 comments6 min readLW link 1 review

AI Control: Improving Safety Despite Intentional Subversion

Buck, Fabien Roger, ryan_greenblatt and Kshitij Sachan

13 Dec 2023 15:51 UTC

235 points

23 comments10 min readLW link 4 reviews

Auditing failures vs concentrated failures

ryan_greenblatt and Fabien Roger

11 Dec 2023 2:47 UTC

44 points

1 comment7 min readLW link 1 review

How useful is mechanistic interpretability?

ryan_greenblatt, Neel Nanda, Buck and habryka

1 Dec 2023 2:54 UTC

166 points

54 comments25 min readLW link

Preventing Language Models from hiding their reasoning

Fabien Roger and ryan_greenblatt

31 Oct 2023 14:34 UTC

119 points

15 comments12 min readLW link 1 review