Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
ryan_greenblatt
Karma:
12,430
I’m the chief scientist at Redwood Research.
All
Posts
Comments
New
Top
Old
Page
1
AI companies are unlikely to make high-assurance safety cases if timelines are short
ryan_greenblatt
23 Jan 2025 18:41 UTC
137
points
4
comments
13
min read
LW
link
How will we update about scheming?
ryan_greenblatt
6 Jan 2025 20:21 UTC
149
points
19
comments
36
min read
LW
link
A breakdown of AI capability levels focused on AI R&D labor acceleration
ryan_greenblatt
22 Dec 2024 20:56 UTC
102
points
5
comments
6
min read
LW
link
Alignment Faking in Large Language Models
ryan_greenblatt
,
evhub
,
Carson Denison
,
Benjamin Wright
,
Fabien Roger
,
Monte M
,
Sam Marks
,
Johannes Treutlein
,
Sam Bowman
and
Buck
18 Dec 2024 17:19 UTC
476
points
68
comments
10
min read
LW
link
Getting 50% (SoTA) on ARC-AGI with GPT-4o
ryan_greenblatt
17 Jun 2024 18:44 UTC
262
points
50
comments
13
min read
LW
link
Memorizing weak examples can elicit strong behavior out of password-locked models
Fabien Roger
and
ryan_greenblatt
6 Jun 2024 23:54 UTC
58
points
5
comments
7
min read
LW
link
[Paper] Stress-testing capability elicitation with password-locked models
Fabien Roger
and
ryan_greenblatt
4 Jun 2024 14:52 UTC
85
points
10
comments
12
min read
LW
link
(arxiv.org)
Thoughts on SB-1047
ryan_greenblatt
29 May 2024 23:26 UTC
59
points
1
comment
11
min read
LW
link
How useful is “AI Control” as a framing on AI X-Risk?
habryka
and
ryan_greenblatt
14 Mar 2024 18:06 UTC
70
points
4
comments
34
min read
LW
link
Notes on control evaluations for safety cases
ryan_greenblatt
,
Buck
and
Fabien Roger
28 Feb 2024 16:15 UTC
49
points
0
comments
32
min read
LW
link
Preventing model exfiltration with upload limits
ryan_greenblatt
6 Feb 2024 16:29 UTC
69
points
22
comments
14
min read
LW
link
The case for ensuring that powerful AIs are controlled
ryan_greenblatt
and
Buck
24 Jan 2024 16:11 UTC
272
points
68
comments
28
min read
LW
link
Managing catastrophic misuse without robust AIs
ryan_greenblatt
and
Buck
16 Jan 2024 17:27 UTC
63
points
17
comments
11
min read
LW
link
Catching AIs red-handed
ryan_greenblatt
and
Buck
5 Jan 2024 17:43 UTC
106
points
27
comments
17
min read
LW
link
Measurement tampering detection as a special case of weak-to-strong generalization
ryan_greenblatt
,
Fabien Roger
and
Buck
23 Dec 2023 0:05 UTC
57
points
10
comments
4
min read
LW
link
Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem
Ansh Radhakrishnan
,
Buck
,
ryan_greenblatt
and
Fabien Roger
16 Dec 2023 5:49 UTC
76
points
4
comments
6
min read
LW
link
1
review
AI Control: Improving Safety Despite Intentional Subversion
Buck
,
Fabien Roger
,
ryan_greenblatt
and
Kshitij Sachan
13 Dec 2023 15:51 UTC
235
points
23
comments
10
min read
LW
link
4
reviews
Auditing failures vs concentrated failures
ryan_greenblatt
and
Fabien Roger
11 Dec 2023 2:47 UTC
44
points
1
comment
7
min read
LW
link
1
review
How useful is mechanistic interpretability?
ryan_greenblatt
,
Neel Nanda
,
Buck
and
habryka
1 Dec 2023 2:54 UTC
166
points
54
comments
25
min read
LW
link
Preventing Language Models from hiding their reasoning
Fabien Roger
and
ryan_greenblatt
31 Oct 2023 14:34 UTC
119
points
15
comments
12
min read
LW
link
1
review
Back to top
Next