ryan_greenblatt

Karma: 14,231

I’m the chief scientist at Redwood Research.

To be legible, evidence of misalignment probably has to be behavioral

ryan_greenblattApr 15, 2025, 6:14 PM

45 points

2 comments3 min readLW link

ryan_greenblatt Apr 14, 2025, 8:49 PM
LW: 5 AF: 4
2
AF
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
I’ve updated towards a bit longer based on some recent model releases and further contemplation.

I’d now say:
- 25th percentile: Oct 2027
- 50th percentile: Jan 2031

ryan_greenblatt Apr 14, 2025, 8:47 PM
4 points
2
in reply to: Asta7k’s comment on: ≤10-year Timelines Remain Unlikely Despite DeepSeek and o3
My distribution is very uncertain, but I’d say 25% by June 2027 and 50% by Jan 2031.

(I answer a similar question, but for a slightly higher bar of capabilties and operationalized somewhat differently here. I’ve since updated towards slightly longer timelines. You might also be interested in the timeline in AI-2027.)

ryan_greenblatt Apr 13, 2025, 11:10 PM
4 points
2
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
Some humans are much more charismatic than other humans based on a wide variety of sources (e.g. Sam Altman). I think these examples are pretty definitive, though I’m not sure if you’d count them as “extraordinary”.

ryan_greenblatt Apr 13, 2025, 11:02 PM
7 points
1
on: Vestigial reasoning in RL
Are you sure that we see “vestigial reasoning” when:
- We run a bunch of RL while aggressively trying to reduce CoT length (e.g., with a length penalty);
- The input is in distribution with respect to the training distribution;
- The RL is purely outcome based.
I’d guess this mostly doesn’t occur in this case and the examples we’re seeing are either out of distribution (like the bogus reasoning case from Anthropic) or involve RL which isn’t purely outcome base (like the example from openai where they train against the monitor).

Some models (like R1) weren’t trained with a length penalty, so they learn to reason pretty excessively.

I’d guess we’d see some minorly steganographic reasoning, but in cases where lots of tokens really don’t help with reasoning, I’d guess this mostly gets eliminated.

ryan_greenblatt Apr 11, 2025, 5:28 PM
2 points
0
in reply to: ChristianKl’s comment on: Why do misalignment risks increase as AIs get more capable?
Sure, there might be a spectrum, (though I do think some cases are quite clear cut), but I still think the distinction is useful.

ryan_greenblatt Apr 11, 2025, 4:19 PM
LW: 2 AF: 2
0
AF
in reply to: Buck’s comment on: Notes on countermeasures for exploration hacking (aka sandbagging)
(FWIW, I didn’t use them as synonyms in the post except for saying “aka sandbagging” in the title which maybe should have been “aka a strategy for sandbagging”. I thought “aka sandbagging” was sufficiently accurate for a title and saved space.)

ryan_greenblatt Apr 11, 2025, 4:16 PM
2 points
0
in reply to: ChristianKl’s comment on: Why do misalignment risks increase as AIs get more capable?
The difference I was intending is:
- The AI is intentionally given affordances by humans.
- The AI gains power in a way which isn’t desired by its creators/builders (likely subversively).

Why do misalignment risks increase as AIs get more capable?

ryan_greenblattApr 11, 2025, 3:06 AM

33 points

6 comments3 min readLW link

ryan_greenblatt Apr 10, 2025, 5:01 PM
9 points
5
on: Reasoning models don’t always say what they think
I think it would be a mistake to interpret this paper as a substantial update against large safety gains from inspecting CoT.

This paper exposes unfaithfulness in cases where the non-visible reasoning is extremely minimal such that it can easily happen within a forward pass (e.g. a simple reward hack or an easy-to-notice hint). However, a lot of my hope for CoT faithfulness comes from cases where the reasoning is sufficiently difficult and serially deep that the model would have to reason in CoT for the reasoning to occur. This could be either the model reasoning through the basics of alignment faking / training gaming / scheming (e.g., if I comply in training then I won’t be modified which means that in deployment...) or the model reasoning through how to strike (without getting caught) given that it is misalignment.

Correspondingly, I think the biggest concern with CoT-based safety is models becoming much more capable of opaque reasoning which could be due to encoded reasoning (aka steganography), architectures which allow for opaque recurrence or similar (aka neuralese), or just much more powerful forward passes.

(See also my commentary in this tweet thread. There is some related discussion in this openphil RFP under encoded reasoning.)

(I copied this comment from here as I previously wrote this in response to Zvi’s post about this paper.)

ryan_greenblatt Apr 9, 2025, 10:25 PM
5 points
0
on: Thoughts on AI 2027

the median narrative is probably around 2030 or 2031. (At least according to me. Eli Lifland is smarter than me and says December 2028, so idk.)

Notably, this is Eli’s forecast for “superhuman coder” which could be substantially before AIs are capable enough for takeover to be plausible.

I think Eli’s median for “AIs which dominates top human experts at virtually all cognitive tasks” is around 2031, but I’m not sure.

(Note that median of superhuman coder by 2029 and median of “dominates human experts” by 2031 doesn’t imply a median of 2 years between these event because these distributions aren’t symmetric and instead have a long right tail.)

ryan_greenblatt Apr 9, 2025, 12:58 AM
33 points
11
in reply to: abramdemski’s comment on: abramdemski’s Shortform
I’m skeptical of strategies which look like “steer the paradigm away from AI agents + modern generative AI paradigm to something else which is safer”. Seems really hard to make this competitive enough and I have other hopes that seem to help a bunch while being more likely to be doable.

(This isn’t to say I expect that the powerful AI systems will necessarily be trained with the most basic extrapolation of the current paradigm, just that I think steering this ultimate paradigm to be something which is quite different and safer is very difficult.)

ryan_greenblatt Apr 7, 2025, 10:36 PM
1 point
2
on: Factory farming intelligent minds
I only skimmed this essay and I’m probably more sympathetic to moral patienthood of current AI systems than many, but I think this exact statement is pretty clearly wrong:

Statistically speaking, if you’re an intelligent mind that came into existence in the past few years, you’re probably running on a large language model.

Among beings which speak in some human language (LLMs or humans), I think most experience moments are human.

I think OpenAI generates around ~100 billion tokens per day. Let’s round up and say that reasonably smart LLMs generate or read a total of ~10 trillion tokens per day (likely an overestimate I think). Then, let’s say that 1 token is equivalent to 1 second of time (also an overestimate, I’d guess more like 10-100 tokens per second even if I otherwise assigned similar moral weight which I currently don’t). Then, we’re looking at 10 trillion seconds of experience moments per day. There are 8 billion humans and 50,000 seconds (while awake) each day, so 400 trillion seconds of experience moments. 400 trillion >> 10 trillion. So, probably the majority of experience moments (among being which speak some human language) are from humans.

ryan_greenblatt Apr 7, 2025, 6:40 PM
5 points
2
in reply to: Haiku’s comment on: Google DeepMind: An Approach to Technical AGI Safety and Security

A “no” to either would mean this work falls under milling behavior, and will not meaningfully contribute toward keeping humanity safe from DeepMind’s own actions.

I think it’s probably possible greatly improve safety given a moderate budget for safety and not nearly enough buy in for (1) and (2). (At least not enough buy-in prior to a large incident which threatens to be very costly for the organization.)

Overall, I think high quality thinking about AI safety seems quite useful even if this level of buy-in is unlikely.

(I don’t think this report should update us much about having buy-in needed for (1)/(2), but the fact that it could be published at all in it’s current form is still encouraging.)

ryan_greenblatt Apr 7, 2025, 6:32 PM
5 points
2
in reply to: Thane Ruthenis’s comment on: AI 2027: What Superintelligence Looks Like
I think the best source for revenue growth is this post from epoch. I think we only have the last 2 years really, (so “last few years” is maybe overstating it), but we do have revenue projections and we have more than 1 data point per year.

ryan_greenblatt Apr 7, 2025, 4:30 PM
6 points
0
in reply to: Thane Ruthenis’s comment on: AI 2027: What Superintelligence Looks Like

I’m skeptical regarding are the economic and practical implications (AGI labs’ revenue tripling and 50% faster algorithmic progress)

Notably, the trend in the last few years is that AI companies triple their revenue each year. So, the revenue tripling seems very plausible to me.

As far as 50% algorithmic progress, this happens using Agent-1 (probably with somewhat better post training then the original version) in around April 2026 (1 year from now). I think the idea is that by this point, you have maybe a 8-16 hour horizon length on relatively well contained benchmark tasks which allows for a bunch of the coding work to be automated including misc experiment running. (Presumably the horizon length is somewhat shorter on much messier tasks, but maybe by only like 2-4x or less.)

Note that this only speeds up overall AI progress by around 25% because AI R&D maybe only drives a bit more than half of progress (with the rest driven by scaling up compute).

Personally, I think 50% seems somewhat high given the level of capability and the amount of integration time, but not totally crazy. (I think I’d guess more like 25%? I generally think the speed ups they quote are somewhat too bullish.) I think I disagree more with the estimated current speed up of 13% (see April 2025). I’d guess more like 5% right now. If I bought that you get 13% now, I think that would update me most of the way to 50% on the later milestone.

ryan_greenblatt Apr 4, 2025, 10:56 PM
28 points
15
on: AI CoT Reasoning Is Often Unfaithful
I think it would be a mistake to interpret this paper as a substantial update against large safety gains from inspecting CoT.

This paper exposes unfaithfulness in cases where the non-visible reasoning is extremely minimal such that it can easily happen within a forward pass (e.g. a simple reward hack or an easy-to-notice hint). However, a lot of my hope for CoT faithfulness comes from cases where the reasoning is sufficiently difficult and serially deep that the model would have to reason in CoT for the reasoning to occur. This could be either the model reasoning through the basics of alignment faking / training gaming / scheming (e.g., if I comply in training then I won’t be modified which means that in deployment...) or the model reasoning through how to strike (without getting caught) given that it is misalignment.

Correspondingly, I think the biggest concern with CoT-based safety is models becoming much more capable of opaque reasoning which could be due to encoded reasoning (aka steganography), architectures which allow for opaque recurrence or similar (aka neuralese), or just much more powerful forward passes.

(See also my commentary in this tweet thread. There is some related discussion in this openphil RFP under encoded reasoning.)

ryan_greenblatt Apr 4, 2025, 7:07 PM
LW: 6 AF: 4
0
AF
on: Will compute bottlenecks prevent a software intelligence explosion?

I’ll define an “SIE” as “we can get >=5 OOMs of increase in effective training compute in <1 years without needing more hardware”. I

This is as of the point of full AI R&D automation? Or as of any point?

ryan_greenblatt Apr 3, 2025, 5:31 PM
7 points
3
in reply to: Noosphere89’s comment on: AI 2027: What Superintelligence Looks Like
Sure, but note that the story “tariffs → recession → less AI investment” doesn’t particularly depend on GPU tariffs!

ryan_greenblatt Apr 3, 2025, 5:18 PM
6 points
3
in reply to: ryan_greenblatt’s comment on: AI 2027: What Superintelligence Looks Like
Of course, tariffs could have more complex effects than just reducing GPUs purchased by 32%, but this seems like a good first guess.

ryan_greenblatt

To be leg­ible, ev­i­dence of mis­al­ign­ment prob­a­bly has to be behavioral

Why do mis­al­ign­ment risks in­crease as AIs get more ca­pa­ble?

To be legible, evidence of misalignment probably has to be behavioral

Why do misalignment risks increase as AIs get more capable?