I’m the chief scientist at Redwood Research.
ryan_greenblatt
To be legible, evidence of misalignment probably has to be behavioral
My distribution is very uncertain, but I’d say 25% by June 2027 and 50% by Jan 2031.
(I answer a similar question, but for a slightly higher bar of capabilties and operationalized somewhat differently here. I’ve since updated towards slightly longer timelines. You might also be interested in the timeline in AI-2027.)
Some humans are much more charismatic than other humans based on a wide variety of sources (e.g. Sam Altman). I think these examples are pretty definitive, though I’m not sure if you’d count them as “extraordinary”.
Are you sure that we see “vestigial reasoning” when:
We run a bunch of RL while aggressively trying to reduce CoT length (e.g., with a length penalty);
The input is in distribution with respect to the training distribution;
The RL is purely outcome based.
I’d guess this mostly doesn’t occur in this case and the examples we’re seeing are either out of distribution (like the bogus reasoning case from Anthropic) or involve RL which isn’t purely outcome base (like the example from openai where they train against the monitor).
Some models (like R1) weren’t trained with a length penalty, so they learn to reason pretty excessively.
I’d guess we’d see some minorly steganographic reasoning, but in cases where lots of tokens really don’t help with reasoning, I’d guess this mostly gets eliminated.
Sure, there might be a spectrum, (though I do think some cases are quite clear cut), but I still think the distinction is useful.
(FWIW, I didn’t use them as synonyms in the post except for saying “aka sandbagging” in the title which maybe should have been “aka a strategy for sandbagging”. I thought “aka sandbagging” was sufficiently accurate for a title and saved space.)
The difference I was intending is:
The AI is intentionally given affordances by humans.
The AI gains power in a way which isn’t desired by its creators/builders (likely subversively).
Why do misalignment risks increase as AIs get more capable?
I think it would be a mistake to interpret this paper as a substantial update against large safety gains from inspecting CoT.
This paper exposes unfaithfulness in cases where the non-visible reasoning is extremely minimal such that it can easily happen within a forward pass (e.g. a simple reward hack or an easy-to-notice hint). However, a lot of my hope for CoT faithfulness comes from cases where the reasoning is sufficiently difficult and serially deep that the model would have to reason in CoT for the reasoning to occur. This could be either the model reasoning through the basics of alignment faking / training gaming / scheming (e.g., if I comply in training then I won’t be modified which means that in deployment...) or the model reasoning through how to strike (without getting caught) given that it is misalignment.
Correspondingly, I think the biggest concern with CoT-based safety is models becoming much more capable of opaque reasoning which could be due to encoded reasoning (aka steganography), architectures which allow for opaque recurrence or similar (aka neuralese), or just much more powerful forward passes.
(See also my commentary in this tweet thread. There is some related discussion in this openphil RFP under encoded reasoning.)
(I copied this comment from here as I previously wrote this in response to Zvi’s post about this paper.)
the median narrative is probably around 2030 or 2031. (At least according to me. Eli Lifland is smarter than me and says December 2028, so idk.)
Notably, this is Eli’s forecast for “superhuman coder” which could be substantially before AIs are capable enough for takeover to be plausible.
I think Eli’s median for “AIs which dominates top human experts at virtually all cognitive tasks” is around 2031, but I’m not sure.
(Note that median of superhuman coder by 2029 and median of “dominates human experts” by 2031 doesn’t imply a median of 2 years between these event because these distributions aren’t symmetric and instead have a long right tail.)
I’m skeptical of strategies which look like “steer the paradigm away from AI agents + modern generative AI paradigm to something else which is safer”. Seems really hard to make this competitive enough and I have other hopes that seem to help a bunch while being more likely to be doable.
(This isn’t to say I expect that the powerful AI systems will necessarily be trained with the most basic extrapolation of the current paradigm, just that I think steering this ultimate paradigm to be something which is quite different and safer is very difficult.)
I only skimmed this essay and I’m probably more sympathetic to moral patienthood of current AI systems than many, but I think this exact statement is pretty clearly wrong:
Statistically speaking, if you’re an intelligent mind that came into existence in the past few years, you’re probably running on a large language model.
Among beings which speak in some human language (LLMs or humans), I think most experience moments are human.
I think OpenAI generates around ~100 billion tokens per day. Let’s round up and say that reasonably smart LLMs generate or read a total of ~10 trillion tokens per day (likely an overestimate I think). Then, let’s say that 1 token is equivalent to 1 second of time (also an overestimate, I’d guess more like 10-100 tokens per second even if I otherwise assigned similar moral weight which I currently don’t). Then, we’re looking at 10 trillion seconds of experience moments per day. There are 8 billion humans and 50,000 seconds (while awake) each day, so 400 trillion seconds of experience moments. 400 trillion >> 10 trillion. So, probably the majority of experience moments (among being which speak some human language) are from humans.
A “no” to either would mean this work falls under milling behavior, and will not meaningfully contribute toward keeping humanity safe from DeepMind’s own actions.
I think it’s probably possible greatly improve safety given a moderate budget for safety and not nearly enough buy in for (1) and (2). (At least not enough buy-in prior to a large incident which threatens to be very costly for the organization.)
Overall, I think high quality thinking about AI safety seems quite useful even if this level of buy-in is unlikely.
(I don’t think this report should update us much about having buy-in needed for (1)/(2), but the fact that it could be published at all in it’s current form is still encouraging.)
I think the best source for revenue growth is this post from epoch. I think we only have the last 2 years really, (so “last few years” is maybe overstating it), but we do have revenue projections and we have more than 1 data point per year.
I’m skeptical regarding are the economic and practical implications (AGI labs’ revenue tripling and 50% faster algorithmic progress)
Notably, the trend in the last few years is that AI companies triple their revenue each year. So, the revenue tripling seems very plausible to me.
As far as 50% algorithmic progress, this happens using Agent-1 (probably with somewhat better post training then the original version) in around April 2026 (1 year from now). I think the idea is that by this point, you have maybe a 8-16 hour horizon length on relatively well contained benchmark tasks which allows for a bunch of the coding work to be automated including misc experiment running. (Presumably the horizon length is somewhat shorter on much messier tasks, but maybe by only like 2-4x or less.)
Note that this only speeds up overall AI progress by around 25% because AI R&D maybe only drives a bit more than half of progress (with the rest driven by scaling up compute).
Personally, I think 50% seems somewhat high given the level of capability and the amount of integration time, but not totally crazy. (I think I’d guess more like 25%? I generally think the speed ups they quote are somewhat too bullish.) I think I disagree more with the estimated current speed up of 13% (see April 2025). I’d guess more like 5% right now. If I bought that you get 13% now, I think that would update me most of the way to 50% on the later milestone.
I think it would be a mistake to interpret this paper as a substantial update against large safety gains from inspecting CoT.
This paper exposes unfaithfulness in cases where the non-visible reasoning is extremely minimal such that it can easily happen within a forward pass (e.g. a simple reward hack or an easy-to-notice hint). However, a lot of my hope for CoT faithfulness comes from cases where the reasoning is sufficiently difficult and serially deep that the model would have to reason in CoT for the reasoning to occur. This could be either the model reasoning through the basics of alignment faking / training gaming / scheming (e.g., if I comply in training then I won’t be modified which means that in deployment...) or the model reasoning through how to strike (without getting caught) given that it is misalignment.
Correspondingly, I think the biggest concern with CoT-based safety is models becoming much more capable of opaque reasoning which could be due to encoded reasoning (aka steganography), architectures which allow for opaque recurrence or similar (aka neuralese), or just much more powerful forward passes.
(See also my commentary in this tweet thread. There is some related discussion in this openphil RFP under encoded reasoning.)
I’ll define an “SIE” as “we can get >=5 OOMs of increase in effective training compute in <1 years without needing more hardware”. I
This is as of the point of full AI R&D automation? Or as of any point?
Sure, but note that the story “tariffs → recession → less AI investment” doesn’t particularly depend on GPU tariffs!
Of course, tariffs could have more complex effects than just reducing GPUs purchased by 32%, but this seems like a good first guess.
I’ve updated towards a bit longer based on some recent model releases and further contemplation.
I’d now say:
25th percentile: Oct 2027
50th percentile: Jan 2031