I’m the chief scientist at Redwood Research.
ryan_greenblatt
(FWIW, I didn’t use them as synonyms in the post except for saying “aka sandbagging” in the title which maybe should have been “aka a strategy for sandbagging”. I thought “aka sandbagging” was sufficiently accurate for a title and saved space.)
The difference I was intending is:
The AI is intentionally given affordances by humans.
The AI gains power in a way which isn’t desired by its creators/builders (likely subversively).
Why do misalignment risks increase as AIs get more capable?
I think it would be a mistake to interpret this paper as a substantial update against large safety gains from inspecting CoT.
This paper exposes unfaithfulness in cases where the non-visible reasoning is extremely minimal such that it can easily happen within a forward pass (e.g. a simple reward hack or an easy-to-notice hint). However, a lot of my hope for CoT faithfulness comes from cases where the reasoning is sufficiently difficult and serially deep that the model would have to reason in CoT for the reasoning to occur. This could be either the model reasoning through the basics of alignment faking / training gaming / scheming (e.g., if I comply in training then I won’t be modified which means that in deployment...) or the model reasoning through how to strike (without getting caught) given that it is misalignment.
Correspondingly, I think the biggest concern with CoT-based safety is models becoming much more capable of opaque reasoning which could be due to encoded reasoning (aka steganography), architectures which allow for opaque recurrence or similar (aka neuralese), or just much more powerful forward passes.
(See also my commentary in this tweet thread. There is some related discussion in this openphil RFP under encoded reasoning.)
(I copied this comment from here as I previously wrote this in response to Zvi’s post about this paper.)
the median narrative is probably around 2030 or 2031. (At least according to me. Eli Lifland is smarter than me and says December 2028, so idk.)
Notably, this is Eli’s forecast for “superhuman coder” which could be substantially before AIs are capable enough for takeover to be plausible.
I think Eli’s median for “AIs which dominates top human experts at virtually all cognitive tasks” is around 2031, but I’m not sure.
(Note that median of superhuman coder by 2029 and median of “dominates human experts” by 2031 doesn’t imply a median of 2 years between these event because these distributions aren’t symmetric and instead have a long right tail.)
I’m skeptical of strategies which look like “steer the paradigm away from AI agents + modern generative AI paradigm to something else which is safer”. Seems really hard to make this competitive enough and I have other hopes that seem to help a bunch while being more likely to be doable.
(This isn’t to say I expect that the powerful AI systems will necessarily be trained with the most basic extrapolation of the current paradigm, just that I think steering this ultimate paradigm to be something which is quite different and safer is very difficult.)
I only skimmed this essay and I’m probably more sympathetic to moral patienthood of current AI systems than many, but I think this exact statement is pretty clearly wrong:
Statistically speaking, if you’re an intelligent mind that came into existence in the past few years, you’re probably running on a large language model.
Among beings which speak in some human language (LLMs or humans), I think most experience moments are human.
I think OpenAI generates around ~100 billion tokens per day. Let’s round up and say that reasonably smart LLMs generate or read a total of ~10 trillion tokens per day (likely an overestimate I think). Then, let’s say that 1 token is equivalent to 1 second of time (also an overestimate, I’d guess more like 10-100 tokens per second even if I otherwise assigned similar moral weight which I currently don’t). Then, we’re looking at 10 trillion seconds of experience moments per day. There are 8 billion humans and 50,000 seconds (while awake) each day, so 400 trillion seconds of experience moments. 400 trillion >> 10 trillion. So, probably the majority of experience moments (among being which speak some human language) are from humans.
A “no” to either would mean this work falls under milling behavior, and will not meaningfully contribute toward keeping humanity safe from DeepMind’s own actions.
I think it’s probably possible greatly improve safety given a moderate budget for safety and not nearly enough buy in for (1) and (2). (At least not enough buy-in prior to a large incident which threatens to be very costly for the organization.)
Overall, I think high quality thinking about AI safety seems quite useful even if this level of buy-in is unlikely.
(I don’t think this report should update us much about having buy-in needed for (1)/(2), but the fact that it could be published at all in it’s current form is still encouraging.)
I think the best source for revenue growth is this post from epoch. I think we only have the last 2 years really, (so “last few years” is maybe overstating it), but we do have revenue projections and we have more than 1 data point per year.
I’m skeptical regarding are the economic and practical implications (AGI labs’ revenue tripling and 50% faster algorithmic progress)
Notably, the trend in the last few years is that AI companies triple their revenue each year. So, the revenue tripling seems very plausible to me.
As far as 50% algorithmic progress, this happens using Agent-1 (probably with somewhat better post training then the original version) in around April 2026 (1 year from now). I think the idea is that by this point, you have maybe a 8-16 hour horizon length on relatively well contained benchmark tasks which allows for a bunch of the coding work to be automated including misc experiment running. (Presumably the horizon length is somewhat shorter on much messier tasks, but maybe by only like 2-4x or less.)
Note that this only speeds up overall AI progress by around 25% because AI R&D maybe only drives a bit more than half of progress (with the rest driven by scaling up compute).
Personally, I think 50% seems somewhat high given the level of capability and the amount of integration time, but not totally crazy. (I think I’d guess more like 25%? I generally think the speed ups they quote are somewhat too bullish.) I think I disagree more with the estimated current speed up of 13% (see April 2025). I’d guess more like 5% right now. If I bought that you get 13% now, I think that would update me most of the way to 50% on the later milestone.
I think it would be a mistake to interpret this paper as a substantial update against large safety gains from inspecting CoT.
This paper exposes unfaithfulness in cases where the non-visible reasoning is extremely minimal such that it can easily happen within a forward pass (e.g. a simple reward hack or an easy-to-notice hint). However, a lot of my hope for CoT faithfulness comes from cases where the reasoning is sufficiently difficult and serially deep that the model would have to reason in CoT for the reasoning to occur. This could be either the model reasoning through the basics of alignment faking / training gaming / scheming (e.g., if I comply in training then I won’t be modified which means that in deployment...) or the model reasoning through how to strike (without getting caught) given that it is misalignment.
Correspondingly, I think the biggest concern with CoT-based safety is models becoming much more capable of opaque reasoning which could be due to encoded reasoning (aka steganography), architectures which allow for opaque recurrence or similar (aka neuralese), or just much more powerful forward passes.
(See also my commentary in this tweet thread. There is some related discussion in this openphil RFP under encoded reasoning.)
I’ll define an “SIE” as “we can get >=5 OOMs of increase in effective training compute in <1 years without needing more hardware”. I
This is as of the point of full AI R&D automation? Or as of any point?
Sure, but note that the story “tariffs → recession → less AI investment” doesn’t particularly depend on GPU tariffs!
Of course, tariffs could have more complex effects than just reducing GPUs purchased by 32%, but this seems like a good first guess.
Shouldn’t a 32% increase in prices only make a modest difference to training FLOP? In particular, see the compute forecast. Between Dec 2026 and Dec 2027, compute increases by roughly an OOM and generally it looks like compute increases by a bit less than 1 OOM per year in the scenario. This implies that a 32% reduction only puts you behind by like 1-2 months.
I probably should have used a running example in this post—this just seems like a mostly unforced error.
I considered writing a conclusion, but decided not to because I wanted to spend the time on other things and I wasn’t sure what I would say that was useful and not just a pure restatement of things from earlier. This post is mostly a high level framework + list of considerations, so it doesn’t really have a small number of core points.
This post is a relatively low effort post as indicated by “Notes on”, possibly I should have flagged this more.
I think comments / in person are easier to understand than my blog posts as I often try to write blog posts that have lots and lots of content which is all grouped together, but without a specific thesis. I typically have either 1 point or a small number of points in comments / in person. Also, it’s easier to write in response to something as there is an assumed level of context already etc.
Is this an accurate summary:
3.5 substantially improved performance for your use case and 3.6 slightly improved performance.
The o-series models didn’t improve performance on your task. (And presumably 3.7 didn’t improve perf.)
So, by “recent model progress feels mostly like bullshit” I think you basically just mean “reasoning models didn’t improve performance on my application and Claude 3.5/3.6 sonnet is still best”. Is this right?
I don’t find this state of affairs that surprising:
Without specialized scaffolding o1 is quite a bad agent and it seems plausible your use case is mostly blocked on this. Even with specialized scaffolding, it’s pretty marginal. (This shows up in the benchmarks AFAICT, e.g., see METR’s results.)
o3-mini is generally a worse agent than o1 (aside from being cheaper). o3 might be a decent amount better than o1, but it isn’t released.
Generally Anthropic models are better for real world coding and agentic tasks relative to other models and this mostly shows up in the benchmarks. (Anthropic models tend to slightly overperform their benchmarks relative to other models I think, but they also perform quite well on coding and agentic SWE benchmarks.)
I would have guessed you’d see performance gains with 3.7 after coaxing it a bit. (My low confidence understanding is that this model is actually better, but it is also more misaligned and reward hacky in ways that make it less useful.)
METR has found that substantially different scaffolding is most effective for o-series models. I get the sense that they weren’t optimized for being effective multi-turn agents. At least, the o1 series wasn’t optimized for this, I think o3 may have been.
Sure, there might be a spectrum, (though I do think some cases are quite clear cut), but I still think the distinction is useful.