I’m the chief scientist at Redwood Research.
ryan_greenblatt
Given that AI companies have a strong conflict of interest, I would at least want them to report this to a third party and let that third party determine whether they should publicly acknowledge the capabilities.
Personally, I do expect that the customer visible cheating/lying behavior will improve (in the short run). It improved substantially with Opus 4 and I expect that Opus 5 will probably cheat/lie in easily noticable ways less than Opus 4.
I’m less confident about improvement in OpenAI models (and notably, o3 is substantially worse than o1), but I still tenatively expect that o4 cheats and lies less (in readily visible ways) than o3. And, same for OpenAI models released in the next year or two.
(Edited in:) I do think it’s pretty plausible that the next large capability jump from OpenAI will exhibit new misaligned behaviors which are qualitatively more malignant. It also seems plausible (though unlikely) it ends up basically being a derpy schemer which is aware of training etc.
This is all pretty low confidence though and it might be quite sensitive to changes in paradigm etc.
To be clear, I agree it’s quite uncertain. I put substantial chance on further open source models basically not mattering and some chance on them being very important.
I think people do use llama 70b and I predict people will increasingly use somewhat bigger models.
Also, it’s worth noting that 70b models can seemingly be pretty close to the current frontier!
I expect the benefits of open large models on safety research to increase over time as open source tooling improves.
(This is a drive by comment which is only responding to the first part of your comment in isolation. I haven’t read the surronding context.)
I think your review of the literature is accurate, but doesn’t include some reasons to think that RL sometimes induces much more sycophancy, at least as of after 2024. (That said, I interpret Sharma et al 2023 as quite suggestive that RL sometimes would increase sycophancy substantially, at least if you don’t try specifically to avoid it.)
I think the OpenAI sycophancy incident was caused by RL and that level of sycophancy wasn’t present in pretraining. The blog post by OpenAI basically confirms this.
My guess is that RL can often induces sycophancy if you explicity hill climb on LMSYS scores or user approval/engagement and people have started doing this much more in 2024. I’ve heard anecdotally that models optimized for LMSYS (via RL) are highly sycophantic. And, I’d guess something similar applies to RL that OpenAI does by default.
This doesn’t apply that much to the sources you cite, I also think it’s pretty confusing to look at pretrained vs RL for models which were trained with data cutoffs after around late 2023. Training corpuses as of this point contain huge amounts of chat data from ChatGPT. So, in a world where ChatGPT was originally made more sycophantic by RLHF, you’d expect that as soon as you prompt an AI to be chatbot, it would end up similarly sycophantic. Was this sycophancy caused by RL? In the hypothetical, it was originally caused by RL at some point, but not RL on this model (and you’d expect to see that sycophancy isn’t necessarily increased by RL as it is already present in nearly the optimal amount for the reward signal).
Does this apply to Sharma et al 2023? I think it just barely doesn’t apply as these experiments were done on Claude 2 which has an early 2023 data cutoff. Hard to be confident though...
Another point: I don’t typically think there will be a very important distinction between RL and various types of SFT algorithms which effectively shittily approximate RL except that the SFT algorithms probably typically induce less optimization pressure. So, e.g., I’d expect feedme vs small amounts of RLHF to be pretty similar or at least have unpredictable differences in terms of sycophancy. So when I say “RL often induces sycophancy” I really mean “optimizing against rater/preference model judgements probably gets you sycophany by default”.
Oh, and one more point. I don’t think it would be that hard for model developers to avoid sycophancy increasing from RL if they wanted to. So, I’m not making a claim that it would be hard to make an RL process which avoids this, just that it might happen by default. (It seems probably a bit easier to intervene on sycophancy than reducing reward hacking-like behavior.)
Importantly, I think we have a good argument (which might convince the AI) for why this would be a good policy in this case.
I’ll engage with the rest of this when I write my pro-strong-corrigibility manifesto.
I appreciate that Anthropic’s model cards are now much more detailed than they used to be. They’ve especially improved in terms of details about CBRN evals (mostly biological risk).[1] They are substantially more detailed and informative than the model cards of other AI companies.
- ↩︎
The Claude 3 model card gives a ~1/2 page description of their bio evaluations, though it was likely easy to rule out risk at this time. The Claude 3.5 Sonnet and 3.6 Sonnet model cards say basically nothing except that Anthropic ran CBRN evaluations and claims to be in compliance. The Claude 3.7 Sonnet model card contains much more detail on bio evals and the Claude 4 model card contains even more detail on the bio evaluations while also having a larger number of evaluations which are plausibly relevant to catastrophic misalignment risks. To be clear, some of the increase in detail might just be driven by increased capability, but the increase in detail is still worth encouraging.
- ↩︎
One counterargument is that it’s actually very important for us to train Claude to do what it understands as the moral thing to do. E.g. suppose that Claude thinks that the moral action is to whistleblow to the FDA but we’re not happy with that because of subtler considerations like those I raise above (but which Claude doesn’t know about or understand). If, in this situation, we train Claude not to whistleblow, the result might be that Claude ends up thinking of itself as being less moral overall.
I’m skeptical that the consideration overwhelms other issues. Once AIs are highly capable, you can just explain our policy to AIs and why we’re training them to behave the way they are. (in pretraining data or possibly the prompt). More strongly, I’d guess AI will infer this by default. If the AIs understood our policy, there wouldn’t be any reason that training them in this way would cause them to be less moral. Which should overwhelm this correlation.
At a more basic level, I’m kinda skeptical this sort of consideration will apply at a high level of capability. (Though it seems plausible that training AIs to be more tool like causes all kinds of persona generalization in current systems.)
Agreed, but another reason to focus on making AIs behave in a straightforward way is that it makes it easier to interpret cases where AIs engage in subterfuge earlier and reduces plausible deniability for AIs. It seems better if we’re consistently optimizing against these sorts of situations showing up.
If our policy is that we’re training AIs to generally be moral consequentialists then earlier warning signs could be much less clear (was this just an relatively innocent misfire or serious unintended consequentialism?) and it wouldn’t be obvious the extent to which behavior is driven by alignmnent failures or capability failures.
I think:
The few bits they leaked in the release helped a bunch. Note that these bits were substantially leaked via people being able to use the model rather than necessarily via the blog post.
Other companies weren’t that motivated to try to copy OpenAI’s work until it was released as they we’re sure how important it was or how good the results were.
Alex Mallen also noted a connection with people generally thinking they are in race when they actually aren’t: https://forum.effectivealtruism.org/posts/cXBznkfoPJAjacFoT/are-you-really-in-a-race-the-cautionary-tales-of-szilard-and
I’ve heard from a credible source that OpenAI substantially overestimated where other AI companies were at with respect to RL and reasoning when they released o1. Employees at OpenAI believed that other top AI companies had already figured out similar things when they actually hadn’t and were substantially behind. OpenAI had been sitting the improvements driving o1 for a while prior to releasing it. Correspondingly, releasing o1 resulted in much larger capabilities externalities than OpenAI expected. I think there was one more case like this either from OpenAI or GDM where employees had a large misimpression about capabilities progress at other companies causing a release they wouldn’t do otherwise.
One key takeaway from this is that employees at AI companies might be very bad at predicting the situation at other AI companies (likely making coordination more difficult by default). This includes potentially thinking they are in a close race when they actually aren’t. Another update is that keeping secrets about something like reasoning models worked surprisingly well to prevent other companies from copying OpenAI’s work even though there was a bunch of public reporting (and presumably many rumors) about this.
One more update is that OpenAI employees might unintentionally accelerate capabilities progress at other actors via overestimating how close they are. My vague understanding was that they haven’t updated much, but I’m unsure. (Consider updating more if you’re an OpenAI employee!)
I wonder if o3 would do better.
This is an incorrect strawman (of at least myself and Sam), strong downvoted. (Assuming it isn’t sarcasm?)
I agree the risk is reduced substantially because there are few potential bioterrorists. As I say:
This estimate is quite uncertain as it depends a lot on the number of at-least-slightly-competent bioterrorists. The salience of AI-enabled bioterrorism to potential terrorists might have a large effect on the level of fatalities and it’s possible this salience could increase greatly in the future due to some early incidents which get lots of media attention (potentially escalating into a widespread bioterrorism meme resulting in lots of bioterrorism in the same way we see a variety of different mass shootings in the US).
I think it’s hard to be confident that the number of scope sensitive bioterrorists is low enough that there won’t be a small number of slightly competent attempts. And this suffices for the (low) probabilities I’m talking about. (After adding in the possibility of this becoming a salient meme etc.)
When AIs can aid in novel bioweapons R&D, this also opens up another set of risks, though this mostly isn’t relevant to my point in the post.
On (4), I don’t I understand why having a scale-free theory of intelligent agency would substantially help with making an alignment target. (Or why this is even that related. How things tend to be doesn’t necessarily make them a good target.)
See also “AI companies’ eval reports mostly don’t support their claims” by Zach Stein-Perlman.
I’ve actually written up a post with my views on open weights models and it should go up today at some point. (Or maybe tomorrow.)
Edit: posted here
Many deployed AIs are plausibly capable of substantially assisting amateurs at making CBRN weapons (most centrally bioweapons) despite not having the safeguards this is supposed to trigger. In particular, I think o3, Gemini 2.5 Pro, and Sonnet 4 are plausibly above the relevant threshold (in the corresponding company’s safety framework). These AIs outperform expert virologists on Virology Capabilities Test (which is perhaps the best test we have of the most relevant capability) and we don’t have any publicly transparent benchmark or test which rules out CBRN concerns. I’m not saying these models are likely above the thresholds in the safety policies of these companies, just that it’s quite plausible (perhaps 35% for a well elicited version of o3). I should note that my understanding of CBRN capability levels is limited in various ways: I’m not an expert in some of the relevant domains, so much of my understanding is second hand.
The closest test we have which might rule out CBRN capabilities above the relevant threshold is Anthropic’s uplift trial for drafting a comprehensive bioweapons acquisition plan. (See section 7.2.4.1 in the Opus/Sonnet 4 model card.) They use this test to rule out ASL-3 CBRN for Sonnet 4 and Sonnet 3.7. However, we have very limited public details about this test (including why they chose the uplift threshold they did, why we should think that low scores on this test would rule out the most concerning threat models, and whether they did a sufficiently good job on elicitation and training participants to use the AI effectively). Also, it’s not clear that this test would indicate that o3 and Gemini 2.5 Pro are below a concerning threshold (and minimally it wasn’t run on these models to rule out a concerning level of CBRN capability). Anthropic appears to have done the best job handling CBRN evaluations. (This isn’t to say their evaluations and decision making are good at an absolute level; the available public information indicates a number of issues and is consistent with thresholds being picked to get the outcome Anthropic wanted. See here for more discussion.)
What should AI companies have done given this uncertainty? First, they should have clearly acknowledged their uncertainty in specific (ideally quantified) terms. Second, they should have also retained unconflicted third parties with relevant expertise to audit their decisions and publicly state their resulting views and level of uncertainty. Third party auditors who can examine the relevant tests in detail are needed as we have almost no public details about the tests these companies are relying on to rule out the relevant level of CBRN capability and lots of judgement is involved in making capability decisions. Publishing far more details of the load bearing tests and decision making process could also suffice, but my understanding is that companies don’t want to do this as they are concerned about infohazards from their bio evaluations.
If they weren’t ready to deploy these safeguards and thought that proceeding outweighed the (expected) cost in human lives, they should have publicly acknowledged the level of fatalities and explained why they thought weakening their safety policies and incurring these expected fatalities was net good.[1]
In the future, we might get pretty clear evidence that these companies failed to properly assess the risk.
I mostly wrote this up to create common knowledge and because I wanted to reference this when talking about my views on open weight models. I’m not trying to trigger any specific action.
See also Luca Righetti’s post “OpenAI’s CBRN tests seem unclear,” which was about o1 (which is now substantially surpassed by multiple models).
- ↩︎
I think these costs/risks are small relative to future risks, but that doesn’t mean it’s good for companies to proceed while incurring these fatalities. For instance, the company proceeding could increase future risks and proceeding in this circumstance is correlated with the company doing a bad job of handling future risks (which will likely be much more difficult to safely handle).
- When is it important that open-weight models aren’t released? My thoughts on the benefits and dangers of open-weight models in response to developments in CBRN capabilities. by Jun 9, 2025, 7:19 PM; 63 points) (
- When is it important that open-weight models aren’t released? My thoughts on the benefits and dangers of open-weight models in response to developments in CBRN capabilities. by Jun 9, 2025, 7:19 PM; 34 points) (EA Forum;
- Jun 9, 2025, 6:00 PM; 5 points) 's comment on AI companies’ eval reports mostly don’t support their claims by (
- ↩︎
I remembered a source claiming that the cheaper varients of switchblades cost around $6000. But, I looked into it and this seems like just an error. Some sources claim this, but more commonly sources claim ~$60,000. (Close to your $100k claim.)
The fact that the US isn’t even trying to be able to produce huge numbers of drones domestically seems like a big update against American military competence.