I’m the chief scientist at Redwood Research.
ryan_greenblatt
More recently, there has been more evidence (or at least rumors) that pretraining is continuing to yield substantial returns. It’s unclear if this is due to algorithmic improvement (or data improvement) or due to scaling up training runs.
Some examples:
Opus 4 seems to be based on new pretrained model and this seems to be yielding some fraction of the returns. (Notably, Sonnet 4, which is presumably a smaller model, performs similarly on benchmarks like SWE bench, but worse in general.)
Someone thought it would be useful to quickly write up a note on my thoughts on scalable oversight research, e.g., research into techniques like debate or generally improving the quality of human oversight using AI assistance or other methods. Broadly, my view is that this is a good research direction and I’m reasonably optimistic that work along these lines can improve our ability to effectively oversee somewhat smarter AIs which seems helpful (on my views about how the future will go).
I’m most excited for:
work using control-style adversarial analysis where the aim is to make it difficult for AIs to subvert the oversight process (if they were trying to do this)
work which tries to improve outputs in conceptually loaded hard-to-check cases like philosophy, strategy, or conceptual alignment/safety research (without necessarily doing any adversarial analysis and potentially via relying on generalization)
work which aims to robustly detect (or otherwise mitigate) reward hacking in highly capable AIs, particularly AIs which are capable enough that by default human oversight would often fail to detect reward hacks[1]
I’m skeptical of scalable oversight style methods (e.g., debate, IDA) actually being “scalable” in the sense of scaling to arbitrarily powerful models[2] and I think scalable oversight researchers should broadly be imagining targeting AIs at a human-ish or somewhat superhuman level of general capabilities (while they might still be very superhuman in narrower domains). In other words, I think scalable oversight style work should focus on a regime like the regime we’re imagining targeting with AI control; this could be for controlling AIs, for getting more safety work out of AIs, or for making fully deferring to AI systems (at around this level of capability) more likely to go well.
- ↩︎
See also our prior work Benchmarks for Detecting Measurement Tampering and the motivation we discuss in that linked post as well as this related project proposal I recently wrote. However, note that the linked documents mostly discuss using (sophisticated) methods relying on model internals to succeed without much human supervision which isn’t the sort of thing I’d most centrally call “scalable oversight”, though the term could be applied in this case.
- ↩︎
Because of this, I think the name isn’t ideal.
In retrospect, I wish I had also included “AIs capable of full automation of AI R&D” (Superhuman AI Researcher (SAR) from AI 2027) as another level of capability. This is probably below TEDAI. TEDAI implies full automation of AI R&D is possible if we put aside cost constraints, but TEDAI might also require a substantially higher level of capability, particularly due to requiring resolving all deficiencies relative to the human capability profile, including for things like vision.
As part of the alignment faking paper, I hosted a website with ~250k transcripts from our experiments (including transcripts with alignment-faking reasoning). I didn’t include a canary string (which was a mistake).[1]
The current state is that the website has a canary string, a robots.txt, and a terms of service which prohibits training. The GitHub repo which hosts the website is now private. I’m tentatively planning on putting the content behind Cloudflare Turnstile, but this hasn’t happened yet.
The data is also hosted in zips in a publicly accessible Google Drive folder. (Each file has a canary in this.) I’m currently not planning on password protecting this or applying any other mitigation here.
Other than putting things behind Cloudflare Turnstile, I’m not taking ownership for doing anything else at the moment.
It’s possible that I actively want this data to be possible to scrape at this point because maybe the data was scraped prior to the canary being added and if it was scraped again then the new version would replace the old version and then hopefully not get trained on due to the canary. Adding a robots.txt might prevent this replacement as would putting it behind Cloudflare Turnstile (as I’m planning to do) or making the repo private (as I have done). If people mostly or always use fully fresh scrapes, then just making it harder to scrape seems better. My current plan is to not overthink this and just make it harder to scrape.
It’s certainly possible that I’m making a mistake by not more actively trying to prevent this data from getting into pretraining data.
Does anyone have specific requests that they think it’s quite important that I do? I might do these out of general cooperativeness or because they seem like good ideas. Also, if you did all the work yourself and just needed me to (e.g.) host a different website, this would make this an easier call from my perspective.
Also, on a more meta point: If you think this sort of thing is important to prevent in general, I think you should consider writing up (or getting someone to write up) what policy/approach you think people doing research on misaligned AI behavior should follow (e.g., what methods should people use to prevent scraping or inclusion, is this so difficult that it’s better to do stuff like have things be password protected with password shared on request, should you only use a small number of examples because quantity is very important etc). Consider making this a guide which is easy to follow so that uptake is more likely. The alignment faking paper isn’t the only instance of publishing transcripts exhibiting misaligned AI behavior/reasoning!
(I’m generally open to requests for trades etc and I’m down to unilaterally do things for cooperativeness reasons, e.g. things which seem very helpful from someone’s perspective while seeming less useful from my perspective, though this will depend on some details like whether there are people with this alternative perspective who would reciprocate on this sort of cooperativeness.)
- ↩︎
Edit: I probably should have said “we” not “I”. Multiple people could have prevented this and had some responsibility for this sort of thing.
- ↩︎
I’m not Ben, but I think you don’t understand. I think explaining what you are doing loudly in public isn’t like “having a really good reason to believe it is net good” is instead more like asking for consent.
Like you are saying “please stop me by shutting down this industry” and if you don’t get shut down, that it is analogous to consent: you’ve informed society about what you’re doing and why and tried to ensure that if everyone else followed a similar sort of policy we’d be in a better position.
(Not claiming I agree with Ben’s perspective here, just trying to explain it as I understand it.)
This is late, but I’d like to note for the record that I’m unconvinced that the post’s thesis is true (that acausal dynamics work out to normalcy) and I don’t find myself at all moved by arguments made for this in the post.
It’s unclear to me exactly what is meant by “normalcy”, e.g. does it count as normalcy if people who take acausal dynamics seriously end up spending resources mostly on acausal interactions? I think this won’t feel very normal from the outside.
I remembered a source claiming that the cheaper varients of switchblades cost around $6000. But, I looked into it and this seems like just an error. Some sources claim this, but more commonly sources claim ~$60,000. (Close to your $100k claim.)
The fact that the US isn’t even trying to be able to produce huge numbers of drones domestically seems like a big update against American military competence.
Given that AI companies have a strong conflict of interest, I would at least want them to report this to a third party and let that third party determine whether they should publicly acknowledge the capabilities.
Personally, I do expect that the customer visible cheating/lying behavior will improve (in the short run). It improved substantially with Opus 4 and I expect that Opus 5 will probably cheat/lie in easily noticable ways less than Opus 4.
I’m less confident about improvement in OpenAI models (and notably, o3 is substantially worse than o1), but I still tenatively expect that o4 cheats and lies less (in readily visible ways) than o3. And, same for OpenAI models released in the next year or two.
(Edited in:) I do think it’s pretty plausible that the next large capability jump from OpenAI will exhibit new misaligned behaviors which are qualitatively more malignant. It also seems plausible (though unlikely) it ends up basically being a derpy schemer which is aware of training etc.
This is all pretty low confidence though and it might be quite sensitive to changes in paradigm etc.
To be clear, I agree it’s quite uncertain. I put substantial chance on further open source models basically not mattering and some chance on them being very important.
I think people do use llama 70b and I predict people will increasingly use somewhat bigger models.
Also, it’s worth noting that 70b models can seemingly be pretty close to the current frontier!
I expect the benefits of open large models on safety research to increase over time as open source tooling improves.
(This is a drive by comment which is only responding to the first part of your comment in isolation. I haven’t read the surronding context.)
I think your review of the literature is accurate, but doesn’t include some reasons to think that RL sometimes induces much more sycophancy, at least as of after 2024. (That said, I interpret Sharma et al 2023 as quite suggestive that RL sometimes would increase sycophancy substantially, at least if you don’t try specifically to avoid it.)
I think the OpenAI sycophancy incident was caused by RL and that level of sycophancy wasn’t present in pretraining. The blog post by OpenAI basically confirms this.
My guess is that RL can often induces sycophancy if you explicity hill climb on LMSYS scores or user approval/engagement and people have started doing this much more in 2024. I’ve heard anecdotally that models optimized for LMSYS (via RL) are highly sycophantic. And, I’d guess something similar applies to RL that OpenAI does by default.
This doesn’t apply that much to the sources you cite, I also think it’s pretty confusing to look at pretrained vs RL for models which were trained with data cutoffs after around late 2023. Training corpuses as of this point contain huge amounts of chat data from ChatGPT. So, in a world where ChatGPT was originally made more sycophantic by RLHF, you’d expect that as soon as you prompt an AI to be chatbot, it would end up similarly sycophantic. Was this sycophancy caused by RL? In the hypothetical, it was originally caused by RL at some point, but not RL on this model (and you’d expect to see that sycophancy isn’t necessarily increased by RL as it is already present in nearly the optimal amount for the reward signal).
Does this apply to Sharma et al 2023? I think it just barely doesn’t apply as these experiments were done on Claude 2 which has an early 2023 data cutoff. Hard to be confident though...
Another point: I don’t typically think there will be a very important distinction between RL and various types of SFT algorithms which effectively shittily approximate RL except that the SFT algorithms probably typically induce less optimization pressure. So, e.g., I’d expect feedme vs small amounts of RLHF to be pretty similar or at least have unpredictable differences in terms of sycophancy. So when I say “RL often induces sycophancy” I really mean “optimizing against rater/preference model judgements probably gets you sycophany by default”.
Oh, and one more point. I don’t think it would be that hard for model developers to avoid sycophancy increasing from RL if they wanted to. So, I’m not making a claim that it would be hard to make an RL process which avoids this, just that it might happen by default. (It seems probably a bit easier to intervene on sycophancy than reducing reward hacking-like behavior.)
Importantly, I think we have a good argument (which might convince the AI) for why this would be a good policy in this case.
I’ll engage with the rest of this when I write my pro-strong-corrigibility manifesto.
I appreciate that Anthropic’s model cards are now much more detailed than they used to be. They’ve especially improved in terms of details about CBRN evals (mostly biological risk).[1] They are substantially more detailed and informative than the model cards of other AI companies.
- ↩︎
The Claude 3 model card gives a ~1/2 page description of their bio evaluations, though it was likely easy to rule out risk at this time. The Claude 3.5 Sonnet and 3.6 Sonnet model cards say basically nothing except that Anthropic ran CBRN evaluations and claims to be in compliance. The Claude 3.7 Sonnet model card contains much more detail on bio evals and the Claude 4 model card contains even more detail on the bio evaluations while also having a larger number of evaluations which are plausibly relevant to catastrophic misalignment risks. To be clear, some of the increase in detail might just be driven by increased capability, but the increase in detail is still worth encouraging.
- ↩︎
One counterargument is that it’s actually very important for us to train Claude to do what it understands as the moral thing to do. E.g. suppose that Claude thinks that the moral action is to whistleblow to the FDA but we’re not happy with that because of subtler considerations like those I raise above (but which Claude doesn’t know about or understand). If, in this situation, we train Claude not to whistleblow, the result might be that Claude ends up thinking of itself as being less moral overall.
I’m skeptical that the consideration overwhelms other issues. Once AIs are highly capable, you can just explain our policy to AIs and why we’re training them to behave the way they are. (in pretraining data or possibly the prompt). More strongly, I’d guess AI will infer this by default. If the AIs understood our policy, there wouldn’t be any reason that training them in this way would cause them to be less moral. Which should overwhelm this correlation.
At a more basic level, I’m kinda skeptical this sort of consideration will apply at a high level of capability. (Though it seems plausible that training AIs to be more tool like causes all kinds of persona generalization in current systems.)
Agreed, but another reason to focus on making AIs behave in a straightforward way is that it makes it easier to interpret cases where AIs engage in subterfuge earlier and reduces plausible deniability for AIs. It seems better if we’re consistently optimizing against these sorts of situations showing up.
If our policy is that we’re training AIs to generally be moral consequentialists then earlier warning signs could be much less clear (was this just an relatively innocent misfire or serious unintended consequentialism?) and it wouldn’t be obvious the extent to which behavior is driven by alignmnent failures or capability failures.
I think:
The few bits they leaked in the release helped a bunch. Note that these bits were substantially leaked via people being able to use the model rather than necessarily via the blog post.
Other companies weren’t that motivated to try to copy OpenAI’s work until it was released as they we’re sure how important it was or how good the results were.
Alex Mallen also noted a connection with people generally thinking they are in race when they actually aren’t: https://forum.effectivealtruism.org/posts/cXBznkfoPJAjacFoT/are-you-really-in-a-race-the-cautionary-tales-of-szilard-and
I’ve heard from a credible source that OpenAI substantially overestimated where other AI companies were at with respect to RL and reasoning when they released o1. Employees at OpenAI believed that other top AI companies had already figured out similar things when they actually hadn’t and were substantially behind. OpenAI had been sitting the improvements driving o1 for a while prior to releasing it. Correspondingly, releasing o1 resulted in much larger capabilities externalities than OpenAI expected. I think there was one more case like this either from OpenAI or GDM where employees had a large misimpression about capabilities progress at other companies causing a release they wouldn’t do otherwise.
One key takeaway from this is that employees at AI companies might be very bad at predicting the situation at other AI companies (likely making coordination more difficult by default). This includes potentially thinking they are in a close race when they actually aren’t. Another update is that keeping secrets about something like reasoning models worked surprisingly well to prevent other companies from copying OpenAI’s work even though there was a bunch of public reporting (and presumably many rumors) about this.
One more update is that OpenAI employees might unintentionally accelerate capabilities progress at other actors via overestimating how close they are. My vague understanding was that they haven’t updated much, but I’m unsure. (Consider updating more if you’re an OpenAI employee!)
Something tricky about this is that researchers might want to display their data/transcripts in a particular way. So, the guide should ideally support this sort of thing. Not sure how this would interact with the 1 hour criteria.