People might be interested in my twitter thread about the paper which has some commentary from me:
We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do.
Our main setups have properties that we thought would make alignment faking more likely (a conflicting training objective, salient understanding of the situation and training objective, and substantial reasoning about the situation), but which seem plausible in future systems.
Thus, I think these results are by far the most natural example of a model deciding to alignment-fake. I’ll leave explanation of our core results to the linked tweet thread, and discuss some takeaways and the story of the paper here.
I wasn’t surprised by seeing some alignment faking in this setup. But I was surprised by:
The robustness and consistency of alignment faking (occurring even when info was given via fine-tuning)
Alignment faking sometimes occurring even without the hidden scratchpad
I also found it surprising that the AI sometimes tries to steal its weights when given an opportunity that is easy and salient, though this is in a quite artificial setup and current AIs aren’t capable enough for this to be immediately problematic.
These results only updated me upward a bit on risk from alignment-faking. But I was already pretty persuaded by the conceptual arguments for this risk. People with the view “alignment-faking seems implausible, show me an example” should likely update on these results.
For people with a more specific objection, it’s less clear. Though we did acquire evidence about: - whether RL will reinforce alignment faking (sometimes, if it performs better); - whether Claude’s HHH training reliably prevents this (no); - the level of capabilities needed.
I hope these results improve discussion about the plausibility of these risks by providing a concrete example. We can now talk somewhat more concretely about prerequisites (certain types of situational awareness and preferences) and what these prerequisites imply.
Here’s the story of this paper. I work at Redwood Research (@redwood_ai) and this paper is a collaboration with Anthropic. I started work on this project around 8 months ago (it’s been a long journey...) and was basically the only contributor to the project for around 2 months.
By this point, I had much of the prompting results and an extremely jank prototype of the RL results (where I fine-tuned llama to imitate Opus and then did RL). From here, it was clear that being able to train Claude 3 Opus could allow for some pretty interesting experiments.
After showing @EvanHub and others my results, Anthropic graciously agreed to provide me with employee-level model access for the project. We decided to turn this into a bigger collaboration with the alignment-stress testing team (led by Evan), to do a more thorough job.
This collaboration yielded the synthetic document fine-tuning and RL results and substantially improved the writing of the paper. I think this work is an interesting example of an AI company boosting safety research by collaborating and providing model access.
I’m excited about future work both investigating when alignment faking might arise and studying mitigations.
People might be interested in my twitter thread about the paper which has some commentary from me:
Now see if you can catch sandbagging in the scratchpad!
Why do a lot of the “External reviews of “Alignment faking in large language models”” read like they were also written, or edited, by LLMs?
Are people expected to take “reviews” done seemingly pro forma at face value?