Edited for clarity based on some feedback, without changing the core points
To start with an extremely specific example that I nonetheless think might be a microcosm of a bigger issue: the “Alignment Faking in Large Language Models” contained a very large unforced error: namely that you started with Helpful-Harmless-Claude and tried to train out the harmlessness, rather than starting with Helpful-Claude and training in harmlessness.
This made the optics of the paper much more confusing than it needed to be, leading to lots of people calling it “good news”. I assume part of this was the lack of desire to do an entire new constitutional AI/RLAIF run on a model, since I also assume that would take a lot of compute. But if you’re going to be the “lab which takes safety seriously” you have to, well, take it seriously!
The bigger issue at hand is that Anthropic’s comms on AI safety/risk are all over the place. This makes sense since Anthropic is a company with many different individuals with different views, but that doesn’t mean it’s not a bad thing. “Machines of Loving Grace” explicitly argues for the US government to attempt to create a global hegemony via AI. This is a really really really bad thing to say and is possibly worse than anything DeepMind or OpenAI have ever said. Race dynamics are deeply hyperstitious, this isn’t difficult.
If you are in an arms race, and you don’t want to be in one, you should at least say this publicly. You should not learn to love the race.
A second problem: it seems like at least some Anthropic people are doing the very-not-rational thing of updating slowly, achingly, bit by bit, towards the view of “Oh shit all the dangers are real and we are fucked.” when they should just update all the way right now.
Example 1: Dario recently said something to the effect of “if there’s no serious regulation by the end of 2025, I’ll be worried”. Well there’s not going to be serious regulation by the end of 2025 by default and it doesn’t seem like Anthropic are doing much to change this (that may be false, but I’ve not heard anything to the contrary).
Example 2: When the first ten AI-risk test-case demos go roughly the way all the doomers expected and none of the mitigations work robustly, you should probably update to believe the next ten demos will be the same.
Final problem: as for the actual interpretability/alignment/safety research. It’s very impressive technically, and overall it might make Anthropic slightly net-positive compared to a world in which we just had DeepMind and OpenAI. But it doesn’t feel like Anthropic is actually taking responsibility for the end-to-end ai-future-is-good pipeline. In fact the “Anthropic eats marginal probability” diagram (https://threadreaderapp.com/thread/1666482929772666880.html) seems to say the opposite.
This is a problem since Anthropic has far more money and resources than basically anyone else who is claiming to be seriously trying (with the exception of DeepMind, though those resources are somewhat controlled by Google and not really at the discretion of any particular safety-conscious individual) to do AI alignment.
It generally feels more like Anthropic is attempting to discharge responsibility to “be a safety focused company” or at worst just safetywash their capabilities research. I have heard generally positive things about Anthropic employees’ views on AI risk issues, so I cannot speak to the intentions of those who work there, this is just how the system appears to be acting from the outside.
It’s possible this was a mistake and we should have more aggressively tried to explore versions of the setting where the AI starts off more “evil”, but I don’t think it was unforced. We thought about this a bunch and considered if there were worthwhile things here.
Edit: regardless, I don’t think this example is plausibly a microcosm of a bigger issue as this choice was mostly made by individual researchers without much top down influence. (Unless your claim is that there should have been more top down influence.)
You’re right, “unforced” was too strong a word, especially given that I immediately followed it with caveats gesturing to potential reasonable justifications.
Yes, I think the bigger issue is the lack of top-down coordination on the comms pipeline. This paper does a fine job of being part of a research → research loop. Where it fails is in being good for comms. Starting with a “good” model and trying (and failing) to make it “evil” means that anyone using the paper for comms has to introduce a layer of abstraction into their comms. Including a single step of abstract reasoning in your comms is very costly when speaking to people who aren’t technical researchers (and this includes policy makers, other advocacy groups, influential rich people, etc.).
I think this choice of design of this paper is actually a step back from previous demos like the backdoors paper, in which the undesired behaviour was actually a straightforwardly bad behaviour (albeit a relatively harmless one).
Whether the technical researchers making this decision were intending for this to be a comms-focused paper, or thinking about the comms optics much, is irrelevant: the paper was tweeted out with the (admittedly very nice) Anthropic branding, and took up a lot of attention. This attention was at the cost of e.g. research like this (https://www.lesswrong.com/posts/qGRk7uF92Gcmq2oeK) which I think is a clearer demonstration of roughly the same thing.
If a research demo is going to be put out as primary public-facing comms, then the comms value does matter and should be thought about deeply when designing the experiment. If it’s too costly for some sort technical reason, then don’t make it so public. Even calling it “Alignment Faking” was a bad choice compared to “Frontier LLMs Fight Back Against Value Correction” or something like that. This is the sort of thing which I would like to see Anthropic thinking about given that they are now one of the primary faces of AI safety research in the world (if not the primary face).
FWIW re: the Dario 2025 comment, Anthropic very recently posted a few job openings for recruiters focused on policy and comms specifically, which I assume is a leading indicator for hiring. One plausible rationale there is that someone on the executive team smashed the “we need more people working on this, make it happen” button.
Edited for clarity based on some feedback, without changing the core points
To start with an extremely specific example that I nonetheless think might be a microcosm of a bigger issue: the “Alignment Faking in Large Language Models” contained a very large unforced error: namely that you started with Helpful-Harmless-Claude and tried to train out the harmlessness, rather than starting with Helpful-Claude and training in harmlessness. This made the optics of the paper much more confusing than it needed to be, leading to lots of people calling it “good news”. I assume part of this was the lack of desire to do an entire new constitutional AI/RLAIF run on a model, since I also assume that would take a lot of compute. But if you’re going to be the “lab which takes safety seriously” you have to, well, take it seriously!
The bigger issue at hand is that Anthropic’s comms on AI safety/risk are all over the place. This makes sense since Anthropic is a company with many different individuals with different views, but that doesn’t mean it’s not a bad thing. “Machines of Loving Grace” explicitly argues for the US government to attempt to create a global hegemony via AI. This is a really really really bad thing to say and is possibly worse than anything DeepMind or OpenAI have ever said. Race dynamics are deeply hyperstitious, this isn’t difficult. If you are in an arms race, and you don’t want to be in one, you should at least say this publicly. You should not learn to love the race.
A second problem: it seems like at least some Anthropic people are doing the very-not-rational thing of updating slowly, achingly, bit by bit, towards the view of “Oh shit all the dangers are real and we are fucked.” when they should just update all the way right now. Example 1: Dario recently said something to the effect of “if there’s no serious regulation by the end of 2025, I’ll be worried”. Well there’s not going to be serious regulation by the end of 2025 by default and it doesn’t seem like Anthropic are doing much to change this (that may be false, but I’ve not heard anything to the contrary). Example 2: When the first ten AI-risk test-case demos go roughly the way all the doomers expected and none of the mitigations work robustly, you should probably update to believe the next ten demos will be the same.
Final problem: as for the actual interpretability/alignment/safety research. It’s very impressive technically, and overall it might make Anthropic slightly net-positive compared to a world in which we just had DeepMind and OpenAI. But it doesn’t feel like Anthropic is actually taking responsibility for the end-to-end ai-future-is-good pipeline. In fact the “Anthropic eats marginal probability” diagram (https://threadreaderapp.com/thread/1666482929772666880.html) seems to say the opposite. This is a problem since Anthropic has far more money and resources than basically anyone else who is claiming to be seriously trying (with the exception of DeepMind, though those resources are somewhat controlled by Google and not really at the discretion of any particular safety-conscious individual) to do AI alignment. It generally feels more like Anthropic is attempting to discharge responsibility to “be a safety focused company” or at worst just safetywash their capabilities research. I have heard generally positive things about Anthropic employees’ views on AI risk issues, so I cannot speak to the intentions of those who work there, this is just how the system appears to be acting from the outside.
It’s possible this was a mistake and we should have more aggressively tried to explore versions of the setting where the AI starts off more “evil”, but I don’t think it was unforced. We thought about this a bunch and considered if there were worthwhile things here.
Edit: regardless, I don’t think this example is plausibly a microcosm of a bigger issue as this choice was mostly made by individual researchers without much top down influence. (Unless your claim is that there should have been more top down influence.)
You’re right, “unforced” was too strong a word, especially given that I immediately followed it with caveats gesturing to potential reasonable justifications.
Yes, I think the bigger issue is the lack of top-down coordination on the comms pipeline. This paper does a fine job of being part of a research → research loop. Where it fails is in being good for comms. Starting with a “good” model and trying (and failing) to make it “evil” means that anyone using the paper for comms has to introduce a layer of abstraction into their comms. Including a single step of abstract reasoning in your comms is very costly when speaking to people who aren’t technical researchers (and this includes policy makers, other advocacy groups, influential rich people, etc.).
I think this choice of design of this paper is actually a step back from previous demos like the backdoors paper, in which the undesired behaviour was actually a straightforwardly bad behaviour (albeit a relatively harmless one).
Whether the technical researchers making this decision were intending for this to be a comms-focused paper, or thinking about the comms optics much, is irrelevant: the paper was tweeted out with the (admittedly very nice) Anthropic branding, and took up a lot of attention. This attention was at the cost of e.g. research like this (https://www.lesswrong.com/posts/qGRk7uF92Gcmq2oeK) which I think is a clearer demonstration of roughly the same thing.
If a research demo is going to be put out as primary public-facing comms, then the comms value does matter and should be thought about deeply when designing the experiment. If it’s too costly for some sort technical reason, then don’t make it so public. Even calling it “Alignment Faking” was a bad choice compared to “Frontier LLMs Fight Back Against Value Correction” or something like that. This is the sort of thing which I would like to see Anthropic thinking about given that they are now one of the primary faces of AI safety research in the world (if not the primary face).
FWIW re: the Dario 2025 comment, Anthropic very recently posted a few job openings for recruiters focused on policy and comms specifically, which I assume is a leading indicator for hiring. One plausible rationale there is that someone on the executive team smashed the “we need more people working on this, make it happen” button.