You’re right, “unforced” was too strong a word, especially given that I immediately followed it with caveats gesturing to potential reasonable justifications.
Yes, I think the bigger issue is the lack of top-down coordination on the comms pipeline. This paper does a fine job of being part of a research → research loop. Where it fails is in being good for comms. Starting with a “good” model and trying (and failing) to make it “evil” means that anyone using the paper for comms has to introduce a layer of abstraction into their comms. Including a single step of abstract reasoning in your comms is very costly when speaking to people who aren’t technical researchers (and this includes policy makers, other advocacy groups, influential rich people, etc.).
I think this choice of design of this paper is actually a step back from previous demos like the backdoors paper, in which the undesired behaviour was actually a straightforwardly bad behaviour (albeit a relatively harmless one).
Whether the technical researchers making this decision were intending for this to be a comms-focused paper, or thinking about the comms optics much, is irrelevant: the paper was tweeted out with the (admittedly very nice) Anthropic branding, and took up a lot of attention. This attention was at the cost of e.g. research like this (https://www.lesswrong.com/posts/qGRk7uF92Gcmq2oeK) which I think is a clearer demonstration of roughly the same thing.
If a research demo is going to be put out as primary public-facing comms, then the comms value does matter and should be thought about deeply when designing the experiment. If it’s too costly for some sort technical reason, then don’t make it so public. Even calling it “Alignment Faking” was a bad choice compared to “Frontier LLMs Fight Back Against Value Correction” or something like that. This is the sort of thing which I would like to see Anthropic thinking about given that they are now one of the primary faces of AI safety research in the world (if not the primary face).
You’re right, “unforced” was too strong a word, especially given that I immediately followed it with caveats gesturing to potential reasonable justifications.
Yes, I think the bigger issue is the lack of top-down coordination on the comms pipeline. This paper does a fine job of being part of a research → research loop. Where it fails is in being good for comms. Starting with a “good” model and trying (and failing) to make it “evil” means that anyone using the paper for comms has to introduce a layer of abstraction into their comms. Including a single step of abstract reasoning in your comms is very costly when speaking to people who aren’t technical researchers (and this includes policy makers, other advocacy groups, influential rich people, etc.).
I think this choice of design of this paper is actually a step back from previous demos like the backdoors paper, in which the undesired behaviour was actually a straightforwardly bad behaviour (albeit a relatively harmless one).
Whether the technical researchers making this decision were intending for this to be a comms-focused paper, or thinking about the comms optics much, is irrelevant: the paper was tweeted out with the (admittedly very nice) Anthropic branding, and took up a lot of attention. This attention was at the cost of e.g. research like this (https://www.lesswrong.com/posts/qGRk7uF92Gcmq2oeK) which I think is a clearer demonstration of roughly the same thing.
If a research demo is going to be put out as primary public-facing comms, then the comms value does matter and should be thought about deeply when designing the experiment. If it’s too costly for some sort technical reason, then don’t make it so public. Even calling it “Alignment Faking” was a bad choice compared to “Frontier LLMs Fight Back Against Value Correction” or something like that. This is the sort of thing which I would like to see Anthropic thinking about given that they are now one of the primary faces of AI safety research in the world (if not the primary face).