In particular, your argument that putting material into the world about LLMs potentially becoming misaligned may cause problems—I agree that that’s true, but what’s the alternative? Never talking about risks from AI? That seems like it plausibly turns out worse.
I think this depends somewhat on the threat model. How scared are you of the character instantiated by the model vs the language model itself? If you’re primarily scared that the character would misbehave, and not worried about the language model misbehaving except insofar as it reifies a malign character, then maybe making the training data not give the model any reason to expect such a character to be malign would reduce the risk of this to negligible, and that sure would be easier if no one had ever thought of the idea that powerful AI could be dangerous. But if you’re also worried about the language model itself misbehaving, independently of whether it predicts that its assigned character would misbehave (for instance, the classic example of turning the world into computronium that it can use to better predict the behavior of the character), then this doesn’t seem feasible to solve without talking about it, so the decrease in risk of model misbehavior from publically discussing AI risk is probably worth the increase in risk of the character misbehaving (which is probably easier to solve anyway) that it would cause.
I don’t understand outer vs inner alignment especially well, but I think this at least roughly tracks that distinction. If a model does a great job of instantiating a character like we told it to, and that character kills us, then the goal we gave it was catastrophic, and we failed at outer alignment. If the model, in the process of being trained on how to instantiate the character, also kills us for reasons other than that it predicts the character would do so, then the process we set up for achieving the given goal also ended up optimizing for something else undesirable, and we failed at inner alignment.
This post claims that Anthropic is embarrassingly far behind twitter AI psychologists at skills that are possibly critical to Anthropic’s mission. This suggests to me that Anthropic should be trying to recruit from the twitter AI psychologist circle.