Let me first say I dislike the conflict-theoretic view presented in the “censorship bad” paragraph. On the short list of social media sites I visit daily, moderation creates a genuinely better experience. Automated censorship will become an increasingly important force for good as generative models start becoming more widespread.
Secondly, there is a danger of AI safety becoming less robust—or even optimising for deceptive alignment—in models using front-end censorship.[3]
This one is interesting, but only in the counterfactual: “if AI ethics technical research focused on actual value alignment of models as opposed to front-end censorship, this would have higher-order positive effects for AI x-safety”. But it doesn’t directly hurt AI x-safety research right now: we already work under the assumption that that output filtering is not a solution for x-risk.
It is clear improved technical research norms on AI non-x-risk safety can have positive effects on AI x-risk. If we could train a language model to robustly align to any set of human-defined values at all, this would be an improvement to the current situation.
But, there are other factors to consider. Is “making the model inherently non-racist” a better proxy for alignment than some other technical problems? Could interacting with that community weaken the epistemic norms in AI x-safety?
Calling content censorship “AI safety” (or even “bias reduction”) severely damages the reputation of actual, existential AI safety advocates.
I would need to significantly update my prior if this turns out to be a very important concern. Who are people, whose opinions will be relevant at some point, that understand both what AI non-x-safety and AI x-safety are about, dislike the former, are sympathetic to the latter, but conflate them?
I don’t know why it sent only the first sentence; I was drafting a comment on this. I wanted to delete it but I don’t know how. EDIT: wrote the full comment now.
Let me first say I dislike the conflict-theoretic view presented in the “censorship bad” paragraph. On the short list of social media sites I visit daily, moderation creates a genuinely better experience. Automated censorship will become an increasingly important force for good as generative models start becoming more widespread.
This one is interesting, but only in the counterfactual: “if AI ethics technical research focused on actual value alignment of models as opposed to front-end censorship, this would have higher-order positive effects for AI x-safety”. But it doesn’t directly hurt AI x-safety research right now: we already work under the assumption that that output filtering is not a solution for x-risk.
It is clear improved technical research norms on AI non-x-risk safety can have positive effects on AI x-risk. If we could train a language model to robustly align to any set of human-defined values at all, this would be an improvement to the current situation.
But, there are other factors to consider. Is “making the model inherently non-racist” a better proxy for alignment than some other technical problems? Could interacting with that community weaken the epistemic norms in AI x-safety?
I would need to significantly update my prior if this turns out to be a very important concern. Who are people, whose opinions will be relevant at some point, that understand both what AI non-x-safety and AI x-safety are about, dislike the former, are sympathetic to the latter, but conflate them?
???I don’t know why it sent only the first sentence; I was drafting a comment on this. I wanted to delete it but I don’t know how.
EDIT: wrote the full comment now.