When there’s little incentive against classifying harmless documents, and immense cost to making a mistake in the other direction, I’d expect overclassification to be rampant in these bureaucracies.
Your analysis of the default incentives is correct. However, if there is any institution that has noticed the mounds of skulls, it is the DoD. Overclassification, and classification for inappropriate reasons (explicitly enumerated in written guidance: avoiding embarrassment, covering up wrongdoing) is not allowed, and the DoD carries out audits of classified data to identify and correct overclassification.
It’s possible they’re not doing enough to fight against the natural incentive gradient toward overclassification, but they’re trying hard enough that I wouldn’t expect positive EV from disregarding all the rules.
It may be time to revisit this question. With Owain Evans et. al. discovering a generalized evil vector in LLMs, and older work like [Pretraining Language Models with Human Preferences](https://www.lesswrong.com/posts/8F4dXYriqbsom46x5/pretraining-language-models-with-human-preferences) that could use a follow-up, AI in the current paradigm seems ripe for some experimentation with parenting practices in pre-training—perhaps something like affect markers for the text that goes in, or pretraining on children’s literature before going on to the more technically and morally complex text?
I haven’t run any experiments of my own, but this doesn’t seem obviously stupid to me.