More generally, I think arguments that human feedback is failing should ideally be of the form:
“Human labelers (with AI assistance) fail to notice this sort of bad behavior. Also, either this or nearby stuff can’t just be resolved with trivial and obvious countermeasures like telling human labelers to be on the look out for this bad behavior.”
More generally, I think arguments that human feedback is failing should ideally be of the form:
“Human labelers (with AI assistance) fail to notice this sort of bad behavior. Also, either this or nearby stuff can’t just be resolved with trivial and obvious countermeasures like telling human labelers to be on the look out for this bad behavior.”
See Meta-level oversight evaluation for how I think you should evaluate oversight in general.