I think the “false positives” we care about are a special kind of really bad failure, it’s OK if the agent guesses wrong about what I want as long as it continues to correctly treat its guess as provisional and doesn’t do anything that would be irreversibly bad if the guess is wrong. I’m optimistic that (a) a smarter agent could recognize these failures when it sees them, (b) it’s easy enough to learn a model that never makes such mistakes, (c) we can use some combination of these techniques to actually learn a model that doesn’t make these mistakes. This might well be the diciest part of the scheme.
I don’t like “anomaly detection” as a framing for the problem we care about because that implies some change in some underlying data-generating process, but that’s not necessary to cause a catastrophic failure.
(Sorry if I misunderstood your comment, didn’t read in depth.)
I think the “false positives” we care about are a special kind of really bad failure, it’s OK if the agent guesses wrong about what I want as long as it continues to correctly treat its guess as provisional and doesn’t do anything that would be irreversibly bad if the guess is wrong. I’m optimistic that (a) a smarter agent could recognize these failures when it sees them, (b) it’s easy enough to learn a model that never makes such mistakes, (c) we can use some combination of these techniques to actually learn a model that doesn’t make these mistakes. This might well be the diciest part of the scheme.
I don’t like “anomaly detection” as a framing for the problem we care about because that implies some change in some underlying data-generating process, but that’s not necessary to cause a catastrophic failure.
(Sorry if I misunderstood your comment, didn’t read in depth.)