The general strategy I’m describing for anomaly detection is:
Search for an explanation of a model behavior (like “answers questions coherently”) on the training set.
Given a new input, take a sub-explanation that explains almost all of the training set behavior but doesn’t explain the behavior on the new input.
If you can’t find one, then call the behavior potentially anomalous (e.g. because on the training set coherence follows from the logical structure of the world, but on the deceptive alignment point it follows from a desire to avoid looking incoherent)
The first two steps are solving problems in NP, so I think it’s reasonable to expect them to be easier from an alignment perspective. (We also don’t imagine solving them using the same AI techniques applied on the object level, and if we did then I would agree that we have further deceptive alignment problems, but still it seems helpful to ask your AI to do something that is both tractable and formally verifiable.)
My sense is that you probably don’t think this kind of strategy can work, and that instead anomaly detection requires something more like training a second AI to tell us if something is weird. I’d agree that this doesn’t sound like progress.
The general strategy I’m describing for anomaly detection is:
Search for an explanation of a model behavior (like “answers questions coherently”) on the training set.
Given a new input, take a sub-explanation that explains almost all of the training set behavior but doesn’t explain the behavior on the new input.
If you can’t find one, then call the behavior potentially anomalous (e.g. because on the training set coherence follows from the logical structure of the world, but on the deceptive alignment point it follows from a desire to avoid looking incoherent)
The first two steps are solving problems in NP, so I think it’s reasonable to expect them to be easier from an alignment perspective. (We also don’t imagine solving them using the same AI techniques applied on the object level, and if we did then I would agree that we have further deceptive alignment problems, but still it seems helpful to ask your AI to do something that is both tractable and formally verifiable.)
My sense is that you probably don’t think this kind of strategy can work, and that instead anomaly detection requires something more like training a second AI to tell us if something is weird. I’d agree that this doesn’t sound like progress.