I’m somewhat confused, but it does seem like there are two AIs when you talk about doing automated anomaly detection for deceptive alignment. If I attempt to read your mind, I get a lot of disjoint possibilities. Some of them are:
We probably agree but you don’t quite know what I’m talking about either, or
You don’t think anomaly detection counts as “an AI,” maybe because you expect it to not involve much learning or model-building (where I would expect it to involve model-building), or
You expect anomaly detection to require cleverness, but think that cleverness will all be located in one place, so that we’re really talking about one AI reflecting on itself.
The general strategy I’m describing for anomaly detection is:
Search for an explanation of a model behavior (like “answers questions coherently”) on the training set.
Given a new input, take a sub-explanation that explains almost all of the training set behavior but doesn’t explain the behavior on the new input.
If you can’t find one, then call the behavior potentially anomalous (e.g. because on the training set coherence follows from the logical structure of the world, but on the deceptive alignment point it follows from a desire to avoid looking incoherent)
The first two steps are solving problems in NP, so I think it’s reasonable to expect them to be easier from an alignment perspective. (We also don’t imagine solving them using the same AI techniques applied on the object level, and if we did then I would agree that we have further deceptive alignment problems, but still it seems helpful to ask your AI to do something that is both tractable and formally verifiable.)
My sense is that you probably don’t think this kind of strategy can work, and that instead anomaly detection requires something more like training a second AI to tell us if something is weird. I’d agree that this doesn’t sound like progress.
I’m somewhat confused, but it does seem like there are two AIs when you talk about doing automated anomaly detection for deceptive alignment. If I attempt to read your mind, I get a lot of disjoint possibilities. Some of them are:
We probably agree but you don’t quite know what I’m talking about either, or
You don’t think anomaly detection counts as “an AI,” maybe because you expect it to not involve much learning or model-building (where I would expect it to involve model-building), or
You expect anomaly detection to require cleverness, but think that cleverness will all be located in one place, so that we’re really talking about one AI reflecting on itself.
The general strategy I’m describing for anomaly detection is:
Search for an explanation of a model behavior (like “answers questions coherently”) on the training set.
Given a new input, take a sub-explanation that explains almost all of the training set behavior but doesn’t explain the behavior on the new input.
If you can’t find one, then call the behavior potentially anomalous (e.g. because on the training set coherence follows from the logical structure of the world, but on the deceptive alignment point it follows from a desire to avoid looking incoherent)
The first two steps are solving problems in NP, so I think it’s reasonable to expect them to be easier from an alignment perspective. (We also don’t imagine solving them using the same AI techniques applied on the object level, and if we did then I would agree that we have further deceptive alignment problems, but still it seems helpful to ask your AI to do something that is both tractable and formally verifiable.)
My sense is that you probably don’t think this kind of strategy can work, and that instead anomaly detection requires something more like training a second AI to tell us if something is weird. I’d agree that this doesn’t sound like progress.