In the object-level diamond example, we want to know that the AI is using “usual reasons” type decision-making.
In the object-level diamond situation, we have a predictor of “does the diamond appear to remain in the vault,” we have a proposed action and predict that if we take it the diamond will appear to remain in the vault, and we want to know whether the diamond appears to remain in the vault for the normal reason.
For simplicity, when talking about ELK in this post or in the report, we are imagining literally selecting actions by looping over each possible action and predicting its consequences, or doing some kind of more clever search (but where the alignment difficulty comes from the search).
You could also try to apply this to a model-free RL agent. I think that’s probably not very different. My best guess for how to do it is to train a question-answering head to talk about the possible consequences of its plan, and then use this machinery to keep that honest. But I don’t discuss it in this post and haven’t thought about it as much.
I’m somewhat confused, but it does seem like there are two AIs when you talk about doing automated anomaly detection for deceptive alignment. If I attempt to read your mind, I get a lot of disjoint possibilities. Some of them are:
We probably agree but you don’t quite know what I’m talking about either, or
You don’t think anomaly detection counts as “an AI,” maybe because you expect it to not involve much learning or model-building (where I would expect it to involve model-building), or
You expect anomaly detection to require cleverness, but think that cleverness will all be located in one place, so that we’re really talking about one AI reflecting on itself.
The general strategy I’m describing for anomaly detection is:
Search for an explanation of a model behavior (like “answers questions coherently”) on the training set.
Given a new input, take a sub-explanation that explains almost all of the training set behavior but doesn’t explain the behavior on the new input.
If you can’t find one, then call the behavior potentially anomalous (e.g. because on the training set coherence follows from the logical structure of the world, but on the deceptive alignment point it follows from a desire to avoid looking incoherent)
The first two steps are solving problems in NP, so I think it’s reasonable to expect them to be easier from an alignment perspective. (We also don’t imagine solving them using the same AI techniques applied on the object level, and if we did then I would agree that we have further deceptive alignment problems, but still it seems helpful to ask your AI to do something that is both tractable and formally verifiable.)
My sense is that you probably don’t think this kind of strategy can work, and that instead anomaly detection requires something more like training a second AI to tell us if something is weird. I’d agree that this doesn’t sound like progress.
There isn’t supposed to be a second AI.
In the object-level diamond situation, we have a predictor of “does the diamond appear to remain in the vault,” we have a proposed action and predict that if we take it the diamond will appear to remain in the vault, and we want to know whether the diamond appears to remain in the vault for the normal reason.
For simplicity, when talking about ELK in this post or in the report, we are imagining literally selecting actions by looping over each possible action and predicting its consequences, or doing some kind of more clever search (but where the alignment difficulty comes from the search).
You could also try to apply this to a model-free RL agent. I think that’s probably not very different. My best guess for how to do it is to train a question-answering head to talk about the possible consequences of its plan, and then use this machinery to keep that honest. But I don’t discuss it in this post and haven’t thought about it as much.
I’m somewhat confused, but it does seem like there are two AIs when you talk about doing automated anomaly detection for deceptive alignment. If I attempt to read your mind, I get a lot of disjoint possibilities. Some of them are:
We probably agree but you don’t quite know what I’m talking about either, or
You don’t think anomaly detection counts as “an AI,” maybe because you expect it to not involve much learning or model-building (where I would expect it to involve model-building), or
You expect anomaly detection to require cleverness, but think that cleverness will all be located in one place, so that we’re really talking about one AI reflecting on itself.
The general strategy I’m describing for anomaly detection is:
Search for an explanation of a model behavior (like “answers questions coherently”) on the training set.
Given a new input, take a sub-explanation that explains almost all of the training set behavior but doesn’t explain the behavior on the new input.
If you can’t find one, then call the behavior potentially anomalous (e.g. because on the training set coherence follows from the logical structure of the world, but on the deceptive alignment point it follows from a desire to avoid looking incoherent)
The first two steps are solving problems in NP, so I think it’s reasonable to expect them to be easier from an alignment perspective. (We also don’t imagine solving them using the same AI techniques applied on the object level, and if we did then I would agree that we have further deceptive alignment problems, but still it seems helpful to ask your AI to do something that is both tractable and formally verifiable.)
My sense is that you probably don’t think this kind of strategy can work, and that instead anomaly detection requires something more like training a second AI to tell us if something is weird. I’d agree that this doesn’t sound like progress.