While noting down a highly lukewarm hot take about ELK, I thought of a plan for a “heist:”
Create a copy of your diamond, then forge evidence both of swapping my forgery with the diamond in your vault, and you covering up that swap. Use PR to damage your reputation and convince the public that I in fact hold the real diamond. Then sell my new original for fat stacks of cash. This could make a fun heist movie, where the question of whether the filmed heist is staged or actually happened is left with room for doubt by the audience.
Anyhow, I feel like there’s something fishy about the shift of meta-level going on when the focus of this post moves from the diamond staying put for the usual reasons, to the AI making a decision for the usual reasons. In the object-level diamond example, we want to know that the AI is using “usual reasons” type decision-making. If we need a second AI to tell us that the first AI is using “usual reasons” type reasoning, we might need a third AI to tell us whether the second AI is using “usual reasons” type reasoning when inspecting the first AI, or whether it might be tricked by an edge case. If we don’t need the third AI, then it feels like maybe that means we should have a good enough grasp of how to construct reasoning systems that we shouldn’t need the second AI either.
In the object-level diamond example, we want to know that the AI is using “usual reasons” type decision-making.
In the object-level diamond situation, we have a predictor of “does the diamond appear to remain in the vault,” we have a proposed action and predict that if we take it the diamond will appear to remain in the vault, and we want to know whether the diamond appears to remain in the vault for the normal reason.
For simplicity, when talking about ELK in this post or in the report, we are imagining literally selecting actions by looping over each possible action and predicting its consequences, or doing some kind of more clever search (but where the alignment difficulty comes from the search).
You could also try to apply this to a model-free RL agent. I think that’s probably not very different. My best guess for how to do it is to train a question-answering head to talk about the possible consequences of its plan, and then use this machinery to keep that honest. But I don’t discuss it in this post and haven’t thought about it as much.
I’m somewhat confused, but it does seem like there are two AIs when you talk about doing automated anomaly detection for deceptive alignment. If I attempt to read your mind, I get a lot of disjoint possibilities. Some of them are:
We probably agree but you don’t quite know what I’m talking about either, or
You don’t think anomaly detection counts as “an AI,” maybe because you expect it to not involve much learning or model-building (where I would expect it to involve model-building), or
You expect anomaly detection to require cleverness, but think that cleverness will all be located in one place, so that we’re really talking about one AI reflecting on itself.
The general strategy I’m describing for anomaly detection is:
Search for an explanation of a model behavior (like “answers questions coherently”) on the training set.
Given a new input, take a sub-explanation that explains almost all of the training set behavior but doesn’t explain the behavior on the new input.
If you can’t find one, then call the behavior potentially anomalous (e.g. because on the training set coherence follows from the logical structure of the world, but on the deceptive alignment point it follows from a desire to avoid looking incoherent)
The first two steps are solving problems in NP, so I think it’s reasonable to expect them to be easier from an alignment perspective. (We also don’t imagine solving them using the same AI techniques applied on the object level, and if we did then I would agree that we have further deceptive alignment problems, but still it seems helpful to ask your AI to do something that is both tractable and formally verifiable.)
My sense is that you probably don’t think this kind of strategy can work, and that instead anomaly detection requires something more like training a second AI to tell us if something is weird. I’d agree that this doesn’t sound like progress.
While noting down a highly lukewarm hot take about ELK, I thought of a plan for a “heist:”
Create a copy of your diamond, then forge evidence both of swapping my forgery with the diamond in your vault, and you covering up that swap. Use PR to damage your reputation and convince the public that I in fact hold the real diamond. Then sell my new original for fat stacks of cash. This could make a fun heist movie, where the question of whether the filmed heist is staged or actually happened is left with room for doubt by the audience.
Anyhow, I feel like there’s something fishy about the shift of meta-level going on when the focus of this post moves from the diamond staying put for the usual reasons, to the AI making a decision for the usual reasons. In the object-level diamond example, we want to know that the AI is using “usual reasons” type decision-making. If we need a second AI to tell us that the first AI is using “usual reasons” type reasoning, we might need a third AI to tell us whether the second AI is using “usual reasons” type reasoning when inspecting the first AI, or whether it might be tricked by an edge case. If we don’t need the third AI, then it feels like maybe that means we should have a good enough grasp of how to construct reasoning systems that we shouldn’t need the second AI either.
There isn’t supposed to be a second AI.
In the object-level diamond situation, we have a predictor of “does the diamond appear to remain in the vault,” we have a proposed action and predict that if we take it the diamond will appear to remain in the vault, and we want to know whether the diamond appears to remain in the vault for the normal reason.
For simplicity, when talking about ELK in this post or in the report, we are imagining literally selecting actions by looping over each possible action and predicting its consequences, or doing some kind of more clever search (but where the alignment difficulty comes from the search).
You could also try to apply this to a model-free RL agent. I think that’s probably not very different. My best guess for how to do it is to train a question-answering head to talk about the possible consequences of its plan, and then use this machinery to keep that honest. But I don’t discuss it in this post and haven’t thought about it as much.
I’m somewhat confused, but it does seem like there are two AIs when you talk about doing automated anomaly detection for deceptive alignment. If I attempt to read your mind, I get a lot of disjoint possibilities. Some of them are:
We probably agree but you don’t quite know what I’m talking about either, or
You don’t think anomaly detection counts as “an AI,” maybe because you expect it to not involve much learning or model-building (where I would expect it to involve model-building), or
You expect anomaly detection to require cleverness, but think that cleverness will all be located in one place, so that we’re really talking about one AI reflecting on itself.
The general strategy I’m describing for anomaly detection is:
Search for an explanation of a model behavior (like “answers questions coherently”) on the training set.
Given a new input, take a sub-explanation that explains almost all of the training set behavior but doesn’t explain the behavior on the new input.
If you can’t find one, then call the behavior potentially anomalous (e.g. because on the training set coherence follows from the logical structure of the world, but on the deceptive alignment point it follows from a desire to avoid looking incoherent)
The first two steps are solving problems in NP, so I think it’s reasonable to expect them to be easier from an alignment perspective. (We also don’t imagine solving them using the same AI techniques applied on the object level, and if we did then I would agree that we have further deceptive alignment problems, but still it seems helpful to ask your AI to do something that is both tractable and formally verifiable.)
My sense is that you probably don’t think this kind of strategy can work, and that instead anomaly detection requires something more like training a second AI to tell us if something is weird. I’d agree that this doesn’t sound like progress.