Preventing capability gains (e.g. situational awareness) that lead to deception
Note: I’m at the crackpot idea stage of thinking about how model editing could be useful for alignment.
One worry with deception is that the AI will likely develop a sufficiently good world model to understand it is in a training loop before it has fully aligned inner values.
The thing is, if the model was aligned, then at some point we’d consider it useful for the model to have a good enough world model to recognize that it is a model. Well, what if you prevent the model from being able to gain situational awareness only after it has properly embedded aligned values? In other words, you don’t lobotomize the model permanently from ever gaining situational awareness (which would be uncompetitive), but you lobotomize it until we are confident it is aligned and won’t suddenly become deceptive once it gains situational awareness.
I’m imagining a scenario where situational awareness is a module in the network or you’re able to remove it from the model without completely destroying the model and having interpretability tools powerful enough to be confident that the trained model is aligned. Once you are confident this is the case, you might be in a world where you are no longer worried about situational awareness.
Anyway, I expect that there are issues with this, but wanted to write it up here so I can remove it from another post I’m writing. I’d need to think about this type of stuff a lot more to add it to the post, so I’m leaving it here for now.
Preventing capability gains (e.g. situational awareness) that lead to deception
Note: I’m at the crackpot idea stage of thinking about how model editing could be useful for alignment.
One worry with deception is that the AI will likely develop a sufficiently good world model to understand it is in a training loop before it has fully aligned inner values.
The thing is, if the model was aligned, then at some point we’d consider it useful for the model to have a good enough world model to recognize that it is a model. Well, what if you prevent the model from being able to gain situational awareness only after it has properly embedded aligned values? In other words, you don’t lobotomize the model permanently from ever gaining situational awareness (which would be uncompetitive), but you lobotomize it until we are confident it is aligned and won’t suddenly become deceptive once it gains situational awareness.
I’m imagining a scenario where situational awareness is a module in the network or you’re able to remove it from the model without completely destroying the model and having interpretability tools powerful enough to be confident that the trained model is aligned. Once you are confident this is the case, you might be in a world where you are no longer worried about situational awareness.
Anyway, I expect that there are issues with this, but wanted to write it up here so I can remove it from another post I’m writing. I’d need to think about this type of stuff a lot more to add it to the post, so I’m leaving it here for now.