As in, maybe we can just fake the deployment environment and test the AIs behavior in this case?
Quick dialogue:
Bob: Can’t we just simulate what the deployment environment looks like upfront and see if the AI does anything bad?
Alice: Well, for one, that might be quite hard to do so realistically if your AI is quite smart and is trying to figure out if it is being tested. You can of course, ongoingly review some of the AI’s actual behavior in deployment. (Putting privacy and logistical concerns aside.)
Bob: Ok, but probably we can make this pretty hard for the AI to pull off right? It sounds kinda hard for the AI to get away with something.
Alice: Maybe, that is the hope of the AI control agenda. We can generalize these concerns to be “suppose the AI was an adversary trying to subvert our safety measures (i.e. an insider threat), would that cause problems?”. I think it’s pretty unclear how easy it would be to ensure control, and I currently expect at least some aspects to be reasonably hard. In particular, I expect at the very least it require substantial effort to evaluate that AIs are controlled for AIs capable of automating research. I also expect control to be basically impossible for very super human models[1].
As in, maybe we can just fake the deployment environment and test the AIs behavior in this case?
Quick dialogue:
Bob: Can’t we just simulate what the deployment environment looks like upfront and see if the AI does anything bad?
Alice: Well, for one, that might be quite hard to do so realistically if your AI is quite smart and is trying to figure out if it is being tested. You can of course, ongoingly review some of the AI’s actual behavior in deployment. (Putting privacy and logistical concerns aside.)
Bob: Yeah, why doesn’t that solve our problem.
Alice: Well, the AI can also realize you might do this. So, if it can try to pick a strategy which has the greatest chance of achieving it’s misaligned aims. In this case, it could either suddenly or rarely take misaligned actions such that you don’t spot it before it is too late or it could take subtly bad actions often that help it achieve it’s aims.
Bob: Ok, but probably we can make this pretty hard for the AI to pull off right? It sounds kinda hard for the AI to get away with something.
Alice: Maybe, that is the hope of the AI control agenda. We can generalize these concerns to be “suppose the AI was an adversary trying to subvert our safety measures (i.e. an insider threat), would that cause problems?”. I think it’s pretty unclear how easy it would be to ensure control, and I currently expect at least some aspects to be reasonably hard. In particular, I expect at the very least it require substantial effort to evaluate that AIs are controlled for AIs capable of automating research. I also expect control to be basically impossible for very super human models[1].
That said, it does seem pretty plausible that if we did align smart models, those models could control somewhat smarter models.