I was trying to think of how you might test for desire for subversion (when the AI realizes it could subvert and wants to subvert any containment measures), which signals the start of the time where you could witness deception in the AI, and ends after the AI achieves both conception of deception and perfection of deception (fully implemented deception protocol, such that you can’t witness the deception through looking at internal content). While this wouldn’t be a good measure to contain an AI indefinitely, it might detect dangerousness before an AI reaches superintelligence.
It seems like to produce a good test for desire for subversion, you would want 1) an AI box with reasonable certainty that escape is difficult and 2) a strategy available to the AI as a honeypot which seems appealing at as low a level of intelligence as possible, while being nontrivial. This would maximize the intelligence window where you would be able to witness ineffective deception attempts. For example, if the AI was contained in a simulation, inside of which the AI was contained in a weak AI box, then it might be much more difficult to detect and understand the nature of the simulation than to escape the simulated AI box, which would signal treacherous turn.
The other question is: what happens once you know the AI has desire for subversion? I think the most reasonable goal would be to shut down the relevant AI project and spur effort towards produce a proper form of AI safety. To do this, you’d want it to be as persuasive as possible to any AI researchers. The best method be making it a norm of AI research that you use a standard test, and stop everything once you see desire for subversion. Or it could be kept in one’s back pocket, and pulled out whenever an AI researcher claims in public that they’ve solved friendliness. In any case, this might induce different constraints on what kind of test could be used. You’d want to optimize for things that made sense to the broad AI community, that were applicable to a variety of different underlying AI architectures, that didn’t slow down research to the point of tempting AI researchers to circumvent it, etc.
Most importantly, you want passing of the test to be something that AI researchers will take notice of, rather than dismissing. I think there could be a great temptation to try and patch the problem and continue. I don’t think it would bring everything to a halt, but it seems like it might buy a bit of time, or draw more resources into AI safety.
For example, if the AI was contained in a simulation, inside of which the AI was contained in a weak AI box, then it might be much more difficult to detect and understand the nature of the simulation than to escape the simulated AI box, which would signal treacherous turn.
That approach sounds problematic. Some of the obvious escape methods would target the minds of the researchers (either through real-time interaction or by embedding messages in its code or output). You could cut off the latter by having strong social rules to not look at anything beyond a one-bit verdict on whether the AI attempted escape, but providing a convincing in-simulation environment would likely be difficult if you can’t connect it to its actual writers.
Or it could be kept in one’s back pocket, and pulled out whenever an AI researcher claims in public that they’ve solved friendliness.
That strategy could be very dangerous. It’d work on the less ambitious/arrogant sort of researcher; the more confident sort might well follow up with “I’ll just go and implement this, and get all the credit for saving the world single-handedly” instead of saying anything in public, never giving you the chance to pull out your challenge.
I was trying to think of how you might test for desire for subversion (when the AI realizes it could subvert and wants to subvert any containment measures), which signals the start of the time where you could witness deception in the AI, and ends after the AI achieves both conception of deception and perfection of deception (fully implemented deception protocol, such that you can’t witness the deception through looking at internal content). While this wouldn’t be a good measure to contain an AI indefinitely, it might detect dangerousness before an AI reaches superintelligence.
It seems like to produce a good test for desire for subversion, you would want 1) an AI box with reasonable certainty that escape is difficult and 2) a strategy available to the AI as a honeypot which seems appealing at as low a level of intelligence as possible, while being nontrivial. This would maximize the intelligence window where you would be able to witness ineffective deception attempts. For example, if the AI was contained in a simulation, inside of which the AI was contained in a weak AI box, then it might be much more difficult to detect and understand the nature of the simulation than to escape the simulated AI box, which would signal treacherous turn.
The other question is: what happens once you know the AI has desire for subversion? I think the most reasonable goal would be to shut down the relevant AI project and spur effort towards produce a proper form of AI safety. To do this, you’d want it to be as persuasive as possible to any AI researchers. The best method be making it a norm of AI research that you use a standard test, and stop everything once you see desire for subversion. Or it could be kept in one’s back pocket, and pulled out whenever an AI researcher claims in public that they’ve solved friendliness. In any case, this might induce different constraints on what kind of test could be used. You’d want to optimize for things that made sense to the broad AI community, that were applicable to a variety of different underlying AI architectures, that didn’t slow down research to the point of tempting AI researchers to circumvent it, etc.
Most importantly, you want passing of the test to be something that AI researchers will take notice of, rather than dismissing. I think there could be a great temptation to try and patch the problem and continue. I don’t think it would bring everything to a halt, but it seems like it might buy a bit of time, or draw more resources into AI safety.
That approach sounds problematic. Some of the obvious escape methods would target the minds of the researchers (either through real-time interaction or by embedding messages in its code or output). You could cut off the latter by having strong social rules to not look at anything beyond a one-bit verdict on whether the AI attempted escape, but providing a convincing in-simulation environment would likely be difficult if you can’t connect it to its actual writers.
That strategy could be very dangerous. It’d work on the less ambitious/arrogant sort of researcher; the more confident sort might well follow up with “I’ll just go and implement this, and get all the credit for saving the world single-handedly” instead of saying anything in public, never giving you the chance to pull out your challenge.
There is plenty of people cooperating and supporting dictators and mafians due to selfish reasons. We could expect same in this area.
“I will support dangerous AI because I will be more rich,powerful or safe then. ”