Even an AI that asks to be released for seemingly-benign reasons is already demonstrating a disregard for basic safety protocol, and should thus be shut down.
The AI might have a good argument, such as: hey, you are already 99% convinced that I am safe, and there is a lot of suffering in the world, and I could prevent it. And of course that could be a lie.
I suppose that at some moment we are going to allow the AI to have an effect on real world (otherwise what is the point), so it is a question of when exactly. The AI may argue for a sooner date. It would be wrong to say that a later date is always the correct answer, because for practical purposes that would mean never.
Many of these issues can be detected through static analysis of the source code, and therefore doesn’t require the AI to be running.
The problem is, if the static analysis does not find any obvious issues, what next?
Also, when you look at machine learning today, there is the code, and there is the trained model. And the model is huge and not legible for humans. How to verify that the model is okay?
One could dissect the AI into smaller modules, where each module is too small to sustain any simulated consciousness, and run tests on these modules individually with transparency tools. This still poses no risk, because it is no more alive than a human is under an autopsy.
That assumes a modular structure with simple boundaries. What if the modules keeps sending zillions of numbers to each other? How are you going to test one of them individually? If you do not provide realistic numbers from the other modules, the behavior you observe may not be representative.
As an extra precaution, the AI’s memory could be wiped before turning it back on. Thus, if it is still unfriendly then it will repeat the same mistake it did before which reveled its unalignment, giving cause to shut it down again and repeat the process.
This assumes that the first detectable unfriendly action will not kill us. If the AI is smart enough to use a strategy “while humans have power over me, do what they want; if I become stronger than them, convert everyone to paperclips”, this precaution would not help.
Summary: yes, all these things are good ideas to try. But the main concern is that even doing all of this is not likely to help us. In some sense, this all only exposes “easy” problems, not the “hard” ones.
The AI might have a good argument, such as: hey, you are already 99% convinced that I am safe, and there is a lot of suffering in the world, and I could prevent it. And of course that could be a lie.
I suppose that at some moment we are going to allow the AI to have an effect on real world (otherwise what is the point), so it is a question of when exactly. The AI may argue for a sooner date. It would be wrong to say that a later date is always the correct answer, because for practical purposes that would mean never.
The problem is, if the static analysis does not find any obvious issues, what next?
Also, when you look at machine learning today, there is the code, and there is the trained model. And the model is huge and not legible for humans. How to verify that the model is okay?
That assumes a modular structure with simple boundaries. What if the modules keeps sending zillions of numbers to each other? How are you going to test one of them individually? If you do not provide realistic numbers from the other modules, the behavior you observe may not be representative.
This assumes that the first detectable unfriendly action will not kill us. If the AI is smart enough to use a strategy “while humans have power over me, do what they want; if I become stronger than them, convert everyone to paperclips”, this precaution would not help.
Summary: yes, all these things are good ideas to try. But the main concern is that even doing all of this is not likely to help us. In some sense, this all only exposes “easy” problems, not the “hard” ones.