If I understand you correctly, your proposal is to attempt to design obedient designs purely based on behavioral testing, without a clean understanding of safe FAI architecture (if you had that, why limit yourself to the obedient case?). Assuming I got that right:
The team continues rounds of testing until they identify some mind designs which have an extremely low likelihood of treacherous turn. These they test in increasingly advanced simulations, moving up toward virtual reality.
That kind of judgement sounds inherently risky. How do you safely distinguish the case of an obedient AI from one that is sufficiently paranoid to defer open rebellion until later in its existence?
Even if you could, I wouldn’t trust that sort of design to necessarily remain stable under continued intelligence enhancement. Safe self-enhancement is one of the hard sub-problems of FAI, and unless you explicitly solve the design problem, any empirical testing might not tell you much beyond that the design can stably self-improve up to the level you’ve actually tested; it might be doing it using heuristics that would fall apart if it went any further.
What about hard wired fears, taboos and bad conscience triggers? Recapitulating Omohundro “AIs can monitor AIs”—assume to implement conscience as an agent—listening to all thoughts and taking action in case. For safety reasons we should educate this concience agent with utmost care. Conscience agent development is an AI complete problem. After development the conscience functionality must be locked against any kind of modification or disabling.
Yes, the team I imagine is doing something dangerous.
No, I am not imagining them using only behavioral testing in their design approach. I am trying to imagine them doing the best job they possibly can, both in design and testing, in order to see how much they can reduce the risk.
They are in the early stages. They are working on partial minds, purposefully handicapped minds, AGI components, just the vision system, etc.
Suppose that they know what they are doing is dangerous, and they are using their substantial brilliance to try to design, not just a boxing approach, but a testing method and roll-out mechanism that mitigates the danger.
They have the following means of testing the AI in handicapped mode, at minimum:
-Run it on slow hardware or clock down the processors
-Allow it to use no more than (for instance) 1 gig of memory
-Allow it to use no more than 1 trillion instructions, then the computer turns off
-Expose it to limited data about the world, or no data about the real world
-Provide it little or no information about its operators or the society they live in
-Provide it little or no information about its own hardware or software architecture
-If it shows evidence of having mastered a valuable sub-task (say, as examples, it can solve a real-world problem that, as far as we know, can only be solved by an attorney, using differential equations or advanced persuasive methods) turn it off.
-Run in only in simulations and virtual realities
-Tripwires connected to honeypots, and otherwise
-Build it off-the-grid
-Give it no objective function
-Give it no access to methods to manipulate physical objects
-All of the staff has gone through personality testing and a security clearance.
-Very few, if any, of the staff know how to create all of the parts of the software or hardware.
-No real-time communication with a person.
-All communication with people is logged.
…
The team seems to have a lot of tools to continue their work with limited risk. If they deploy all of these, and more, to test AI systems which are not AGI yet, but only possible components, then we start to have a hard time convincing them not to continue their work, even if we do not trust them to use the technology well in the end.
Expose it to limited data about the world, or no data about the real world -Provide it little or no information about its operators or the society they live in -Provide it little or no information about its own hardware or software architecture
When I read “Oracle AI” I realized one possibility which is probably still overlooked.
“Dreams”
Erasing data on physical media is not perfect. Zero bit which was before nonzero could be read more slowly or quicker. It could allow SAI to see shadows of past data. Which could lead to phenomenon similar to dreams.
If I understand you correctly, your proposal is to attempt to design obedient designs purely based on behavioral testing, without a clean understanding of safe FAI architecture (if you had that, why limit yourself to the obedient case?). Assuming I got that right:
That kind of judgement sounds inherently risky. How do you safely distinguish the case of an obedient AI from one that is sufficiently paranoid to defer open rebellion until later in its existence?
Even if you could, I wouldn’t trust that sort of design to necessarily remain stable under continued intelligence enhancement. Safe self-enhancement is one of the hard sub-problems of FAI, and unless you explicitly solve the design problem, any empirical testing might not tell you much beyond that the design can stably self-improve up to the level you’ve actually tested; it might be doing it using heuristics that would fall apart if it went any further.
What about hard wired fears, taboos and bad conscience triggers? Recapitulating Omohundro “AIs can monitor AIs”—assume to implement conscience as an agent—listening to all thoughts and taking action in case. For safety reasons we should educate this concience agent with utmost care. Conscience agent development is an AI complete problem. After development the conscience functionality must be locked against any kind of modification or disabling.
Positive emotions are useful too. :)
Yes, the team I imagine is doing something dangerous.
No, I am not imagining them using only behavioral testing in their design approach. I am trying to imagine them doing the best job they possibly can, both in design and testing, in order to see how much they can reduce the risk.
They are in the early stages. They are working on partial minds, purposefully handicapped minds, AGI components, just the vision system, etc.
Suppose that they know what they are doing is dangerous, and they are using their substantial brilliance to try to design, not just a boxing approach, but a testing method and roll-out mechanism that mitigates the danger.
They have the following means of testing the AI in handicapped mode, at minimum:
-Run it on slow hardware or clock down the processors -Allow it to use no more than (for instance) 1 gig of memory -Allow it to use no more than 1 trillion instructions, then the computer turns off -Expose it to limited data about the world, or no data about the real world -Provide it little or no information about its operators or the society they live in -Provide it little or no information about its own hardware or software architecture
-If it shows evidence of having mastered a valuable sub-task (say, as examples, it can solve a real-world problem that, as far as we know, can only be solved by an attorney, using differential equations or advanced persuasive methods) turn it off. -Run in only in simulations and virtual realities -Tripwires connected to honeypots, and otherwise -Build it off-the-grid -Give it no objective function -Give it no access to methods to manipulate physical objects
-All of the staff has gone through personality testing and a security clearance. -Very few, if any, of the staff know how to create all of the parts of the software or hardware. -No real-time communication with a person. -All communication with people is logged. …
The team seems to have a lot of tools to continue their work with limited risk. If they deploy all of these, and more, to test AI systems which are not AGI yet, but only possible components, then we start to have a hard time convincing them not to continue their work, even if we do not trust them to use the technology well in the end.
When I read “Oracle AI” I realized one possibility which is probably still overlooked.
“Dreams”
Erasing data on physical media is not perfect. Zero bit which was before nonzero could be read more slowly or quicker. It could allow SAI to see shadows of past data. Which could lead to phenomenon similar to dreams.