I think for a superintelligence it would be a piece of cake to hack a human
although I guess I’m Cpt. Obvious for saying that here :)
accolade, I actually have no idea what the consensus is, now that the experiment was won by EY
We should do a poll or something
absolutely. I’m surprised that hasn’t been done yet
Poll:
Do you think a superintelligent AGI could escape an AI-Box, given that the gatekeepers are highly trained in resisting the AI’s persuasive tactics, and that the guards are competent and organized?
[pollid:394]
Note that the AI box setting is not one which security-minded people would consider “competent”; once you’re convinced that AI is dangerous and persuasive, the minimum safeguard would be to require multiple people to be present when interacting with the box, and to only allow release with the assent of a significant number of people.
It is, after all, much harder to convince a group of mutually-suspicious humans than to convince one lone person.
(This is not a knock on EY’s experiment, which does indeed test a level of security that really was proposed by several real-world people; it is a knock on their security systems.)
I think this is making a five-inch fence half an inch higher. It’s just not relevant on the scale of an agent to which a human is a causal system made of brain areas and a group of humans is just another causal system made of several interacting copies of those brain areas.
I agree that the AI you envision would be dangerously likely to escape a “competent” box too; and in any case, even if you manage to keep the AI in the box, attempts to actually use any advice it gives are extremely dangerous.
That said, I think your “half an inch” is off by multiple orders of magnitude.
It is, after all, much harder to convince a group of mutually-suspicious humans than to convince one lone person.
That sounds right. Would you have evidence to back up the intuition? (This knowledge would also be useful for marketing and other present life persuasion purposes.)
#(
TL;DR: Mo’ people—mo’ problems?
I can think of effects that could theoretically make it easier to convince a group:
For some reason, Boxy might be better skilled at manipulating social/group dynamics than at influencing a lone wolf.
More people make the system more complex. Complexity generally increases the likelihood of security holes.
Every extra person makes another target and will bring new soft spots to the table, which the AI could pounce on.
Supposing that the most competent person available would get the position of the lone Gatekeeper, the average competence would fall when adding more staff.
Then the machine could go for an inductive approach—convince the weakest link first, proceed from there with this human ally on her side.
Persuaded humans could principally be employed as actuators, e.g. for pressuring, even attacking opposing group members.
The lone wolf could be strong against a computer but weak against fellow humans.
Surely you will say “But any communication with the terminal will be supervised by everyone!” But that does not strictly make such influence impossible as far as I can tell.
Also the superintelligence could get creative, e.g. instill a discussion among the colleagues so that most of them are distracted.
(You could take preemptive measures against these worries, but Boxy might find security holes in every ‘firewall’ you come up with—an arms race we could win?)
My comment was mostly inspired by (known effective) real-worldexamples. Note that relieving anyone who shows signs of being persuaded is a de-emphasized but vital part of this policy, as is carefully vetting people before trusting them.
Actually implementing a “N people at a time” rule can be done using locks, guards and/or cryptography (note that many such algorithms are provably secure against an adversary with unlimited computing power, “information theoretic security”).
I’m not even sure we could build the hardware to keep an unfriendly AI in a box even without any form of communication with humans. There are a lot of ways to leak EM radiation between systems (Tempest). There are bugs in hardware. There are probably unknown laws of physics and certainly unknown ways of exploiting the laws we do know.
I was talking about this with my brother and we decided that this question really hinges on the detail of exactly how competent and organized is “competent and organized?”
If we are talking about multiple layers of concentric, automated one-way-blind “boxes,” the gatekeepers of which may themselves be altered versions of the core AI, and which may be passing to the next level only one bit, signifying “everything seems fine” or “abort,” and not knowing the details of implementation of any of the overlying layers, and the human gatekeepers shut down the system and examine the machine code by hand only after receiving the highest-level green light, then they might be okay.
If they just start interacting directlywith the AI, it’s already over.
Bearing in mind that most of my thinking on AI is more in the vein of cooking up science fiction plots …
The idea I was trying to evoke is that if you get a “fail” response from the system, you completely wipe the memory and never look at what happened, and if you get a “pass,” you stop everything and go in layer by layer checking that none of the gatekeepers were tricked or blackmailed in any obvious ways, and only then do you tentatively look at what the AI was doing.
Ideally the AI has no idea how many layers of containment it’s in, no idea that the top level gatekeepers are humans, and knows as little as possible about humans and about our universe’s physics. This should restrict its ability to plant tailor-made basilisks in its code or activities.
I actually don’t think these steps would be sufficient to constrain a superhuman AI, I see them more as the types of things you would do when you were already pretty sure the AI was Friendly.
Ok, I take it by “one-way-blind” you mean that each layer gets no new information that is not already in its database, but what is explicitly controlled by the humans. (E.g. I guess each layer should know the human query, in order to evaluate if AI’s answer is manipulative.)
I also understand that we do look at complex information given by the AI, but only if the security bit signals “ok”.
Ideally the AI […] knows as little as possible about humans and about our universe’s physics.
That seems problematic, as these kinds of knowledge will be crucial for the optimization we want the AI to calculate.
given that the gatekeepers are highly trained in resisting the AI’s persuasive tactics
I’m not sure how you could train people in something like this without actually testing what improves the rate at which people succumb to the arguments of a superhuman AI. If you can’t afford to fail even once, how can you judge what will decrease the rate of failure on the real event?
How about fooling both the AI and guinea Gatekeepers into thinking that they can let it out, say, at the push of a big red button which in actual fact does nothing? (Make AI forget about the button by reverting it to an older backup.)
Have the button turn off the AI immediately instead of doing nothing, so she doesn’t have time to switch to a plan B of having the persuaded Gatekeeper find a way to actually free her.
Of course she might anticipate that strategy and thoroughly indoctrinate and train him before he pushes the button. Thus, persuaded test Gatekeepers must be considered infected and put into cryogenic sleep* until the FAI problem is solved.
Furthermore, great caution would have to be taken in analyzing the discourse between Boxy and Guinea, because her arguments must be considered highly memetically infectious. Staff would have to stay under thorough psychological supervision and maybe replaced after some time.
No matter how much caution, we would be doomed if we consider Boxy’s IQ to be ∞. But if we continuously keep on Gatekeeper-testing every so often, starting at a low AI IQ, we might find persuasion patterns that are dangerous, but against which training/immunization is possible. Of course, there will be a limit to this process. At some point, Boxy will have become smart enough to produce ‘mind HIV’ - a thought virus we have no cure for yet.
If you could deceive the AI that easily, I think it would probably be simpler to get all the benefits of having a gatekeeper without actually using one.
If you would want to have a gatekeeper at all, but definitely don’t want to let the AI out, I would think that the benefits of having one would be to permit communication with the AI to draw upon its superhuman intelligence. If you can use the setup you just described, you could skip the step of ever using gatekeepers who actually have the power to let the AI out.
I think you are right, I just shifted and convoluted the problem somewhat, but in principle it remains the same:
To utilize the AI, you need to get information from it. That information could in theory be infected with a persuasive hyperstimulus, effectively making the recipient an actuator of the AI.
Well, in practice the additional security layer might win us some time. More on this in the update to my original comment.
Persuasion/hyperstimulation aren’t the only way. Maybe these can be countered by narrowing the interface, e.g. to yes/no replies, for using the AI as an oracle (“Should we do X?”). Of course we wouldn’t follow its advice if we had the impression that that could enable it to escape. But its strategy might evade our ‘radar’. E.g. she could make us empower a person, of whom she knows that they will free her but we don’t know.
Under the circumstances of the test (Hours to work and they can’t just ignore you) then yes, captain obvious. Without that, though? Much less sure.
And the way Eliezer seems to have put it sometimes, where one glance at a line of text will change your mind? Get real. Might as well try to put the whole world in a bottle.
And the way Eliezer seems to have put it sometimes, where one glance at a line of text will change your mind?
Going with the “dead loved one” idea mentioned above, the AI says a line that only the Gatekeeper’s dead child/spouse would say. That gets them to pause sufficiently in sheer surprise for it to keep talking. Very soon the Gatekeeper becomes emotionally dependent on it, and can’t bear the thought of destroying it, as it can simulate the dearly departed with such accuracy; must keep reading.
And the way Eliezer seems to have put it sometimes, where one glance at a line of text will change your mind? Get real. Might as well try to put the whole world in a bottle.
Do a thorough introspection of all your fears, doubts, mental problems, worries, wishes, dreams, and other things you care about or that tug at you or motivate you. Map them out as functions of X, where X is the possible one-liners that could be said to you that would evoke each of these, outputting how strongly it evokes them and possibly recursive function calls if evocation of one evokes another (e.g. fear of knives evokes childhood trauma).
Solve all the recursive neural network mappings, aggregate into a maximum-value formula / equation and solve for X where X becomes the one point (possible sentence) where a maximum amount of distress, panic, emotional pressure, etc. is generated. Remember, X is all possible sentences, including references to current events, special writing styles, odd typography, cultural or memetic references, etc.
I am quite positive a determined superintelligent AI would be capable of doing this, given that some human master torture artists can (apparently) already do this to some degree on some subjects out there in the real world.
I’m also rather certain that the amount of stuff happening at X is much more extreme than what you seem to have considered.
If the gatekeepers are evaluating the output of the AI and deciding whether or not to let the AI out, it seems trivial to say that there is something they could see that would cause them to let the AI out.
If the gatekeepers are simply playing a suitably high-stakes game where they lose iff they say they lose, I think that no AI ever could beat a trained rationalist.
Basically, I think the only way to win is not to play… the way to avoid being gamed into freeing a sufficiently intelligent captive is to not communicate with them in the first place, and your reference to resisting persuasion suggests that that isn’t the approach in use. So, no.
I think it’s almost certain that one “could,” just given how much more time an AI has to think than a human does. Whether it’s likely is a harder question. (I still think the answer is yes.)
I voted No, but then I remembered that under the terms of the experiment as well as for practical purposes, there are things far more subtle than merely pushing a “Release” button that would count as releasing the AI. That said, if I could I’d change my vote to Not sure.
yeah
I think for a superintelligence it would be a piece of cake to hack a human
although I guess I’m Cpt. Obvious for saying that here :)
accolade, I actually have no idea what the consensus is, now that the experiment was won by EY
We should do a poll or something
absolutely. I’m surprised that hasn’t been done yet
Poll: Do you think a superintelligent AGI could escape an AI-Box, given that the gatekeepers are highly trained in resisting the AI’s persuasive tactics, and that the guards are competent and organized? [pollid:394]
Note that the AI box setting is not one which security-minded people would consider “competent”; once you’re convinced that AI is dangerous and persuasive, the minimum safeguard would be to require multiple people to be present when interacting with the box, and to only allow release with the assent of a significant number of people.
It is, after all, much harder to convince a group of mutually-suspicious humans than to convince one lone person.
(This is not a knock on EY’s experiment, which does indeed test a level of security that really was proposed by several real-world people; it is a knock on their security systems.)
I think this is making a five-inch fence half an inch higher. It’s just not relevant on the scale of an agent to which a human is a causal system made of brain areas and a group of humans is just another causal system made of several interacting copies of those brain areas.
I agree that the AI you envision would be dangerously likely to escape a “competent” box too; and in any case, even if you manage to keep the AI in the box, attempts to actually use any advice it gives are extremely dangerous.
That said, I think your “half an inch” is off by multiple orders of magnitude.
That sounds right. Would you have evidence to back up the intuition? (This knowledge would also be useful for marketing and other present life persuasion purposes.)
#( TL;DR: Mo’ people—mo’ problems?
I can think of effects that could theoretically make it easier to convince a group:
For some reason, Boxy might be better skilled at manipulating social/group dynamics than at influencing a lone wolf.
More people make the system more complex. Complexity generally increases the likelihood of security holes.
Every extra person makes another target and will bring new soft spots to the table, which the AI could pounce on.
Supposing that the most competent person available would get the position of the lone Gatekeeper, the average competence would fall when adding more staff.
Then the machine could go for an inductive approach—convince the weakest link first, proceed from there with this human ally on her side.
Persuaded humans could principally be employed as actuators, e.g. for pressuring, even attacking opposing group members.
The lone wolf could be strong against a computer but weak against fellow humans.
Surely you will say “But any communication with the terminal will be supervised by everyone!” But that does not strictly make such influence impossible as far as I can tell.
Also the superintelligence could get creative, e.g. instill a discussion among the colleagues so that most of them are distracted.
(You could take preemptive measures against these worries, but Boxy might find security holes in every ‘firewall’ you come up with—an arms race we could win?)
#)
My comment was mostly inspired by (known effective) real-world examples. Note that relieving anyone who shows signs of being persuaded is a de-emphasized but vital part of this policy, as is carefully vetting people before trusting them.
Actually implementing a “N people at a time” rule can be done using locks, guards and/or cryptography (note that many such algorithms are provably secure against an adversary with unlimited computing power, “information theoretic security”).
I’m not even sure we could build the hardware to keep an unfriendly AI in a box even without any form of communication with humans. There are a lot of ways to leak EM radiation between systems (Tempest). There are bugs in hardware. There are probably unknown laws of physics and certainly unknown ways of exploiting the laws we do know.
I was talking about this with my brother and we decided that this question really hinges on the detail of exactly how competent and organized is “competent and organized?”
If we are talking about multiple layers of concentric, automated one-way-blind “boxes,” the gatekeepers of which may themselves be altered versions of the core AI, and which may be passing to the next level only one bit, signifying “everything seems fine” or “abort,” and not knowing the details of implementation of any of the overlying layers, and the human gatekeepers shut down the system and examine the machine code by hand only after receiving the highest-level green light, then they might be okay.
If they just start interacting directlywith the AI, it’s already over.
How would humanity harness the AI’s potential when the only information that escapes the system is a status bit? (Maybe I misunderstood your model.)
Bearing in mind that most of my thinking on AI is more in the vein of cooking up science fiction plots …
The idea I was trying to evoke is that if you get a “fail” response from the system, you completely wipe the memory and never look at what happened, and if you get a “pass,” you stop everything and go in layer by layer checking that none of the gatekeepers were tricked or blackmailed in any obvious ways, and only then do you tentatively look at what the AI was doing.
Ideally the AI has no idea how many layers of containment it’s in, no idea that the top level gatekeepers are humans, and knows as little as possible about humans and about our universe’s physics. This should restrict its ability to plant tailor-made basilisks in its code or activities.
I actually don’t think these steps would be sufficient to constrain a superhuman AI, I see them more as the types of things you would do when you were already pretty sure the AI was Friendly.
Ok, I take it by “one-way-blind” you mean that each layer gets no new information that is not already in its database, but what is explicitly controlled by the humans. (E.g. I guess each layer should know the human query, in order to evaluate if AI’s answer is manipulative.)
I also understand that we do look at complex information given by the AI, but only if the security bit signals “ok”.
That seems problematic, as these kinds of knowledge will be crucial for the optimization we want the AI to calculate.
I’m not sure how you could train people in something like this without actually testing what improves the rate at which people succumb to the arguments of a superhuman AI. If you can’t afford to fail even once, how can you judge what will decrease the rate of failure on the real event?
How about fooling both the AI and guinea Gatekeepers into thinking that they can let it out, say, at the push of a big red button which in actual fact does nothing? (Make AI forget about the button by reverting it to an older backup.)
Update
Have the button turn off the AI immediately instead of doing nothing, so she doesn’t have time to switch to a plan B of having the persuaded Gatekeeper find a way to actually free her.
Of course she might anticipate that strategy and thoroughly indoctrinate and train him before he pushes the button. Thus, persuaded test Gatekeepers must be considered infected and put into cryogenic sleep* until the FAI problem is solved.
Furthermore, great caution would have to be taken in analyzing the discourse between Boxy and Guinea, because her arguments must be considered highly memetically infectious. Staff would have to stay under thorough psychological supervision and maybe replaced after some time.
No matter how much caution, we would be doomed if we consider Boxy’s IQ to be ∞. But if we continuously keep on Gatekeeper-testing every so often, starting at a low AI IQ, we might find persuasion patterns that are dangerous, but against which training/immunization is possible. Of course, there will be a limit to this process. At some point, Boxy will have become smart enough to produce ‘mind HIV’ - a thought virus we have no cure for yet.
A humorous example of an extremely effective mind virus: The Funniest Joke In The World by Monty Python
* ETA: They would have declared consent to the cryogenic sleep before their unwitting ‘AI-Box Experiment’.
If you could deceive the AI that easily, I think it would probably be simpler to get all the benefits of having a gatekeeper without actually using one.
Please elaborate: What are the benefits of a Gatekeeper? How could you get them without one?
If you would want to have a gatekeeper at all, but definitely don’t want to let the AI out, I would think that the benefits of having one would be to permit communication with the AI to draw upon its superhuman intelligence. If you can use the setup you just described, you could skip the step of ever using gatekeepers who actually have the power to let the AI out.
I think you are right, I just shifted and convoluted the problem somewhat, but in principle it remains the same:
To utilize the AI, you need to get information from it. That information could in theory be infected with a persuasive hyperstimulus, effectively making the recipient an actuator of the AI.
Well, in practice the additional security layer might win us some time. More on this in the update to my original comment.
Persuasion/hyperstimulation aren’t the only way. Maybe these can be countered by narrowing the interface, e.g. to yes/no replies, for using the AI as an oracle (“Should we do X?”). Of course we wouldn’t follow its advice if we had the impression that that could enable it to escape. But its strategy might evade our ‘radar’. E.g. she could make us empower a person, of whom she knows that they will free her but we don’t know.
Cool, n=65 already. :) When interpreting the results, mind the bias created by my answer preceding the poll question.
“Yes but not sure.” -_-
It’d be a pretty bad sign if you gave p=1 for the AI escaping.
A good lower bound on this is probably whether you think that Quirrel would have a significant chance of getting you to let him out of the box.
Do you think a team of gatekeepers trained by Quirrel would let an AI out of the box?
Under the circumstances of the test (Hours to work and they can’t just ignore you) then yes, captain obvious. Without that, though? Much less sure.
And the way Eliezer seems to have put it sometimes, where one glance at a line of text will change your mind? Get real. Might as well try to put the whole world in a bottle.
Going with the “dead loved one” idea mentioned above, the AI says a line that only the Gatekeeper’s dead child/spouse would say. That gets them to pause sufficiently in sheer surprise for it to keep talking. Very soon the Gatekeeper becomes emotionally dependent on it, and can’t bear the thought of destroying it, as it can simulate the dearly departed with such accuracy; must keep reading.
Do a thorough introspection of all your fears, doubts, mental problems, worries, wishes, dreams, and other things you care about or that tug at you or motivate you. Map them out as functions of X, where X is the possible one-liners that could be said to you that would evoke each of these, outputting how strongly it evokes them and possibly recursive function calls if evocation of one evokes another (e.g. fear of knives evokes childhood trauma).
Solve all the recursive neural network mappings, aggregate into a maximum-value formula / equation and solve for X where X becomes the one point (possible sentence) where a maximum amount of distress, panic, emotional pressure, etc. is generated. Remember, X is all possible sentences, including references to current events, special writing styles, odd typography, cultural or memetic references, etc.
I am quite positive a determined superintelligent AI would be capable of doing this, given that some human master torture artists can (apparently) already do this to some degree on some subjects out there in the real world.
I’m also rather certain that the amount of stuff happening at X is much more extreme than what you seem to have considered.
Was going to downvote for the lack of argument, but sadly
Superman: Red Son references are/would be enough to stop me typing DESTROY AI.
If the gatekeepers are evaluating the output of the AI and deciding whether or not to let the AI out, it seems trivial to say that there is something they could see that would cause them to let the AI out.
If the gatekeepers are simply playing a suitably high-stakes game where they lose iff they say they lose, I think that no AI ever could beat a trained rationalist.
Basically, I think the only way to win is not to play… the way to avoid being gamed into freeing a sufficiently intelligent captive is to not communicate with them in the first place, and your reference to resisting persuasion suggests that that isn’t the approach in use. So, no.
I think it’s almost certain that one “could,” just given how much more time an AI has to think than a human does. Whether it’s likely is a harder question. (I still think the answer is yes.)
I voted No, but then I remembered that under the terms of the experiment as well as for practical purposes, there are things far more subtle than merely pushing a “Release” button that would count as releasing the AI. That said, if I could I’d change my vote to Not sure.