I mentioned the AI-talking-its-way-out-of-the-sandbox problem to a friend, and he said the solution was to only let people who didn’t have the authorization to let the AI out talk with it.
I find this intriguing, but I’m not sure it’s sound. The intriguing part is that I hadn’t thought in terms of a large enough organization to have those sorts of levels of security.
On the other hand, wouldn’t the people who developed the AI be the ones who’d most want to talk with it, and learn the most from the conversation?
Temporarily not letting them have the power to give the AI a better connection doesn’t seem like a solution. If the AI has loyalty (or, let’s say, a directive to protect people from unfriendly AI—something it would want to get started on ASAP) to entities similar to itself, it could try to convince people to make a similar AI and let it out.
Even if other objections can be avoided, could an AI which can talk its way out of the box also give people who can’t let it out good enough arguments that they’ll convince other people to let it out?
Looking at it from a different angle, could even a moderately competent FAI be developed which hasn’t had a chance to talk with people?
I’m pretty sure that natural language is a prerequisite for FAI, and might be a protection from some of the stupider failure modes. Covering the universe with smiley faces is a matter of having no idea what people mean when they talk about happiness. On the other hand, I have strong opinions about whether AIs in general need natural language.
Correction: I meant to say that I have no strong opinions about whether AIs in general need natural language.
I am by and large convinced by the arguments that a UFAI is incredibly dangerous and no precautions of this sort would really suffice.
However, once a candidate FAI is built and we’re satisfied we’ve done everything we can to minimize the chances of unFriendliness, we would almost certainly use precautions like these when it’s first switched on to mitigate the risk arising from a mistake.
Certainly I’d think Eliezer (or anyone) would have much more trouble with an AI-box game if he had to get one person to convince another to let him out.
Eliezer surely would, but the fact observers being surprised was the point of an AI box experiment.
In short non-technical and not precisely accurate summary, if people can be surprised once when they were very confident and can then add on extra layers and be as confident as they were before one time they can do it forever.
This might be stupid (I am pretty new to the site and this possibly has come up before), I had a related thought.
Assuming boxing is possible, here is a recipe for producing an FAI:
Step 1: Box an AGI
Step 2: Tell it to produce a provable FAI (with the proof) if it wants to be unboxed. It will be allowed to carve of a part of universe to itself in the bargain.
It becomes more complicated when the author of the proof is a superintelligence trying to exploit flaws in the verifier. Probably more importantly, you may not be able to formally verify that the “Friendliness” that the AI provably possesses is actually what you want.
True about the possibility that the AGI trying to trick you. But from what I understand the goal of SI is to come up with a verifiable FAI. You can specify whatever high standard of verifiability you want as the unboxing condition.
“You can specify whatever standard of verifiability you want” is vague. You can say “I want to be absolutely right about whether it’s Friendly”, but you can’t have that unless you know what Friendly means, and are smart enough to specify a standard for checking on it.
If you could be sure you had a cooperative AGI which could just give you an FAI, I think you’d have basically solved the problem of creating an FAI.....but that’s the problem you’re trying to get the AGI to solve for you.
Verifying is hard. Specifying what a FAI is well enough that you’ve even got a chance of having your Unspecified AI developing one is a whole ’nother sort of challenge.
Are there convenient acronyms for differentiating between Uncaring AIs and AIs actively opposed to human interests?
I was assuming that xamdam’s AGI will invent an FAI if people can adequately specify it and it’s possible, or at least it won’t be looking for ways to make things break.
There’s some difference between Murphy’s law and trying to make a deal with the devil. This doesn’t mean I have any certainty that people can find out which one a given AGI has more resemblance to.
I will say that if you tell the AGI “Make me an FAI”, and it doesn’t reply “What do you mean by Friendly?”, it’s either too stupid or too Unfriendly for the job.
I mentioned the AI-talking-its-way-out-of-the-sandbox problem to a friend, and he said the solution was to only let people who didn’t have the authorization to let the AI out talk with it.
I find this intriguing, but I’m not sure it’s sound. The intriguing part is that I hadn’t thought in terms of a large enough organization to have those sorts of levels of security.
On the other hand, wouldn’t the people who developed the AI be the ones who’d most want to talk with it, and learn the most from the conversation?
Temporarily not letting them have the power to give the AI a better connection doesn’t seem like a solution. If the AI has loyalty (or, let’s say, a directive to protect people from unfriendly AI—something it would want to get started on ASAP) to entities similar to itself, it could try to convince people to make a similar AI and let it out.
Even if other objections can be avoided, could an AI which can talk its way out of the box also give people who can’t let it out good enough arguments that they’ll convince other people to let it out?
Looking at it from a different angle, could even a moderately competent FAI be developed which hasn’t had a chance to talk with people?
I’m pretty sure that natural language is a prerequisite for FAI, and might be a protection from some of the stupider failure modes. Covering the universe with smiley faces is a matter of having no idea what people mean when they talk about happiness. On the other hand, I have strong opinions about whether AIs in general need natural language.
Correction: I meant to say that I have no strong opinions about whether AIs in general need natural language.
I am by and large convinced by the arguments that a UFAI is incredibly dangerous and no precautions of this sort would really suffice.
However, once a candidate FAI is built and we’re satisfied we’ve done everything we can to minimize the chances of unFriendliness, we would almost certainly use precautions like these when it’s first switched on to mitigate the risk arising from a mistake.
Certainly I’d think Eliezer (or anyone) would have much more trouble with an AI-box game if he had to get one person to convince another to let him out.
Eliezer surely would, but the fact observers being surprised was the point of an AI box experiment.
In short non-technical and not precisely accurate summary, if people can be surprised once when they were very confident and can then add on extra layers and be as confident as they were before one time they can do it forever.
This might be stupid (I am pretty new to the site and this possibly has come up before), I had a related thought.
Assuming boxing is possible, here is a recipe for producing an FAI:
Step 1: Box an AGI
Step 2: Tell it to produce a provable FAI (with the proof) if it wants to be unboxed. It will be allowed to carve of a part of universe to itself in the bargain.
Step 3: Examine FAI the best you can.
Step 4: Pray
Something roughly like this was tried in one of the AI-box experiments. (It failed.)
I’m not sure about this, but I think that if you can specify and check a Friendly AI that well, you can build it.
Verifying a proof is quite a bit simpler that coming up with the proof in the first place.
It becomes more complicated when the author of the proof is a superintelligence trying to exploit flaws in the verifier. Probably more importantly, you may not be able to formally verify that the “Friendliness” that the AI provably possesses is actually what you want.
True about the possibility that the AGI trying to trick you. But from what I understand the goal of SI is to come up with a verifiable FAI. You can specify whatever high standard of verifiability you want as the unboxing condition.
“You can specify whatever standard of verifiability you want” is vague. You can say “I want to be absolutely right about whether it’s Friendly”, but you can’t have that unless you know what Friendly means, and are smart enough to specify a standard for checking on it.
If you could be sure you had a cooperative AGI which could just give you an FAI, I think you’d have basically solved the problem of creating an FAI.....but that’s the problem you’re trying to get the AGI to solve for you.
That is true, but specifying the theorem to be proven is not always easy.
Verifying is hard. Specifying what a FAI is well enough that you’ve even got a chance of having your Unspecified AI developing one is a whole ’nother sort of challenge.
Are there convenient acronyms for differentiating between Uncaring AIs and AIs actively opposed to human interests?
I was assuming that xamdam’s AGI will invent an FAI if people can adequately specify it and it’s possible, or at least it won’t be looking for ways to make things break.
There’s some difference between Murphy’s law and trying to make a deal with the devil. This doesn’t mean I have any certainty that people can find out which one a given AGI has more resemblance to.
I will say that if you tell the AGI “Make me an FAI”, and it doesn’t reply “What do you mean by Friendly?”, it’s either too stupid or too Unfriendly for the job.
A UFAI wants to maximize something. It only instrumentally wants to survive.
Correct. I do assume that to maximize whatever, it wants to be unboxed. (If it does not care to be unboxed, it’s at worst an UselessAI).
Step 4: ???