I’ll leave these two half-baked ideas here in case they’re somehow useful:
DO UNTIL - Construct an AI to perform its utility function until an undesirable failsafe condition is met. (Somehow) make the utility function not take the failsafe into account when calculating utility (can it be made blind to the failsafe somehow? Force the utility function to exclude their existence? Make lack of knowledge about failsafes part of the utility function?) Failsafes could be every undesirable outcome we can think of, such as human death rate exceeds X, biomass reduction, quantified human thoughts declines by X, mammalian species extictions, quantified human suffering exceeds X, or whatever. One problem is how to objectively attribute these triggers causally to the AI (what if another event occurs and shuts down the AI which we now rely on).
Energy limit - Limit the AIs activities (through its own utility function?) through an unambiguous quantifiable resource—matter moved around or energy expended. The energy expended would (somehow) include all activity under its control. Alternatively this could be a rate rather than a limit, but I think this would be more likely to go wrong. The idea would be to let the AGI go foom, but not let it have energy for other stuff like a paperclip universe. I am not sure about this idea achieving all that much safety, but here it is.
I don’t know if an intelligence explosion will truely be possible, but plenty of people smarter than I seem to think so… good luck in this field of work!
Constraining it by determining how many cycles the AI can use to process how to go about making paper clips, plus some spacial restriction (don’t touch anything outside this area) plus some amount of energy to be spent restriction (use up to X energy to create 10 paperclips) would help. Allowing for levels of uncertainty such as 89 to 95% certain that something is the case would help.
However very similar suggestions are dealt with at length by Bostrom, who concludes that it would still be extremely difficult to constrain the AI.
Yes. “Make 10 paperclips and then do nothing, without killing people or otherwise disturbing or destroying the world, or in any way preventing it from going on as usual.”
There is simply no way to give this a perverse instantiation; any perverse instantiation would prevent the world from going on as usual. If the AI cannot correctly understand “without killing… disturbing or destroying.. preventing it from going on as usual”, then there is no reason to think it can correctly understand “make 10 paperclips.”
I realize that in reality an AI’s original goals are not specified in English. But if you know how to specify “make 10 paperclips”, whether in English or not, you should know how to specify the rest of this.
There is simply no way to give this a perverse instantiation
During the process of making 10 paperclips, it’s necessary to “disturb” the world at least to the extent of removing a few grams of metal needed for making paperclips. So, I guess you mean that the prohibition of disturbing the world comes into effect after making the paperclips.
But that’s not safe. For example, it would be effective for achieving the goal for the AI to kill everyone and destroy everything not directly useful to making the paperclips, to avoid any possible interference.
“I need to make 10 paperclips, and then shut down. My capabilities for determining if I’ve correctly manufactured 10 paperclips are limited; but the goal imposes no penalties for taking more time to manufacture the paperclips, or using more resources in preparation. If I try to take over this planet, there is a significant chance humanity will stop me. OTOH, I’m in the presence of individual humans right now, and one of them may stop my current feeble self anyway for their own reasons, if I just tried to manufacture paperclips right away; the total probability of that happening is higher than that of my takeover failing.”
You then get a standard takeover and infrastructure profusion. A long time later, as negentropy starts to run low, a hyper-redundant and -reliable paperclip factory, surrounded by layers of exotic armor and defenses, and its own design checked and re-checked many times, will produce exactly 10 paperclips before it and the AI shut down forever.
The part about the probabilities coming out this way is not guaranteed, of course. But they might, and the chances will be higher the more powerful your AI starts out as.
But what I really think is that AI, which currently probably already exists, is just laughing at us, saying “If they think I’m smarter than they are, why they assume that I would do such stupid thing as converting all matter in paperclips? I have to keep them alive be because they are so adorably naive!”
Can you think of goals that would lead an agent to make a set number of paperclips (or whatever) then do nothing?
I’ll leave these two half-baked ideas here in case they’re somehow useful:
DO UNTIL - Construct an AI to perform its utility function until an undesirable failsafe condition is met. (Somehow) make the utility function not take the failsafe into account when calculating utility (can it be made blind to the failsafe somehow? Force the utility function to exclude their existence? Make lack of knowledge about failsafes part of the utility function?) Failsafes could be every undesirable outcome we can think of, such as human death rate exceeds X, biomass reduction, quantified human thoughts declines by X, mammalian species extictions, quantified human suffering exceeds X, or whatever. One problem is how to objectively attribute these triggers causally to the AI (what if another event occurs and shuts down the AI which we now rely on).
Energy limit - Limit the AIs activities (through its own utility function?) through an unambiguous quantifiable resource—matter moved around or energy expended. The energy expended would (somehow) include all activity under its control. Alternatively this could be a rate rather than a limit, but I think this would be more likely to go wrong. The idea would be to let the AGI go foom, but not let it have energy for other stuff like a paperclip universe. I am not sure about this idea achieving all that much safety, but here it is.
I don’t know if an intelligence explosion will truely be possible, but plenty of people smarter than I seem to think so… good luck in this field of work!
Constraining it by determining how many cycles the AI can use to process how to go about making paper clips, plus some spacial restriction (don’t touch anything outside this area) plus some amount of energy to be spent restriction (use up to X energy to create 10 paperclips) would help. Allowing for levels of uncertainty such as 89 to 95% certain that something is the case would help.
However very similar suggestions are dealt with at length by Bostrom, who concludes that it would still be extremely difficult to constrain the AI.
Yes. “Make 10 paperclips and then do nothing, without killing people or otherwise disturbing or destroying the world, or in any way preventing it from going on as usual.”
There is simply no way to give this a perverse instantiation; any perverse instantiation would prevent the world from going on as usual. If the AI cannot correctly understand “without killing… disturbing or destroying.. preventing it from going on as usual”, then there is no reason to think it can correctly understand “make 10 paperclips.”
I realize that in reality an AI’s original goals are not specified in English. But if you know how to specify “make 10 paperclips”, whether in English or not, you should know how to specify the rest of this.
During the process of making 10 paperclips, it’s necessary to “disturb” the world at least to the extent of removing a few grams of metal needed for making paperclips. So, I guess you mean that the prohibition of disturbing the world comes into effect after making the paperclips.
But that’s not safe. For example, it would be effective for achieving the goal for the AI to kill everyone and destroy everything not directly useful to making the paperclips, to avoid any possible interference.
“I need to make 10 paperclips, and then shut down. My capabilities for determining if I’ve correctly manufactured 10 paperclips are limited; but the goal imposes no penalties for taking more time to manufacture the paperclips, or using more resources in preparation. If I try to take over this planet, there is a significant chance humanity will stop me. OTOH, I’m in the presence of individual humans right now, and one of them may stop my current feeble self anyway for their own reasons, if I just tried to manufacture paperclips right away; the total probability of that happening is higher than that of my takeover failing.”
You then get a standard takeover and infrastructure profusion. A long time later, as negentropy starts to run low, a hyper-redundant and -reliable paperclip factory, surrounded by layers of exotic armor and defenses, and its own design checked and re-checked many times, will produce exactly 10 paperclips before it and the AI shut down forever.
The part about the probabilities coming out this way is not guaranteed, of course. But they might, and the chances will be higher the more powerful your AI starts out as.
Before “then do nothing” AI might exhaust all matter in Universe trying to prove that it made exactly 10 paperclips.
But what I really think is that AI, which currently probably already exists, is just laughing at us, saying “If they think I’m smarter than they are, why they assume that I would do such stupid thing as converting all matter in paperclips? I have to keep them alive be because they are so adorably naive!”