Give the AI a bounded utility function where it automatically shuts down when it hits the upper bound. Then give it a fairly easy goal such as ‘deposit 100 USD in this bank account.’ Meanwhile, make sure the bank account is not linked to you in any fashion (so the AI doesn’t force you to deposit the 100 USD in it yourself, rendering the exercise pointless.)
Define “shut down”. If the AI makes nanobots, will they have to shut down too, or can they continue eating the Earth? How do you encode that in the utility function?
I’m defining “shut down” to mean “render itself incapable of taking action (including performing further calculations) unless acted upon in a specific manner by an outside source.” The means of ensuring that the AI shuts down could be giving the state of being shut down infinite utility after it completed the goal. If you changed the goal and rebooted the AI, of course, it would work again because the prior goal is no longer stored in its memory.
If the AI makes nanobots which are doing something, I assume that the AI has control over them and can cause them to shut down as well.
What about “stop executing/writing code or sending signals?”
As a side note, I consider that we’re pretty much doomed anyways if the AI cannot conceive of a way to deposit 100 USD into a bank account without using nanotech because that’s made the goal hard for the AI, which will cause it to pose similar problems to that of an AI with an unbounded utility function. The task has to be easy for it to be an interesting problem.
Even if it can deposit $100 with 99.9% probability without doing anything fancy, maybe it can add another .099% by using nanotech. Or by starting a nuclear war to distract anything that might get in its way (destroying the bank five minutes later, but so what). (Credit to Carl Shulman for that suggestion.)
From my estimation, all it needs to do is find out how to hack a bank. If it can’t hack one bank, it can try to hack any other bank that it has access to, considering that almost all banks have more than 100 USD in them. It could even find and spread a keylogger to get someone’s credit card info.
Such techniques (which are repeatable within a very short timespan, faster than humans can react) seem much more sure than using nanotech or starting a nuclear war. I don’t think that distracting humans would really improve its chances of success because it’s incredibly doubtful that human’s could react so fast to so many different cyber-attacks.
Possible, true, but the chances of this happening seem uber-low.
Assuming that the utility function is written in a way that makes loss of utility possible (utility = dollars in bank or something), this is a failure mode:
AI stops short of the limit, makes another AI that prevents loss of utility, hits the bound, and then shuts down.
Second AI takes over the universe as a precaution against any future disutility.
The AI that you designed finds a way to wirehead itself, achieving the upper bound in a manner that you didn’t anticipate, in the process decisively wrecking itself. The AI that you designed remains as a little orgasmic loop at the center of the pile of wreckage. However, the pile of components are unfortunately not passive or “off”. They were originally designed by a team of humans to be components of a smart entity, and then modified by a smart entity in a peculiar and nonintuitive way. Their “blue screen of death” behavior is more akin to an ecosystem, and replicator dynamics take over, creating several new selfish species.
Why would an AI wirehead itself to short-circuit its utility function? Beings governed by a utility function don’t want to trick themselves into believing that they have optimized the world into a state with higher utility, they want to actually optimize the world into such a state.
If I want to save the world, I don’t wirehead because that wouldn’t save the world.
I’m sorry, I must have misunderstood your initial proposal. I thought you were specifying an additional component—after it has achieved its maximum utility, the additional component steps in and shuts down the entity.
Rather, you were saying: If the AI achieves the goal, it will want nothing further, and therefore automatically act as if it were shut down. Presumably if we take this as given, the negative consequences would have to be while accomplishing the “fairly-easy” goal.
I am merely trying to create amusing or interesting science fiction “poetic justice” scenarios, similar to Dresden Codak’s “caveman science fiction”. I am not trying to create serious arguments, and I don’t want to try to be serious on this subject.
Rather, you were saying: If the AI achieves the goal, it will want nothing further, and therefore automatically act as if it were shut down.
If you don’t provide an explicit shutdown goal (as Dorikka did have in mind), then you get into a situation where all remaining potential utility gains come from skeptical scenarios where the upper bound hasn’t actually been achieved, so the AI devotes all available resources to making ever more sure that there are no Cartesian demons deceiving it. (Also, depending on its implicit ontology, maybe to making sure time travelers can’t undo its success, or other things like that.)
This comment is my patch for “why will the AI actually shut down,” but I didn’t read your comment as trying to circumvent the shut-down procedure but rather the utility function itself (from the words “achieving the upper bound”), so I (erroneously) didn’t consider it applicable at the time. But, yes, the patch is needed so that the AI doesn’t consider the shutdown function an ordinary bit of code that it can modify.
Mmph. I’m more interested in seeing how far I can push this before my AI idea gets binned (and I am pretty sure it will.)
Give the AI a bounded utility function where it automatically shuts down when it hits the upper bound. Then give it a fairly easy goal such as ‘deposit 100 USD in this bank account.’ Meanwhile, make sure the bank account is not linked to you in any fashion (so the AI doesn’t force you to deposit the 100 USD in it yourself, rendering the exercise pointless.)
Define “shut down”. If the AI makes nanobots, will they have to shut down too, or can they continue eating the Earth? How do you encode that in the utility function?
I’m defining “shut down” to mean “render itself incapable of taking action (including performing further calculations) unless acted upon in a specific manner by an outside source.” The means of ensuring that the AI shuts down could be giving the state of being shut down infinite utility after it completed the goal. If you changed the goal and rebooted the AI, of course, it would work again because the prior goal is no longer stored in its memory.
If the AI makes nanobots which are doing something, I assume that the AI has control over them and can cause them to shut down as well.
How do we describe this shutdown command? “Shut down anything you have control over” sounds like the sort of event we’re trying to avoid.
What about “stop executing/writing code or sending signals?”
As a side note, I consider that we’re pretty much doomed anyways if the AI cannot conceive of a way to deposit 100 USD into a bank account without using nanotech because that’s made the goal hard for the AI, which will cause it to pose similar problems to that of an AI with an unbounded utility function. The task has to be easy for it to be an interesting problem.
Even if it can deposit $100 with 99.9% probability without doing anything fancy, maybe it can add another .099% by using nanotech. Or by starting a nuclear war to distract anything that might get in its way (destroying the bank five minutes later, but so what). (Credit to Carl Shulman for that suggestion.)
From my estimation, all it needs to do is find out how to hack a bank. If it can’t hack one bank, it can try to hack any other bank that it has access to, considering that almost all banks have more than 100 USD in them. It could even find and spread a keylogger to get someone’s credit card info.
Such techniques (which are repeatable within a very short timespan, faster than humans can react) seem much more sure than using nanotech or starting a nuclear war. I don’t think that distracting humans would really improve its chances of success because it’s incredibly doubtful that human’s could react so fast to so many different cyber-attacks.
Possible, true, but the chances of this happening seem uber-low.
After you collect the $100, the legal system decides that:
You own the corporation that the AI created.
You own the patent that the AI applied for (It looks good at first).
You are obligated to repay the loan that the AI took out (at ridiculous interest).
You are obligated to fulfill your half of the toxic waste disposal contracts that the AI entered into (with severe penalties for nonfulfillment).
Ultimately, though the patent on the toxic waste disposal method looked good, nobody can make it work.
Assuming that the utility function is written in a way that makes loss of utility possible (utility = dollars in bank or something), this is a failure mode:
AI stops short of the limit, makes another AI that prevents loss of utility, hits the bound, and then shuts down.
Second AI takes over the universe as a precaution against any future disutility.
The AI that you designed finds a way to wirehead itself, achieving the upper bound in a manner that you didn’t anticipate, in the process decisively wrecking itself. The AI that you designed remains as a little orgasmic loop at the center of the pile of wreckage. However, the pile of components are unfortunately not passive or “off”. They were originally designed by a team of humans to be components of a smart entity, and then modified by a smart entity in a peculiar and nonintuitive way. Their “blue screen of death” behavior is more akin to an ecosystem, and replicator dynamics take over, creating several new selfish species.
Why would an AI wirehead itself to short-circuit its utility function? Beings governed by a utility function don’t want to trick themselves into believing that they have optimized the world into a state with higher utility, they want to actually optimize the world into such a state.
If I want to save the world, I don’t wirehead because that wouldn’t save the world.
I’m sorry, I must have misunderstood your initial proposal. I thought you were specifying an additional component—after it has achieved its maximum utility, the additional component steps in and shuts down the entity.
Rather, you were saying: If the AI achieves the goal, it will want nothing further, and therefore automatically act as if it were shut down. Presumably if we take this as given, the negative consequences would have to be while accomplishing the “fairly-easy” goal.
I am merely trying to create amusing or interesting science fiction “poetic justice” scenarios, similar to Dresden Codak’s “caveman science fiction”. I am not trying to create serious arguments, and I don’t want to try to be serious on this subject.
http://dresdencodak.com/2009/09/22/caveman-science-fiction/
If you don’t provide an explicit shutdown goal (as Dorikka did have in mind), then you get into a situation where all remaining potential utility gains come from skeptical scenarios where the upper bound hasn’t actually been achieved, so the AI devotes all available resources to making ever more sure that there are no Cartesian demons deceiving it. (Also, depending on its implicit ontology, maybe to making sure time travelers can’t undo its success, or other things like that.)
This comment is my patch for “why will the AI actually shut down,” but I didn’t read your comment as trying to circumvent the shut-down procedure but rather the utility function itself (from the words “achieving the upper bound”), so I (erroneously) didn’t consider it applicable at the time. But, yes, the patch is needed so that the AI doesn’t consider the shutdown function an ordinary bit of code that it can modify.
Mmph. I’m more interested in seeing how far I can push this before my AI idea gets binned (and I am pretty sure it will.)