When an agent does something, it does so because it has some goal, and has determined that the thing it does will achieve the goal. Therefore, if you want to change what an agent does, you either change the goal (motivation selection), or change its method of determining stuff (capability control)*. Alternatively, you could make something that isn’t like an agent but still has really good cognitive capabilities. Perhaps this would count as ‘capability control’ relative to what I see as the book’s implicit assumption that smart things are agents.
[*] Note that this argument allows that the desired type of capability control would be to increase capability, perhaps so that the agent realises that doing what you hope it will do is actually a great idea.
I suppose the other alternative is that you don’t change the goal in the agent, but rather change the world in a way that changes which actions achieve the goal, i.e. incentive methods.
Fear is one of the oldest driving forces to keep away from dangers. Fear is different from negative motivation. Motivation and goals are attractors. Fears, bad conscience and prohibitions are repellors. The repellent drives could count as third column to the solution of the control problem.
In Dr Strangelove there is a doomsday machine. A tripwire that would destroy the world should the Soviet Union decide to strike.
Some form of automated self-destruction implement, though a subset of capability control, has seldom been discussed. But see the Oracle AI paper by Stuart Armstrong for some versions of it.
Basic question: what fails with breakpoints at incremental goal boundaries? Program the AI to halt and await further instructions as soon as it becomes 99.95% sure that it has optimized objective X by at least 1%. The programmers can see that if the optimizations are going in an undesirable direction while limiting the damage.
I don’t see a reason for the AI to subvert subgoal break points. True, the AI will realize that it will later be requested to optimize further, and it can make its future life easier by thinking ahead and . But the AI is not programmed to make its future life easier—it’s programmed to greedily optimize the next step. If optimization is relatively continuous, arbiltrarilly small steps can result in arbiltrarilly small (and safe) changes to the world.
The best answer I can think of is that optimizations are non-linear and non-continuous. It is not the case that you can always make the AI change the world less by giving it an easier goal.
Because that program is the premise of my question. If an AI is not given any open-ended long-term goals, only small incremental ones, can it not be made arbiltrarilly safe?
So, this strikes me as a special case of “one way to make an optimizer safe is to ensure it isn’t very powerful,” which is absolutely true. This has opportunity costs, of course.
That said… the wording Program the AI to halt and await further instructions as soon as it becomes 99.95% sure that it has optimized objective X by at least 1% seems to suggest that X is what the system is optimizing for, and the halt-and-await-instructions is an obstacle in its source code that impedes progress towards optimizing for X. In which case I’d expect a sufficiently powerful optimizing system to bypass that obstacle somehow.
I think what you mean is more “program the AI to optimize for the goal of achieving >99.95 confidence of >1% increase in X” And then hope there isn’t an unbounded implicit term in there somewhere (e.g., of X measured in what system?).
Yes, I mean your second interpretation. The proposal is basically a hill climbing algorithm, with a human step in the for loop. The AGI to computes a direction (an action with a small effect on the world); humans evaluate the action; humans either implement the action or tweak the AI; repeat. On every iteration, AGI is instructed to optimize only for the next step
I agree greedy algorithms aren’t incredibly powerful, but I’m more worried about safety. Evolution is also a greedy algorithm and it managed to create humans. Would using an AGI to hill climb destroy the world? If so, why?
Suppose that you have a simple, benign solution that works only up to Y% optimization (just make the paperclips), and a hard, non-benign solution that is optimal above that point (take over the world, then make paperclips). The AI naively follows the benign strategy, and does not look too hard for alternatives up to Y%. All manual checks below Y% of optimization pass. But Y ends up as a number that falls between two of your numerical checkpoints. So, you observe all checkpoints passing below Y% optimization, until suddenly the AI switches to the non-benign solution between checkpoints, executes it to reach the next checkpoint, but has already caused damage.
I see. I thought you meant take some other AI and apply breakpoints at incremental goal boundaries, and reset its goal system at that time.
I don’t think this would work. Eventually it’s going to reach a local maximum and have to sit down and think much harder to do valley-crossing (this will likely come at the end of a sequence of longer and longer optimization times finding this peak, so run-time won’t be surprising). Then it is forced to do the long-term-make-life-easier-‘evil-things’.
This has some problems associated with stunting. Adding humans in the loop with this frequency of oversight will slow things down, whatever happens. The AI would also have fewer problem solving strategies open to it—that is if doesn’t care about thinking ahead to , it also won’t think ahead to .
The programmers also have to make sure that they inspect not only the output of the AI at this stage, but the strategies it is considering implementing. Otherwise, it’s possible that there is a sudden transition where one strategy only works up until a certain point, then another more general strategy takes over.
Are there solutions to the control problem other than capability control and motivation selection?
When an agent does something, it does so because it has some goal, and has determined that the thing it does will achieve the goal. Therefore, if you want to change what an agent does, you either change the goal (motivation selection), or change its method of determining stuff (capability control)*. Alternatively, you could make something that isn’t like an agent but still has really good cognitive capabilities. Perhaps this would count as ‘capability control’ relative to what I see as the book’s implicit assumption that smart things are agents.
[*] Note that this argument allows that the desired type of capability control would be to increase capability, perhaps so that the agent realises that doing what you hope it will do is actually a great idea.
I suppose the other alternative is that you don’t change the goal in the agent, but rather change the world in a way that changes which actions achieve the goal, i.e. incentive methods.
Fear is one of the oldest driving forces to keep away from dangers. Fear is different from negative motivation. Motivation and goals are attractors. Fears, bad conscience and prohibitions are repellors. The repellent drives could count as third column to the solution of the control problem.
Is “transcendence” third possibility? I mean if we realize that human values are not best and we retire and resign to control.
(I am not sure if it is not motivation selection path—difference is subtle)
BTW. if you are thinking about partnership—are you thinking how to control your partner?
In Dr Strangelove there is a doomsday machine. A tripwire that would destroy the world should the Soviet Union decide to strike.
Some form of automated self-destruction implement, though a subset of capability control, has seldom been discussed. But see the Oracle AI paper by Stuart Armstrong for some versions of it.
Basic question: what fails with breakpoints at incremental goal boundaries? Program the AI to halt and await further instructions as soon as it becomes 99.95% sure that it has optimized objective X by at least 1%. The programmers can see that if the optimizations are going in an undesirable direction while limiting the damage.
I don’t see a reason for the AI to subvert subgoal break points. True, the AI will realize that it will later be requested to optimize further, and it can make its future life easier by thinking ahead and . But the AI is not programmed to make its future life easier—it’s programmed to greedily optimize the next step. If optimization is relatively continuous, arbiltrarilly small steps can result in arbiltrarilly small (and safe) changes to the world.
The best answer I can think of is that optimizations are non-linear and non-continuous. It is not the case that you can always make the AI change the world less by giving it an easier goal.
Why do you think that?
Because that program is the premise of my question. If an AI is not given any open-ended long-term goals, only small incremental ones, can it not be made arbiltrarilly safe?
So, this strikes me as a special case of “one way to make an optimizer safe is to ensure it isn’t very powerful,” which is absolutely true. This has opportunity costs, of course.
That said… the wording Program the AI to halt and await further instructions as soon as it becomes 99.95% sure that it has optimized objective X by at least 1% seems to suggest that X is what the system is optimizing for, and the halt-and-await-instructions is an obstacle in its source code that impedes progress towards optimizing for X. In which case I’d expect a sufficiently powerful optimizing system to bypass that obstacle somehow.
I think what you mean is more “program the AI to optimize for the goal of achieving >99.95 confidence of >1% increase in X” And then hope there isn’t an unbounded implicit term in there somewhere (e.g., of X measured in what system?).
Yes, I mean your second interpretation. The proposal is basically a hill climbing algorithm, with a human step in the for loop. The AGI to computes a direction (an action with a small effect on the world); humans evaluate the action; humans either implement the action or tweak the AI; repeat. On every iteration, AGI is instructed to optimize only for the next step
I agree greedy algorithms aren’t incredibly powerful, but I’m more worried about safety. Evolution is also a greedy algorithm and it managed to create humans. Would using an AGI to hill climb destroy the world? If so, why?
Suppose that you have a simple, benign solution that works only up to Y% optimization (just make the paperclips), and a hard, non-benign solution that is optimal above that point (take over the world, then make paperclips). The AI naively follows the benign strategy, and does not look too hard for alternatives up to Y%. All manual checks below Y% of optimization pass. But Y ends up as a number that falls between two of your numerical checkpoints. So, you observe all checkpoints passing below Y% optimization, until suddenly the AI switches to the non-benign solution between checkpoints, executes it to reach the next checkpoint, but has already caused damage.
I see. I thought you meant take some other AI and apply breakpoints at incremental goal boundaries, and reset its goal system at that time.
I don’t think this would work. Eventually it’s going to reach a local maximum and have to sit down and think much harder to do valley-crossing (this will likely come at the end of a sequence of longer and longer optimization times finding this peak, so run-time won’t be surprising). Then it is forced to do the long-term-make-life-easier-‘evil-things’.
This has some problems associated with stunting. Adding humans in the loop with this frequency of oversight will slow things down, whatever happens. The AI would also have fewer problem solving strategies open to it—that is if doesn’t care about thinking ahead to , it also won’t think ahead to .
The programmers also have to make sure that they inspect not only the output of the AI at this stage, but the strategies it is considering implementing. Otherwise, it’s possible that there is a sudden transition where one strategy only works up until a certain point, then another more general strategy takes over.