So, this strikes me as a special case of “one way to make an optimizer safe is to ensure it isn’t very powerful,” which is absolutely true. This has opportunity costs, of course.
That said… the wording Program the AI to halt and await further instructions as soon as it becomes 99.95% sure that it has optimized objective X by at least 1% seems to suggest that X is what the system is optimizing for, and the halt-and-await-instructions is an obstacle in its source code that impedes progress towards optimizing for X. In which case I’d expect a sufficiently powerful optimizing system to bypass that obstacle somehow.
I think what you mean is more “program the AI to optimize for the goal of achieving >99.95 confidence of >1% increase in X” And then hope there isn’t an unbounded implicit term in there somewhere (e.g., of X measured in what system?).
Yes, I mean your second interpretation. The proposal is basically a hill climbing algorithm, with a human step in the for loop. The AGI to computes a direction (an action with a small effect on the world); humans evaluate the action; humans either implement the action or tweak the AI; repeat. On every iteration, AGI is instructed to optimize only for the next step
I agree greedy algorithms aren’t incredibly powerful, but I’m more worried about safety. Evolution is also a greedy algorithm and it managed to create humans. Would using an AGI to hill climb destroy the world? If so, why?
Suppose that you have a simple, benign solution that works only up to Y% optimization (just make the paperclips), and a hard, non-benign solution that is optimal above that point (take over the world, then make paperclips). The AI naively follows the benign strategy, and does not look too hard for alternatives up to Y%. All manual checks below Y% of optimization pass. But Y ends up as a number that falls between two of your numerical checkpoints. So, you observe all checkpoints passing below Y% optimization, until suddenly the AI switches to the non-benign solution between checkpoints, executes it to reach the next checkpoint, but has already caused damage.
So, this strikes me as a special case of “one way to make an optimizer safe is to ensure it isn’t very powerful,” which is absolutely true. This has opportunity costs, of course.
That said… the wording Program the AI to halt and await further instructions as soon as it becomes 99.95% sure that it has optimized objective X by at least 1% seems to suggest that X is what the system is optimizing for, and the halt-and-await-instructions is an obstacle in its source code that impedes progress towards optimizing for X. In which case I’d expect a sufficiently powerful optimizing system to bypass that obstacle somehow.
I think what you mean is more “program the AI to optimize for the goal of achieving >99.95 confidence of >1% increase in X” And then hope there isn’t an unbounded implicit term in there somewhere (e.g., of X measured in what system?).
Yes, I mean your second interpretation. The proposal is basically a hill climbing algorithm, with a human step in the for loop. The AGI to computes a direction (an action with a small effect on the world); humans evaluate the action; humans either implement the action or tweak the AI; repeat. On every iteration, AGI is instructed to optimize only for the next step
I agree greedy algorithms aren’t incredibly powerful, but I’m more worried about safety. Evolution is also a greedy algorithm and it managed to create humans. Would using an AGI to hill climb destroy the world? If so, why?
Suppose that you have a simple, benign solution that works only up to Y% optimization (just make the paperclips), and a hard, non-benign solution that is optimal above that point (take over the world, then make paperclips). The AI naively follows the benign strategy, and does not look too hard for alternatives up to Y%. All manual checks below Y% of optimization pass. But Y ends up as a number that falls between two of your numerical checkpoints. So, you observe all checkpoints passing below Y% optimization, until suddenly the AI switches to the non-benign solution between checkpoints, executes it to reach the next checkpoint, but has already caused damage.