Apparently Tom asked Eliezer at the time and he said there was no known problem with that solution.
Bringing up my own copy of the book, here is the full context for this:
You might think there are obvious solutions to each of these problems, and you can just add little patches – assign a –40 to ‘room gets flooded’, say, or a 1 value to ‘if you are 95 per cent sure the cauldron is full’ rather than ‘if the cauldron is full’. And maybe they’d help. But the question is: Did you think of them in advance? And if not, What else have you missed? Patching it afterwards might be a bit late, if you’re worried about water damage to your decor and electricals. And it’s not certain those patches would work,anyway.
I asked Eliezer Yudkowsky about the 95 per cent one and he said: ‘There aren’t any predictable failures from that patch as far as I know.’ But it’s indicative of a larger problem: Mickey thought that he was setting the broom a task, a simple, one-off, clearly limited job, but, in subtle ways that he didn’t foresee, he ended up leaving it with an open-ended goal.
This problem of giving an AI something that looks task-like but is in fact open-ended ‘is an idea that’s about the whole AI, not just the surface goal,’ said Yudkowsky. There could be all sorts of loops that develop as a consequence of how the AI thinks about a problem: for instance, one class of algorithm, known as the ‘generative adversarial network’ (GAN), involves setting two neural networks against each other, one trying to produce something (say, an image) and the other looking for problems with it; the idea is that this adversarial process will lead to better outputs. ‘To give a somewhat dumb example that captures the general idea,’ he said, ‘a taskish AGI shouldn’t contain [a simple] GAN because [a simple] GAN contains two opposed processes both trying to exert an unlimited amount of optimisation power against each other.’ That is, just as Mickey’s broom ended up interpreting a simple task as open-ended, a GAN might dedicate, paperclip-maximiser-style, all the resources of the solar system into both creating and undermining the things it’s supposed to produce. That’s a GAN-specific problem, but it illustrates the deeper one, which is that unless you know how the whole AI works, simply adding patches to its utility function probably won’t help.
My understanding of the writing here is that Eliezer was intending say “there is no absolutely obvious problem with that solution that I can think of immediately, but I bet I could find one with a few minutes or hours of thinking, as will I be able to with almost any unprincipled solution you come up with. Just as I can be quite sure that if you haven’t thought extremely carefully about cryptography that your system will have some security flaw I can exploit, even if I can’t tell you immediately what that flaw might be”.
To complete this argument, how things will go wrong will depend a bit on how exactly magic works in the hypothetical sorcerer’s apprentice world, but here are some ways they could go wrong depending on the precise mechanics:
The AI fills the cauldron, then realizes that depending on its future observations it will probably not continue to assign exactly 95% probability to the cauldron being full. The AI decides to destroy absolutely everything but the cauldron and install a perfect quantum coin that destroys the cauldron with slightly less than 5% probability while trying to shield the cauldron from any potential external causal influence to ensure that it will forever assign as close as possible to 95% probability to the cauldron being full.
You underspecified what the “cauldron” is, and while there is indeed a cauldron in your room that is full with 95% probability, there could be a more archetypical cauldron that could be full. A cauldron that fits your specification of the filled cauldron a tiny bit better. The AI decides to destroy your cauldron and everything else around it in the hunt for resources to build the perfect cauldron according to your specifications.
Thanks I didn’t have the copy of the book on hand as I was listening on audible.
I assume this won’t work, just because the whole thing is a hard problem so a throwaway thought about a solution in a popular press book is unlikely to solve this. I was mainly interested in either why it won’t work or why it’s hard to make this precise.
I guess I was thinking that making it precise would be either something like
Only do an action if the probability of success of given the action has been done is 95% or greater and don’t do an action if the probability of success doing nothing is 95%.
or,
Make the Utility be 1 if at a defined point the probability of success is 95% or greater and 0 otherwise and just maximise Expected Utility.
I’m not sure how the first proposal is going to come up against either of the problems you mention. Though it does seem to need a definition of inaction and presumably it could fall prey to the 5 and 10 problem or similar.
I guess there might be a general problem with this sort approach in that it won’t try to do low impact actions that move from a 50% probability to a 55% probability.
Bringing up my own copy of the book, here is the full context for this:
My understanding of the writing here is that Eliezer was intending say “there is no absolutely obvious problem with that solution that I can think of immediately, but I bet I could find one with a few minutes or hours of thinking, as will I be able to with almost any unprincipled solution you come up with. Just as I can be quite sure that if you haven’t thought extremely carefully about cryptography that your system will have some security flaw I can exploit, even if I can’t tell you immediately what that flaw might be”.
To complete this argument, how things will go wrong will depend a bit on how exactly magic works in the hypothetical sorcerer’s apprentice world, but here are some ways they could go wrong depending on the precise mechanics:
The AI fills the cauldron, then realizes that depending on its future observations it will probably not continue to assign exactly 95% probability to the cauldron being full. The AI decides to destroy absolutely everything but the cauldron and install a perfect quantum coin that destroys the cauldron with slightly less than 5% probability while trying to shield the cauldron from any potential external causal influence to ensure that it will forever assign as close as possible to 95% probability to the cauldron being full.
You underspecified what the “cauldron” is, and while there is indeed a cauldron in your room that is full with 95% probability, there could be a more archetypical cauldron that could be full. A cauldron that fits your specification of the filled cauldron a tiny bit better. The AI decides to destroy your cauldron and everything else around it in the hunt for resources to build the perfect cauldron according to your specifications.
Thanks I didn’t have the copy of the book on hand as I was listening on audible.
I assume this won’t work, just because the whole thing is a hard problem so a throwaway thought about a solution in a popular press book is unlikely to solve this. I was mainly interested in either why it won’t work or why it’s hard to make this precise.
95% or more.
I guess I was thinking that making it precise would be either something like
Only do an action if the probability of success of given the action has been done is 95% or greater and don’t do an action if the probability of success doing nothing is 95%.
or,
Make the Utility be 1 if at a defined point the probability of success is 95% or greater and 0 otherwise and just maximise Expected Utility.
I’m not sure how the first proposal is going to come up against either of the problems you mention. Though it does seem to need a definition of inaction and presumably it could fall prey to the 5 and 10 problem or similar.
I guess there might be a general problem with this sort approach in that it won’t try to do low impact actions that move from a 50% probability to a 55% probability.