I was reading Tom Chivers book “The AI does not hate you” and in a discussion about avoiding bad side effects when asking a magic broomstick to fill a water bucket, it was suggested that somehow instead of asking the broomstick to fill the bucket you could do something like ask it to become 95 percent sure that it was full, and that might make it less likely to flood the house.
Apparently Tom asked Eliezer at the time and he said there was no known problem with that solution.
Are there any posts on this? Is the reason why we don’t know this won’t work just because it’s hard to make this precise?
Bringing up my own copy of the book, here is the full context for this:
My understanding of the writing here is that Eliezer was intending say “there is no absolutely obvious problem with that solution that I can think of immediately, but I bet I could find one with a few minutes or hours of thinking, as will I be able to with almost any unprincipled solution you come up with. Just as I can be quite sure that if you haven’t thought extremely carefully about cryptography that your system will have some security flaw I can exploit, even if I can’t tell you immediately what that flaw might be”.
To complete this argument, how things will go wrong will depend a bit on how exactly magic works in the hypothetical sorcerer’s apprentice world, but here are some ways they could go wrong depending on the precise mechanics:
The AI fills the cauldron, then realizes that depending on its future observations it will probably not continue to assign exactly 95% probability to the cauldron being full. The AI decides to destroy absolutely everything but the cauldron and install a perfect quantum coin that destroys the cauldron with slightly less than 5% probability while trying to shield the cauldron from any potential external causal influence to ensure that it will forever assign as close as possible to 95% probability to the cauldron being full.
You underspecified what the “cauldron” is, and while there is indeed a cauldron in your room that is full with 95% probability, there could be a more archetypical cauldron that could be full. A cauldron that fits your specification of the filled cauldron a tiny bit better. The AI decides to destroy your cauldron and everything else around it in the hunt for resources to build the perfect cauldron according to your specifications.
Thanks I didn’t have the copy of the book on hand as I was listening on audible.
I assume this won’t work, just because the whole thing is a hard problem so a throwaway thought about a solution in a popular press book is unlikely to solve this. I was mainly interested in either why it won’t work or why it’s hard to make this precise.
95% or more.
I guess I was thinking that making it precise would be either something like
Only do an action if the probability of success of given the action has been done is 95% or greater and don’t do an action if the probability of success doing nothing is 95%.
or,
Make the Utility be 1 if at a defined point the probability of success is 95% or greater and 0 otherwise and just maximise Expected Utility.
I’m not sure how the first proposal is going to come up against either of the problems you mention. Though it does seem to need a definition of inaction and presumably it could fall prey to the 5 and 10 problem or similar.
I guess there might be a general problem with this sort approach in that it won’t try to do low impact actions that move from a 50% probability to a 55% probability.