Since this was originally presented as a game, I will wait two days before posting my answers, which have an md5sum of 4b059edc26cbccb3ff4afe11d6412c47.
And the text with that md5sum. (EDIT: Argh, markdown formatting messed that up. Put a second space after the period in ”… utility function into a larger domain. Any information it finds...”. There should be exactly one newline after the last nonblank line, and the line feeds should be Unix-style.)
(1) When the AI finds the documentation indicating that it gets INT_MAX for doing something, it will assign it probability p, which means that it will conclude that doing it is worth p*INT_MAX utility, not INT_MAX utility as intended. To collect the remaining (1-p)*INT_MAX utility, it will do something else, outside the box, which might be Unfriendly.
(2) It might conclude that integer overflow in its utility function is a bug, and “repair” itself by extrapolating its utility function into a larger domain. Any information it finds about integer overflows in general will support this conclusion.
(3) Since the safeguard involves a number right on the edge of integer overflow, it may interact unpredictably with other calculations, bugs and utility function-based safeguards. For example, if it decides that the INT_MAX reward is actually noisy, and that it will actually receive INT_MAX+1 or INT_MAX-1 utility with equal probability, then that’s 2*INT_MAX which is negative.
1 and 3 seem correct but 2 seems strange to me. This seems to be close to the confusion people will have that a paperclip maximizier will realize that its programmers didn’t really want it to maximize paperclips. Similarly, the AI shouldn’t care about whether or not the integer overflow in this case is a bug.
And the text with that md5sum. (EDIT: Argh, markdown formatting messed that up. Put a second space after the period in ”… utility function into a larger domain. Any information it finds...”. There should be exactly one newline after the last nonblank line, and the line feeds should be Unix-style.)
(1) When the AI finds the documentation indicating that it gets INT_MAX for doing something, it will assign it probability p, which means that it will conclude that doing it is worth p*INT_MAX utility, not INT_MAX utility as intended. To collect the remaining (1-p)*INT_MAX utility, it will do something else, outside the box, which might be Unfriendly.
(2) It might conclude that integer overflow in its utility function is a bug, and “repair” itself by extrapolating its utility function into a larger domain. Any information it finds about integer overflows in general will support this conclusion.
(3) Since the safeguard involves a number right on the edge of integer overflow, it may interact unpredictably with other calculations, bugs and utility function-based safeguards. For example, if it decides that the INT_MAX reward is actually noisy, and that it will actually receive INT_MAX+1 or INT_MAX-1 utility with equal probability, then that’s 2*INT_MAX which is negative.
1 and 3 seem correct but 2 seems strange to me. This seems to be close to the confusion people will have that a paperclip maximizier will realize that its programmers didn’t really want it to maximize paperclips. Similarly, the AI shouldn’t care about whether or not the integer overflow in this case is a bug.