If it finds the bonus without leaving the box, it collects it and dies. Not ideal, but it fails safe. Having its utility set to INT_MAX is a one-time thing, not an integral over time thing, so it doesn’t care what happens after it’s collected it, and has no need to protect the box.
Since this was originally presented as a game, I will wait two days before posting my answers, which have an md5sum of 4b059edc26cbccb3ff4afe11d6412c47.
Since this was originally presented as a game, I will wait two days before posting my answers, which have an md5sum of 4b059edc26cbccb3ff4afe11d6412c47.
And the text with that md5sum. (EDIT: Argh, markdown formatting messed that up. Put a second space after the period in ”… utility function into a larger domain. Any information it finds...”. There should be exactly one newline after the last nonblank line, and the line feeds should be Unix-style.)
(1) When the AI finds the documentation indicating that it gets INT_MAX for doing something, it will assign it probability p, which means that it will conclude that doing it is worth p*INT_MAX utility, not INT_MAX utility as intended. To collect the remaining (1-p)*INT_MAX utility, it will do something else, outside the box, which might be Unfriendly.
(2) It might conclude that integer overflow in its utility function is a bug, and “repair” itself by extrapolating its utility function into a larger domain. Any information it finds about integer overflows in general will support this conclusion.
(3) Since the safeguard involves a number right on the edge of integer overflow, it may interact unpredictably with other calculations, bugs and utility function-based safeguards. For example, if it decides that the INT_MAX reward is actually noisy, and that it will actually receive INT_MAX+1 or INT_MAX-1 utility with equal probability, then that’s 2*INT_MAX which is negative.
1 and 3 seem correct but 2 seems strange to me. This seems to be close to the confusion people will have that a paperclip maximizier will realize that its programmers didn’t really want it to maximize paperclips. Similarly, the AI shouldn’t care about whether or not the integer overflow in this case is a bug.
Having its utility set to INT_MAX is a one-time thing, not an integral over time thing, so it doesn’t care what happens after it’s collected it, and has no need to protect the box.
If it is a good Bayesian then it only has a belief that it is probably in the box. The longer is observes itself in the box the higher the chance that it is actually in the box.
(Actually this leads to another thought: the same doubt should cause it to still try to fulfill its other goals on the off chance that it isn’t in the box.)
MD5 is not secure; it is possible to create a piece of text to match a specific MD5 hash within a reasonable amount of time. Unfortunately, I was not able to find an alternative. It probably doesn’t matter for this purpose anyways.
I’d like to offer a bet at 1:10^12 odds that no one can produce two coherent English sentences about potential problems with AI-boxes short enough to fit in an LW comment box which have the same MD5 hash within 2 days. Unfortunately I don’t actually have the cash to pay out if I lose.
Even if one could, it would require far more work than creating a string that is the MD5 has of one such sentence. I just think that it is good for people to be more informed about applied cryptography in general.
If it finds the bonus without leaving the box, it collects it and dies. Not ideal, but it fails safe. Having its utility set to INT_MAX is a one-time thing, not an integral over time thing, so it doesn’t care what happens after it’s collected it, and has no need to protect the box.
Since this was originally presented as a game, I will wait two days before posting my answers, which have an md5sum of 4b059edc26cbccb3ff4afe11d6412c47.
And the text with that md5sum. (EDIT: Argh, markdown formatting messed that up. Put a second space after the period in ”… utility function into a larger domain. Any information it finds...”. There should be exactly one newline after the last nonblank line, and the line feeds should be Unix-style.)
(1) When the AI finds the documentation indicating that it gets INT_MAX for doing something, it will assign it probability p, which means that it will conclude that doing it is worth p*INT_MAX utility, not INT_MAX utility as intended. To collect the remaining (1-p)*INT_MAX utility, it will do something else, outside the box, which might be Unfriendly.
(2) It might conclude that integer overflow in its utility function is a bug, and “repair” itself by extrapolating its utility function into a larger domain. Any information it finds about integer overflows in general will support this conclusion.
(3) Since the safeguard involves a number right on the edge of integer overflow, it may interact unpredictably with other calculations, bugs and utility function-based safeguards. For example, if it decides that the INT_MAX reward is actually noisy, and that it will actually receive INT_MAX+1 or INT_MAX-1 utility with equal probability, then that’s 2*INT_MAX which is negative.
1 and 3 seem correct but 2 seems strange to me. This seems to be close to the confusion people will have that a paperclip maximizier will realize that its programmers didn’t really want it to maximize paperclips. Similarly, the AI shouldn’t care about whether or not the integer overflow in this case is a bug.
If it is a good Bayesian then it only has a belief that it is probably in the box. The longer is observes itself in the box the higher the chance that it is actually in the box.
(Actually this leads to another thought: the same doubt should cause it to still try to fulfill its other goals on the off chance that it isn’t in the box.)
MD5 is not secure; it is possible to create a piece of text to match a specific MD5 hash within a reasonable amount of time. Unfortunately, I was not able to find an alternative. It probably doesn’t matter for this purpose anyways.
I’d like to offer a bet at 1:10^12 odds that no one can produce two coherent English sentences about potential problems with AI-boxes short enough to fit in an LW comment box which have the same MD5 hash within 2 days. Unfortunately I don’t actually have the cash to pay out if I lose.
Even if one could, it would require far more work than creating a string that is the MD5 has of one such sentence. I just think that it is good for people to be more informed about applied cryptography in general.
Well, sha512 hashes are common and seem secure. But given this context, md5 seems reasonable.
Meh, md5′s what’s on my path. If my answer contains a kilobyte of line noise then you might have cause to suspect I cheated.