″ [...] since every string can be reconstructed by only answering yes or no to questions like ‘is the first bit 1?’ [...]”
Why would humans ever ask this question, and (furthermore) why would we ever ask this question n number of times? It seems unlikely, and easy to prevent. Is there something I’m not understanding about this step?
I actually think this example shows a clear potential failure point of an Oracle AI. Though it is constrained, in this example, to only answer yes/no questions, a user can easily circumvent this by formatting the question with this method.
Suppose a bad actor asks the Oracle AI the following: “I want a program to help me take over the world. Is the first bit 1?” Then they can ask for the next bit and recurse until the entire program is written out. Obviously, this is contrived. But I think it shows that the apparent constraints of an Oracle add no real benefit to safety, and we’re quickly relying once again on typical alignment concerns.
″ [...] since every string can be reconstructed by only answering yes or no to questions like ‘is the first bit 1?’ [...]”
Why would humans ever ask this question, and (furthermore) why would we ever ask this question n number of times? It seems unlikely, and easy to prevent. Is there something I’m not understanding about this step?
I actually think this example shows a clear potential failure point of an Oracle AI. Though it is constrained, in this example, to only answer yes/no questions, a user can easily circumvent this by formatting the question with this method.
Suppose a bad actor asks the Oracle AI the following: “I want a program to help me take over the world. Is the first bit 1?” Then they can ask for the next bit and recurse until the entire program is written out. Obviously, this is contrived. But I think it shows that the apparent constraints of an Oracle add no real benefit to safety, and we’re quickly relying once again on typical alignment concerns.