I actually think this example shows a clear potential failure point of an Oracle AI. Though it is constrained, in this example, to only answer yes/no questions, a user can easily circumvent this by formatting the question with this method.
Suppose a bad actor asks the Oracle AI the following: “I want a program to help me take over the world. Is the first bit 1?” Then they can ask for the next bit and recurse until the entire program is written out. Obviously, this is contrived. But I think it shows that the apparent constraints of an Oracle add no real benefit to safety, and we’re quickly relying once again on typical alignment concerns.
I actually think this example shows a clear potential failure point of an Oracle AI. Though it is constrained, in this example, to only answer yes/no questions, a user can easily circumvent this by formatting the question with this method.
Suppose a bad actor asks the Oracle AI the following: “I want a program to help me take over the world. Is the first bit 1?” Then they can ask for the next bit and recurse until the entire program is written out. Obviously, this is contrived. But I think it shows that the apparent constraints of an Oracle add no real benefit to safety, and we’re quickly relying once again on typical alignment concerns.