Fix a measure over strings, such as one that distributes mass geometrically over string lengths, then uniformly across strings of the same length.
Let L be a string set for which we assume that random outputs from any cth portion of L are safe. Any safe low bandwidth Oracle query trivially works for this with L={0,1}^bandwidth and c=2^bandwidth, but this also lets us examine certificates for any low bandwidth Oracle query like “Is this theorem provable?” or “Does there exist an AI with this safety property?”.
Then we make the Oracle prove its own safety given the above assumption, and sample an answer.
Edit: Equivalently, and perhaps easier on our intuition, make the Oracle produce a program, make the Oracle prove the program’s safety, and sample from the program’s outputs. Any feedback or questions?
Edit: This generalizes my elsethread idea of minimizing proof length to make the answer unique. If the measure assigns half the remainder to each string length and L is all stubs starting with some prefix, c=2 will let it submit just the prefix.
We essentially assume that the safety assumption behind quantilization is strong enough to hold up to adversial optimization over preference orderings. At the same time, this seems to give the Oracle as much slack/power as possible while preserving our “bits of optimization pressure” proxy of its power.
… you’re going to just keep watching how far I can develop this model without feedback, aren’t you? :(
Edit: One example for an L where we worry that even though concerted human effort might land us at a safe element, randomization won’t, is programs that well predict a sequence.
By the Curry-Howard correspondence, the set of proofs for some theorems might also be malevolent!
Can you develop this model in more detail?
Fix a measure over strings, such as one that distributes mass geometrically over string lengths, then uniformly across strings of the same length.
Let L be a string set for which we assume that random outputs from any cth portion of L are safe. Any safe low bandwidth Oracle query trivially works for this with L={0,1}^bandwidth and c=2^bandwidth, but this also lets us examine certificates for any low bandwidth Oracle query like “Is this theorem provable?” or “Does there exist an AI with this safety property?”.
Then we make the Oracle prove its own safety given the above assumption, and sample an answer.
Edit: Equivalently, and perhaps easier on our intuition, make the Oracle produce a program, make the Oracle prove the program’s safety, and sample from the program’s outputs. Any feedback or questions?
Edit: This generalizes my elsethread idea of minimizing proof length to make the answer unique. If the measure assigns half the remainder to each string length and L is all stubs starting with some prefix, c=2 will let it submit just the prefix.
We essentially assume that the safety assumption behind quantilization is strong enough to hold up to adversial optimization over preference orderings. At the same time, this seems to give the Oracle as much slack/power as possible while preserving our “bits of optimization pressure” proxy of its power.
… you’re going to just keep watching how far I can develop this model without feedback, aren’t you? :(
Edit: One example for an L where we worry that even though concerted human effort might land us at a safe element, randomization won’t, is programs that well predict a sequence.
By the Curry-Howard correspondence, the set of proofs for some theorems might also be malevolent!