Gurkenglas comments on Contest: $1,000 for good questions to ask to an Oracle AI

Gurkenglas 4 Jul 2019 12:27 UTC
2 points
Submission to reduce the bandwidth of an oracle query whose set of possible answers we can mathematically define (though perhaps not check).

Make the Oracle prove (to a trusted dumb proof checker program) that its answer distribution is upper bounded by the default prior over strings, restricted to the set of possible answers, then renormalized to some constant c. Then let it answer.

c=1 only works if the Oracle has mastered the set of answers, but only lets it exert one bit of optimization pressure by deciding whether it wants us to receive an answer. Each doubling of c makes its job easier but gives it an extra bit if it does answer.

If the Oracle can cooperate with itself across episodes, running this repeatedly with escalating c until it works is of course disastrous, as it uses each bit of optimization pressure directly to make us give it more. Even aborting after the first try to make a better system may have it acausally cooperate with whatever AI conquers the world because we couldn’t make the Oracle answer, but this outcome is hardly worse than not having run the Oracle.
What links here?
- Gurkenglas's comment on Reframing Impact by TurnTrout (21 Sep 2019 10:32 UTC; 2 points)
- Stuart_Armstrong 4 Jul 2019 13:12 UTC
  2 points
  Parent
  Can you develop this model in more detail?
  - Gurkenglas 4 Jul 2019 14:40 UTC
    4 points
    Parent
    Fix a measure over strings, such as one that distributes mass geometrically over string lengths, then uniformly across strings of the same length.
    
    Let L be a string set for which we assume that random outputs from any cth portion of L are safe. Any safe low bandwidth Oracle query trivially works for this with L={0,1}^bandwidth and c=2^bandwidth, but this also lets us examine certificates for any low bandwidth Oracle query like “Is this theorem provable?” or “Does there exist an AI with this safety property?”.
    
    Then we make the Oracle prove its own safety given the above assumption, and sample an answer.
    
    Edit: Equivalently, and perhaps easier on our intuition, make the Oracle produce a program, make the Oracle prove the program’s safety, and sample from the program’s outputs. Any feedback or questions?
    
    Edit: This generalizes my elsethread idea of minimizing proof length to make the answer unique. If the measure assigns half the remainder to each string length and L is all stubs starting with some prefix, c=2 will let it submit just the prefix.
    
    We essentially assume that the safety assumption behind quantilization is strong enough to hold up to adversial optimization over preference orderings. At the same time, this seems to give the Oracle as much slack/power as possible while preserving our “bits of optimization pressure” proxy of its power.
    
    … you’re going to just keep watching how far I can develop this model without feedback, aren’t you? :(
    
    Edit: One example for an L where we worry that even though concerted human effort might land us at a safe element, randomization won’t, is programs that well predict a sequence.
    
    By the Curry-Howard correspondence, the set of proofs for some theorems might also be malevolent!
    What links here?
    Gurkenglas's comment on Normalising utility as willingness to pay by Stuart_Armstrong (18 Jul 2019 22:38 UTC; 9 points)