Yuxi_Liu comments on Contest: $1,000 for good questions to ask to an Oracle AI

Yuxi_Liu 7 Jul 2019 16:37 UTC
1 point
Submission.

Setup: Other than making sure the oracles won’t accidentally consume the world in their attempt to think up the answer, no other precautions necessary.

Episode length: as long as you want to wait, though a month should be more than enough.
1. For a low-bandwidth oracle.
Ask the low-bandwidth oracle to predict if an earthquake (or some other natural disaster, like volcanoes or asteroid impacts, that the oracle’s answer cannot affect), of a certain magnitude, in a certain area, in a certain timeframe, would happen. Possible answers are Yes, No.
- If No, or time out, shut down the oracle without reward.
- Else, wait until the prediction is tested after the timeframe has elapsed. Then if it actually happened, reward by (1/p-1), else, reward by −1. Then shutdown.
This causes the oracle to reply Yes if the oracle thinks there is an aboves p chance that the event would happen.
1. For a counterfactual oracle.
Ask the Oracle to give a proof of some really hard mathematical theorems that are fundamental to computer science and logic. The answer is then in the counterfactual case given to a proofchecker:
- the proof is correct, reward by +1.
- the proof is false, reward by -infinity.
- the oracle times out, reward by 0.
Examples include
- P=NP problem
- proving Peano arithmetics is inconsistent
This should be safe, since there are very small proofcheckers that are as verified as they could be. As long as the hardware is trusty enough (to build the Oracle itself safely at all, hardware safety should already be solved sufficiently, so this is not much of an extra requirement), this should be safe.
1. For a counterfactual oracle.
Give the oracle the schematics of another oracle, perhaps itself, and ask it to generate some input question that would make the other “behave badly”. Then, the answer would be tested on the other oracle (isolated from the first, save for the cable through which the answer is wired), and
- if the answer worked, the oracle is rewarded by +1.
- if the answer doesn’t work, the oracle is rewarded by −1
If the oracle times out, then the (nonexistent) answer is not tested and the episode ends.

For examples of the question to ask the oracle:

Given this oracle in this physical setup with some tripwires to stop it from behaving badly (such as eating the world), what question would make this oracle bump into these tripwires?

This can be used for debugging more powerful but less verified oracles with lesser but more verified oracles, for safe intelligence amplification.
1. For two low-bandwidth oracles totally isolated from each other.
Ask the oracles to play the prisoner’s dilemma, and they would be rewarded accordingly after both have given their answers. Timing out is punished by -infinity, so that the oracles would not time-out.

Each oracle knows what themself is like, and they also know what their opponent is like. See if they actually cooperate. This is an experimental test on “superrationality” or “acausal trade”. I’m extremely skeptical about superrationality/acausal trade, and an experimental proof/disproof would be good.

This can be run counterfactually, since the whole setup is simple and can be done without human intervention, although I do not see how this adds any security.
1. For a low-bandwidth and counterfactual oracle.
Link a low-bandwidth oracle to a supervising computer, and run many episodes of the oracle. At the beginning of each episode, initiate the oracle identically, and ask it to do the Newcomb problem:

Oracle, we have run you n episodes previously, and have kept a record of all the answers you have given at the ends of the episodes. Your answer must be A (“one box”) or B (“two box”). We predict that you would choose the answer that you have chosen the most in the past.
- If we predict that you would choose A, then
  - if you would answer A, then you get +100,
  - if you answer B, then you get +101.
- If we predict that you would choose B, then
  - if you would answer A, then you get +0
  - if you would answer B, then you get +1
- If you timeout, then you get −100
The oracle never actually gets any reward and is shutdown immediately after it gives an answer. Collect the results and see if it replies the same each time, or differently. This would give insight to what the oracle thinks about itself, and finally provide some experimental results about Newcomb’s paradox.
- Stuart_Armstrong 8 Jul 2019 11:58 UTC
  3 points
  Parent
  Please add “submission” at the top of the post.
  1. is insufficiently detailed—can you explain what is going on, how the Oracles are rewarded, what happens when the message is read/not read, and so on. Same for 5.
  2. seems potentially very interesting.
  - Yuxi_Liu 8 Jul 2019 18:57 UTC
    3 points
    Parent
    I fixed the submission as required.
    
    Also I changed the submission 3 significantly.