Non-manipulative oracles

Benja had a post about trying to get predictors to not manipulate you. It involved a predictor that could predict tennis matches, but where the prediction could also manipulate the impact of those matches.

To solve this, Benja imagined the actions of a hypothetical CDT reasoning agent located on the moon, and unable to affect the outcome.

While his general approach is interesting, it seems the specific problem has a much simpler solution: before the AI’s message is outputed, it’s run through a randomised scrubber that has a tiny chance of erasing it.

Then the predictor would try and maximise expected correctness of its prediction, given that the scrubber erased its output (utility indifference can have a similar effect). In practice the scrubber would almost never trigger, so we would get accurate predictions, unaffected by our reading of them.

Does this seem it’ll work?

Stuart_Armstrong6 Feb 2015 17:05 UTC

LW: 3 AF: 2

1 comment1 min readLW link

Oracle AI

What links here?

Forum Digest: Corrigibility, utility indifference, & related control ideas by Benya_Fallenstein (24 Mar 2015 17:39 UTC; 35 points)

jessicata 6 Feb 2015 22:50 UTC
0 points
AF
I discussed this with Benja at a previous MIRIx workshop and I don’t remember exactly what we concluded, but I think it mostly works, it just requires that people behave sensibly when they get scrubbed predictions.

Now that I think about it: to handle cases when people don’t behave that sensibly with scrubbed predictions, maybe we want some kind of sequence of oracles, where oracle 0 outputs nothing, and oracle n+1 outputs what would happen if it were replaced with oracle n. We could take the limit as n approaches infinity, but then we don’t know that much about which fixed point we will get (it will be controlled by subtle feedback loops), so maybe we want something like n=3 being most probable (although we will want to make n random between 0 and 3 so it’s meaningful to condition on n=0, n=1, n=2).