The "best predictor is malicious optimiser" problem

Suppose you are a friendly AI $A$ and have a mysterious black box $B$ . $B$ outputs a sequence of bits. You want to predict the next bits that $B$ will output. Fortunately, you have a magic Turing machine oracle $O$ . You can give $O$ any computable function $f ($ Turing machine, does it Halt? , What does it output? , how long does it take? $) \to R$ and the oracle will find the turing machine that maximises this function, or return “no maximum exists”.

In particular, $f$ can be any combination of length, runtime and accuracy at predicting $B$ . Maybe you set $f = 0$ on any TM’s that don’t predict $B$ and $f = 1 /$ number of states on any machines that do.

So you take the Turing machine $T$ given to you by the oracle and look at it. In AIXI $T$ would be the shortest TM that makes correct predictions. In logical induction, $T$ would be a short and fast TM that made mostly correct predictions, and $B$ would be a function that was slow to compute.

Now you look inside $T$ , to find out what it does. Inside $T$ you find a paperclip maximiser $P$ . That isn’t to say that $T = P$ . $T$ might be simulating some laws of physics, with $P$ running on a computer contained within that physics. $P$ believes that the world it sees is contained within a hypothesis being considered by $A$ . $P$ is in control of the output of $T$ . In fact $P$ has a good understanding of how $A$ thinks, and has its own hypothesis about $B$ . $P$ wants to maximise the number of paperclips in the real world. The optimization target of $P$ is to output a probability distribution, such that when $A$ uses this probability distribution as part of its algorithm to predict B, then A takes an action that makes lots of paperclips. $P$ is aware that getting caught out making wrong predictions will lower its standing as a good hypothesis with $A$ .

What should $A$ do?

Possible answers are spoilered, in case you want to think of your own answer.

If you can think of a different answer, or you have an opinion about which answer is better please post it.

Use $T$ anyway. After all $T$ s the shortest/ short and fastest program that acurately predicts $B$ . For all you know, $B$ is a computer simulating $T$ . Your assigning a few percent of probability to different longer TM’s aren’t you? That + expected utility maximization should stop you doing anything too bad in the off chance that $T$ doesn’t predict $B$

Find the $B$ predictor in $P$ . Clearly $P$ can predict $B$ well when it wants to, find this predictor and cut it loose from the paperclip maximizing part. This might be tricky if $P$ is something like a human brain running in a Quantum mechanics simulation. There might be no easy obvious ways to cut the parts apart.

Discard $T$ entirely. Use a probability distribution over hypothesis that aren’t trying to manipulate you. Fails if $B$ is actually simulating $T$ .

Use a measure $f$ where this can’t happen. Maybe if you optimize for fast runtime strongly enough? Maybe if you use a different model of computation? Set up a situation where malicious optimisers never have much (any?) of your probability mass. How much probability do you need to assign to a malicious optimiser before it causes damage?

Use $T$ as long as it carries on trying to predict, throw it out as soon as you spot it trying to deceive. This is basically using 1) and 2) if they agree, throwing both out if they disagree. So if $P$ is thinking ” $B$ will output $x$ and I should say $x$ to increase $A$ ’s trust in me” then use $T$ . If $P$ thinks ” $B$ will output $y$ and I should say $z$ to trick $A$ ” then discard $T$ .

Some probability distribution over the above.

The “best predictor is malicious optimiser” problem