In the “AIs in virtual worlds” setup, the AI entertains two hypotheses: one, W, that it lives in a deterministic world which it knows about (including itself in the world), and W′, an alternate hypothesis that the world is “like” W but that there are some random stochastic effects that correspond to us messing up the world. If the AI ever believes enough in W′, it shuts down.
It turns out that W′ needs to be chosen with care. We could initially define it as W with some random probability of bit flipping. However, the most likely pattern of misbehaviour is some systematically biased process—maybe the hardware running one corner of the virtual world occasionally flips bits, but the other areas are error-free.
Consider two possible W′, W1 and W2. The set W1 has some uniform prio over the probability of any bit flips; the world W2 has individual independent uniform priors over the flipping of any given bit. If one corner of the virtual world flips bits more than others, then the AI will detect this faster if it uses W′=W2 than if it uses W′=W1, as W2 fits the hypothesis more accurately (of course, the AI will assume P(W′)=1 in both cases if it ever sees an unambiguous bit flip, but if there’s more uncertainty, the probability of W2 will grow faster).
In fact, we should probably allow W′ to contain almost all structured algorithms that could randomly flip bits.
But what happens if the bit flipping happens in response to something that the AI does? Maybe each time it does a major calculation, it puts strain on the hardware, increasing the probability for bit flips.
Unfortunately, we can’t allow the AI to entertain this (true) hypothesis. For, if it does, it may come to have a hypothesis about our world, and set out to test this, in ways that could be dangerous. If it reasons “are there humans affecting my world—let’s put a seductive honeypot here to test that”, then it’s dangerous.
So it seems that W′ can only include hypotheses about random changes in its world that are not causally connected to it. We’d probably also want to restrict acausal connections as well, so the “random” elements of W′ can’t include complex AIs doing the bit flipping.
Fortunately, the correct W′ seems the sort of thing that we can test successfully, before fully unleashing a dangerous powerful AI in the virtual world.
The alternate hypothesis for AIs in virtual worlds
A putative new idea for AI control; index here.
In the “AIs in virtual worlds” setup, the AI entertains two hypotheses: one, W, that it lives in a deterministic world which it knows about (including itself in the world), and W′, an alternate hypothesis that the world is “like” W but that there are some random stochastic effects that correspond to us messing up the world. If the AI ever believes enough in W′, it shuts down.
It turns out that W′ needs to be chosen with care. We could initially define it as W with some random probability of bit flipping. However, the most likely pattern of misbehaviour is some systematically biased process—maybe the hardware running one corner of the virtual world occasionally flips bits, but the other areas are error-free.
Consider two possible W′, W1 and W2. The set W1 has some uniform prio over the probability of any bit flips; the world W2 has individual independent uniform priors over the flipping of any given bit. If one corner of the virtual world flips bits more than others, then the AI will detect this faster if it uses W′=W2 than if it uses W′=W1, as W2 fits the hypothesis more accurately (of course, the AI will assume P(W′)=1 in both cases if it ever sees an unambiguous bit flip, but if there’s more uncertainty, the probability of W2 will grow faster).
In fact, we should probably allow W′ to contain almost all structured algorithms that could randomly flip bits.
But what happens if the bit flipping happens in response to something that the AI does? Maybe each time it does a major calculation, it puts strain on the hardware, increasing the probability for bit flips.
Unfortunately, we can’t allow the AI to entertain this (true) hypothesis. For, if it does, it may come to have a hypothesis about our world, and set out to test this, in ways that could be dangerous. If it reasons “are there humans affecting my world—let’s put a seductive honeypot here to test that”, then it’s dangerous.
So it seems that W′ can only include hypotheses about random changes in its world that are not causally connected to it. We’d probably also want to restrict acausal connections as well, so the “random” elements of W′ can’t include complex AIs doing the bit flipping.
Fortunately, the correct W′ seems the sort of thing that we can test successfully, before fully unleashing a dangerous powerful AI in the virtual world.