Under a policy that doesn’t cause the computer’s memory to be tampered with (which is plausible, even ideal), ν† and ν⋆ are identical, so we can’t count on ν†losing probability mass relative to ν⋆.
I agree with that, but if they are always making the same on-policy prediction it doesn’t matter what happens to their relative probability (modulo exploration). The agent can’t act on an incentive to corrupt memory infinitely often, because each time requires the models making a different prediction on-policy. So the agent only acts on such an incentive finitely many times, and hence never does so after some sufficiently late episode i0. Agree/disagree?
(Having a bad model can still hurt, since the bogus model might agree on-policy but assign lower rewards off-policy. But if they also always approximately agree on the exploration distribution, then a bad model also can’t discourage exploration. And if they don’t agree on the exploration distribution, then the bad model will eventually get tested.)
I suppose I constrained myself to producing an algorithm/setup where the asymptotic benignity result followed from reasons that don’t require dangerous behavior in the interim.
Also, you can add another parameter to BoMAI where you just have the human explorer explore for the first E episodes. The i0 in the Eventual Benignity Theorem can be thought of as the max of i’ and i″. i’ comes from the i0 in Lemma 1 (Rejecting the Simple Memory-Based). i″ comes from the point in time when ^ν(i) is ε-accurate on policy, which renders Lemma 3 applicable. (And Lemma 2 always applies). My initial thought was to set E so that the human explorer is exploring for the whole time when the MAP world-model was not necessarily benign. This works for i’. E can just be set to be greater than i’. The thing it doesn’t work for is i″. If you increase E, the value of i″ goes up as well.
So in fact, if you set E large enough, the first time BoMAI controls the episode, it will be benign. Then, there is a period where it might not be benign. However, from that point on, the only “way” for a world-model to be malign is by being worse than ε-inaccurate on-policy, because Lemmas 1 and 2 have already kicked in, and if it were ε-accurate on-policy, Lemma 3 would kick in as well. The first point to make about this is that in this regime, benignity comes in tandem with intelligence—it has to be confused to be dangerous (like a self-driving car). The second point is: I can’t come up with an example of world-model which is plausibly maximum a posteriori in this interval of time, and which is plausibly dangerous (for what that’s worth; and I don’t like to assume it’s worth much because it took me months to notice ν†).
I suppose I constrained myself to producing an algorithm/setup where the asymptotic benignity result followed from reasons that don’t require dangerous behavior in the interim.
I think my point is this:
The intuitive thing you are aiming at is stronger than what the theorem establishes (understandably!)
You probably don’t need the memory trick to establish the theorem itself.
Even with the memory trick, I’m not convinced you meet the stronger criterion. There are a lot of other things similar to memory that can cause trouble—the theorem is able to avoid them only because of the same unsatisfying asymptotic feature that would have caused it to avoid memory-based models even without the amnesia.
the theorem is able to avoid them only because of the same unsatisfying asymptotic feature that would have caused it to avoid memory-based models even without the amnesia
This is a conceptual approach I hadn’t considered before—thank you. I don’t think it’s true in this case. Let’s be concrete: the asymptotic feature that would have caused it to avoid memory-based models even without amnesia is trial and error, applied to unsafe policies. Every section of the proof, however, can be thought of as making off-policy predictions behave. The real result of the paper would then be “Asymptotic Benignity, proven in a way that involves off-policy predictions approaching their benign output without ever being tested”. So while there might be malign world-models of a different flavor to the memory-based ones, I don’t think the way this theorem treats them is unsatisfying.
I agree with that, but if they are always making the same on-policy prediction it doesn’t matter what happens to their relative probability (modulo exploration). The agent can’t act on an incentive to corrupt memory infinitely often, because each time requires the models making a different prediction on-policy. So the agent only acts on such an incentive finitely many times, and hence never does so after some sufficiently late episode i0. Agree/disagree?
(Having a bad model can still hurt, since the bogus model might agree on-policy but assign lower rewards off-policy. But if they also always approximately agree on the exploration distribution, then a bad model also can’t discourage exploration. And if they don’t agree on the exploration distribution, then the bad model will eventually get tested.)
Ah I see what you’re saying.
I suppose I constrained myself to producing an algorithm/setup where the asymptotic benignity result followed from reasons that don’t require dangerous behavior in the interim.
Also, you can add another parameter to BoMAI where you just have the human explorer explore for the first E episodes. The i0 in the Eventual Benignity Theorem can be thought of as the max of i’ and i″. i’ comes from the i0 in Lemma 1 (Rejecting the Simple Memory-Based). i″ comes from the point in time when ^ν(i) is ε-accurate on policy, which renders Lemma 3 applicable. (And Lemma 2 always applies). My initial thought was to set E so that the human explorer is exploring for the whole time when the MAP world-model was not necessarily benign. This works for i’. E can just be set to be greater than i’. The thing it doesn’t work for is i″. If you increase E, the value of i″ goes up as well.
So in fact, if you set E large enough, the first time BoMAI controls the episode, it will be benign. Then, there is a period where it might not be benign. However, from that point on, the only “way” for a world-model to be malign is by being worse than ε-inaccurate on-policy, because Lemmas 1 and 2 have already kicked in, and if it were ε-accurate on-policy, Lemma 3 would kick in as well. The first point to make about this is that in this regime, benignity comes in tandem with intelligence—it has to be confused to be dangerous (like a self-driving car). The second point is: I can’t come up with an example of world-model which is plausibly maximum a posteriori in this interval of time, and which is plausibly dangerous (for what that’s worth; and I don’t like to assume it’s worth much because it took me months to notice ν†).
I think my point is this:
The intuitive thing you are aiming at is stronger than what the theorem establishes (understandably!)
You probably don’t need the memory trick to establish the theorem itself.
Even with the memory trick, I’m not convinced you meet the stronger criterion. There are a lot of other things similar to memory that can cause trouble—the theorem is able to avoid them only because of the same unsatisfying asymptotic feature that would have caused it to avoid memory-based models even without the amnesia.
This is a conceptual approach I hadn’t considered before—thank you. I don’t think it’s true in this case. Let’s be concrete: the asymptotic feature that would have caused it to avoid memory-based models even without amnesia is trial and error, applied to unsafe policies. Every section of the proof, however, can be thought of as making off-policy predictions behave. The real result of the paper would then be “Asymptotic Benignity, proven in a way that involves off-policy predictions approaching their benign output without ever being tested”. So while there might be malign world-models of a different flavor to the memory-based ones, I don’t think the way this theorem treats them is unsatisfying.