the theorem is able to avoid them only because of the same unsatisfying asymptotic feature that would have caused it to avoid memory-based models even without the amnesia
This is a conceptual approach I hadn’t considered before—thank you. I don’t think it’s true in this case. Let’s be concrete: the asymptotic feature that would have caused it to avoid memory-based models even without amnesia is trial and error, applied to unsafe policies. Every section of the proof, however, can be thought of as making off-policy predictions behave. The real result of the paper would then be “Asymptotic Benignity, proven in a way that involves off-policy predictions approaching their benign output without ever being tested”. So while there might be malign world-models of a different flavor to the memory-based ones, I don’t think the way this theorem treats them is unsatisfying.
This is a conceptual approach I hadn’t considered before—thank you. I don’t think it’s true in this case. Let’s be concrete: the asymptotic feature that would have caused it to avoid memory-based models even without amnesia is trial and error, applied to unsafe policies. Every section of the proof, however, can be thought of as making off-policy predictions behave. The real result of the paper would then be “Asymptotic Benignity, proven in a way that involves off-policy predictions approaching their benign output without ever being tested”. So while there might be malign world-models of a different flavor to the memory-based ones, I don’t think the way this theorem treats them is unsatisfying.