Actually, I’m not sure if the world model I described is memory-based. EDIT: Never mind, see Michael’s comment below, the non-benign ones are memory-based.
The rewards it outputs are correct, except when it says “the reward is zero”, but those exceptions are not causally dependent on outside-world features that causally depend on the actions of the episode. But it also satisfies Lemma 3. So in that case it seems like none of the theoretical arguments prohibit this world model?
(I think this is because you are thinking of how the world model can cause BoMAI to have bad incentives, whereas here I’m postulating that the world model itself has malicious goals and is manipulating BoMAI.)
Actually, I’m not sure if the world model I described is memory-based. EDIT: Never mind, see Michael’s comment below, the non-benign ones are memory-based.
The rewards it outputs are correct, except when it says “the reward is zero”, but those exceptions are not causally dependent on outside-world features that causally depend on the actions of the episode. But it also satisfies Lemma 3. So in that case it seems like none of the theoretical arguments prohibit this world model?
(I think this is because you are thinking of how the world model can cause BoMAI to have bad incentives, whereas here I’m postulating that the world model itself has malicious goals and is manipulating BoMAI.)