The actions are inferred from the argmax, but they are also inputs to the prediction models. Thus AIXI is not constrained to avoid updating on its own actions, which allows it to entertain the correct world models for one boxing, for example. If it’s world models have learned that Omega never lies and is always correct, those same world models will learn the predictive shortcut that the box content is completely predictable from the action output channel, and thus it will correctly estimate that the one-box branch has higher payout.
My understanding is that CDT explicitly disallows acausal predictions—so it disallows models which update on future agent actions themselves, which is important for one boxing.
Action
Box Empty
Box Full
one_box
disallowed
allowed
two_box
allowed
disallowed
In EDT/AIXI the world model is allowed to update the hidden box state conditional on the action chosen, even though this is ‘acausal’. Its equivalent to simply correctly observing that the agent will get higher reward in the subset of the multiverse where the agent decides to one boxe.
The actions are inferred from the argmax, but they are also inputs to the prediction models. Thus AIXI is not constrained to avoid updating on its own actions, which allows it to entertain the correct world models for one boxing, for example. If it’s world models have learned that Omega never lies and is always correct, those same world models will learn the predictive shortcut that the box content is completely predictable from the action output channel, and thus it will correctly estimate that the one-box branch has higher payout.
The actions sui generis being “inputs to the prediction models” does not distinguish CDT from EDT.
(To be continued, leaving now.)
My understanding is that CDT explicitly disallows acausal predictions—so it disallows models which update on future agent actions themselves, which is important for one boxing.
In EDT/AIXI the world model is allowed to update the hidden box state conditional on the action chosen, even though this is ‘acausal’. Its equivalent to simply correctly observing that the agent will get higher reward in the subset of the multiverse where the agent decides to one boxe.