Recall the definition of AIXI: A will try to infer a simple program which takes A’s outputs as input and provides A’s inputs as output, and then choose utility maximizing actions with respect to that program.
I don’t think this is an accurate description of AIXI.
At time step n, AIXI uses Solomonoff induction to infer a probabilistic mixture of programs (not just one simple program) that take no inputs and produce a sequence of at least n+T triples, where T is a fixed time horizon. Triples are in the form (percept(t), action(t), reward(t)). (your model doesn’t mention any reward channel, hence you couldn’t possibly be referring to anything like AIXI):
At time step n, the AIXI agent has a recorded history of n-1 triples (percept(0), action(0), reward(0)), …, (percept(n-1), action(n-1), reward(n-1)) and a new percept(n). It will run all the syntactically correct programs and filter out those that produce outputs which are inconsistent with the recorded history and the new percept (or enter in a dead-end non-halting configuration before having produced enough output). For each remaining program, AIXI adds up the computed future rewards up to the time horizon (possibly doing some time discounting) and multiplies it by the program weight (2 ^ -length) to compute an expected cumulative future reward. Then AIXI picks the program with the highest expected cumulative future reward and executed the action(n) that was computed by that program.
I don’t think this is an accurate description of AIXI.
At time step n, AIXI uses Solomonoff induction to infer a probabilistic mixture of programs (not just one simple program) that take no inputs and produce a sequence of at least n+T triples, where T is a fixed time horizon. Triples are in the form (percept(t), action(t), reward(t)). (your model doesn’t mention any reward channel, hence you couldn’t possibly be referring to anything like AIXI):
At time step n, the AIXI agent has a recorded history of n-1 triples (percept(0), action(0), reward(0)), …, (percept(n-1), action(n-1), reward(n-1)) and a new percept(n). It will run all the syntactically correct programs and filter out those that produce outputs which are inconsistent with the recorded history and the new percept (or enter in a dead-end non-halting configuration before having produced enough output). For each remaining program, AIXI adds up the computed future rewards up to the time horizon (possibly doing some time discounting) and multiplies it by the program weight (2 ^ -length) to compute an expected cumulative future reward. Then AIXI picks the program with the highest expected cumulative future reward and executed the action(n) that was computed by that program.