Now that I write it out explicity, I see that, while it isn’t circular, it’s definitely double-counting. I’m not sure that’s a problem, though. Initially, for all deterministic programs q that model the environment, it calculates its expected reward assuming each q one at a time. Then it weights all q by the rewards and acts to maximize the expected reward for that weighted combination of all q.
Consider this ugly ASCII version of the expression for AIXI found in this paper by Marcus Hutter,
a_k := arg max[a_k, SUM[o_k*r_k … max[a_m, SUM[o_m*r_m, (r_k +...+ r_m) SUM[q:U(q,a_1...a_m) = o_1*r_1..o_m*r_m, 2^-l(q)] ]]...]] .
What I was thinking was to replace the inner sum for the Solomonoff prior, SUM[q:..., 2^-l(q)], with a repeat of the interleaved maxes and SUMs.
SUM[q:U(q,a_1...a_m)=o_1*r_1..o_m*r_m, max[a_k, SUM[o_k*r_k … max[a_m, SUM[o_m*r_m, (r_k + … + r_m)]]...]] ] .
Now that I write it out explicity, I see that, while it isn’t circular, it’s definitely double-counting. I’m not sure that’s a problem, though. Initially, for all deterministic programs q that model the environment, it calculates its expected reward assuming each q one at a time. Then it weights all q by the rewards and acts to maximize the expected reward for that weighted combination of all q.