If you already choose the policy … then you cannot choose an y_k in the argmax.
The argmax comes before choosing a policy. In
, there is already a value for y_k before you consider all the policies such that p(x_<k) = y_<k y_k.
Also for the Solomonoff prior you must sum over all programs
Didn’t I do that?
Could you maybe expand on the proof of Lemma 1 a little bit?
Look at any finite observation sequence. There exists some action you could output in response to that sequence that would allow you to get arbitrarily close to the supremum expected utility with suitable responses to the other finite observation sequences (for instance, you could get within 1⁄2 of the supremum). Now look at another finite observation sequence. There exists some action you could output in response to that, without changing your response to the previous finite observation sequence, such that you can get arbitrarily close to the supremum (within 1⁄4). Look at a third finite observation sequence. There exists some action you could output in response to that, without changing your responses to the previous 2, that would allow you to get within 1⁄8 of the supremum. And keep going in some fashion that will eventually consider every finite observation sequence. At each step n, you will be able to specify a policy that gets you within 2^-n of the supremum, and these policies converge to the policy that the agent actually implements.
I hope that helps. If you still don’t know what I mean, could you describe where you’re stuck?
I think there is something off with the formulas that use policies: If you already choose the policy
=y_{%3Ck}y_k)then you cannot choose an y_k in the argmax.
Also for the Solomonoff prior you must sum over all programs
=x_{1:m_k}) .Could you maybe expand on the proof of Lemma 1 a little bit? I am not sure I get what you mean yet.
The argmax comes before choosing a policy. In
, there is already a value for y_k before you consider all the policies such that p(x_<k) = y_<k y_k.Didn’t I do that?
Look at any finite observation sequence. There exists some action you could output in response to that sequence that would allow you to get arbitrarily close to the supremum expected utility with suitable responses to the other finite observation sequences (for instance, you could get within 1⁄2 of the supremum). Now look at another finite observation sequence. There exists some action you could output in response to that, without changing your response to the previous finite observation sequence, such that you can get arbitrarily close to the supremum (within 1⁄4). Look at a third finite observation sequence. There exists some action you could output in response to that, without changing your responses to the previous 2, that would allow you to get within 1⁄8 of the supremum. And keep going in some fashion that will eventually consider every finite observation sequence. At each step n, you will be able to specify a policy that gets you within 2^-n of the supremum, and these policies converge to the policy that the agent actually implements.
I hope that helps. If you still don’t know what I mean, could you describe where you’re stuck?
I get that now, thanks.