Ah, the formal statement was something like “if the policy A isn’t the argmax policy, the successor policy B must be in the policy space of the future argmax, and the action selected by policy A is computed so the relevant equality holds”
Yeah, I am assuming fast feedback that it is resolved on day n .
What I meant was that the computation isn’t extremely long in the sense of description length, not in the sense of computation time. Also, we aren’t doing policy search over the set of all turing machines, we’re doing policy search over some smaller set of policies that can be guaranteed to halt in a reasonable time (and more can be added as time goes on)
Also I’m less confident in conditional future-trust for all conditionals than I used to be, I’ll try to crystallize where I think it goes wrong.
What I meant was that the computation isn’t extremely long in the sense of description length, not in the sense of computation time. Also, we aren’t doing policy search over the set of all turing machines, we’re doing policy search over some smaller set of policies that can be guaranteed to halt in a reasonable time (and more can be added as time goes on)
Wouldn’t the set of all action sequences have lower description length than some large finite set of policies? There’s also the potential problem that all of the policies in the large finite set you’re searching over could be quite far from optimal.
Ah, the formal statement was something like “if the policy A isn’t the argmax policy, the successor policy B must be in the policy space of the future argmax, and the action selected by policy A is computed so the relevant equality holds”
Yeah, I am assuming fast feedback that it is resolved on day n .
What I meant was that the computation isn’t extremely long in the sense of description length, not in the sense of computation time. Also, we aren’t doing policy search over the set of all turing machines, we’re doing policy search over some smaller set of policies that can be guaranteed to halt in a reasonable time (and more can be added as time goes on)
Also I’m less confident in conditional future-trust for all conditionals than I used to be, I’ll try to crystallize where I think it goes wrong.
Wouldn’t the set of all action sequences have lower description length than some large finite set of policies? There’s also the potential problem that all of the policies in the large finite set you’re searching over could be quite far from optimal.