Finding the action a that optimizes a reward function r(a) is NP-complete for general r. If the reward function r is itself able to use an oracle for NP, then that’s complete for NPNP, and so on. The analogy is loose because you aren’t really getting the optimal a at each step.
Finding the action a that optimizes a reward function r(a) is NP-complete for general r. If the reward function r is itself able to use an oracle for NP, then that’s complete for NPNP, and so on. The analogy is loose because you aren’t really getting the optimal a at each step.