Note: I skimmed this post because I didn’t want to think about maths right now.
Take a Q-value table for, I don’t know, a 3 state MDP. Apply TD-0 a ton until you get convergence. How has this system not learnt its base objective? Similairly, train any system to completion. How has it not learnt its base objective? If learning the base objective is impossible, then I think I’ve got a contradiction on my hands.
Note: I skimmed this post because I didn’t want to think about maths right now.
Take a Q-value table for, I don’t know, a 3 state MDP. Apply TD-0 a ton until you get convergence. How has this system not learnt its base objective? Similairly, train any system to completion. How has it not learnt its base objective? If learning the base objective is impossible, then I think I’ve got a contradiction on my hands.