“Note that we can represent sequential decision problems in this framework (e.g. Sudoku), elements of A would then be vectors of individual actions.”
Unless the environment is deterministic, you want to consider policies rather than vectors of actions. On a related note, instead of considering a uniform distribution over actions, we might consider a uniform distribution over programs for a prefix-free universal Turing machine. This solves your repeated game paradox in the sense that, the program that always picks 9 will have some finite probability and will do better than your agent for any T, so your agent’s score will be bounded.
“Note that we can represent sequential decision problems in this framework (e.g. Sudoku), elements of A would then be vectors of individual actions.”
Unless the environment is deterministic, you want to consider policies rather than vectors of actions. On a related note, instead of considering a uniform distribution over actions, we might consider a uniform distribution over programs for a prefix-free universal Turing machine. This solves your repeated game paradox in the sense that, the program that always picks 9 will have some finite probability and will do better than your agent for any T, so your agent’s score will be bounded.