Consider the set of state-action pairs (s,a), where s∈Strain and a is the action that would be taken by the trained agent f(θ∗) in state s.
Optimal policies are often explicitly stochastic, i.e., it’s optimal from a game-theoretic perspective, in the presence of other agents that are learning about you as well as you learning about them, to behave randomly and not be always consistent in one’s choices. This is another line of generalisation that you don’t consider in the conclusion of the post.
I think the big flaw of the theoretical setup in this article is that it ignores the learning dynamics completely and assumes it is “magic”, and therefore is not based on any learning theory. Basing the setup on any learning theory immediately changes the setup: you can no longer talk about “agent’s behaviour on the training data”, because hysteresis and continuous learning change the agent (either its model states/parameters or inferential states/variables/activations) every time the agent is “shown” the training data, including the very first, “original” demonstration, when the agent was actually trained. In real, complex agents (including gargantuan DNNs) hysteresis is important, one cannot ignore it. And therefore, the actual learning dynamic (e.g., observed at checkpoints) is relevant for determining how the agent will behave in the future. On the other hand, the notion of a fixed “reward function” that is “selected” during training becomes incoherent: rather, we should talk about a distribution of possible “reward functions”, and the whole training trajectory doesn’t allow to determine which particular reward function from the distribution the agent is “using”. Including to itself. Ontologically (or epistemologically), such an object just doesn’t exist.
Other learning-theoretic considerations are also important and relevant: e.g., some reward functions could not be effectively learnable with some algorithms (inductive biases), which actually opens up the (theoretical, at least) possibility for designing such algorithms that cannot learn a “bad” reward function (or, in the ontology that I suggest above, if P is the probability distribution of reward functions that we infer for the agent upon the end of the training process, P(f_bad) is either infinitesimal or zero).
Optimal policies are often explicitly stochastic, i.e., it’s optimal from a game-theoretic perspective, in the presence of other agents that are learning about you as well as you learning about them, to behave randomly and not be always consistent in one’s choices. This is another line of generalisation that you don’t consider in the conclusion of the post.
I think the big flaw of the theoretical setup in this article is that it ignores the learning dynamics completely and assumes it is “magic”, and therefore is not based on any learning theory. Basing the setup on any learning theory immediately changes the setup: you can no longer talk about “agent’s behaviour on the training data”, because hysteresis and continuous learning change the agent (either its model states/parameters or inferential states/variables/activations) every time the agent is “shown” the training data, including the very first, “original” demonstration, when the agent was actually trained. In real, complex agents (including gargantuan DNNs) hysteresis is important, one cannot ignore it. And therefore, the actual learning dynamic (e.g., observed at checkpoints) is relevant for determining how the agent will behave in the future. On the other hand, the notion of a fixed “reward function” that is “selected” during training becomes incoherent: rather, we should talk about a distribution of possible “reward functions”, and the whole training trajectory doesn’t allow to determine which particular reward function from the distribution the agent is “using”. Including to itself. Ontologically (or epistemologically), such an object just doesn’t exist.
Other learning-theoretic considerations are also important and relevant: e.g., some reward functions could not be effectively learnable with some algorithms (inductive biases), which actually opens up the (theoretical, at least) possibility for designing such algorithms that cannot learn a “bad” reward function (or, in the ontology that I suggest above, if P is the probability distribution of reward functions that we infer for the agent upon the end of the training process, P(f_bad) is either infinitesimal or zero).