Jobst Heitzig comments on Models Don’t “Get Reward”

Jobst Heitzig 13 Jul 2023 21:37 UTC
7 points
0
I like the clarity of this post very much! Still, we should be aware that all this hinges on what exactly we mean by “the model”.

If “the model” only refers to one or more functions, like a policy function pi(s) and/or a state-value function V(s) and/or a state-action -value function Q(s,a) etc., but does not refer to the training algorithm, then all you write is fine. This is how RL theory uses the word “model”.

But some people here also use the term “the model” in a broader sense, potentially including the learning algorithm that adjusts said functions, and in that case “the model” does see the reward signal. A better and more common term for the combination of model and learning algorithm is “agent”, but some people seem to be a little sloppy in distinguishing “model” and “agent”. One can of course also imagine architectures in which the distinction is less clear, e.g., when the whole “AI system” consists of even more components such as several “agents”, each of which using different “models”. Some actor-critic systems can for example be interpreted as systems consisting of two agents (an actor and a critic). And one can also imagine hierarchical systems in which a parameterized learning algorithm used in the low level component is adjusted by a (hyper-)policy function on a higher level that is learned by a 2nd-level learning algorithm, which might as well be hyperparameterized by an even higher-level learned policy, and so on, up towards one final “base” learning algorithm that was hard-coded by the designer.

So, in the context of AGI or ASI, I believe the concept of an “AI system” is the most useful term in this ontology, as we cannot be sure what the architecture of an ASI will be, how many “agents” and “policies” on how many “hierarchical levels” it will contain, what their division of labor will be, and how many “models” they will use and adjust in response to observations in the environment.

In summary, as the outermost-level learning algorithm in such an “AI system” will generally see some form of “reward signal”, I believe that most statements that are imprecisely phrased in terms of a “model” getting “rewarded” can be fixed by simply replacing the term “model” by “AI system”.