Hmm I feel like there are a bunch of notions that you use interchangeably here which are very different for me (us being able to describe a learning-based agent’s choices through a dynamic utility function vs. an agent updating its value estimator via Bellman backups vs. an agent learning to model the reward signal based on predictive learning vs. using a learned reward model as a reward function to train an agent), so I wanna clear that up first and foremost. The content below may be stuff you already know, but I think it would be helpful if you clarified which things you’re intending to make claims about.
Reward in the RL sense is a signal used in order to provide directional updates to cognition. Value in the RL sense is a forecast about the statistics of future reward signals, used in order to provide better/faster directional updates to cognition than would be possible with reward signals alone. Utility is a numeric encoding that you can translate certain kinds of preferences in certain kinds of decision-making procedures into. A “reward model” makes predictions about a reward signal directly, as opposed to merely making predictions about the reward signal’s overall statistics like the value function does. Goals are components in plans shaped towards future outcomes.
Reward & value functions are sources of reinforcement that shape the circuits inside the agent. Goals and reward models are mental representations that the agent may use as tools to shape its plans. (For example, the thought “I know that to do ML research I need to learn about ML, and I want to do ML research, so I should read an ML textbook” uses a goal as a tool to achieve an outcome. The thoughts “If I eat this candy, then that will reinforce a bad habit, so I’ll abstain from it” and “If I eat this candy after I study, that’ll encourage me to study more, and I like that so I’ll do it” use model of reward signals as a tool to make more informed choices. Note that an agent can have a reward model and yet use it for something other than pursuing reward.) Depending on how the agent makes decisions, utility functions may or may not be reasonable compressed description of the sorts of tradeoffs that that it ends up making.
From my view:
I am agnostic as to whether the reward functions we use to train AIs will be learned via function approximation or coded in the ordinary way.
I think it is likely that AIs will have something like RL value estimators, or at least that they’ll do something like TD learning.
I think it is likely that AIs will make use of explicitly-represented internal goals and planning procedures sometimes.
I think it is possible but not particularly overdetermined that it will be theoretically possible to nontrivially represent an AI’s decision-making with a utility function.
I think it is possible but not particularly overdetermined that AIs will sometimes use explicitly-represented utility functions to help them make certain local tradeoffs between options.
I think it is unlikely that an AI’s planning procedure will involve calling into an explicitly-represented and globally-defined utility calculator, whether a learned one or a fixed one.
It is also helpful to think of it in map-and-territory terms. The reward signal is a thing in the territory. The agent itself is also a thing in the territory. A value function is a map (either an abstract one or a map in the territory) that compresses information derived from the reward signal in the territory. A reward model is a thing in the agent’s map of the territory, a thing that the agent treats as a map, and that thing attempts to accurately track the reward signal in the territory. A goal is also a thing in the agent’s map of the territory, one that can track pretty much anything. A utility function is a map that exists in our maps, a thing that we treat as a map, and that thing attempts to accurately track the agent’s behavior in the territory.
Hmm I feel like there are a bunch of notions that you use interchangeably here which are very different for me (us being able to describe a learning-based agent’s choices through a dynamic utility function vs. an agent updating its value estimator via Bellman backups vs. an agent learning to model the reward signal based on predictive learning vs. using a learned reward model as a reward function to train an agent), so I wanna clear that up first and foremost. The content below may be stuff you already know, but I think it would be helpful if you clarified which things you’re intending to make claims about.
Reward in the RL sense is a signal used in order to provide directional updates to cognition. Value in the RL sense is a forecast about the statistics of future reward signals, used in order to provide better/faster directional updates to cognition than would be possible with reward signals alone. Utility is a numeric encoding that you can translate certain kinds of preferences in certain kinds of decision-making procedures into. A “reward model” makes predictions about a reward signal directly, as opposed to merely making predictions about the reward signal’s overall statistics like the value function does. Goals are components in plans shaped towards future outcomes.
These are not the same[1].
Reward & value functions are sources of reinforcement that shape the circuits inside the agent. Goals and reward models are mental representations that the agent may use as tools to shape its plans. (For example, the thought “I know that to do ML research I need to learn about ML, and I want to do ML research, so I should read an ML textbook” uses a goal as a tool to achieve an outcome. The thoughts “If I eat this candy, then that will reinforce a bad habit, so I’ll abstain from it” and “If I eat this candy after I study, that’ll encourage me to study more, and I like that so I’ll do it” use model of reward signals as a tool to make more informed choices. Note that an agent can have a reward model and yet use it for something other than pursuing reward.) Depending on how the agent makes decisions, utility functions may or may not be reasonable compressed description of the sorts of tradeoffs that that it ends up making.
From my view:
I am agnostic as to whether the reward functions we use to train AIs will be learned via function approximation or coded in the ordinary way.
I think it is likely that AIs will have something like RL value estimators, or at least that they’ll do something like TD learning.
I think it is likely that AIs will make use of explicitly-represented internal goals and planning procedures sometimes.
I think it is possible but not particularly overdetermined that it will be theoretically possible to nontrivially represent an AI’s decision-making with a utility function.
I think it is possible but not particularly overdetermined that AIs will sometimes use explicitly-represented utility functions to help them make certain local tradeoffs between options.
I think it is unlikely that an AI’s planning procedure will involve calling into an explicitly-represented and globally-defined utility calculator, whether a learned one or a fixed one.
It is also helpful to think of it in map-and-territory terms. The reward signal is a thing in the territory. The agent itself is also a thing in the territory. A value function is a map (either an abstract one or a map in the territory) that compresses information derived from the reward signal in the territory. A reward model is a thing in the agent’s map of the territory, a thing that the agent treats as a map, and that thing attempts to accurately track the reward signal in the territory. A goal is also a thing in the agent’s map of the territory, one that can track pretty much anything. A utility function is a map that exists in our maps, a thing that we treat as a map, and that thing attempts to accurately track the agent’s behavior in the territory.