Is it fair to say that all the things you’re doing with multi-objective RL could also be called “single-objective RL with a more complicated objective”? Like, if you calculate the vector of values V, and then use a scalarization function S, then I could just say to you “Nope, you’re doing normal single-objective RL, using the objective function S(V).” Right?
(Not that there’s anything wrong with that, just want to make sure I understand.)
…this pops out at me because the two reasons I personally like multi-objective RL are not like that. Instead they’re things that I think you genuinely can’t do with one objective function, even a complicated one built out of multiple pieces combined nonlinearly. Namely, (1) transparency/interpretability [because a human can inspect the vector V], and (2) real-time control [because a human can change the scalarization function on the fly]. Incidentally, I think (2) is part of how brains work; an example of the real-time control is that if you’re hungry, entertaining a plan that involves eating gets extra points from the brainstem/hypothalamus (positive coefficient), whereas if you’re nauseous, it loses points (negative coefficient). That’s my model anyway, you can disagree :) As for transparency/interpretability, I’ve suggested that maybe the vector V should have thousands of entries, like one for every word in the dictionary … or even millions of entries, or infinity, I dunno, can’t have too much of a good thing. :-)
You can apply the nonlinear transformation either to the rewards or to the Q values. The aggregation can occur only after transformation. When transformation is applied to Q values then the aggregation takes place quite late in the process—as Ben said, during action selection.
Both the approach of transforming the rewards and the approach of transforming the Q values are valid, but have different philosophical interpretations and also have different experimental outcomes to the agent behaviour. I think both approaches need more research.
For example, I would say that transforming the rewards instead of Q values is more risk-averse as well as “fair” towards individual timesteps, since it does not average out the negative outcomes across time before exponentiating them. But it also results in slower learning by the agent.
Finally there is a third approach which uses lexicographical ordering between objectives or sets of objectives. Vamplew has done work on this direction. This approach is truly multi-objective in the sense that there is no aggregation at all. Instead the vectors must be compared during RL action selection without aggregation. The downside is that it is unwieldy to have many objectives (or sets of objectives) lexicographically ordered.
I imagine that the lexicographical approach and our continuous nonlinear transformation approaches are complementary. There could be for example two main sets of objectives: one set for alignment objectives, the other set for performance objectives. Inside a set there would be nonlinear transformation and then aggregation applied, but between the sets there would be lexicographical ordering applied. In other words there would be a hierarchy of objectives. By having only two sets in lexicographical ordering the lexicographical ordering does not become unwieldy.
This approach would be a bit analogous to the approach used by constraint programming, though more flexible. The safety objectives would act as a constraint against performance objectives. An approach that is almost in absurd manner missing from classical naive RL, but which is very essential, widely known, and technically developed in practical applications, that is, in constraint programming! In the hybrid approach proposed in the above paragraph the difference from classical constraint programming would be that among the safety objectives there would still be flexibility and ability to trade (in a risk-averse way).
Finally, when we say “multi-objective” then it does not just refer to the technical details of the computation. It also stresses the importance of acknowledging the need for researching and making more explicit the inherent presence and even structure of multiple objectives inside any abstract top objective. To encode knowledge in a way that constrains incorrect solutions but not correct solutions. As well as acknowledging the potential existence of even more complex, nonlinear interactions between these multiple objectives. We did not focus on nonlinear interactions between the objectives yet, but these interactions are possibly relevant in the future.
I totally agree that in a reasonable agent the objectives or target values / set-points do change, as it is also exemplified by biological systems.
That’s right. What I mainly have in mind is a vector of Q-learned values V and a scalarization function that combines them in some (probably non-linear) way. Note that in our technical work, the combination occurs during action selection, not during reward assignment and learning.
I guess whether one calls this “multi-objective RL” is semantic. Because objectives are combined during action selection, not during learning itself, I would not call it “single objective RL with a complicated objective”. If you combined objectives during reward, then I could call it that.
re: your example of real-time control during hunger, I think yours is a pretty reasonable model. I haven’t thought about homeostatic processes in this project (my upcoming paper is all about them!). Definitely am not suggesting that our particular implementation of “MORL” (if we can call it that) is the only or even the best sort of MORL. I’m just trying to get started on understanding it! I really like the way you put it. It makes me think that perhaps the brain is a sort of multi-objective decision-making system with no single combinatory mechanism at all except for the emergent winner of whatever kind of output happens in a particular context—that could plausibly be different depending on whether an action is moving limbs, talking, or mentally setting an intention for a long term plan.
Great post, thanks for writing it!!
The links to http://modem2021.cs.nuigalway.ie/ are down at the moment, is that temporary, or did the website move or something?
Is it fair to say that all the things you’re doing with multi-objective RL could also be called “single-objective RL with a more complicated objective”? Like, if you calculate the vector of values V, and then use a scalarization function S, then I could just say to you “Nope, you’re doing normal single-objective RL, using the objective function S(V).” Right?
(Not that there’s anything wrong with that, just want to make sure I understand.)
…this pops out at me because the two reasons I personally like multi-objective RL are not like that. Instead they’re things that I think you genuinely can’t do with one objective function, even a complicated one built out of multiple pieces combined nonlinearly. Namely, (1) transparency/interpretability [because a human can inspect the vector V], and (2) real-time control [because a human can change the scalarization function on the fly]. Incidentally, I think (2) is part of how brains work; an example of the real-time control is that if you’re hungry, entertaining a plan that involves eating gets extra points from the brainstem/hypothalamus (positive coefficient), whereas if you’re nauseous, it loses points (negative coefficient). That’s my model anyway, you can disagree :) As for transparency/interpretability, I’ve suggested that maybe the vector V should have thousands of entries, like one for every word in the dictionary … or even millions of entries, or infinity, I dunno, can’t have too much of a good thing. :-)
You can apply the nonlinear transformation either to the rewards or to the Q values. The aggregation can occur only after transformation. When transformation is applied to Q values then the aggregation takes place quite late in the process—as Ben said, during action selection.
Both the approach of transforming the rewards and the approach of transforming the Q values are valid, but have different philosophical interpretations and also have different experimental outcomes to the agent behaviour. I think both approaches need more research.
For example, I would say that transforming the rewards instead of Q values is more risk-averse as well as “fair” towards individual timesteps, since it does not average out the negative outcomes across time before exponentiating them. But it also results in slower learning by the agent.
Finally there is a third approach which uses lexicographical ordering between objectives or sets of objectives. Vamplew has done work on this direction. This approach is truly multi-objective in the sense that there is no aggregation at all. Instead the vectors must be compared during RL action selection without aggregation. The downside is that it is unwieldy to have many objectives (or sets of objectives) lexicographically ordered.
I imagine that the lexicographical approach and our continuous nonlinear transformation approaches are complementary. There could be for example two main sets of objectives: one set for alignment objectives, the other set for performance objectives. Inside a set there would be nonlinear transformation and then aggregation applied, but between the sets there would be lexicographical ordering applied. In other words there would be a hierarchy of objectives. By having only two sets in lexicographical ordering the lexicographical ordering does not become unwieldy.
This approach would be a bit analogous to the approach used by constraint programming, though more flexible. The safety objectives would act as a constraint against performance objectives. An approach that is almost in absurd manner missing from classical naive RL, but which is very essential, widely known, and technically developed in practical applications, that is, in constraint programming! In the hybrid approach proposed in the above paragraph the difference from classical constraint programming would be that among the safety objectives there would still be flexibility and ability to trade (in a risk-averse way).
Finally, when we say “multi-objective” then it does not just refer to the technical details of the computation. It also stresses the importance of acknowledging the need for researching and making more explicit the inherent presence and even structure of multiple objectives inside any abstract top objective. To encode knowledge in a way that constrains incorrect solutions but not correct solutions. As well as acknowledging the potential existence of even more complex, nonlinear interactions between these multiple objectives. We did not focus on nonlinear interactions between the objectives yet, but these interactions are possibly relevant in the future.
I totally agree that in a reasonable agent the objectives or target values / set-points do change, as it is also exemplified by biological systems.
Until the Modem website is down, you can access our workshop paper here: https://drive.google.com/file/d/1qufjPkpsIbHiQ0rGmHCnPymGUKD7prah/view?usp=sharing
That’s right. What I mainly have in mind is a vector of Q-learned values V and a scalarization function that combines them in some (probably non-linear) way. Note that in our technical work, the combination occurs during action selection, not during reward assignment and learning.
I guess whether one calls this “multi-objective RL” is semantic. Because objectives are combined during action selection, not during learning itself, I would not call it “single objective RL with a complicated objective”. If you combined objectives during reward, then I could call it that.
re: your example of real-time control during hunger, I think yours is a pretty reasonable model. I haven’t thought about homeostatic processes in this project (my upcoming paper is all about them!). Definitely am not suggesting that our particular implementation of “MORL” (if we can call it that) is the only or even the best sort of MORL. I’m just trying to get started on understanding it! I really like the way you put it. It makes me think that perhaps the brain is a sort of multi-objective decision-making system with no single combinatory mechanism at all except for the emergent winner of whatever kind of output happens in a particular context—that could plausibly be different depending on whether an action is moving limbs, talking, or mentally setting an intention for a long term plan.