If Iâm understanding you correctly, the structure looks something like this:
We have a toy environment where human preferences are both exactly specified and consequential.
We want to learn how hard it is to discover the human preference function, and whether it is âlearned by defaultâ in an RL agent thatâs operating in the world and just paying attention to consequences.
One possible way to check whether itâs âlearned by defaultâ is to compare the performance of a predictor trained just on environmental data, a predictor trained just on the RL agentâs internal state, and a predictor extracted from the RL agent.
The relative performance of those predictors should give you a sense of whether the environment or the agentâs internal state give you a clearer signal of the humanâs preferences.
It seems to me like there should be some environments where the human preference function is âtoo easyâ to learn on environmental data (naively, the âtoo many applesâ case should qualify?) and cases where itâs âtoo hardâ (like âjudge how sublime this haiku isâ, where the RL agent will also probably be confused), and then thereâs some goldilocks zone where the environmental predictor struggles to capture the nuance and the RL agent has managed to capture the nuance (and so the human preferences can be easily exported from the RL agent).
Does this frame line up with yours? If so, what are the features of the environments that you investigated that made you think they were in the goldilocks zone? (Or what other features would you look for in other environments if you had to continue this research?)
Iâm Mislav, one of the team members that worked on this project. Thank you for your thoughtful comment.
Yes, you understood what we did correctly. We wanted to check whether human preferences are âlearned by defaultâ by comparing the performance of a human preference predictor trained just on the environment data and a human preference predictor trained on the RL agentâs internal state.
As for your question related to environments, I agree with you. There are probably some environments (like the gridworld environment we used) where the human preference is too easy to learn. On other environments, the human preference is too hard to learn and then thereâs the golden middle.
One of our team members (I think it was Riccardo) had the idea of investigating the research question which could be posed as follows: âWhat kinds of environments are suitable for the agent to learn human preferences by default?â. As you stated, in that case it would be useful to investigate the properties (features) of the environment and make some conclusions about what characterizes the environments where the RL agent can learn human preferences by default.
This is a research direction that could build up on our work here.
As for your question on why and how did we choose what the human preference will be in a particular environment: to be honest, I think we were mostly guided by our intuition. Nevan and Riccardo experimented with a lot of different environment setups in the VizDoom environment. Arun and me worked on setting up the PySC2 environment, but since training the agent on the PySC2 environment demanded a lot of resources, was pretty unstable and the VizDoom environment results turned out to be negative, we decided not to experiment on other environments further. So to recap, I think that we were mostly guided by our intuition on what would be too easy, too hard or just right of a human preference to predict and we course corrected by the experimental results.
Thanks for sharing negative results!
If Iâm understanding you correctly, the structure looks something like this:
We have a toy environment where human preferences are both exactly specified and consequential.
We want to learn how hard it is to discover the human preference function, and whether it is âlearned by defaultâ in an RL agent thatâs operating in the world and just paying attention to consequences.
One possible way to check whether itâs âlearned by defaultâ is to compare the performance of a predictor trained just on environmental data, a predictor trained just on the RL agentâs internal state, and a predictor extracted from the RL agent.
The relative performance of those predictors should give you a sense of whether the environment or the agentâs internal state give you a clearer signal of the humanâs preferences.
It seems to me like there should be some environments where the human preference function is âtoo easyâ to learn on environmental data (naively, the âtoo many applesâ case should qualify?) and cases where itâs âtoo hardâ (like âjudge how sublime this haiku isâ, where the RL agent will also probably be confused), and then thereâs some goldilocks zone where the environmental predictor struggles to capture the nuance and the RL agent has managed to capture the nuance (and so the human preferences can be easily exported from the RL agent).
Does this frame line up with yours? If so, what are the features of the environments that you investigated that made you think they were in the goldilocks zone? (Or what other features would you look for in other environments if you had to continue this research?)
Hello Matthew,
Iâm Mislav, one of the team members that worked on this project. Thank you for your thoughtful comment.
Yes, you understood what we did correctly. We wanted to check whether human preferences are âlearned by defaultâ by comparing the performance of a human preference predictor trained just on the environment data and a human preference predictor trained on the RL agentâs internal state.
As for your question related to environments, I agree with you. There are probably some environments (like the gridworld environment we used) where the human preference is too easy to learn. On other environments, the human preference is too hard to learn and then thereâs the golden middle.
One of our team members (I think it was Riccardo) had the idea of investigating the research question which could be posed as follows: âWhat kinds of environments are suitable for the agent to learn human preferences by default?â. As you stated, in that case it would be useful to investigate the properties (features) of the environment and make some conclusions about what characterizes the environments where the RL agent can learn human preferences by default.
This is a research direction that could build up on our work here.
As for your question on why and how did we choose what the human preference will be in a particular environment: to be honest, I think we were mostly guided by our intuition. Nevan and Riccardo experimented with a lot of different environment setups in the VizDoom environment. Arun and me worked on setting up the PySC2 environment, but since training the agent on the PySC2 environment demanded a lot of resources, was pretty unstable and the VizDoom environment results turned out to be negative, we decided not to experiment on other environments further. So to recap, I think that we were mostly guided by our intuition on what would be too easy, too hard or just right of a human preference to predict and we course corrected by the experimental results.
Best,
Mislav