Thank you Erik, that was super valuable feedback and gives some food for thought.
It also seems to me that humans being suboptimal planners and not knowing everything the AI knows seem like the hardest (and most informative) problems in IRL. I’m curious what you’d think about this approach for adressing the suboptimal planner sub-problem : “Include models from coginitive psychology about human decision in IRL, to allow IRL to better understand the decision process.” This would give IRL more realistic assumptions about the human planner and possibly allow it to understand it’s irrationalites and get to the values which drive behaviour.
Also do you have a pointer for something to read on preference comparisons?
I’m curious what you’d think about this approach for adressing the suboptimal planner sub-problem : “Include models from coginitive psychology about human decision in IRL, to allow IRL to better understand the decision process.”
Yes, this is one of two approaches I’m aware of (the other being trying to somehow jointly learn human biases and values, see e.g. https://arxiv.org/abs/1906.09624). I don’t have very strong opinions on which of these is more promising, they both seem really hard. What I would suggest here is again to think about how to fail fast. The thing to avoid is spending a year on a project that’s trying to use a slightly more realistic model of human planning, and then realizing afterwards that the entire approach is doomed anyway. Sometimes this is hard to avoid, but in this case I think it makes more sense to start by thinking more about the limits of this approach. For example, if our model of human planning is slightly misspecified, how does that affect the learned reward function, and how much regret does that lead to? If slight misspecifications are already catastrophic, then we can probably forget about this approach, since we’ll surely only get a crude approximation of human planning.
Also worth thinking about other obstacles to IRL. One issue is “how do we actually implement this?”. Reward model hacking seems like a potentially hard problem to me if we just do a naive setup of reward model + RL agent. Or if you want to do something more like CIRL/assistance games, you need to figure out how to get a (presumably learned) agent to actually reason in a CIRL-like way (Rohin mentions something related in the second-to-last bullet here). Arguably those obstacles feel more like inner alignment, and maybe you’re more interested in outer alignment. But (1) if those turn out to be the bottlenecks, why not focus on them?, and (2) if you want your agent to do very specific cognition, such as reasoning in a CIRL-like way, then it seems like you might need to solve a harder inner alignment problem, so even if you’re focused on outer alignment there are important connections.
I think there’s a third big obstacle (in addition to “figuring out a good human model seems hard”, and “implementing the right agent seems hard”), namely that you probably have to solve something like ontology identification even if you have a good model of human planning/knowledge. But I’m not aware of any write-up explicitly about this point. ETA: I’ve now written a more detailed post about this here.
Also do you have a pointer for something to read on preference comparisons?
If you’re completely unfamiliar with preference comparisons for reward learning, then Deep RL from Human Preferences is a good place to start. More recently, people are using this to fine-tune language models, see e.g. InstructGPT or Learning to summarize from human feedback. People have also combined human demonstrations with preference comparisons: https://arxiv.org/abs/1811.06521 But usually that just means pretaining using demonstrations and then fine-tuning with preference comparisons (I think InstructGPT did this as well). AFAIK there isn’t really a canonical reference comparing IRL and preference comparisons and telling you which one you should use in which cases.
Thank you Erik, that was super valuable feedback and gives some food for thought.
It also seems to me that humans being suboptimal planners and not knowing everything the AI knows seem like the hardest (and most informative) problems in IRL. I’m curious what you’d think about this approach for adressing the suboptimal planner sub-problem : “Include models from coginitive psychology about human decision in IRL, to allow IRL to better understand the decision process.” This would give IRL more realistic assumptions about the human planner and possibly allow it to understand it’s irrationalites and get to the values which drive behaviour.
Also do you have a pointer for something to read on preference comparisons?
Yes, this is one of two approaches I’m aware of (the other being trying to somehow jointly learn human biases and values, see e.g. https://arxiv.org/abs/1906.09624). I don’t have very strong opinions on which of these is more promising, they both seem really hard. What I would suggest here is again to think about how to fail fast. The thing to avoid is spending a year on a project that’s trying to use a slightly more realistic model of human planning, and then realizing afterwards that the entire approach is doomed anyway. Sometimes this is hard to avoid, but in this case I think it makes more sense to start by thinking more about the limits of this approach. For example, if our model of human planning is slightly misspecified, how does that affect the learned reward function, and how much regret does that lead to? If slight misspecifications are already catastrophic, then we can probably forget about this approach, since we’ll surely only get a crude approximation of human planning.
Also worth thinking about other obstacles to IRL. One issue is “how do we actually implement this?”. Reward model hacking seems like a potentially hard problem to me if we just do a naive setup of reward model + RL agent. Or if you want to do something more like CIRL/assistance games, you need to figure out how to get a (presumably learned) agent to actually reason in a CIRL-like way (Rohin mentions something related in the second-to-last bullet here). Arguably those obstacles feel more like inner alignment, and maybe you’re more interested in outer alignment. But (1) if those turn out to be the bottlenecks, why not focus on them?, and (2) if you want your agent to do very specific cognition, such as reasoning in a CIRL-like way, then it seems like you might need to solve a harder inner alignment problem, so even if you’re focused on outer alignment there are important connections.
I think there’s a third big obstacle (in addition to “figuring out a good human model seems hard”, and “implementing the right agent seems hard”), namely that you probably have to solve something like ontology identification even if you have a good model of human planning/knowledge.
But I’m not aware of any write-up explicitly about this point.ETA: I’ve now written a more detailed post about this here.If you’re completely unfamiliar with preference comparisons for reward learning, then Deep RL from Human Preferences is a good place to start. More recently, people are using this to fine-tune language models, see e.g. InstructGPT or Learning to summarize from human feedback. People have also combined human demonstrations with preference comparisons: https://arxiv.org/abs/1811.06521 But usually that just means pretaining using demonstrations and then fine-tuning with preference comparisons (I think InstructGPT did this as well). AFAIK there isn’t really a canonical reference comparing IRL and preference comparisons and telling you which one you should use in which cases.