Yeah, this is basically CIRL, when the human-model is smart enough to do Gricean communication. The important open problems left over after starting with CIRL are basically “how do you make sure that your model of communicating humans infers the right things about human preferences?”, both due to very obvious problems like human irrationality, and also due to weirder stuff like the human intuition that we can’t put complete confidence in any single model.
Roughly, yeah, though there are some differences—e.g. here the AI has no prior “directly about” values, it’s all mediated by the “messages”, which are themselves informing intended AI behavior directly. So e.g. we don’t need to assume that “human values” live in the space of utility functions, or that the AI is going to explicitly optimize for something, or anything like that. But most of the things which are hard in CIRL are indeed still hard here; it doesn’t really solve anything in itself.
One way to interpret it: this approach uses a similar game to CIRL, but strips out most of the assumptions about the AI and human being expected utility maximizers. To the extent we’re modelling the human as an optimizer, it’s just an approximation to kick off communication, and can be discarded later on.
Yeah, this is basically CIRL, when the human-model is smart enough to do Gricean communication. The important open problems left over after starting with CIRL are basically “how do you make sure that your model of communicating humans infers the right things about human preferences?”, both due to very obvious problems like human irrationality, and also due to weirder stuff like the human intuition that we can’t put complete confidence in any single model.
Roughly, yeah, though there are some differences—e.g. here the AI has no prior “directly about” values, it’s all mediated by the “messages”, which are themselves informing intended AI behavior directly. So e.g. we don’t need to assume that “human values” live in the space of utility functions, or that the AI is going to explicitly optimize for something, or anything like that. But most of the things which are hard in CIRL are indeed still hard here; it doesn’t really solve anything in itself.
One way to interpret it: this approach uses a similar game to CIRL, but strips out most of the assumptions about the AI and human being expected utility maximizers. To the extent we’re modelling the human as an optimizer, it’s just an approximation to kick off communication, and can be discarded later on.