It seems to me that another potential failure of CIRL, depending on exactly how the game and learning is structured and if a game is played iteratively, is that the robot will eventually come to put low enough probability on the human deriving any utility from pressing the button or from shutting down once the button is pressed that it will ultimately not do so.
Maybe a way to address this would be for the robot to model the human as having a utility which changes in some way over time (although that may make learning insurmountably difficult without some information about how the human’s utility is changing). Does this seem correct? My understanding of CIRL is not super complete.
It seems to me that another potential failure of CIRL, depending on exactly how the game and learning is structured and if a game is played iteratively, is that the robot will eventually come to put low enough probability on the human deriving any utility from pressing the button or from shutting down once the button is pressed that it will ultimately not do so.
Maybe a way to address this would be for the robot to model the human as having a utility which changes in some way over time (although that may make learning insurmountably difficult without some information about how the human’s utility is changing). Does this seem correct? My understanding of CIRL is not super complete.