You don’t put any citation for that. Is this an actual result, or just what you think really strongly?
Yeah, sorry, I thought there might be an appropriate citation but I didn’t find one. My thinking here is: in model-based RL, the best model you can have to fit the data is one which correctly identifies the reward signal as coming from the reward button (or whatever the actual physical reward source is). Whereas the desired model (what we want the system to learn) is one which, while perhaps being less predictively accurate, models reward as coming from some kind of values. If you couple RL with process-level feedback, you could directly discourage modeling reward as coming from the actual reward system, and encourage identifying it with other things—overcoming the incentive to model it accurately.
Similarly, human manipulation comes from a “mistaken” (but predictively accurate) model which says that the human values are precisely whatever-the-human-feedback-says-they-are (IE that humans are in some sense the final judge of their values, so that any “manipulation” still reveals legitimate preferences by definition). Humans can provide feedback against this model, favoring models in which the human feedback can be corrupted by various means including manipulation.
Yeah, sorry, I thought there might be an appropriate citation but I didn’t find one. My thinking here is: in model-based RL, the best model you can have to fit the data is one which correctly identifies the reward signal as coming from the reward button (or whatever the actual physical reward source is). Whereas the desired model (what we want the system to learn) is one which, while perhaps being less predictively accurate, models reward as coming from some kind of values. If you couple RL with process-level feedback, you could directly discourage modeling reward as coming from the actual reward system, and encourage identifying it with other things—overcoming the incentive to model it accurately.
Similarly, human manipulation comes from a “mistaken” (but predictively accurate) model which says that the human values are precisely whatever-the-human-feedback-says-they-are (IE that humans are in some sense the final judge of their values, so that any “manipulation” still reveals legitimate preferences by definition). Humans can provide feedback against this model, favoring models in which the human feedback can be corrupted by various means including manipulation.