Another aspect / method is, let’s call it, value hysteresis. If there are two functions which ambiguously both agree with the reward, then it’s possible that the agent will come across one interpretation first, and then (motivated by that first goal) resist adopting the other interpretation. Like how drugs and family both give us dopamine, but if we start by caring about our family, we may shutter at the thought of abandoning the family life for drugs, and vice versa! So maybe we need to have the target interpretation salient and simple to learn early in training (for the kind of introspective agent to which this consideration applies)? (Doesn’t seem all that reliable as a safety measure, but maybe worth keeping in mind..)
Another aspect / method is, let’s call it, value hysteresis. If there are two functions which ambiguously both agree with the reward, then it’s possible that the agent will come across one interpretation first, and then (motivated by that first goal) resist adopting the other interpretation. Like how drugs and family both give us dopamine, but if we start by caring about our family, we may shutter at the thought of abandoning the family life for drugs, and vice versa! So maybe we need to have the target interpretation salient and simple to learn early in training (for the kind of introspective agent to which this consideration applies)? (Doesn’t seem all that reliable as a safety measure, but maybe worth keeping in mind..)