This is a good question. I think this is both important, and debated. I don’t think this question has been adequately discussed and settled.
I think part of what you’re getting at is what I’ve called The alignment stability problem. You can see my thoughts there, including links to related work.
You may be referring more specifically to goal misgeneralization. Searching for that term on LW finds a good deal of discussion. That’s another potential form of alignment instability that I haven’t addressed.
That effect is termed a superstimulus in biology. Modern synthetic foods and pornography are, or at least attempt to be, superstimuli relative to the “training set” of reward-predictive (or perhaps directly rewarding) stimuli from our ancestral past.
I think part of what you’re getting at is what I’ve called The alignment stability problem. You can see my thoughts there, including links to related work.
Looking at the google scholar link in this article, it looks like what I’m describing more closely resembles “motivation hacking”, except that, in my thought experiment, the agent doesn’t modify its own reward system. Instead, it selects arbitrary actions and anticipates if their reward is coincidentally more satisfying than the base objective. This allows it to perform this attack even if its in the training environment.
Further, this sort of “attack” may be a component of the self-analysis an agent may do in pursuit of the base objective, so at no point does the agent need to exhibit deceptive or antagonistic behavior to pursue this vulnerability. It may be that an agent pursuing this vulnerability is fundamentally the same as an agent pursuing the base objective.
This is a good question. I think this is both important, and debated. I don’t think this question has been adequately discussed and settled.
I think part of what you’re getting at is what I’ve called The alignment stability problem. You can see my thoughts there, including links to related work.
You may be referring more specifically to goal misgeneralization. Searching for that term on LW finds a good deal of discussion. That’s another potential form of alignment instability that I haven’t addressed.
That effect is termed a superstimulus in biology. Modern synthetic foods and pornography are, or at least attempt to be, superstimuli relative to the “training set” of reward-predictive (or perhaps directly rewarding) stimuli from our ancestral past.
Looking at the google scholar link in this article, it looks like what I’m describing more closely resembles “motivation hacking”, except that, in my thought experiment, the agent doesn’t modify its own reward system. Instead, it selects arbitrary actions and anticipates if their reward is coincidentally more satisfying than the base objective. This allows it to perform this attack even if its in the training environment.
Further, this sort of “attack” may be a component of the self-analysis an agent may do in pursuit of the base objective, so at no point does the agent need to exhibit deceptive or antagonistic behavior to pursue this vulnerability. It may be that an agent pursuing this vulnerability is fundamentally the same as an agent pursuing the base objective.