Let’s generously suppose that q is some fixed distribution of questions that we want the AI system to ask humans. Some manipulative action may only change the answers on q by a little bit but may yet change the consequences of acting on those responses by a lot.
Consider an AI system that optimizes a utility function that includes this kind of term for regularizing against manipulation. The actions that best fulfill this utility function may be ones that manipulate humans a lot (and repurposes their resources for some other function) and coerces them into answering questions in a “natural way”. i.e. maybe impact is more like distance traveled (i.e. a path integral) than displacement.
Re #1, an obvious set of questions to include in q are questions of approval for various aspects of the AI’s policy. (In particular, if we want the AI to later calculate a human’s HCH and ask it for guidance, then we would like to be sure that HCH’s answer to that question is not manipulated.)
I can think of two problems:
Let’s generously suppose that q is some fixed distribution of questions that we want the AI system to ask humans. Some manipulative action may only change the answers on q by a little bit but may yet change the consequences of acting on those responses by a lot.
Consider an AI system that optimizes a utility function that includes this kind of term for regularizing against manipulation. The actions that best fulfill this utility function may be ones that manipulate humans a lot (and repurposes their resources for some other function) and coerces them into answering questions in a “natural way”. i.e. maybe impact is more like distance traveled (i.e. a path integral) than displacement.
Re #2, I think this is an important objection to low-impact-via-regularization-penalty in general.
Re #1, an obvious set of questions to include in q are questions of approval for various aspects of the AI’s policy. (In particular, if we want the AI to later calculate a human’s HCH and ask it for guidance, then we would like to be sure that HCH’s answer to that question is not manipulated.)