RyanCarey comments on HCH as a measure of manipulation

RyanCarey 11 Mar 2017 21:49 UTC
LW: 1 AF: 1
AF
I can think of two problems:
1. Let’s generously suppose that $q$ is some fixed distribution of questions that we want the AI system to ask humans. Some manipulative action may only change the answers on $q$ by a little bit but may yet change the consequences of acting on those responses by a lot.
2. Consider an AI system that optimizes a utility function that includes this kind of term for regularizing against manipulation. The actions that best fulfill this utility function may be ones that manipulate humans a lot (and repurposes their resources for some other function) and coerces them into answering questions in a “natural way”. i.e. maybe impact is more like distance traveled (i.e. a path integral) than displacement.
- orthonormal 13 Mar 2017 21:41 UTC
  0 points
  AF Parent
  Re #2, I think this is an important objection to low-impact-via-regularization-penalty in general.
- orthonormal 13 Mar 2017 21:29 UTC
  0 points
  AF Parent
  Re #1, an obvious set of questions to include in $q$ are questions of approval for various aspects of the AI’s policy. (In particular, if we want the AI to later calculate a human’s HCH and ask it for guidance, then we would like to be sure that HCH’s answer to that question is not manipulated.)