HCH as a measure of manipulation

orthonormal11 Mar 2017 3:02 UTC

LW: 1 AF: 1

A half-baked idea that came out of a conversation with Jessica, Ryan, and Tsvi:

We’d like to have a straightforward way to define “manipulation”, so that we could instruct an AI not to manipulate its developers, or construct a low-impact measure that takes manipulation as a particularly important impact.

We could initially define manipulation in terms of a human’s expected actions, or more robustly, in terms of effects on a human’s policy distribution across a wide array of plausible environments. However, we’d like to have our AI still be able to tell us information (in a non-manipulative manner) instead of hiding from us in an effort to avoid all influence!

The title of course spoils the next idea: if the AI can reason about some suitable model of HCH, then we can define the notion of “action a has very low influence on a human, as compared to the null action, apart from conveying information x”: that over a distribution of questions q,

$H C H (q) | a \approx H C H (x, q) | n u l l$

where HCH is defined relative to that human; we’re conditioning the distribution on whether the AI takes action a or the null action; and x,q is the input consisting of statement x followed by question q.

This of course does not exclude the use of manipulative statements x, but it at least could allow us to reduce forms of manipulation to those that would happen with the text input to HCH.

I’d prefer to have the AI reason about HCH rather than just (e.g.) the human’s actions in a one-hour simulation, because HCH can in principle capture a human’s long-term and extrapolated preferences, and these are the ones I most want to ensure don’t get manipulated.

Is there an obvious failure of this approach, an obvious improvement to it, or something simpler that it reduces to?

What links here?

Comparing AI Alignment Approaches to Minimize False Positive Risk by Gordon Seidoh Worley (30 Jun 2020 19:34 UTC; 5 points)

orthonormal11 Mar 2017 3:02 UTC

LW: 1 AF: 1

7 comments1 min readLW link

Humans Consulting HCH

RyanCarey 11 Mar 2017 21:49 UTC
LW: 1 AF: 1
AF
I can think of two problems:
1. Let’s generously suppose that $q$ is some fixed distribution of questions that we want the AI system to ask humans. Some manipulative action may only change the answers on $q$ by a little bit but may yet change the consequences of acting on those responses by a lot.
2. Consider an AI system that optimizes a utility function that includes this kind of term for regularizing against manipulation. The actions that best fulfill this utility function may be ones that manipulate humans a lot (and repurposes their resources for some other function) and coerces them into answering questions in a “natural way”. i.e. maybe impact is more like distance traveled (i.e. a path integral) than displacement.
- orthonormal 13 Mar 2017 21:41 UTC
  0 points
  AF Parent
  Re #2, I think this is an important objection to low-impact-via-regularization-penalty in general.
- orthonormal 13 Mar 2017 21:29 UTC
  0 points
  AF Parent
  Re #1, an obvious set of questions to include in $q$ are questions of approval for various aspects of the AI’s policy. (In particular, if we want the AI to later calculate a human’s HCH and ask it for guidance, then we would like to be sure that HCH’s answer to that question is not manipulated.)
William_S 17 Mar 2017 0:23 UTC
0 points
Where does $x$ come from in this setup? Is $x$ is arbitrary given $a$ , so that it only guarantees that $a$ can only manipulate you in ways that it could have done via giving you information (if I could persuade via some argument to do what I want, then it’s okay for me to just pull a gun on you if it puts you into the same state)? Or would you have some additional assumption about the information in $x$ being a “reasonable equivalent” of $a$ ?
orthonormal 13 Mar 2017 21:27 UTC
0 points
AF
There’s the additional objection of “if you’re doing this, why not just have the AI ask HCH what to do?”

Overall, I’m hoping that it could be easier for an AI to robustly conclude that a certain plan only changes a human’s HCH via certain informational content, than for the AI to reliably calculate the human’s HCH. But I don’t have strong arguments for this intuition.
- jessicata 13 Mar 2017 22:03 UTC
  LW: 1 AF: 1
  AF Parent
  “Having a well-calibrated estimate of HCH” is the condition you want, not “being able to reliably calculate HCH”.
  - orthonormal 14 Mar 2017 1:49 UTC
    LW: 1 AF: 1
    AF Parent
    I should have said “reliably estimate HCH”; I’d also want quite a lot of precision in addition to calibration before I trust it.