Chipmonk comments on Reward Hacking from a Causal Perspective

Chipmonk 12 Aug 2023 15:02 UTC
1 point
0
AF
One definition of manipulation is intentional and covert influence. Content recommenders can satisfy this definition, as they are typically trained to influence the user by any means, including “covert” ones like appealing to the user’s biases and emotions.
I don’t think that “covert” is a coherent thing an (e.g.) content recommender could optimize against. For example, everything could appeal to the biases and emotions of the wrong person. Anything can be rude/triggering/bias-inducing to the right person. In which case, how do you classify what is covert and what isn’t in a way that isn’t entirely subjective and also isn’t behest to (arbitrary) social norms?
I still think it’s possible to define manipulation ~objectively though, but in terms of infiltration across human Markov blankets.
- tom4everitt 14 Aug 2023 17:06 UTC
  2 points
  0
  Parent
  The point here isn’t that the content recommender is optimised to use covert means in particular, but that it is not optimised to avoid them. Therefore it may well end up using them, as they might be the easiest path to reward.
  Re Markov blankets, won’t any kind of information penetrate a human’s Markov blanket, as any information received will alter the human’s brain state?
  - Chipmonk 14 Aug 2023 17:43 UTC
    4 points
    0
    Parent
    The point here isn’t that the content recommender is optimised to use covert means in particular, but that it is not optimised to avoid them. Therefore it may well end up using them, as they might be the easiest path to reward.
    Yes but I’m not sure that there is such a distinction as “using them” or “not using them”
    Re Markov blankets, won’t any kind of information penetrate a human’s Markov blanket, as any information received will alter the human’s brain state?
    No—For example: imagine a bacterium with a membrane. The bacterium has methods of controlling what influence flows in and out, e.g. it has ion channels. So, here I define “irresistible manipulation” as “influence that stabs through the bacterium’s membrane”.
    But influence that the bacterium “willingly” allows through its ion channels/whatever is fine (because if it didn’t “want” the influence it didn’t have to let it in).
    Andrew Critch (in «Boundaries» 3a) defines this as
    “Infiltration” of information from the environment into the active boundary & viscera:
    $I n f i l (ϕ) := A g g t \geq 0 M u t W^{ω} \sim ϕ ((V_{t + 1}, A_{t + 1}); E_{t} ∣ (V_{t}, A_{t}, P_{t}))$
    longer explanation from a draft i’m writing--
    Formalizing (irresistible) aggression
    Markov blankets
    Past work has formalized what I mean here by irresistible manipulation via Markov blankets. In this section, I will explain what Markov blankets mean for this purpose.
    By the end of this section, you will be able to understand this (Pearlian causal) diagram:
    (Note: I will assume that you have basic familiarity with Markov chains.)
    First, I want you to imagine a simple Markov chain that represents the fact that a human influences itself over time:
    Second, I want you to imagine a Markov chain that represents the fact that the environment (~ the complement of the human; the rest of the universe minus the human) influences itself over time:
    Okay. Now, notice that in between the human and the environment there’s some kind of membrane. For example, their skin (physical membrane) and their interpretation/cognition (informational membrane). If this were not a human but instead a bacterium, then the membrane I mean (mostly) be the bacterium’s literal membrane.
    Third, imagine a Markov chain that represents that membrane influencing itself over time:
    Okay, so we have these three Markov chains running in parallel:
    But they also influence each other, so let’s build that too.
    How does the environment affect a human? Notice that whenever the environment affects a human, it doesn’t influence them directly, but instead it influences their skin or their cognition (their membrane), and then their membrane influences them.
    For example, I shine light in your eyes (part of the environment), it activates your eyes (part of your membrane), and your eyes send information to your brain (part of your insides).
    Which is to say, this is what does not happen:
    (This is called “infiltration”.) The environment does not directly influence the human.
    Instead, the environment influences the membrane which influences them, which looks like this:
    Okay, now let’s do the other direction. How does the human influence the environment? It’s not that a human controls the environment directly:
    (This is called “exfiltration”; this does not happen.)
    but that they take actions (via their membrane), and then their actions affect the environment:
    Okay, putting together both of directions of human-influences-environment and environment-influences-human, we get:
    Also, I want you to notice which arrows that are conspicuously missing from the diagram above:
    So that’s how we can model the approximate causal separation between an agent and the environment.
    With that, now we can define what irresistible manipulation is.
    Irresistible aggression is exactly this:
    Irresistible aggression is infiltration across human Markov blankets.
    Of course, in reality, there’s actually leakage and the real Markov blanket does include those arrows I said were missing, but humans are agents that actively minimize that leakage.
    For example:
    You don’t want to be directly controlled by your environment. (You don’t want infiltration.)
    Instead, you want to take in information and then be able to decide what to do with it. You want to have a say about how things affect you.
    A bacterium wants things to go through its gates and ion channels, and not just stab through its membrane.
    You don’t want the way that you’re influencing the world to be by people mind-reading you. (Exfiltration^[1])
    Instead, you want to be affecting the world intentionally, through your actions.
    If you believed that someone might be able to predict you well or get close to predicting you well and you don’t want that, you would probably take evasive maneuvers.
    [This section is largely based on Andrew Critch’s «Boundaries», Part 3a: Defining boundaries as directed Markov blankets — LessWrong. His post also has more technical details (relating to mutual information).^[2]]
    ^
    It may also be preferable to avoid exfiltration across human Markov blankets too (which would be arrows directly from H→E), but it’s not clear to me that that can be prevented. Exfiltration is like privacy. Related: 1, 2. I need to think more about this. Let me know if you have thoughts.
    ^
    Critch also bifurcates the membrane into an action-like component and a perception-like component, but I omitted that detail above.
    What links here?
    Chipmonk's comment on «Boundaries», Part 3b: Alignment problems in terms of boundaries by Andrew_Critch (6 Sep 2023 0:29 UTC; 1 point)
    - tom4everitt 15 Aug 2023 9:58 UTC
      1 point
      0
      Parent
      I see, thanks for the careful explanation.
      I think the kind of manipulation you have in mind is bypassing the human’s rational deliberation, which is an important one. This is roughly what I have in mind when I say “covert influence”.
      So in response to your first comment: given that the above can be properly defined, there should also be a distinction between using and not using covert influence?
      Whether manipulation can be defined as penetration of a Markov blanket, it’s possible. I think my main question is how much it adds to the analysis, to characterise it in terms of a Markov blanket. Because it’s non-trivial to define the membrane variable, in a way that information that “covertly” passes through my eyes and ears bypasses the membrane, while other information is mediated by the membrane.
      The SEP article does a pretty good job at spelling out the many different forms manipulation can take https://plato.stanford.edu/entries/ethics-manipulation/

Chipmonk comments on Reward Hacking from a Causal Perspective

longer explanation from a draft i’m writing--

Formalizing (irresistible) aggression

Markov blankets