Wei Dai comments on Another (outer) alignment failure story

Wei Dai 25 Apr 2021 7:57 UTC
LW: 2 AF: 2
AF

Most of the time when I look at a message, a bunch of automated systems have looked at it first and will inform me about the intended effect of the message in order to respond to appropriately or decide whether to read it.

This seems like the most important part so I’ll just focus on this for now. I’m having trouble seeing how this can work. Suppose that I, as an attacker, tell my AI assistant, “interact with Paul in my name (possibly over a very long period of time) so as to maximize the chances that Paul eventually ends up believing in religion/ideology/moral theory X and then start spreading X to his friends” (while implicitly minimizing the chances of these messages/interactions being flagged by your automated systems as adversarial). How would your automation distinguish between me doing this, versus me trying to have a normal human conversation with you about various topics, including what’s moral/normative? Or if the automation isn’t trying to directly make this judgment, what is it telling you to allow you to make this judgment? Can you give a concrete example of a sentence that it might say to you, upon seeing some element of the series of messages/interactions?
- Wei Dai 25 Apr 2021 19:08 UTC
  LW: 10 AF: 4
  AF Parent
  Trying to imagine myself how an automated filter might work, here’s a possible “solution” I came up with. Perhaps your AI maintains a model / probability distribution of things that an uncompromised Wei might naturally say, and flags anything outside or on the fringes of that distribution as potential evidence that I’ve been compromised by an AI-powered attack and is now trying to attack you. (I’m talking in binary terms of “compromised” and “uncompromised” for simplicity but of course it will be more complicated than that in reality.)
  
  Is this close to what you’re thinking? (If not, apologies for going off on a tangent.) If so, given that I would “naturally” change my mind over time (i.e., based on my own thinking or talking with other uncompromised humans), it seems that your AI has to model that as well. I can imagine that in such a scenario, if I ever changed my mind in an unexpected (by the AI model) direction and wanted to talk to you about that, my own AI might say something like “If you say this to Paul, his AI will become more suspicious that you’ve been compromised by an AI-powered attack and your risk of getting blocked now or in the future increases by Y. Are you sure you still want to say this to Paul?” So at this point, collective human philosophical/moral progress would be driven more by what AI filters expect and let pass, than by what physical human brains actually compute, so we better get those models really right, but that faces seemingly difficult problems I mentioned at Replicate the trajectory with ML? and it doesn’t seem like anyone is working on such problems.
  
  If we fail to get such models good enough early on, that could lock in failure as it becomes impossible to meaningfully collaborate with other humans (or human-AI systems) to try to improve such models, as you can’t distinguish whether they’re genuinely trying to make better models with you, or just trying to change your models as part of an attack.
  What links here?
  - paulfchristiano's comment on Another (outer) alignment failure story by paulfchristiano (26 Apr 2021 5:55 UTC; 2 points)
  - Wei Dai's comment on Another (outer) alignment failure story by paulfchristiano (26 Apr 2021 3:38 UTC; 2 points)
  - paulfchristiano 25 Apr 2021 19:18 UTC
    LW: 2 AF: 2
    AF Parent
    Trying to imagine myself how an automated filter might work, here’s a possible “solution” I came up with. Perhaps your AI maintains a model / probability distribution of things that an uncompromised Wei might naturally say, and flags anything outside or on the fringes of that distribution as potential evidence that I’ve been compromised by an AI-powered attack and is now trying to attack you. (I’m talking in binary terms of “compromised” and “uncompromised” for simplicity but of course it will be more complicated than that in reality.)
    This isn’t the kind of approach I’m imagining.
- paulfchristiano 25 Apr 2021 21:39 UTC
  LW: 2 AF: 2
  AF Parent
  This seems like the most important part so I’ll just focus on this for now
  I’m not sure if it’s the most important part. If you are including filtering (and not updates about whether people are good to talk to / legal liability / etc.) then I think it’s a minority of the story. But it still seems fine to talk about (and it’s not like the other steps are easier).
  Suppose that I, as an attacker, tell my AI assistant, “interact with Paul in my name (possibly over a very long period of time) so as to maximize the chances that Paul eventually ends up believing in religion/ideology/moral theory X and then start spreading X to his friends”
  Suppose your AI chooses some message M which is calculated to lead to Paul making (what Paul would or should regard as) an error. It sounds like your main question is how an AI could recognize M as problematic (i.e. such that Paul ought to expect to be worse off after reading M, such that it can either be filtered or caveated, or such that this information can be provided to reputation systems or arbiters, or so on).
  My current view is that the sophistication required to recognize M as problematic is similar to the sophistication required to generate M as a manipulative action. This is clearest if the attacker just generates a lot of messages and then picks M that they think will most successfully manipulate the target—then an equally-sophisticated defender will have the same view about the likely impacts of M.
  This is fuzzier if you can’t tell the difference between deliberation and manipulation. If I define idealized deliberation as an individual activity then I can talk about the extent to which M leads to deviation from idealized deliberation, but it’s probably more accurate to think of idealized deliberation as a collective activity. But as far as I can tell the basic story is still intact (and e.g. I have the intuition about “knowing how to manipulate the process is roughly the same as recognizing manipulation,” just fuzzier.)
  Can you give a concrete example of a sentence that it might say to you, upon seeing some element of the series of messages/interactions?
  It’s probably helpful to get more concrete about the kind of attack you are imagining (which is presumably easier than getting concrete about defenses—both depend on future technology but defenses also depend on what the attack is).
  If your attack involves convincing me of a false claim, or making a statement from which I will predictably make a false inference, then the ideal remedy would be explaining the possible error; if your attack involves threatening me, then an ideal remedy would be to help me implement my preferred policy with respect to threats. And so on.
  I suspect you have none of these examples in mind, but it will be easier to talk about if we zoom in.
  - Wei Dai 26 Apr 2021 3:38 UTC
    LW: 2 AF: 2
    AF Parent
    
    This is fuzzier if you can’t tell the difference between deliberation and manipulation. If I define idealized deliberation as an individual activity then I can talk about the extent to which M leads to deviation from idealized deliberation, but it’s probably more accurate to think of idealized deliberation as a collective activity.
    
    How will your AI compute “the extent to which M leads to deviation from idealized deliberation”? (I’m particularly confused because this seems pretty close to what I guessed earlier and seems to face similar problems, but you said that’s not the kind of approach you’re imagining.)
    
    If your attack involves convincing me of a false claim, or making a statement from which I will predictably make a false inference, then the ideal remedy would be explaining the possible error; if your attack involves threatening me, then an ideal remedy would be to help me implement my preferred policy with respect to threats. And so on.
    
    The attack I have in mind is to imitate a normal human conversation about philosophy or about what’s normative (what one should do), but AI-optimized with a goal of convincing you to adopt a particular conclusion. This may well involve convincing you of a false claim, but of a philosophical nature such that you and your AI can’t detect the error (unless you’ve solved the problem of metaphilosophy and knows what kinds of reasoning reliably leads to true and false conclusions about philosophical problems).
    - paulfchristiano 26 Apr 2021 5:55 UTC
      LW: 2 AF: 2
      AF Parent
      I think I misunderstood what kind of attack you were talking about. I thought you were imagining humans being subject to attack while going about their ordinary business (i.e. while trying to satisfy goals other than moral reflection), but it sounds like in the recent comments you are imagining cases where humans are trying to collaboratively answer hard questions (e.g. about what’s right), some of them may sabotage the process, and none of them are able to answer the question on their own and so can’t avoid relying on untrusted data from other humans.
      I don’t feel like this is going to overlap too much with the story in the OP, since it takes place over a very small amount of calendar time—we’re not trying to do lots of moral deliberation during the story itself, we’re trying to defer moral deliberation until after the singularity (by decoupling it from rapid physical/technological progress), and so the action you are wondering about would have happened after the story ended happily. There are still kinds of attacks that are still important (namely those that prevent humans from surviving through to the singularity).
      Similarly it seems like your description of “go in an info bubble” is not really appropriate for this kind of attack—wouldn’t it be more natural to say “tell your AI not to treat untrusted data as evidence about what is good, and try to rely on carefully chosen data for making novel moral progress.”
      So in that light, I basically want to decouple your concern into two parts:
      Will collaborative moral deliberation actually “freeze” during this scary phase, or will people e.g. keep arguing on the internet and instruct their AI that it shouldn’t protect them from potential manipulation driven by those interactions?
      Will human communities be able to recover mutual trust after the singularity in this story?
      I feel more concerned about #1. I’m not sure where you are at.
      (I’m particularly confused because this seems pretty close to what I guessed earlier and seems to face similar problems, but you said that’s not the kind of approach you’re imagining.)
      I was saying that I think it’s better to directly look at the effects of what is said rather than trying to model the speaker and estimate if they are malicious (or have been compromised). I left a short comment though as a placeholder until writing the grandparent. Also I agree that in the case you seem to have had in mind my proposal is going to look a lot like what you wrote (see below).
      How will your AI compute “the extent to which M leads to deviation from idealized deliberation”?
      Here’s a simple case to start with:
      My AI cares about some judgment X that I’d reach after some idealized deliberative process.
      We may not be able to implement that process, and at any rate I have other preferences, so instead the AI observes the output X’ of some realistic deliberative process embedded in society.
      After observing my estimate X’ the AI acts on its best guess X″ about X.
      An attacker wants to influence X*, so they send me a message M designed to distort X’ (which they hope will in turn estimate X″)
      In this case I think it’s true and easy to derive that:
      If my AI knows what the attacker knows, then updating on X’ and on the fact that the attacker sent me M, can’t push X″ in any direction that’s predictable to the attacker.
      Moreover, if me reading M changes X’ in some predictable-to-the-attacker direction, then my AI knows that the reading M makes X’ less informative about X.
      I’d guess you are on board with the simple case. Some complications in reality:
      We don’t have any definition of the real idealized deliberative process
      The nature of my AI’s deference/corrigibility is quite different then simply regarding my judgment as evidence about X
      The behavior of other people also provides evidence about X, and attackers may have information about other people (that the defender lacks)
      My best guess had been that you were worried about 1 or 2, but from the recent comments it sounds like you may be actually thinking about 3.
      The attack I have in mind is to imitate a normal human conversation about philosophy or about what’s normative (what one should do), but AI-optimized with a goal of convincing you to adopt a particular conclusion
      Filling in some details in a simple way: let’s suppose that the attacker just samples a few plausible things for a human to say, then outputs the one that leads me to make the highest estimate for X. We believe that using “natural” samples from the distribution would yield an endorsed outcome, but that if you consistently pick X-inducing samples then the errors will compound and lead to something bad.
      Then in my proposal a defender would observe X-inducing samples, and could tell that they are X-inducing (since the attacker could tell that and I think we’re discussing the case where the defender has parity—if not then I think we need to return to some of the issues we set aside earlier). They would not initially know whether they are chance or manipulation. But after a few instances of this they will notice that errors tend to push in a surprisingly X-inducing direction and that the upcoming X-inducing samples are therefore particularly harmful.
      This is basically what you proposed in this comment, though I feel the defender can judge based on the object-level reason that the X-inducing outputs are actually bad rather than explicitly flagging corruption.
      In the context of my argument above, a way to view the concern is that a competitive defender can tell which of two samples is more X-inducing, but can’t tell whether an output is surprisingly X-inducing vs if deliberation is just rationally X-inducing, because (unlike the attacker) they aren’t able to observe the several natural samples from which the attack was sampled.
      This kind of thing seems like it can only happen when the right conclusion depends on stuff that other humans know but you and your AI do not (or where for alignment reasons you want to defer to a process that involve the other humans).