Wei Dai comments on Another (outer) alignment failure story

Wei Dai Apr 25, 2021, 7:08 PM
LW: 10 AF: 4
0
AF
Trying to imagine myself how an automated filter might work, here’s a possible “solution” I came up with. Perhaps your AI maintains a model / probability distribution of things that an uncompromised Wei might naturally say, and flags anything outside or on the fringes of that distribution as potential evidence that I’ve been compromised by an AI-powered attack and is now trying to attack you. (I’m talking in binary terms of “compromised” and “uncompromised” for simplicity but of course it will be more complicated than that in reality.)

Is this close to what you’re thinking? (If not, apologies for going off on a tangent.) If so, given that I would “naturally” change my mind over time (i.e., based on my own thinking or talking with other uncompromised humans), it seems that your AI has to model that as well. I can imagine that in such a scenario, if I ever changed my mind in an unexpected (by the AI model) direction and wanted to talk to you about that, my own AI might say something like “If you say this to Paul, his AI will become more suspicious that you’ve been compromised by an AI-powered attack and your risk of getting blocked now or in the future increases by Y. Are you sure you still want to say this to Paul?” So at this point, collective human philosophical/moral progress would be driven more by what AI filters expect and let pass, than by what physical human brains actually compute, so we better get those models really right, but that faces seemingly difficult problems I mentioned at Replicate the trajectory with ML? and it doesn’t seem like anyone is working on such problems.

If we fail to get such models good enough early on, that could lock in failure as it becomes impossible to meaningfully collaborate with other humans (or human-AI systems) to try to improve such models, as you can’t distinguish whether they’re genuinely trying to make better models with you, or just trying to change your models as part of an attack.
What links here?
- paulfchristiano's comment on Another (outer) alignment failure story by paulfchristiano (Apr 26, 2021, 5:55 AM; 2 points)
- Wei Dai's comment on Another (outer) alignment failure story by paulfchristiano (Apr 26, 2021, 3:38 AM; 2 points)
- paulfchristiano Apr 25, 2021, 7:18 PM
  LW: 2 AF: 2
  0
  AF Parent
  Trying to imagine myself how an automated filter might work, here’s a possible “solution” I came up with. Perhaps your AI maintains a model / probability distribution of things that an uncompromised Wei might naturally say, and flags anything outside or on the fringes of that distribution as potential evidence that I’ve been compromised by an AI-powered attack and is now trying to attack you. (I’m talking in binary terms of “compromised” and “uncompromised” for simplicity but of course it will be more complicated than that in reality.)
  This isn’t the kind of approach I’m imagining.