paulfchristiano comments on Another (outer) alignment failure story

paulfchristiano 25 Apr 2021 19:18 UTC
LW: 2 AF: 2
0
AF
Trying to imagine myself how an automated filter might work, here’s a possible “solution” I came up with. Perhaps your AI maintains a model / probability distribution of things that an uncompromised Wei might naturally say, and flags anything outside or on the fringes of that distribution as potential evidence that I’ve been compromised by an AI-powered attack and is now trying to attack you. (I’m talking in binary terms of “compromised” and “uncompromised” for simplicity but of course it will be more complicated than that in reality.)
This isn’t the kind of approach I’m imagining.