paulfchristiano comments on Another (outer) alignment failure story

paulfchristiano 26 Apr 2021 5:55 UTC
LW: 2 AF: 2
AF
I think I misunderstood what kind of attack you were talking about. I thought you were imagining humans being subject to attack while going about their ordinary business (i.e. while trying to satisfy goals other than moral reflection), but it sounds like in the recent comments you are imagining cases where humans are trying to collaboratively answer hard questions (e.g. about what’s right), some of them may sabotage the process, and none of them are able to answer the question on their own and so can’t avoid relying on untrusted data from other humans.
I don’t feel like this is going to overlap too much with the story in the OP, since it takes place over a very small amount of calendar time—we’re not trying to do lots of moral deliberation during the story itself, we’re trying to defer moral deliberation until after the singularity (by decoupling it from rapid physical/technological progress), and so the action you are wondering about would have happened after the story ended happily. There are still kinds of attacks that are still important (namely those that prevent humans from surviving through to the singularity).
Similarly it seems like your description of “go in an info bubble” is not really appropriate for this kind of attack—wouldn’t it be more natural to say “tell your AI not to treat untrusted data as evidence about what is good, and try to rely on carefully chosen data for making novel moral progress.”
So in that light, I basically want to decouple your concern into two parts:
1. Will collaborative moral deliberation actually “freeze” during this scary phase, or will people e.g. keep arguing on the internet and instruct their AI that it shouldn’t protect them from potential manipulation driven by those interactions?
2. Will human communities be able to recover mutual trust after the singularity in this story?
I feel more concerned about #1. I’m not sure where you are at.
(I’m particularly confused because this seems pretty close to what I guessed earlier and seems to face similar problems, but you said that’s not the kind of approach you’re imagining.)
I was saying that I think it’s better to directly look at the effects of what is said rather than trying to model the speaker and estimate if they are malicious (or have been compromised). I left a short comment though as a placeholder until writing the grandparent. Also I agree that in the case you seem to have had in mind my proposal is going to look a lot like what you wrote (see below).
How will your AI compute “the extent to which M leads to deviation from idealized deliberation”?
Here’s a simple case to start with:
- My AI cares about some judgment X that I’d reach after some idealized deliberative process.
- We may not be able to implement that process, and at any rate I have other preferences, so instead the AI observes the output X’ of some realistic deliberative process embedded in society.
- After observing my estimate X’ the AI acts on its best guess X″ about X.
- An attacker wants to influence X*, so they send me a message M designed to distort X’ (which they hope will in turn estimate X″)
In this case I think it’s true and easy to derive that:
- If my AI knows what the attacker knows, then updating on X’ and on the fact that the attacker sent me M, can’t push X″ in any direction that’s predictable to the attacker.
- Moreover, if me reading M changes X’ in some predictable-to-the-attacker direction, then my AI knows that the reading M makes X’ less informative about X.
I’d guess you are on board with the simple case. Some complications in reality:
1. We don’t have any definition of the real idealized deliberative process
2. The nature of my AI’s deference/corrigibility is quite different then simply regarding my judgment as evidence about X
3. The behavior of other people also provides evidence about X, and attackers may have information about other people (that the defender lacks)
My best guess had been that you were worried about 1 or 2, but from the recent comments it sounds like you may be actually thinking about 3.
The attack I have in mind is to imitate a normal human conversation about philosophy or about what’s normative (what one should do), but AI-optimized with a goal of convincing you to adopt a particular conclusion
Filling in some details in a simple way: let’s suppose that the attacker just samples a few plausible things for a human to say, then outputs the one that leads me to make the highest estimate for X. We believe that using “natural” samples from the distribution would yield an endorsed outcome, but that if you consistently pick X-inducing samples then the errors will compound and lead to something bad.
Then in my proposal a defender would observe X-inducing samples, and could tell that they are X-inducing (since the attacker could tell that and I think we’re discussing the case where the defender has parity—if not then I think we need to return to some of the issues we set aside earlier). They would not initially know whether they are chance or manipulation. But after a few instances of this they will notice that errors tend to push in a surprisingly X-inducing direction and that the upcoming X-inducing samples are therefore particularly harmful.
This is basically what you proposed in this comment, though I feel the defender can judge based on the object-level reason that the X-inducing outputs are actually bad rather than explicitly flagging corruption.
In the context of my argument above, a way to view the concern is that a competitive defender can tell which of two samples is more X-inducing, but can’t tell whether an output is surprisingly X-inducing vs if deliberation is just rationally X-inducing, because (unlike the attacker) they aren’t able to observe the several natural samples from which the attack was sampled.
This kind of thing seems like it can only happen when the right conclusion depends on stuff that other humans know but you and your AI do not (or where for alignment reasons you want to defer to a process that involve the other humans).