Rohin Shah comments on General alignment plus human values, or alignment via human values?

Rohin Shah 17 Jan 2022 11:44 UTC
LW: 3 AF: 3
AF
I broadly agree with the things you’re saying; I think it mostly comes down to the actual numbers we’d assign.
This seems really extreme, if I’m not misunderstanding you. (My own number is like 1x-5x.) Assuming your intent alignment risk is 10%, your AI persuasion risk is only 1/1000?
Yeah, that’s about right. I’d note that it isn’t totally clear what the absolute risk number is meant to capture—one operationalization is that it is P(existential catastrophe occurs, and if we had solved AI persuasion but the world was otherwise exactly the same, then no existential catastrophe occurs) -- I realize I didn’t say exactly this above but that’s the one that is mutually exhaustive across risks, and the one that determines expected value of solving the problem.
To justify the absolute number of 1/1000, I’d note that:
1. The case seems pretty speculative + conjunctive—you need people to choose to use AI to be very persuasive (instead of, idk, retiring to live in luxury in small insular subcommunities), you’d need the AI to be better at persuasion than defending against persuasion (or for people to choose not to defend), and you’d need this to be so bad that it leads to an existential catastrophe.
2. I feel like if I talked to lots of people the amount I’ve talked with you / others about AI persuasion (i.e. not very much, but enough to convey a basic idea) I’d end up having 10-300 other risks of similar magnitude and plausibility. Under the operationalization I gave above, these probabilities would be mutually exclusive. So that places an upper bound of ¹⁄₃₀₀ − ¹⁄₁₀ on any given problem.
3. I don’t expect this bound to be tight. For example, if it were tight, that would imply that existential catastrophe is guaranteed. But more importantly, there are lots of worlds in which existential catastrophe is overdetermined because society is terrible at coordinating. If you condition on “existential catastrophe” and “AI persuasion was a big problem”, I update that we were really bad at coordination and so I also think that there would be lots of other problems such that solving persuasion wouldn’t prevent the existential catastrophe. (Whereas alignment feels much more like a direct technical challenge—while there certainly is an update against societal coordination if we get an existential catastrophe with a failure of alignment, the update seems a lot smaller, and so I’m more optimistic that solving alignment means that the existential catastrophe doesn’t happen at all.)
The 100x differential between alignment and persuasion comes mostly because points (2) and (3) above don’t apply to alignment, point (1) applies only in part—given my state of knowledge, the case for alignment failure seems much less speculative (though obviously still speculative), though it is still quite conjunctive.
I see, but I think at least part of the problem with threats is that I’m not sure what I care about, which greatly increases my “attack surface”. For example, if I knew that negative utilitarianism is definitely wrong, then threats to torture some large number of simulated people wouldn’t be effective on me (e.g., under total utilitarianism, I could use the resources demanded by the attacker to create more than enough happy people to counterbalance whatever they threaten to do).
That’s a fair point, I agree this is a way in which full knowledge of human values can help avoid potentially significant risks in a way that intent alignment doesn’t.