Wei Dai comments on General alignment plus human values, or alignment via human values?

Wei Dai 16 Jan 2022 2:21 UTC
LW: 4 AF: 4
AF

In contrast, something like a threat doesn’t count, because you know that the outcome if the threat is executed is not something you want; the problem comes because you don’t know how to act in a way that both disincentivizes threats and also doesn’t lead to (too many) threats being enforced. In particular, the problem is not that you don’t know which outcomes are bad.

I see, but I think at least part of the problem with threats is that I’m not sure what I care about, which greatly increases my “attack surface”. For example, if I knew that negative utilitarianism is definitely wrong, then threats to torture some large number of simulated people wouldn’t be effective on me (e.g., under total utilitarianism, I could use the resources demanded by the attacker to create more than enough happy people to counterbalance whatever they threaten to do).

Alignment is 100x more likely to be an existentially risky problem at all (think of this as the ratio between probabilities of existential catastrophe by the given problem assuming no intervention from longtermists).

This seems really extreme, if I’m not misunderstanding you. (My own number is like 1x-5x.) Assuming your intent alignment risk is 10%, your AI persuasion risk is only 1/1000?

Putting on my “what would I do” hat, I’m imagining that the AI doesn’t know that it was specifically optimized to be persuasive, but it does know that there are other persuasive counterarguments that aren’t being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments.

Given that humans are liable to be persuaded by bad counterarguments too, I’d be concerned that the AI will always “know that there are other persuasive counterarguments that aren’t being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments.” Since it’s not safe to actually look the counterarguments found by your own AI, it’s not really helping at all. (Or it makes things worse if the user isn’t very cautious and does look at their AI’s counterarguments and gets persuaded by them.)

I totally expect them to ask AI for help with such games. I don’t expect (most of) them to lock in their values such that they can’t change their mind.

I think most people don’t think very long term and aren’t very rational. They’ll see some people within their group do AI-enabled value lock-in, get a lot of status reward for it, and emulate that behavior in order to not fall behind and become low status within the group. (This might be a gradual process resembling “purity spirals” of the past, i.e., people ask AI to do more and more things that have the effect of locking in their values, or a sudden wave of explicit value lock-ins.)

I expect AIs will be able to do the sort of philosophical reasoning that we do, and the question of whether we should care about simulations seems way way easier than the question about which simulations of me are being run, by whom, and what they want.

This seems plausible to me, but I don’t see how one can have enough confidence in this view that one isn’t very worried about the opposite being true and constituting a significant x-risk.
- Rohin Shah 17 Jan 2022 11:44 UTC
  LW: 3 AF: 3
  AF Parent
  I broadly agree with the things you’re saying; I think it mostly comes down to the actual numbers we’d assign.
  This seems really extreme, if I’m not misunderstanding you. (My own number is like 1x-5x.) Assuming your intent alignment risk is 10%, your AI persuasion risk is only 1/1000?
  Yeah, that’s about right. I’d note that it isn’t totally clear what the absolute risk number is meant to capture—one operationalization is that it is P(existential catastrophe occurs, and if we had solved AI persuasion but the world was otherwise exactly the same, then no existential catastrophe occurs) -- I realize I didn’t say exactly this above but that’s the one that is mutually exhaustive across risks, and the one that determines expected value of solving the problem.
  To justify the absolute number of 1/1000, I’d note that:
  1. The case seems pretty speculative + conjunctive—you need people to choose to use AI to be very persuasive (instead of, idk, retiring to live in luxury in small insular subcommunities), you’d need the AI to be better at persuasion than defending against persuasion (or for people to choose not to defend), and you’d need this to be so bad that it leads to an existential catastrophe.
  2. I feel like if I talked to lots of people the amount I’ve talked with you / others about AI persuasion (i.e. not very much, but enough to convey a basic idea) I’d end up having 10-300 other risks of similar magnitude and plausibility. Under the operationalization I gave above, these probabilities would be mutually exclusive. So that places an upper bound of ¹⁄₃₀₀ − ¹⁄₁₀ on any given problem.
  3. I don’t expect this bound to be tight. For example, if it were tight, that would imply that existential catastrophe is guaranteed. But more importantly, there are lots of worlds in which existential catastrophe is overdetermined because society is terrible at coordinating. If you condition on “existential catastrophe” and “AI persuasion was a big problem”, I update that we were really bad at coordination and so I also think that there would be lots of other problems such that solving persuasion wouldn’t prevent the existential catastrophe. (Whereas alignment feels much more like a direct technical challenge—while there certainly is an update against societal coordination if we get an existential catastrophe with a failure of alignment, the update seems a lot smaller, and so I’m more optimistic that solving alignment means that the existential catastrophe doesn’t happen at all.)
  The 100x differential between alignment and persuasion comes mostly because points (2) and (3) above don’t apply to alignment, point (1) applies only in part—given my state of knowledge, the case for alignment failure seems much less speculative (though obviously still speculative), though it is still quite conjunctive.
  I see, but I think at least part of the problem with threats is that I’m not sure what I care about, which greatly increases my “attack surface”. For example, if I knew that negative utilitarianism is definitely wrong, then threats to torture some large number of simulated people wouldn’t be effective on me (e.g., under total utilitarianism, I could use the resources demanded by the attacker to create more than enough happy people to counterbalance whatever they threaten to do).
  That’s a fair point, I agree this is a way in which full knowledge of human values can help avoid potentially significant risks in a way that intent alignment doesn’t.