Wei Dai comments on General alignment plus human values, or alignment via human values?

Wei Dai 8 Jan 2022 7:12 UTC
LW: 15 AF: 9
AF
1. Your AI should tell you that it’s worried about your friend being compromised, make sure you have an understanding of the consequences, and then go with your decision.
I think unless we make sure the AI can distinguish between “correct philosophy” or “well-intentioned philosophy” and “philosophy optimized for persuasion”, each human will become either compromised (if they’re not very cautious and read such messages) or isolated from the rest of humanity with regard to philosophical discussion (if they are cautious and discard such messages). This doesn’t seem like an ok outcome to me. Can you explain more why you aren’t worried?
1. Seems fine. Maybe your AI warns you about the risks before helping.
I can imagine that if you subscribe to a metaethics in which a person can’t be wrong about morality (i.e., some version of anti-realism), then you might think it’s fine to lock in whatever values one currently thinks they ought to have. Is this your reason for “seems fine”, or something else? (If so, I think nobody should be so certain about metaethics at this point.)
1. Seems like an important threat that you (and your AI) should try to resolve.
If the AI isn’t very good at dealing with “exotic philosophical cases” then it’s not going to be of much help with this problem, and a lot of humans (including me) probably aren’t very good at thinking about this either, so we probably end up with a lot of humans succumbing to such acausal attacks.
1. Mostly I would hope that this situation doesn’t arise, because none of the humans can come up with utility functions in this way, and the AIs that are aligned with humans have other ways of cooperating that don’t require eliciting a utility function over universe histories.
Do you have any suggestions for this? Or some other reason to think that AIs that are aligned with different humans will find ways to cooperate (as efficiently as merging utility functions probably will be), without either a full understanding of human values or risking permanent loss of some parts of their complex values?
1. Idk, seems pretty unclear, but I’d hope that these situations can’t come up thanks to laws that prevent people from enforcing such threats.
Agreed that’s a possible good outcome, but seems far from a sure thing. Such laws would have to more intrusive than anything people are currently used to, since attackers can create simulated suffering within the “privacy” of their own computers or minds. I suppose if such threats become a serious problem that causes a lot of damage, people might agree to trade off their privacy for security. The law might then constitute a risk in itself, as the implementation mechanism might be subverted/misused to create a form of totalitarianism.

Another issue is if there are powerful unaligned AIs or rogue states who think they can use such threats to asymmetrically gain advantage, they won’t agree to such laws.

(4) can be solved through governance (laws, regulations, norms, etc)

I think COVID shows that we often can’t do this even when it’s relatively trivial (or can only do it with a huge time delay). For example COVID could have been solved at very low cost (relative to the actual human and economic damage it inflicted) if governments had stockpiled enough high filtration elastomeric respirators for everyone, mailed them out at the start of the pandemic, and mandated their use. (Some EAs are trying to convince some governments to do this now, in preparation for the next pandemic. I’m not sure how much success they’re having.)
- Rohin Shah 8 Jan 2022 20:35 UTC
  LW: 10 AF: 5
  AF Parent
  To be clear, my original claim was for hypothetical scenarios where the failure occurs because the AI didn’t know human values, rather than cases where the AI knows what the human would want but still a failure occurs. (I didn’t state this explicitly because I was replying to the post, which focuses specifically on the problem of not knowing all of human values.) I think most of your failures are of the latter type, and I wouldn’t make a similar claim for such failures—they seem plausible and worth attention. I do still think they are not as important as intent alignment. Talking about each one individually:
  I think unless we make sure the AI can distinguish between “correct philosophy” or “well-intentioned philosophy” and “philosophy optimized for persuasion”, each human will become either compromised (if they’re not very cautious and read such messages) or isolated from the rest of humanity with regard to philosophical discussion (if they are cautious and discard such messages). This doesn’t seem like an ok outcome to me. Can you explain more why you aren’t worried?
  Mostly I’d hope that AI can tell what philosophy is optimized for persuasion, or at least is capable of presenting counterarguments persuasively as well. But if your AI can’t even tell what is persuasive then you’re in trouble, but I’m not sure why to expect that.
  I can imagine that if you subscribe to a metaethics in which a person can’t be wrong about morality (i.e., some version of anti-realism), then you might think it’s fine to lock in whatever values one currently thinks they ought to have. Is this your reason for “seems fine”, or something else?
  No, it just doesn’t seem hugely terrible for a few people to lock in bad values, as long as the vast majority do not. (And I don’t expect a large number of people to explicitly try to lock in their values.)
  If the AI isn’t very good at dealing with “exotic philosophical cases” then it’s not going to be of much help with this problem, and a lot of humans (including me) probably aren’t very good at thinking about this either, so we probably end up with a lot of humans succumbing to such acausal attacks.
  It seems odd to me that it’s sufficiently competent to successfully reason about simulations enough that an acausal threat can actually be made, but then not competent at reasoning about exotic philosophical cases, and I don’t particularly expect this to happen.
  Or some other reason to think that AIs that are aligned with different humans will find ways to cooperate (as efficiently as merging utility functions probably will be), without either a full understanding of human values or risking permanent loss of some parts of their complex values?
  1. I don’t expect AIs to have clean crisp utility functions of the form “maximize paperclips” (at least initially).
  2. If you’re using some form of indirect normativity (which is the general approach I’m excited about), then it seems like you could have an agreement between to AIs about how to aggregate / merge these two definitions of “the good” for different humans (e.g. each human gets “points” in proportion to their current resources, and then they can spend points to have influence over specific choices; presumably you could do better but this seems like a fine start). I expect this to be way less work than the complicated plans that the AI is enacting, so it isn’t a huge competitiveness hit.
  Agreed that’s a possible good outcome, but seems far from a sure thing. Such laws would have to more intrusive than anything people are currently used to, since attackers can create simulated suffering within the “privacy” of their own computers or minds. I suppose if such threats become a serious problem that causes a lot of damage, people might agree to trade off their privacy for security. The law might then constitute a risk in itself, as the implementation mechanism might be subverted/misused to create a form of totalitarianism.
  I agree it is far from a sure thing.
  - Wei Dai 8 Jan 2022 22:36 UTC
    LW: 6 AF: 5
    AF Parent
    
    To be clear, my original claim was for hypothetical scenarios where the failure occurs because the AI didn’t know human values, rather than cases where the AI knows what the human would want but still a failure occurs.
    
    I’m not sure I understand the distinction that you’re drawing here. (It seems like my scenarios could also be interpreted as failures where AI don’t know enough human values, or maybe where humans themselves don’t know enough human values.) What are some examples of what your claim was about?
    
    I do still think they are not as important as intent alignment.
    
    As in, the total expected value lost through such scenarios isn’t as large as the expected value lost through the risk of failing to solve intent alignment? Can you give some ballpark figures of how you see each side of this inequality?
    
    Mostly I’d hope that AI can tell what philosophy is optimized for persuasion
    
    How? How would you train an AI to distinguish between philosophy optimized for persuasion, and correct or well-intentioned philosophy that just happens to be very persuasive?
    
    or at least is capable of presenting counterarguments persuasively as well.
    
    You mean every time you hear a philosophical argument, you ask you AI to produce some counterarguments optimized for persuasion? If so, won’t your friends be afraid to send you any arguments they think of, for fear of your AI superhumanly persuading you to the opposite conclusion?
    
    And I don’t expect a large number of people to explicitly try to lock in their values.
    
    A lot of people are playing status games where faith/loyalty to their cause/ideology gains them a lot of status points. Why wouldn’t they ask their AI for help with this? Or do you imagine them asking for something like “more faith”, but AIs understand human values well enough to not interpret that as “lock in values”?
    
    It seems odd to me that it’s sufficiently competent to successfully reason about simulations enough that an acausal threat can actually be made, but then not competent at reasoning about exotic philosophical cases, and I don’t particularly expect this to happen.
    
    The former seems to just require that the AI is good at reasoning about mathematical/empirical matters (e.g., are there many simulations of me actually being run in some universe or set of universes) which I think AIs will be good at by default, whereas dealing with the threats seems to require reasoning about hard philosophical problems like decision theory and morality. For example, how much should I care about my copies in the simulations or my subjective future experience, versus the value that would be lost in the base reality if I were to give in to the simulators’ demands? Should I make a counterthreat? Are there any thoughts I or my AI should avoid having, or computations we should avoid doing?
    
    I don’t expect AIs to have clean crisp utility functions of the form “maximize paperclips” (at least initially).
    
    I expect that AIs (or humans) who are less cautious or who think their values can be easily expressed as utility functions will do this first, and thereby gain an advantage over everyone else and maybe forcing them to follow.
    
    I expect this to be way less work than the complicated plans that the AI is enacting, so it isn’t a huge competitiveness hit.
    
    I don’t think it’s so much that the coordination involving humans is a lot of work, but rather that we don’t know how to do it in a way that doesn’t cause a lot of waste, similar to a democratically elected administration implementing a bunch of policies only to be reversed by the next administration that takes power, or lawmakers pursuing pork barrel projects that collectively make almost everyone worse off, or being unable to establish and implement easy policies (see COVID again). (You may well have something in mind that works well in the context of intent aligned AI, but I have a prior that says this class of problems is very difficult in general so I’d need to see more details before I update.)
    - Rohin Shah 9 Jan 2022 19:01 UTC
      LW: 2 AF: 2
      AF Parent
      I’m not sure I understand the distinction that you’re drawing here. (It seems like my scenarios could also be interpreted as failures where AI don’t know enough human values, or maybe where humans themselves don’t know enough human values.) What are some examples of what your claim was about?
      Examples:
      Your AI thinks it’s acceptable to inject you with heroin, because it predicts you will then want more heroin.
      Your AI is uncertain whether you’d prefer to explore space or stay on Earth. It randomly guesses that you want to stay on Earth and takes irreversible actions on your behalf that force you to stay on Earth.
      In contrast, something like a threat doesn’t count, because you know that the outcome if the threat is executed is not something you want; the problem comes because you don’t know how to act in a way that both disincentivizes threats and also doesn’t lead to (too many) threats being enforced. In particular, the problem is not that you don’t know which outcomes are bad.
      As in, the total expected value lost through such scenarios isn’t as large as the expected value lost through the risk of failing to solve intent alignment?
      No, the expected value of marginal effort aimed at solving these problems isn’t as large as the expected value of marginal effort on intent alignment.
      (I don’t like talking about “expected value lost” because it’s not always clear what does and doesn’t count as part of that. For example I think it’s nearly inevitable that different people will have different goals and so the future will not be exactly as any one of them desired; should I say that there’s a lot of expected value lost from “coordination problems” for that reason? It seems a bit weird to say that if you think there isn’t a way to regain that “expected value”.)
      Can you give some ballpark figures of how you see each side of this inequality?
      Uh, idk. It’s not something I have numbers on. But I suppose I can try and make up some very fake numbers for, say, AI persuasion. (Before I actually do the exercise, let me note that I could imagine the exercise coming out with numbers that favor persuasion over intent alignment; this probably won’t change my mind and would instead make me distrust the numbers, but I’ll publish them anyway.)
      To change an existentially bad outcome from AI persuasion, I’d imagine first figuring out some solutions, then figuring out how to implement them, and then getting the appropriate people to implement them; seems like you need all of these steps in order to make any difference. (Could be technical or governance though.) It feels especially hard to do so at the moment, given how little we know about future AI capabilities and when each potential capability arrives. Maybe:
      Alignment is 100x more likely to be an existentially risky problem at all (think of this as the ratio between probabilities of existential catastrophe by the given problem assuming no intervention from longtermists).
      A piece of alignment work now is 10x more likely to target the right problem than a similar piece of work for persuasion.
      A piece of alignment work now is currently 10x harder to produce than a similar piece of work for persuasion.
      So I guess overall I’m at ~100x, very very roughly?
      How? How would you train an AI to distinguish between philosophy optimized for persuasion, and correct or well-intentioned philosophy that just happens to be very persuasive?
      Putting on my “what would I do” hat, I’m imagining that the AI doesn’t know that it was specifically optimized to be persuasive, but it does know that there are other persuasive counterarguments that aren’t being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments. Or it says that there are other counterfactual letters you could have received, such that after you read them you’d be convinced of the opposite position, and then it asks whether you still want to read the letter.
      If your AI doesn’t know about the counterarguments, and the letter is persuasive even to the AI, then it seems like you’re hosed, but I’m not sure why to expect that.
      A lot of people are playing status games where faith/loyalty to their cause/ideology gains them a lot of status points. Why wouldn’t they ask their AI for help with this? Or do you imagine them asking for something like “more faith”, but AIs understand human values well enough to not interpret that as “lock in values”?
      I totally expect them to ask AI for help with such games. I don’t expect (most of) them to lock in their values such that they can’t change their mind.
      I’m not entirely sure why you do expect this. Maybe you’re viewing them as consciously optimizing for winning status games + a claim that the optimal policy is to lock in your values? But what if the values rewarded by the status games (as they already seem to, e.g. moving from atheism to anti-racism)? In that case it seems like you don’t want to lock in your values, to better play the status game in the future.
      The former seems to just require that the AI is good at reasoning about mathematical/empirical matters (e.g., are there many simulations of me actually being run in some universe or set of universes) which I think AIs will be good at by default, whereas dealing with the threats seems to require reasoning about hard philosophical problems like decision theory and morality.
      If you’ve determined a set of universes to care about, then shouldn’t at least decision theory reduce to mathematical / empirical matters about which decision procedure gets the most value across the set of universes? I do agree that moral questions are still not mathematical / empirical, but I don’t find that all that persuasive. I expect AIs will be able to do the sort of philosophical reasoning that we do, and the question of whether we should care about simulations seems way way easier than the question about which simulations of me are being run, by whom, and what they want.
      I’m guessing you feel better about mathematical / empirical reasoning because there’s a ground truth that says when that reasoning is done well. I don’t particularly find the existence of a ground truth to be all that big a deal—it probably helps but doesn’t seem tremendously important.
      I expect that AIs (or humans) who are less cautious or who think their values can be easily expressed as utility functions will do this first, and thereby gain an advantage over everyone else and maybe forcing them to follow.
      Fair enough—I agree this is plausible (though only plausible; it doesn’t seem like we’ve sacrificed everything to Moloch yet).
      I don’t think it’s so much that the coordination involving humans is a lot of work, but rather that we don’t know how to do it in a way that doesn’t cause a lot of waste, similar to a democratically elected administration implementing a bunch of policies only to be reversed by the next administration that takes power, or lawmakers pursuing pork barrel projects that collectively make almost everyone worse off, or being unable to establish and implement easy policies (see COVID again). (You may well have something in mind that works well in the context of intent aligned AI, but I have a prior that says this class of problems is very difficult in general so I’d need to see more details before I update.)
      If we’re imagining coordination between a billion AIs, that seems less obviously doable. I still think that, if you’ve solved intent alignment, it seems much easier. Democratic elections are not just about coordination—they’re also about alignment, since politicians have to optimize for being re-elected. It seems like you could do a lot better if you didn’t have to worry about the alignment part.
      - Wei Dai 16 Jan 2022 2:21 UTC
        LW: 4 AF: 4
        AF Parent
        
        In contrast, something like a threat doesn’t count, because you know that the outcome if the threat is executed is not something you want; the problem comes because you don’t know how to act in a way that both disincentivizes threats and also doesn’t lead to (too many) threats being enforced. In particular, the problem is not that you don’t know which outcomes are bad.
        
        I see, but I think at least part of the problem with threats is that I’m not sure what I care about, which greatly increases my “attack surface”. For example, if I knew that negative utilitarianism is definitely wrong, then threats to torture some large number of simulated people wouldn’t be effective on me (e.g., under total utilitarianism, I could use the resources demanded by the attacker to create more than enough happy people to counterbalance whatever they threaten to do).
        
        Alignment is 100x more likely to be an existentially risky problem at all (think of this as the ratio between probabilities of existential catastrophe by the given problem assuming no intervention from longtermists).
        
        This seems really extreme, if I’m not misunderstanding you. (My own number is like 1x-5x.) Assuming your intent alignment risk is 10%, your AI persuasion risk is only 1/1000?
        
        Putting on my “what would I do” hat, I’m imagining that the AI doesn’t know that it was specifically optimized to be persuasive, but it does know that there are other persuasive counterarguments that aren’t being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments.
        
        Given that humans are liable to be persuaded by bad counterarguments too, I’d be concerned that the AI will always “know that there are other persuasive counterarguments that aren’t being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments.” Since it’s not safe to actually look the counterarguments found by your own AI, it’s not really helping at all. (Or it makes things worse if the user isn’t very cautious and does look at their AI’s counterarguments and gets persuaded by them.)
        
        I totally expect them to ask AI for help with such games. I don’t expect (most of) them to lock in their values such that they can’t change their mind.
        
        I think most people don’t think very long term and aren’t very rational. They’ll see some people within their group do AI-enabled value lock-in, get a lot of status reward for it, and emulate that behavior in order to not fall behind and become low status within the group. (This might be a gradual process resembling “purity spirals” of the past, i.e., people ask AI to do more and more things that have the effect of locking in their values, or a sudden wave of explicit value lock-ins.)
        
        I expect AIs will be able to do the sort of philosophical reasoning that we do, and the question of whether we should care about simulations seems way way easier than the question about which simulations of me are being run, by whom, and what they want.
        
        This seems plausible to me, but I don’t see how one can have enough confidence in this view that one isn’t very worried about the opposite being true and constituting a significant x-risk.
        Rohin Shah 17 Jan 2022 11:44 UTC
        LW: 3 AF: 3
        AF Parent
        I broadly agree with the things you’re saying; I think it mostly comes down to the actual numbers we’d assign.
        This seems really extreme, if I’m not misunderstanding you. (My own number is like 1x-5x.) Assuming your intent alignment risk is 10%, your AI persuasion risk is only 1/1000?
        Yeah, that’s about right. I’d note that it isn’t totally clear what the absolute risk number is meant to capture—one operationalization is that it is P(existential catastrophe occurs, and if we had solved AI persuasion but the world was otherwise exactly the same, then no existential catastrophe occurs) -- I realize I didn’t say exactly this above but that’s the one that is mutually exhaustive across risks, and the one that determines expected value of solving the problem.
        To justify the absolute number of 1/1000, I’d note that:
        The case seems pretty speculative + conjunctive—you need people to choose to use AI to be very persuasive (instead of, idk, retiring to live in luxury in small insular subcommunities), you’d need the AI to be better at persuasion than defending against persuasion (or for people to choose not to defend), and you’d need this to be so bad that it leads to an existential catastrophe.
        I feel like if I talked to lots of people the amount I’ve talked with you / others about AI persuasion (i.e. not very much, but enough to convey a basic idea) I’d end up having 10-300 other risks of similar magnitude and plausibility. Under the operationalization I gave above, these probabilities would be mutually exclusive. So that places an upper bound of ¹⁄₃₀₀ − ¹⁄₁₀ on any given problem.
        I don’t expect this bound to be tight. For example, if it were tight, that would imply that existential catastrophe is guaranteed. But more importantly, there are lots of worlds in which existential catastrophe is overdetermined because society is terrible at coordinating. If you condition on “existential catastrophe” and “AI persuasion was a big problem”, I update that we were really bad at coordination and so I also think that there would be lots of other problems such that solving persuasion wouldn’t prevent the existential catastrophe. (Whereas alignment feels much more like a direct technical challenge—while there certainly is an update against societal coordination if we get an existential catastrophe with a failure of alignment, the update seems a lot smaller, and so I’m more optimistic that solving alignment means that the existential catastrophe doesn’t happen at all.)
        The 100x differential between alignment and persuasion comes mostly because points (2) and (3) above don’t apply to alignment, point (1) applies only in part—given my state of knowledge, the case for alignment failure seems much less speculative (though obviously still speculative), though it is still quite conjunctive.
        I see, but I think at least part of the problem with threats is that I’m not sure what I care about, which greatly increases my “attack surface”. For example, if I knew that negative utilitarianism is definitely wrong, then threats to torture some large number of simulated people wouldn’t be effective on me (e.g., under total utilitarianism, I could use the resources demanded by the attacker to create more than enough happy people to counterbalance whatever they threaten to do).
        That’s a fair point, I agree this is a way in which full knowledge of human values can help avoid potentially significant risks in a way that intent alignment doesn’t.