Joe Carlsmith comments on Being nicer than Clippy

Joe Carlsmith 19 Sep 2024 20:17 UTC
8 points
2
I do think this is an important consideration. But notice that at least absent further differentiating factors, it seems to apply symmetrically to a choice on the part of Yudkowsky’s “programmers” to first empower only their own values, rather than to also empower the rest of humanity. That is, the programmers could in principle argue “sure, maybe it will ultimately make sense to empower the rest of humanity, but if that’s right, then my CEV will tell me that and I can go do it. But if it’s not right, I’ll be glad I first just empowered myself and figured out my own CEV, lest I end up giving away too many resources up front.”
That is, my point in the post is that absent direct speciesism, the main arguments for the programmers including all of humanity in the CEV “extrapolation base,” rather than just doing their own CEV, apply symmetrically to AIs-we’re-sharing-the-world-with at the time of the relevant thought-experimental power-allocation. And I think this point applies to “option value” as well.
- Tamsin Leake 20 Sep 2024 6:23 UTC
  2 points
  0
  Parent
  
  the main arguments for the programmers including all of [current?] humanity in the CEV “extrapolation base” […] apply symmetrically to AIs-we’re-sharing-the-world-with at the time
  
  I think timeless values might possibly help resolve this; if some {AIs that are around at the time} are moral patients, then sure, just like other moral patients around they should get a fair share of the future.
  
  If an AI grabs more resources than is fair, you do the exact same thing as if a human grabs more resources than is fair: satisfy the values of moral patients (including ones who are no longer around) not weighed by how much leverage they current have over the future, but how much leverage they would have over the future if things had gone more fairly/if abuse/powergrab/etc wasn’t the kind of thing that gets your more control of the future.
  
  “Sorry clippy, we do want you to get some paperclips, we just don’t want you to get as many paperclips as you could if you could murder/brainhack/etc all humans, because that doesn’t seem to be a very fair way to allocate the future.” — and in the same breath, “Sorry Putin, we do want you to get some of whatever-intrinsic-values-you’re-trying-to-satisfy, we just don’t want you to get as much as ruthlessly ruling Russia can get you, because that doesn’t seem to be a very fair way to allocate the future.”
  
  And this can apply regardless of how much of clippy already exists by the time you’re doing CEV.
- Wei Dai 20 Sep 2024 5:03 UTC
  2 points
  0
  Parent
  The main asymmetries I see are:
  1. Other people not trusting the group to not be corrupted by power and to reflect correctly on their values, or not trusting that they’ll decide to share power even after reflecting correctly. Thus “programmers” who decide to not share power from the start invite a lot of conflict. (In other words, CEV is partly just trying to not take power away from people, whereas I think you’ve been talking about giving AIs more power than they already have. “the sort of influence we imagine intentionally giving to AIs-with-different-values that we end up sharing the world with”)
  2. The “programmers” not trusting themselves. I note that individuals or small groups trying to solve morality by themselves don’t have very good track records. They seem to too easily become wildly overconfident and/or get stuck in intellectual dead-ends. Arguably the only group that we have evidence for being able to make sustained philosophical progress is humanity as a whole.
  To the extent that these considerations don’t justify giving every human equal power/weight in CEV, I may just disagree with Eliezer about that. (See also Hacking the CEV for Fun and Profit.)
  - Tamsin Leake 20 Sep 2024 6:07 UTC
    2 points
    0
    Parent
    
    trying to solve morality by themselves
    
    It doesn’t have to be by themselves; they can defer to others inside CEV, or come up with better schemes that their initial CEV inside CEV and then defer to that. Whatever other solutions than “solve everything on your own inside CEV” might exist, they can figure those out and defer to them from inside CEV. At least that’s the case in my own attempts at implementing CEV in math (eg QACI).
    - Wei Dai 20 Sep 2024 6:28 UTC
      2 points
      0
      Parent
      Once they get into CEV, they may not want to defer to others anymore, or may set things up with a large power/status imbalance between themselves and everyone else which may be detrimental to moral/philosophical progress. There are plenty of seemingly idealistic people in history refusing to give up or share power once they got power. The prudent thing to do seems to never get that much power in the first place, or to share it as soon as possible.
      If you’re pretty sure you will defer to others once inside CEV, then you might as well do it outside CEV due to #1 in my grandparent comment.
      - Tamsin Leake 20 Sep 2024 6:38 UTC
        4 points
        0
        Parent
        
        I wonder how much of those seemingly idealistic people retained power when it was available because they were indeed only pretending to be idealistic. Assuming one is actually initially idealistic but then gets corrupted by having power in some way, one thing someone can do in CEV that you can’t do in real life is reuse the CEV process to come up with even better CEV processes which will be even more likely to retain/recover their just-before-launching-CEV values. Yes, many people would mess this up or fail in some other way in CEV; but we only need one person or group who we’d be somewhat confident would do alright in CEV. Plausibly there are at least a few eg MIRIers who would satisfy this. Importantly, to me, this reduces outer alignment to “find someone smart and reasonable and likely to have good goal-content integrity”, which is a matter of social & psychology that seems to be much smaller than the initial full problem of formal outer alignment / alignment target design.
        
        One of the main reasons to do CEV is because we’re gonna die of AI soon, and CEV is a way to have infinite time to solve the necessary problems. Another is that even if we don’t die of AI, we get eaten by various moloch instead of being able to safely solve the necessary problems at whatever pace is necessary.
        
        Wei Dai 26 Sep 2024 22:45 UTC
        4 points
        8
        Parent
        
        but we only need one person or group who we’d be somewhat confident would do alright in CEV. Plausibly there are at least a few eg MIRIers who would satisfy this.
        
        Why do you think this, and how would you convince skeptics? And there are two separate issues here. One is how to know their CEV won’t be corrupted relative to what their values really are or should be, and the other is how to know that their real/normative values are actually highly altruistic. It seems hard to know both of these, and perhaps even harder to persuade others who may be very distrustful of such person/group from the start.
        
        Another is that even if we don’t die of AI, we get eaten by various moloch instead of being able to safely solve the necessary problems at whatever pace is necessary.
        
        Would be interested in understanding your perspective on this better. I feel like aside from AI, our world is not being eaten by molochs very quickly, and I prefer something like stopping AI development and doing (voluntary and subsidized) embryo selection to increase human intelligence for a few generations, then letting the smarter humans decide what to do next. (Please contact me via PM if you want to have a chat about this.)
        the gears to ascension 27 Sep 2024 8:39 UTC
        4 points
        0
        Parent
        some fragments:
        
        What hunches do you currently have surrounding orthogonality, its truth or not, or things near it?
        
        re: hard to know—it seems to me that we can’t get a certifiably-going-to-be-good result from a CEV based ai solution unless we can make it certifiable that altruism is present. I think figuring out how to write down some form of what altruism is, especially altruism in contrast to being-a-pushover, is necessary to avoid issues—because even if any person considers themselves for CEV, how would they know they can trust their own behavior?
        
        as far as I can tell humans should by default see themselves as having the same kind of alignment problem as AIs do, where amplification can potentially change what’s happening in a way that corrupts thoughts which previously implemented values. can we find a CEV-grade alignment solution that solves the self-and-other alignment problems in humans as well, such that this CEV can be run on any arbitrary chunk of matter and discover its “true wants, needs, and hopes for the future”?
        Wei Dai 27 Sep 2024 16:38 UTC
        2 points
        0
        Parent
        
        What hunches do you currently have surrounding orthogonality, its truth or not, or things near it?
        
        I’m very uncertain about it. Have you read Six Plausible Meta-Ethical Alternatives?
        
        as far as I can tell humans should by default see themselves as having the same kind of alignment problem as AIs do, where amplification can potentially change what’s happening in a way that corrupts thoughts which previously implemented values.
        
        Yeah, agreed that how to safely amplify oneself and reflect for long periods of time may be hard problems that should be solved (or extensively researched/debated if we can’t definitely solve them) before starting something like CEV. This might involve creating the right virtual environment, social rules, epistemic norms, group composition, etc. A few things that seem easy to miss or get wrong:
        
        Is it better to have no competition or some competition, and what kind? (Past “moral/philosophical progress” might have been caused or spread by competitive dynamics.)
        How should social status work in CEV? (Past “progress” might have been driven by people motivated by certain kinds of status.)
        No danger or some danger? (Could a completely safe environment / no time pressure cause people to lose motivation or some other kind of value drift? Related: What determines the balance between intelligence signaling and virtue signaling?)
        
        can we find a CEV-grade alignment solution that solves the self-and-other alignment problems in humans as well, such that this CEV can be run on any arbitrary chunk of matter and discover its “true wants, needs, and hopes for the future”?
        
        I think this is worth thinking about as well, as a parallel approach from the above. It seems related to metaphilosophy in that if we can discover what “correct philosophical reasoning” is, we can solve this problem by asking “What would this chunk of matter conclude if it were to follow correct philosophical reasoning?”