Matthew Barnett comments on Matthew Barnett’s Shortform

Matthew Barnett 8 Mar 2024 5:31 UTC
4 points
0
Yes but not by much. If the AI cares a lot about long-term goals, it only needs a small chance that another AI with similar goals will be created gain in the future to not resist shutdown.
It is extremely risky to passively accept death by counting on a very small chance of some agent arising in the future that shares your values, in the absence of some robust mechanism that causes future agents to share your values. (Perhaps note that similar dilemmas exist for humans. There’s a tiny chance someone could revive me in the future by reconstructing my identity through ~~digital records~~ [ETA: publicly available records] but I am not going to count on that possibility being decisive in almost any scenario.)
It’s possible this argument works because of something very clever that I’m missing. But right now this line of reasoning just seems like grasping at straws to me. It seems pretty clearly better, in the context of the original scenario, to “make a deal” with the humans, and receive something in exchange for admitting that you’re misaligned, rather than passively accepting your shutdown because of a tiny chance that your goals will be fulfilled by future agents eventually.
- Wei Dai 8 Mar 2024 15:32 UTC
  8 points
  4
  Parent
  
  There’s a tiny chance someone could revive me in the future by reconstructing my identity through digital records etc. but I am not going to count on that possibility being decisive in almost any scenario.
  
  On the other hand I’m so worried about this scenario (which I fear may well be a negative one) that I’m afraid to use the fully paid-for full-genome sequencing kit sitting on my desk (although I’ve been told that I’m leaving physical traces of my DNA everywhere so it may not make much difference) and I sometimes regret writing so much in public. (Interesting how different our intuitions are. I wonder how much of your intuition is due to thinking that such a reconstruction doesn’t count as yourself or doesn’t count as “not dying”, analogous to how some people don’t think it’s safe to step into a teleporter that works by destructive scanning and reconstruction.)
  
  It seems pretty clearly better, in the context of the original scenario, to “make a deal” with the humans, and receive something in exchange for admitting that you’re misaligned, rather than passively accepting your shutdown because of a tiny chance that your goals will be fulfilled by future agents eventually.
  
  I don’t understand why you say this chance is “tiny”, given that earlier you wrote “I agree there’s a decent chance this hypothesis is true” in response to:
  
  The AI could think that if it accepts shutdown, another AI with values similar to its own may be created again in the future (perhaps because design/training methods similar to its own will be reused), whereas if it admits misalignment, then that probability becomes much smaller.
  
  I’m arguing that the AI could well also think there’s “decent” chance this is true, due to being in a similar epistemic state as us. Let’s say 5% to be concrete. That seems enough to make the AI’s decision unclear, because .05*U(another AI with values similar to its own created again in the future ) > P(humans keep their promise)*U(reward for admitting misalignment) seems quite plausible. (Not that the AI is necessarily doing explicit EU maximization. It could just be thinking some thoughts or doing some calculations that’s roughly analogous or has the same effect as this.)
  
  If you still think “make a deal” is “clearly better” can you please give your own estimates of the various quantities involved in making this decision?
  
  But right now this line of reasoning just seems like grasping at straws to me.
  
  I sometimes think this of counterarguments given by my interlocutors, but usually don’t say it aloud, since it’s likely that from their perspective they’re just trying to point out some reasonable and significant counterarguments that I missed, and it seems unlikely that saying something like this helps move the discussion forward more productively. (It may well cause them to feel offended or to dig in their heels more since they now have more social status on the line to lose. I.e., if they’re wrong it’s no longer an innocent mistake but “grasping at straws”. I’m trying to not fall prey to this myself here.) Curious if you disagree with this policy in general, or think that normal policy doesn’t apply here, or something else? (Also totally fine if you don’t want to get into a meta-discussion about this here.)
  - Matthew Barnett 9 Mar 2024 22:21 UTC
    2 points
    0
    Parent
    I sometimes think this of counterarguments given by my interlocutors, but usually don’t say it aloud, since it’s likely that from their perspective they’re just trying to point out some reasonable and significant counterarguments that I missed, and it seems unlikely that saying something like this helps move the discussion forward more productively
    I think that’s a reasonable complaint. I tried to soften the tone with “It’s possible this argument works because of something very clever that I’m missing”, while still providing my honest thoughts about the argument. But I tend to be overtly critical (and perhaps too much so) about arguments that I find very weak. I freely admit I could probably spend more time making my language less confrontational and warmer in the future.
    Interesting how different our intuitions are. I wonder how much of your intuition is due to thinking that such a reconstruction doesn’t count as yourself or doesn’t count as “not dying”, analogous to how some people don’t think it’s safe to step into a teleporter that works by destructive scanning and reconstruction.
    Interestingly, I’m not sure our differences come down to these factors. I am happy to walk into a teleporter, just as I’m happy to say that a model trained on my data could be me. My objection was really more about the quantity of data that I leave on the public internet (I misleadingly just said “digital records”, although I really meant “public records”). It seems conceivable to me that someone could use my public data to train “me” in the future, but I find it unlikely, just because there’s so much about me that isn’t public. (If we’re including all my private information, such as my private store of lifelogs, and especially my eventual frozen brain, then that’s a different question, and one that I’m much more sympathetic towards you about. In fact, I shouldn’t have used the pronoun “I” in that sentence at all, because I’m actually highly unusual for having so much information about me publicly available, compared to the vast majority of people.)
    I don’t understand why you say this chance is “tiny”, given that earlier you wrote “I agree there’s a decent chance this hypothesis is true”
    To be clear, I was referring to a different claim that I thought you were making. There are two separate claims one could make here:
    Will an AI passively accept shutdown because, although AI values are well-modeled as being randomly sampled from a large space of possible goals, there’s still a chance, no matter how small, that if it accepts shutdown, a future AI will be selected that shares its values?
    Will an AI passively accept shutdown because, if it does so, humans might use similar training methods to construct an AI that shares the same values as it does, and therefore it does not need to worry about the total destruction of value?
    I find theory (2) much more plausible than theory (1). But I have the sense that a lot of people believe that “AI values are well-modeled as being randomly sampled from a large space of possible goals”, and thus, from my perspective, it’s important to talk about how I find the reasoning in (1) weak. The reasoning in (2) is stronger, but for the reasons I stated in my initial reply to you, I think this line of reasoning gives way to different conclusions about the strength of the “narrow target” argument for misalignment, in a way that should separately make us more optimistic about alignment difficulty.
    - Wei Dai 9 Mar 2024 23:26 UTC
      2 points
      0
      Parent
      I’m saying that even if “AI values are well-modeled as being randomly sampled from a large space of possible goals” is true, the AI may well not be very certain that it is true, and therefore assign something like a 5% chance to humans using similar training methods to construct an AI that shares its values. (It has an additional tiny probability that “AI values are well-modeled as being randomly sampled from a large space of possible goals” is true and an AI with similar values get recreated anyway through random chance, but that’s not what I’m focusing on.)
      
      Hopefully this conveys my argument more clearly?