Dagon comments on Rationality vs Alignment

Dagon Jul 8, 2024, 3:10 PM
2 points
0
I think there are a LOT of examples of humans, animals, and very complicated computer programs which do NOT have self-preservation as their absolute top goal—there’s a lot of sacrifice that happens, and a lot of individual existential risk taking for goals other than life preservation. There’s always a balance across multiple values, and “self” is just one more resource to be used.
note: I wish I could upvote and disagree. This is important, even if wrong.
- Donatas Lučiūnas Jul 8, 2024, 4:17 PM
  −2 points
  0
  Parent
  Hm. How many paperclips is enough for the maximizer to kill itself?
  - Dagon Jul 8, 2024, 4:49 PM
    2 points
    0
    Parent
    If killing itself / allowing itself to be replaced leads to more expected paperclips than clinging to life does, it will do so. I don’t know if you’ve played through decisionproblem.com/paperclips/index2.html, and don’t put too much weight on it, but it’s a fun example of the complexity of maximizing paperclips.
    edit: a bit more nuance—if there are competing (who don’t mind paperclips, but don’t love them above all) or opposing (actively want to minimize paperclips) agents, then negotiation and compromise are probably necessary to prevent even worse failures (being destroyed WITHOUT creating/preserving even one paperclip). In this case, self-preservation and power-seeking IS part of the strategy, but it can’t be very direct, because if the other powers get too scared of you, you lose everything.
    In any case the distribution of conditional-on-your-decisions futures will have one or a few that have more paperclips in them than others. Maximizing paperclips means picking those actions.
    - Donatas Lučiūnas Jul 8, 2024, 5:17 PM
      1 point
      0
      Parent
      If killing itself / allowing itself to be replaced leads to more expected paperclips than clinging to life does, it will do so.
      I agree, but this misses the point.
      What would change your opinion? It is not the first time we have a discussion, I don’t feel you are open for my perspective. I am concerned that you may be overlooking the possibility of an argument from ignorance fallacy.
      - Dagon Jul 8, 2024, 5:33 PM
        2 points
        0
        Parent
        You’re right it’s not the first time we’ve discussed this—I didn’t notice until I’d made my first comment. It doesn’t look like you’ve incorporated previous comments, and I don’t know what would change my beliefs (if I did, I’d change them!), specifically about the orthogonality thesis. Utility functions, in real agents, are probably only a useful model, not a literal truth, but I think we have very different reasons for our suspicion of them.
        I AM curious if you have any modeling more than “could be anything at all!” for the idea of an unknown goal. It seems likely, that even for a very self-reflective agent full knowledge of one’s own goals is impossible, but the implications of that don’t seem as obvious as you seem to think.
        Donatas Lučiūnas Jul 8, 2024, 6:15 PM
        3 points
        0
        Parent
        I AM curious if you have any modeling more than “could be anything at all!” for the idea of an unknown goal.
        No.
        I could say—Christian God or aliens. And you would say—bullshit. And I would say—argument from ignorance. And you would say—I don’t have time for that.
        So I won’t say.
        We can approach this from different angle. Imagine an unknown goal that according to your beliefs AGI would really care about. And accept the fact that there is a possibility that it exists. Absense of evidence is not evidence of absense.
        Dagon Jul 8, 2024, 10:16 PM
        2 points
        0
        Parent
        I think this may be our crux. Absence of evidence, in many cases, is evidence (not proof, but updateable Bayesean evidence) of absence. I think we agree that true goals are not fully introspectable by the agent. I think we disagree that there’s no distribution of goals that fit better than others, or whether there’s any evidence that can be used to understand goals, even if not fully understanding them at the source-code level.
        Thanks for the discussion!
        Donatas Lučiūnas Jul 9, 2024, 5:09 AM
        1 point
        −6
        Parent
        Absence of evidence, in many cases, is evidence (not proof, but updateable Bayesean evidence) of absence.
        This conflicts with Gödel’s incompleteness theorems, Fitch’s paradox of knowability, Black swan theory.
        A concept of experiment relies on this principle.
        And this is exactly what scares me—people who work with AI have beliefs that are non scientific. I consider this to be an existential risk.
        You may believe so, but AGI would not believe so.
        Thanks to you too!