abramdemski comments on The curious case of Pretty Good human inner/outer alignment

abramdemski 5 Jul 2022 16:44 UTC
15 points
0
I think the main explanation for our niceness is described by Skyrms in the book the evolution of the social contract and his follow-up book the stag hunt. The main explanation being: in evolutionary dynamics, genes spread geographically, so strategies are heavily correlated with similar strategies. This means it’s beneficial to be somewhat cooperative.
Also, for similar reasons, iterated games are common in our evolutionary ancestry. Many animals display friendly/nice behaviors. (Mixed in with really not very friendly behaviors, of course.)
I also don’t think this solution carries over very well to powerful AIs. A powerful AI has exceptionally little reason to treat its actions as correlated with ours, and will not have grown up with us in an evolutionary environment.
- Kaj_Sotala 6 Jul 2022 10:41 UTC
  17 points
  9
  Parent
  I also don’t think this solution carries over very well to powerful AIs. A powerful AI has exceptionally little reason to treat its actions as correlated with ours, and will not have grown up with us in an evolutionary environment.
  This seems correct, but I think that’s also somewhat orthogonal to the point that I read the OP to be making. I read it to be saying something like “some alignment discussions suggest that capabilities may generalize more than alignment, so that when an AI becomes drastically more capable, this will make it unaligned with its original goals; however, humans seem to remain pretty well aligned with their original goals despite a significant increase in their capabilities, so maybe we could use whatever-thing-keeps-humans-aligned-with-their-original-goals to build AIs in such a way that also keeps them aligned with their original goals when their capabilities increase”.
  So I think the question that the post is asking is not “why did we originally evolve niceness” (the question that your comment answers) but “why have we retained our niceness despite the increase in our capabilities, and what would we need to do for an AI to similarly retain its original goals as it underwent an increase in capabilities”.
  - Charlie Steiner 6 Jul 2022 18:47 UTC
    4 points
    1
    Parent
    Sure. The issue is that we want to explain why we care about niceness, precisely because we currently care about niceness to a degree that seems surprising from an evolutionary perspective.
    
    This is great from the perspective of humans who like niceness. But it’s not great from the perspective of evolution—to evolution, it looks like the mesa-optimizers’ values are drifting as their capabilities increase, because we’re privileging care/harm over purity/contamination ethics or what have you.
  - Noosphere89 6 Jul 2022 19:38 UTC
    1 point
    1
    Parent
    Basically, because of no genetic engineering/mind uploading before the 21st century, and it’s socially unacceptable to genetically engineer people due to World War II. We need to remember how contingent that was, and if WWII was avoided, genetic engineering probably would be more socially acceptable. Only contingency and the new ethical system that grew up in the aftermath of WWII prevented capabilities from eventually misaligning with evolution in genetics. All our capabilities have still not changed human nature.
- Quintin Pope 6 Jul 2022 4:19 UTC
  14 points
  10
  Parent
  There must have been some reason(s) why organisms exhibiting niceness were selected for during our evolution, and this sounds like a plausible factor in producing that selection. However, evolution did not directly configure our values. Rather, it configured our (individually slightly different) learning processes. Each human’s learning process then builds their different values based on how the human’s learning process interacts with that human’s environment and experiences.
  As this post notes, the human learning process (somewhat) consistently converges to niceness. Evolution might have had some weird, inhuman reason for configuring a learning process to converge to niceness, but it still built such a learning process.
  It therefore seems very worthwhile to understand what part of the human learning process allows for niceness to emerge in humans. We may not be able to replicate the selection pressures that caused evolution to build a niceness-producing learning process, but it’s not clear we need to. We still have an example of such a learning process to study. The Wright brothers learned to fly by studying birds, not by re-evolving them!
  - Andrew Currall 6 Jul 2022 8:55 UTC
    10 points
    3
    Parent
    Niceness in humans has three possible explanations:
    Kin altruisim (basically the explanation given above)- in the ancestral environment, humans were likely to be closely related to most of the people they interacted with, giving them genetic “incentive” to be at least somewhat nice. This obviously doesn’t help in getting a “nice” AGI- it won’t share genetic material with us and won’t share a gene-replication goal anyway.
    Reciprocal altruism- humans are social creatures, tuned to detect cheating and ostratice non-nice people. This isn’t totally irrelevant- there is a chance a somewhat dangerous AI may have use for humans in achieving its goals, but basically, if the AI is worried that we might decide it’s not nice and turn it off or not listen to it, then we didn’t have that big a problem in the first place. We’re worried about AGIs sufficiently powerful that they can trivially outwit or overpower humans, so I don’t think this helps us much.
    Group selection. This is a bit controversial and probably least important of the three. At any rate, it obviously doesn’t help with an AGI.
    So in conclusion, human niceness is no reason to expect an AGI to be nice, unfortunately.
    - abramdemski 7 Jul 2022 16:02 UTC
      3 points
      1
      Parent
      I note that none of these is obviously the same as the explanation Skyrms gives.
      Skyrms is considering broader reasons for correlation of strategies than kinship alone; in particular, the idea that humans copy success when they see it is critical for his story.
      Reciprocal altruism feels like a description rather than an explanation. How does reciprocal altruism get started?
      Group selection is again, just one way in which strategies can become correlated.
      - Andrew Currall 10 Jul 2022 12:55 UTC
        1 point
        0
        Parent
        Re: reciprocal altruism. Given the vast swathe of human prehistory, virtually anything not absurdly complex will be “tried” occasionally. It only takes a small number of people whose brains happen to wired to “tit-for-tat” to get started, and if they out-compete people who don’t cooperate (or people who help everyone regardless of behaviour towards them), the wiring will quickly become universal.
        Humans do, as it happens, explicitly copy successful strategies on an individual level. Most animals don’t though, and this has minimal relevance to human niceness, which is almost certainly largely evolutionary.
    - Kaj_Sotala 6 Jul 2022 10:47 UTC
      2 points
      3
      Parent
      Note that the comment you’re responding to wasn’t asking about the evolutionary causes for niceness, nor was it suggesting that the same causes would give us reason to expect an AGI to be nice. (The last paragraph explicitly said that the “Wright brothers learned to fly by studying birds, not by re-evolving them”.) Rather it was noting that evolution produced an algorithm that seems to relatively reliably make humans nice, so if we can understand and copy that algorithm, we can use it to design AGIs that are nice.
  - MSRayne 6 Jul 2022 14:44 UTC
    3 points
    2
    Parent
    There’s a flaw in this, though. Humans are consistently nice, yes—to one another. Not so much to other, less powerful creatures. Look at the proportion of people on earth who are vegan: very few. Similarly, it’s not enough just to figure out how to reproduce the learning process that makes humans nice to one another—we need to invent a process that makes AIs nice to all living things. Otherwise, it will treat humans the same way most humans treat e.g. ants.
    - Quintin Pope 6 Jul 2022 15:39 UTC
      5 points
      11
      Parent
      What fraction of people are nice in the way we want an AI to be nice? 1 / 100? 1 / 1000? What n is large enough such that selecting the 1 / n nicest human would give you a human sufficiently nice?
      Whatever your answer, that equates to saying that human learning processes are ~ log(n) bits of optimization pressure away from satisfying the “nice in the way we want an AI to be nice” criterion.
      Another way to think about this: selecting the nicest out of n humans is essentially doing a single step of random search optimization over human learning processes, optimizing purely for niceness. Random search is a pretty terrible optimization method, and one-step random search is even worse.
      You can object that it’s not necessarily easy to apply optimization pressure towards niceness directly (as opposed to some more accessible proxies for niceness), which is true. But still, I think it’s telling that so few total bits of optimization pressure leads to such big differences in human niceness.
      Edit: there are also lots of ways in which bird flight is non-optimal for us. E.g., birds can’t carry very much. But if you don’t know how to build a flying machine, studying birds is still valuable. Once you understand the underlying principles, then you can think about adapting them to better fit your specific use case. Before we understand why humans are nice to each other, we can’t know how easily it will be to adapt those underlying generators of niceness to better suit our own needs for AIs. How many bits of optimizaiton pressure do you have to apply to birds before they can carry cargo planes worth of stuff?
      What links here?
      TurnTrout's comment on Alignment via prosocial brain algorithms by Cameron Berg (13 Sep 2022 18:24 UTC; 4 points)
      - Noosphere89 6 Jul 2022 18:05 UTC
        2 points
        0
        Parent
        I would say about roughly 1 in 10-1 in a 100 million people can be trusted to be reliably nice to less powerful beings, and maybe at the high end 1 in 1 billion people can reliably not abuse less powerful beings like animals, conditional on the animal not attacking them. That’s my answer for how many bits of optimization pressure is required for reliable niceness towards less powerful beings in humans.
  - abramdemski 7 Jul 2022 15:55 UTC
    2 points
    0
    Parent
    As this post notes, the human learning process (somewhat) consistently converges to niceness. Evolution might have had some weird, inhuman reason for configuring a learning process to converge to niceness, but it still built such a learning process.
    It therefore seems very worthwhile to understand what part of the human learning process allows for niceness to emerge in humans.
    Skyrms makes the case for similar explanations at these two levels of description. Evolutionary dynamics and within-lifetime dynamics might be very different, but the explanation for how they can lead to cooperative outcomes is similar.
    His argument is that within-lifetime, however complex human’s learning process may be, it has the critical feature of imitating success. (This is very different from standard game theory’s CDT-like reasoning-from-first-principles about what would cause success.) This, combined with the same “geographical correlation” and “frequent iterated interaction” arguments that were relevant to the evolutionary story, predicts that cooperative strategies will spread.
    (On the border between a more-cooperative cluster of people and a less-cooperative cluster, people in the middle will see that cooperation leads to success.)
- abramdemski 7 Jul 2022 15:45 UTC
  4 points
  0
  Parent
  The parent comment currently stands at positive karma and negative agreement, but the comments on it seem to be saying “what you are saying is true but not exactly relevant or not the most important thing”—which would seem to suggest the comment should have negative or low karma but positive agreement instead.
  On this evidence, I suspect voters and commenters may have different ideas; any voters want to express the reasons for their votes?
- TurnTrout 6 Jul 2022 17:47 UTC
  3 points
  0
  Parent
  As Quintin wrote, you aren’t describing a mechanistic explanation for our niceness. You’re describing a candidate reason why evolution selected for the mechanisms which do, in fact, end up producing niceness in humans.
  - abramdemski 7 Jul 2022 15:49 UTC
    2 points
    −2
    Parent
    Skyrms makes the case that biological evolution and cultural evolution follow relevantly similar dynamics, here, so that we don’t necessarily need to care very much about the distinction. The mechanistic explanation at both levels of description is similar.
    - TurnTrout 11 Jul 2022 4:42 UTC
      4 points
      1
      Parent
      I can’t speak for OP, but I’m not interested in either kind of evolution. I want to think about the artifact which evolution found: The genome, and the brains it tends to grow. Given the genome, evolution’s influence on human cognition is screened off.
      Why are people often nice to other agents? How does the genome do it, in conjunction with the environment?
- PavleMiha 5 Jul 2022 18:11 UTC
  3 points
  0
  Parent
  Genes being concentrated geographically is a fascinating idea, thanks for the book recommendation, I’ll definitely have a look.
  
  Niceness does seem like the easiest to explain with our current frameworks, and it makes me think about whether there is scope to train agents in shared environments where they are forced to play iterated games with either other artificial agents or us. Unless an AI can take immediate decisive action, as in a fast take-off scenario, it will, at least for a while, need to play repeated games. This does seem to be covered under the idea that powerful AI would be deceptive, and pretend to play nice until it didn’t have to, but somehow our evolutionary environment led to the evolution of actual care for others’ wellbeing rather than only very sophisticated long-term deception abilities.
  I remember reading about how we evolved emotional reactions that are purposefully hard to fake, such as crying, in a sort of arms race against deception, I believe it’s in How the Mind Works. This reminds me somewhat of that, where areas where people have genuine care for each other’s well beings are more likely to propagate the genes concentrated there.

abramdemski comments on The curious case of Pretty Good human inner/​outer alignment

abramdemski comments on The curious case of Pretty Good human inner/outer alignment