Charlie Steiner comments on Goal Alignment Is Robust To the Sharp Left Turn

Charlie Steiner 13 Jul 2022 22:55 UTC
41 points
22
I think for the last month for some reason, people are going around overstating how aligned humans are with past humans.
If you put people from 500 years ago in charge of the galaxy, they’d have screwed it up according to my standards. Bigotry, war, cruelty to animals, religious nonsense, lack of imagination and so on. And conversely, I’d screw up the galaxy according to their standards. And this isn’t just some quirky fact about 500 years ago, all of history and pre-history is like this, we haven’t magically circled back around to wanting to arrange the galaxy the same way humans from a million years ago would.
I think when people talk about how we are aligned with past humans, they are not thinking about how humans from 500 years ago used to burn cats alive for entertainment. They are thinking about how humans feel love, and laugh at jokes, and like the look of healthy trees and symmetrical faces.
But the thing is, those things seems like human values, not “what they would do if put in charge of the galaxy,” precisely because they’re the things that generalize well even to humans of other eras. Defining alignment as those things being preserved is painting on the target after the bullet has been fired.
Now, these past humans would probably drift towards modern human norms if put in a modern environment, especially if they start out young. (They might identify this as value drift and put in safeguards against it—the Amish come to mind—but they might not. I would certainly like to put in safeguards against value drift that might be induced by putting humans in weird future environments.) But if the original “humans are aligned with the past” point was supposed to be that humans’ genetic code unfolds into optimizers that want the same things even across changes of environment, this is not a reassurance.
- Daniel Kokotajlo 14 Jul 2022 20:17 UTC
  20 points
  10
  Parent
  I came here to make this comment, but since you’ve already made it, I will instead say a small note in the opposite direction, which is that even despite all the things you’ve said it still seems like past humans and present humans are mostly aligned. In that the CEV of past humans is probably OK by the standards of the CEV of present humans. Yes, a lot of work here is being done by the “CE” part—I’m claiming that after reflection the people in the past would probably be happy with fake cats rather than real cats, if they still wanted to torture cats at all.
  - Vladimir_Nesov 15 Jul 2022 13:48 UTC
    10 points
    4
    Parent
    The hypothesis is that CEV of past humans is fine from the point of view of CEV of modern humans. This is similar-to/predicted-by the generic value hypothesis I’ve been mulling over for the last month, which says that there is a convergent CEV for many agents with ostensibly different current volitions.
    
    This is plausible for agents that are not mature optimizers, so that the process of extrapolating their volition does more work in selecting the resulting preference than their initial attitudes. Extrapolation of the long reflection vibe could be largely insensitive to the initial attitudes/volition depending on how volition extrapolation works and what kind of thing are the values it primarily produces (something that traditionally isn’t a topic of meaningful discussion). If generic value hypothesis holds, it might put the CEV of a wide variety of AGIs (including those only very loosely aligned) close enough to CEV of humanity to prefer a valuable future. It’s more likely to hold for AGIs that have less legible preferences (don’t hold some proxy values as a strongly reflectively-endorsed optimization target, leaving less influence for volition extrapolation), and for larger coalitions of AGIs of different make, canceling out idiosyncrasies in their individual initial attitudes.
    
    I think this is unlikely to hold in the strong sense where cosmic endowment is used by probable AGIs in a way that’s seen as highly valuable by CEV of humanity. But I’m guessing it’s somewhat likely to hold in the weak sense where probable AGIs end up giving humanity a bit of computational welfare greater than literally nothing.
    What links here?
    Vladimir_Nesov's comment on I missed the crux of the alignment problem the whole time by zeshen (13 Aug 2022 12:08 UTC; 6 points)
- PeterMcCluskey 15 Jul 2022 21:42 UTC
  10 points
  1
  Parent
  I disagree.
  
  Part of what I’ve been trying to do in book reviews such as The Geography of Thought, WEIRDest People, and The Amish has been to illuminate how much of what we think of as value differences are mostly different strategies for achieving some widely shared underlying values such as prosperity, safety, happiness, and life satisfaction.
  
  If a human from a million years ago evaluated us by our policies, then I agree they’d be disappointed. But if they evaluated us by more direct measures of our quality of life, I’d expect them to be rather satisfied. The latter is most of what matters to me.
  
  I don’t like cat burning or religion. But opinions on those topics seem mostly unrelated to what I hope to see in 500 years.
  - Charlie Steiner 16 Jul 2022 6:45 UTC
    5 points
    0
    Parent
    This is a very good point. I’d sorta defend myself by claiming that “what would you do with the galaxy” (and how you rate that) is unusually determined by memetics compared to what you eat for breakfast (and how you rate that). What you eat for breakfast currently has a way bigger impact on your QOL, but it’s more closely tied to supervisory signals shared across humans.
    
    On the one hand, this means I’m picking on a special case, on the other hand, I think that special case is a pretty good analogy for building AI that becomes way more powerful after training.
- Shmi 13 Jul 2022 23:45 UTC
  2 points
  0
  Parent
  I think the point is that slow overall value drift is fine and normal, but a sudden change in values and forcing them on (the rest of) humanity is not so much. The parable of Murder Gandhi is not a terrible development and not something one ought to safeguard from. Instead, a sharp sudden change is the problematic development. Removing guardrails tends to lead to spectacular calamities, as we see throughout human history, so that is what we should hope an AGI would keep.
  - Yitz 15 Jul 2022 2:17 UTC
    4 points
    7
    Parent
    I would consider a gradual murder Gandhi to very much be a tragedy, personally speaking.
- Charlie Steiner 13 Jul 2022 22:58 UTC
  2 points
  0
  Parent
  If AIs are only human-level good at staying aligned, they might undergo value shifts that seem obvious to them the same way our shifts relative to humans 500 years ago seem obvious to us now, in hindsight, but that end with them being similarly misaligned. This would of course still represent significant progress over where we are now, but isn’t what I’d like to shoot for.
  And of course a major reason humans are “human-level good at staying aligned” is because we can’t edit our own source code or add extra grey matter. This is not going to be true for AGI, so “just copy a human design into silicon” probably fails.
- Thane Ruthenis 13 Jul 2022 23:28 UTC
  −1 points
  −3
  Parent
  That’s what I’m talking about when I speak of human object-level behavior differing quite a lot in the past compared to the present, and about a mesa-objective-aligned AI still potentially messing everything up because it’s being driven by biases and broken heuristics.
  If you put people from 500 years ago in charge of the galaxy, they’d have screwed it up according to my standards
  Even if they were given a billion subjective years to try to reason our their “true” robust values, and were warned that they currently might be biased and wrong in all sorts of ways? I dunno, it seems plausible to me that they’d still be able to converge towards something like this.
  And of course, an AGI should be in a somewhat better position than this anyway, inasmuch as it’d be more likely to have a concrete mesa-objective.
  - Noosphere89 14 Jul 2022 0:56 UTC
    1 point
    0
    Parent
    My answer is no, not because finding the one true morality is difficult, but because there is no objective morality and values, and values and morality can’t be derived from facts. Or, as the computing power and technology goes to infinity, morality is divergent, not convergent.