Zack_M_Davis comments on On how various plans miss the hard bits of the alignment challenge

Zack_M_Davis 12 Jul 2022 5:04 UTC
26 points
15
This isn’t addressing straw-Ngo/Shah’s objection? Yes, evolution optimized for fitness, and got adaptation-executors that invent birth control because they care about things that correlated with fitness in the environment of evolutionary adaptedness, and don’t care about fitness itself. The generalization from evolution’s “loss function” alone, to modern human behavior, is terrible and looks like all kinds of white noise.

But the generalization from behavior in the environment of evolutionary adaptedness, to modern human behavior is … actually pretty good? Humans in the EEA told stories, made friends, ate food, &c., and modern humans do those things, too. There are a lot of quirks (like limited wireheading in the form of drugs, candy, and pornography), but it’s far from white noise. AI designers aren’t in the position of “evolution” “trying” to build fitness-maximizers, because they also get to choose the training data or “EEA”—and in that context, the analogy to evolution makes it look like some degree of “correct” goal generalization outside of the training environment is a thing?

Obviously, the conclusion here is not, “And therefore everything will be fine and we have nothing to worry about.” Some nonzero amount of goal generalization, doesn’t mean the humans survive or that the outcome is good, because there are still lots of ways for things to go off the rails. (A toy not-even-model: if you keep 0.95 of your goals with each “round” of recursive self-improvement, and you need 100 rounds to discover the correct theory of alignment, you actually only keep ${0.95}^{100} \approx 0.006$ of your goals.) We would definitely prefer not to bet the universe on “Train it, while being aware of inner alignment issues, and hope for the best”!! But it seems to me that the well-rehearsed “birth control, therefore paperclips” argument is missing a lot of steps?!
- Quintin Pope 12 Jul 2022 5:58 UTC
  8 points
  6
  Parent
  I strongly agree. I think there are vastly better sources of evidence on how inner goals relate to outer selection criteria than “evolution → human values”, and that most of those better sources of evidence paint a much more optimistic picture of how things are likely to go with AI. I think there are lots of reasons to be skeptical of using “evolution → human values” as an informative reference class for AIs, some of which I’ve described in my reply to Ben’s comment.