Disagreements related to what we value seem to explain maybe 10% of the disagreements over AI safety. This post will try to explain how I think about which values I care about perpetuating to the distant future.
Robin Hanson helped to clarify the choices in Which Of Your Origins Are You?:
The key hard question here is this: what aspects of the causal influences that lead to you do you now embrace, and which do you instead reject as “random” errors that you want to cut out? Consider two extremes.
At one extreme, one could endorse absolutely every random element that contributed to any prior choice or intuition.
...
At the other extreme, you might see yourself as primarily the result of natural selection, both of genes and of memes, and see your core non-random value as that of doing the best you can to continue to “win” at that game. … In this view, everything about you that won’t help your descendants be selected in the long run is a random error that you want to detect and reject.
In other words, the more unique criteria we have about what we want to preserve into the distant future, the less we should expect to succeed.
Do Humans Have Many Terminal Values?
The Elephant in the Brain shows that we portray ourselves as having a wide variety of noble values, a large fraction of which are better explained by simple selfishness.
Self-oriented goals related to wealth, prestige, dominance, safety, and health can be largely explained as results of us valuing our own empowerment. The post Empowerment is (almost) All We Need expands on this idea.
Shard Theory tells us that most human values are context-dependent.
Those considerations lead me to conclude that most human values are better explained as instrumental values (i.e. subgoals) rather than as terminal values.
But before I go overboard in that direction, I need to remind myself that evolution produces messy results that can’t be classified as simply as I’d like. E.g. sex. Evolution intended sex to be an instrumental value that helps to spread genes.
Evolution implemented sex drives in a way that caused humans to often treat sex more as a terminal goal. Is sex a genuinely terminal goal? Or is it a goal that’s contingent on certain hormones and sensory organs? Like most of my values, the answer to this question seems context-dependent. Transhumanist culture seems to guide me toward treating sex as an instrumental value. Whereas if I lived in a lower-tech sex-positive culture, I’d treat sex drives as sufficiently inevitable that I couldn’t distinguish sex from a terminal goal.
Human values haven’t evolved to cleanly fit into a framework which says that instrumental and terminal values are clean different categories.
So there is some complexity to human terminal values. But I don’t see how those values differ much from those of simpler animals. I claim that virtually all values that differ from those of other species should be classified as instrumental values.
Identity
Robin says in AIs Will Be Our Mind Children:
future human-level AIs are not co-existing competing aliens; they are instead literally our descendants. So if your evolved instincts tell you to fight your descendants due to their strangeness, that is a huge evolutionary mistake.
I’m bothered by the binary language that leads both Robin and the AI doomers to sound as if AIs will be either 100% or 0% our descendants.
I care how much of my genes (loosely defined so as to include culture) are embodied in our AI successors.
If I believed Eliezer’s model of how AI will function, I’d be almost as pessimistic as he is about AIs sharing my “genes”. My model of AI is a bit closer to Robin’s model. My best guess is that at least some AIs will be sufficiently human-like that they preserve some of what I value.
But in addition to that factual disagreement about how much AIs will resemble me, there’s an important value difference about how we should feel about AIs that perpetuate modest fractions of our “genes”.
Robin is expressing values that roughly correspond to what we should expect from evolved minds. But he sometimes sounds like a more altruistic utilitarian than is consistent with evolutionary values. (He sounds more worried about alien values when talking about the risk that the world will be dominated by Amish and Orthodox Jews.)
Eliezer seems to be expressing values that look more like the result of goal displacement (subgoal stomp), in which preservation of identity / values (presumably originating as a subgoal of empowerment or selective fitness) becomes a terminal goal.
That doesn’t mean Eliezer’s values are wrong. His values are harder to satisfy than are Robin’s. So to the extent my preferences about descendants are subgoals that are amenable to change, I’ll prefer to nudge them a bit in Robin’s direction. In particular, I’ll try to be a bit more accepting of uploads that imperfectly preserve my identity.
I’m not too clear about the extent to which my values about descendants are subgoals or terminal goals. But since I see a pervasive human bias to overstate the extent to which our goals are terminal goals, I’m going to try to think about my values a bit more like Robin thinks, and a bit less like Eliezer thinks.
Uncertainty
Here are two analogies for why I’m concerned about aligning AI:
Mothers are quite selective about whose genes fertilize their eggs.
Fathers are often concerned about the paternity of the children that they’re supporting.
Robin sometimes sounds as if he’s rejecting these instincts.
Current AI efforts are on track to produce a somewhat arbitrary mix of whatever values are easiest to implement, and the values of the most influential AI developers.
To the extent AI values are chosen by ease of implementation, I reject the notion that those AIs are our descendants.
To the extent AI values replicate the culture of influential people, I feel a more normal mix of approval that people like me are reproducing, and concern about whether those descendants embody too much of other people’s values and not enough of mine.
I’d prefer an Age of Em to whatever it is we’re currently on track for. Alas, I have no clear plan for a future in which ems are more important than are more artificial AIs.
There are two different senses of partiality in preserving values: irreversible change, and loss of influence. Irreversible change benefits from reflection, deciding right now is predictably stupid. And quality of reflection doesn’t benefit from irreversible change you can point at right now, it’s prudent to prevent all change in order to carefully develop ability to choose it judiciously. On the other hand, lack of change is a competitive disadvantage, leads to loss of influence and inefficiency of reflection. But at least you don’t lose yourself in a predictably stupid way.
The synthesis is to address both, that’s the problem of creating a smarter-than-yourself assistant that’s aligned enough to give good advice on the process of reflection and help with not losing the future in the meantime. The assistant is change, but putting your values in control makes it reversible.