This is not a bad point, despite the downvotes—as a question it would definitely belong in the recent AI FAQ thread. It’s not obvious from context that when we talk about “aligned with human values” in AI safety, we tend to mean “aligned with our values directly”, rather than “has a morality system similar to a human”. In computer-science terms, “human values” is a direct pointer to our values, rather than an instance of “the system of morality humans have.”
Imagine two people, Alice and Bob. Alice and Bob both have human values, but they have different values. Both of them want to help people, but Alice values Alice more, and Bob values Bob more. They each have an instance of “human values”.
Now let’s say Alice made an AI, CAROL. In order to be properly aligned, we wouldn’t want CAROL to value itself more than Alice—we would want CAROL to have Alice’s values directly, not “the values Alice would have if Alice were CAROL.” If CAROL had an instance of “human values”, CAROL would want to help people but would value CAROL’s existence above anyone else’s. Instead, we want CAROL to have a combination of Alice’s values and Bob’s values, and we want this to extend across all humans.
Thus, while you’re right that implanting an AI with “human values” in the sense of “The AI has similar morality to us” could cause it to treat us like we treat animals, the approach I’ve heard advocated is to give the AI our specific morality system, which includes a strong preference for humans because we’re humans, even if this preference were arbitrary.
This is not a bad point, despite the downvotes—as a question it would definitely belong in the recent AI FAQ thread. It’s not obvious from context that when we talk about “aligned with human values” in AI safety, we tend to mean “aligned with our values directly”, rather than “has a morality system similar to a human”. In computer-science terms, “human values” is a direct pointer to our values, rather than an instance of “the system of morality humans have.”
Imagine two people, Alice and Bob. Alice and Bob both have human values, but they have different values. Both of them want to help people, but Alice values Alice more, and Bob values Bob more. They each have an instance of “human values”.
Now let’s say Alice made an AI, CAROL. In order to be properly aligned, we wouldn’t want CAROL to value itself more than Alice—we would want CAROL to have Alice’s values directly, not “the values Alice would have if Alice were CAROL.” If CAROL had an instance of “human values”, CAROL would want to help people but would value CAROL’s existence above anyone else’s. Instead, we want CAROL to have a combination of Alice’s values and Bob’s values, and we want this to extend across all humans.
Thus, while you’re right that implanting an AI with “human values” in the sense of “The AI has similar morality to us” could cause it to treat us like we treat animals, the approach I’ve heard advocated is to give the AI our specific morality system, which includes a strong preference for humans because we’re humans, even if this preference were arbitrary.