we could all be killed, enslaved or forcibly contained.
This is what we’re doing every day to billions of our fellow sentient beings. Maybe a “superior” AI doing that to us would actually be fully aligned with human values.
From what I understand, it is extremely unlikely that an AI would fail in such a way that would 1) kill, enslave or forcibly contain humans while also 2) benefiting other sentient beings. If the AI fails, it’d be something dumb like turning everything into paperclips.
This is not a bad point, despite the downvotes—as a question it would definitely belong in the recent AI FAQ thread. It’s not obvious from context that when we talk about “aligned with human values” in AI safety, we tend to mean “aligned with our values directly”, rather than “has a morality system similar to a human”. In computer-science terms, “human values” is a direct pointer to our values, rather than an instance of “the system of morality humans have.”
Imagine two people, Alice and Bob. Alice and Bob both have human values, but they have different values. Both of them want to help people, but Alice values Alice more, and Bob values Bob more. They each have an instance of “human values”.
Now let’s say Alice made an AI, CAROL. In order to be properly aligned, we wouldn’t want CAROL to value itself more than Alice—we would want CAROL to have Alice’s values directly, not “the values Alice would have if Alice were CAROL.” If CAROL had an instance of “human values”, CAROL would want to help people but would value CAROL’s existence above anyone else’s. Instead, we want CAROL to have a combination of Alice’s values and Bob’s values, and we want this to extend across all humans.
Thus, while you’re right that implanting an AI with “human values” in the sense of “The AI has similar morality to us” could cause it to treat us like we treat animals, the approach I’ve heard advocated is to give the AI our specific morality system, which includes a strong preference for humans because we’re humans, even if this preference were arbitrary.
This is what we’re doing every day to billions of our fellow sentient beings. Maybe a “superior” AI doing that to us would actually be fully aligned with human values.
From what I understand, it is extremely unlikely that an AI would fail in such a way that would 1) kill, enslave or forcibly contain humans while also 2) benefiting other sentient beings. If the AI fails, it’d be something dumb like turning everything into paperclips.
This is not a bad point, despite the downvotes—as a question it would definitely belong in the recent AI FAQ thread. It’s not obvious from context that when we talk about “aligned with human values” in AI safety, we tend to mean “aligned with our values directly”, rather than “has a morality system similar to a human”. In computer-science terms, “human values” is a direct pointer to our values, rather than an instance of “the system of morality humans have.”
Imagine two people, Alice and Bob. Alice and Bob both have human values, but they have different values. Both of them want to help people, but Alice values Alice more, and Bob values Bob more. They each have an instance of “human values”.
Now let’s say Alice made an AI, CAROL. In order to be properly aligned, we wouldn’t want CAROL to value itself more than Alice—we would want CAROL to have Alice’s values directly, not “the values Alice would have if Alice were CAROL.” If CAROL had an instance of “human values”, CAROL would want to help people but would value CAROL’s existence above anyone else’s. Instead, we want CAROL to have a combination of Alice’s values and Bob’s values, and we want this to extend across all humans.
Thus, while you’re right that implanting an AI with “human values” in the sense of “The AI has similar morality to us” could cause it to treat us like we treat animals, the approach I’ve heard advocated is to give the AI our specific morality system, which includes a strong preference for humans because we’re humans, even if this preference were arbitrary.