To start off, I think we would all agree that “niceness” isn’t a basic feature of reality. This doesn’t, of course, mean that AIs won’t learn a concept directly corresponding to human “niceness”, or that some part of their value system won’t end up hooked up to that “niceness” concept. On the contrary, inasmuch as “niceness” is a natural abstraction, we should expect both of these things to happen.
But we should still keep in mind that “niceness” isn’t irreducibly simple: that it can be decomposed into lower-level concepts, mixed with other lower-level concepts, then re-compiled into some different high-level concept that would both (1) score well on whatever value functions (/shards) in the AI that respond to “niceness”, (2) be completely alien and value-less to humans.
And this is what I’d expect to happen. Consider the following analogies:
A human is raised in some nation with some culture. That human ends up liking some aspects of that culture, and disliking other aspects. Overall, when we evaluate the overall concept of “this nation’s culture” using the human’s value system, the culture scores highly positive: the human loves their homeland.
But if we fine-grain their evaluation, give the human the ability to arbitrarily rewrite the culture at any level of fidelity… The human would likely end up introducing quite a lot of changes, such that the resultant culture wouldn’t resemble the original one at all. The new version might, in fact, end up looking abhorrent to other people that also overall-liked the initial culture, but in ways orthogonal/opposed to our protagonist’s.
The new culture would still retain all of the aspects the human did like. But it would, in expectation, diverge from the original along all other dimensions.
Our civilization likes many animals, such as dogs. But we may also like to modify them along various dimensions, such as being more obedient, or prettier, or less aggressive, or less messy. On a broader scale, some of us would perhaps like to make all animals vegetarian, because they view prey animals as being moral patients. Others would be fine replacing animals with easily-reprogrammable robots, because they don’t consider animals to have moral worth.
As the result, many human cultures/demographics that love animals, if given godlike power, would decompose the “animal” concept and put together some new type of entities that would score well on all of their animal-focused value functions, but which may not actually be an “animal” in the initial sense.
The utility-as-scored-by-actual-animals might end up completely driven out of the universe in the process.
An anti-example is many people’s love for other people. Most people, even if given godlike power, wouldn’t want to disassemble their friends and put them back together in ways that appeal to their aesthetics better.
But it’s a pretty unusual case (I’ll discuss why a bit more later). The default case of valuing some abstract system very much permits disassembling it into lower-level parts and building something more awesome out of its pieces.
Or perhaps you think “niceness” isn’t about consequentialist goals, but about deontological actions. Perhaps AIs would end up “nice” in the sense that they’d have constraints on their actions such as “don’t kill people”, or “don’t be mean”. Well:
The above arguments apply. “Be a nice person” is a value function defined over an abstract concept, and the underlying “niceness” might be decomposed into something that satisfies the AI’s values better, but which doesn’t correspond to human-style niceness at all.
This is a “three laws of robotics”-style constraint: a superintelligent AGI that’s constrained to act nice, but which doesn’t have ultimately nice goals, would find a way to bring about a state of its (human-less) utopia without actually acting “mean”. Consider how we can wipe out animals as mere side-effects of our activity, or how a smart-enough human might end up disempowering their enemies without ever backstabbing or manipulating others.
As a more controversial example, we also have evolution. Humans aren’t actually completely misaligned with its “goals”: we do want to procreate, we do want to outcompete everyone else and consume all resources. But inasmuch as evolution has an “utility function”, it’s stated more clearly as “maximize inclusive generic fitness”, and we may end up wiping out the very concept of “genes” in the course of our technology-assisted procreation.
So although we’re still “a bit ‘nice’”, from evolution’s point of view, that “niceness” is incomprehensibly alien from its own (metaphorical) point of view.
I expect similar to happen as an AGI undergoes self-reflection. It would start out “nice”, in the sense that it’d have a “niceness” concept with some value function attached to it. But it’d then drop down to a lower level of abstraction, disassemble its concepts of “niceness” or “a human”, then re-assemble them into something that’s just as or more valuable from its own perspective, but which (1) is more compatible with its other values (the same way we’d e. g. change animals not to be aggressive towards us, to satisfy our value of “avoid pain”), (2) is completely alien and potentially value-less from our perspective.
One important factor here is that “humans” aren’t “agents” the way Paul is talking about. Humans are very complicated hybrid systems that sometimes function as game-theoretic agents, sometimes can be more well-approximated as shard ecosystems, et cetera. So there’s a free-ish parameter in how exactly we decide to draw the boundaries of a human’s agency; there isn’t a unique solution for how to validly interpret a “human” as a “weak agent”.
See my comment here, for example. When we talk about “a human’s values”, which of the following are we talking about?:
The momentary desires and urges currently active in the human’s mind.
Or: The goals that the human would profess to have if asked to immediately state them in human language.
Or: The goals that the human would write down if given an hour to think and the ability to consult their friends.
Or: Some function/agglomeration of the value functions learned by the human, including the unconscious ones.
Or: The output of some long-term self-reflection process (which is itself can be set up in many different ways, with the outcome sensitive to the details).
Or: Something else?
And so, even if the AGI-upon-reflection ends up “caring about weaker agents”, it might still end up wiping out humans-as-valued-by-us if it ends up interpreting “humans-as-agents” in a different way to how we would like to interpret them. (E. g., perhaps it’d just scoop out everyone’s momentary mental states, then tile the universe with copies of these states frozen in a moment of bliss, unchanging.)
There’s one potential exception: it’s theoretically possible that AIs would end up caring about humans the same way humans care about their friends (as above). But I would not expect that at all. In particular, because human concepts of mutual caring were subjected to a lot of cultural optimization pressure:
[The mutual-caring machinery] wasn’t produced by evolution. It wasn’t produced by the reward circuitry either, nor your own deliberations. Rather, it was produced by thousands of years of culture and adversity and trial-and-error.
A Stone Age or a medieval human, if given superintelligent power, would probably make life miserable for their loved ones, because they don’t have the sophisticated insights into psychology and moral philosophy and meta-cognition that we use to implement our “caring” function. [...]
The reason some of the modern people, who’d made a concentrated effort to become kind, can fairly credibly claim to genuinely care for others, is because their caring functions are perfected. They’d been perfected by generations of victims of imperfect caring, who’d pushed back on the imperfections, and by scientists and philosophers who took such feedback into account and compiled ever-better ways to care about people in a way that care-receivers would endorse. And care-receivers having the power to force the care-givers to go along with their wishes was a load-bearing part in this process.
On any known training paradigm, we would not have as much fidelity and pushback on the AI’s values and behavior as humans had on their own values. So it wouldn’t end up caring about humans the way humans care about their friends; it’d care about humans the way humans care about animals or cultures.
And so it’d end up recombining the abstract concepts comprising “humanity” into some other abstract structure that ticks off all the boxes “humanity” ticked off, but which wouldn’t be human at all.
I hope “the way humans care about their friends” is another natural abstraction, something like “my utility function includes link to yours utility function”. But we still don’t know how to direct AI to the specific abstraction, so it’s not a big hope.
My model is that friendship is one particular strategy for alliance-formation that happened to evolve in humans. I expect this is natural in the sense of being a local optimum (in the ancestral environment), but probably not in the sense of being simple to formally define or implement.
I think friendship is substantially more complicated than “I care some about your utility function”. For instance, you probably stop valuing their utility function if they betray you (friendship can “break”). I also think the friendship algorithm includes a bunch of signalling to help with coordination (so that you understand the other person is trying to be friends), and some less-pleasant stuff like evaluations of how valuable an ally the other person is and how the friendship will affect your social standing.
Friendship also appears to include some sort of check that the other person is making friendship-related-decisions using system 1 instead of system 2--possibly as a security feature to make it harder for people to consciously exploit (with the unfortunate side-effect that we penalize system-2-thinkers even when they sincerely want to be allies), or possibly just because the signalling parts evolved for system 1 and don’t generalize properly.
(One could claim that “the true spirit of friendship” is loving someone unconditionally or something, and that might be simple, but I don’t think that’s what humans actually implement.)
One could claim that “the true spirit of friendship” is loving someone unconditionally or something, and that might be simple, but I don’t think that’s what humans actually implement.
Yeah, I agree that humans implement something more complex. But it is what we want AI to implement, isn’t it? And it looks like may be quite natural abstraction to have.
(But again, it’s useless while we don’t know how to direct AI to the specific abstraction.)
Then we’re no longer talking about “the way humans care about their friends”, we’re inventing new hypothetical algorithms that we might like our AIs to use. Humans no longer provide an example of how that behavior could arise naturally in an evolved organism, nor a case study of how it works out for people to behave that way.
To start off, I think we would all agree that “niceness” isn’t a basic feature of reality. This doesn’t, of course, mean that AIs won’t learn a concept directly corresponding to human “niceness”, or that some part of their value system won’t end up hooked up to that “niceness” concept. On the contrary, inasmuch as “niceness” is a natural abstraction, we should expect both of these things to happen.
But we should still keep in mind that “niceness” isn’t irreducibly simple: that it can be decomposed into lower-level concepts, mixed with other lower-level concepts, then re-compiled into some different high-level concept that would both (1) score well on whatever value functions (/shards) in the AI that respond to “niceness”, (2) be completely alien and value-less to humans.
And this is what I’d expect to happen. Consider the following analogies:
A human is raised in some nation with some culture. That human ends up liking some aspects of that culture, and disliking other aspects. Overall, when we evaluate the overall concept of “this nation’s culture” using the human’s value system, the culture scores highly positive: the human loves their homeland.
But if we fine-grain their evaluation, give the human the ability to arbitrarily rewrite the culture at any level of fidelity… The human would likely end up introducing quite a lot of changes, such that the resultant culture wouldn’t resemble the original one at all. The new version might, in fact, end up looking abhorrent to other people that also overall-liked the initial culture, but in ways orthogonal/opposed to our protagonist’s.
The new culture would still retain all of the aspects the human did like. But it would, in expectation, diverge from the original along all other dimensions.
Our civilization likes many animals, such as dogs. But we may also like to modify them along various dimensions, such as being more obedient, or prettier, or less aggressive, or less messy. On a broader scale, some of us would perhaps like to make all animals vegetarian, because they view prey animals as being moral patients. Others would be fine replacing animals with easily-reprogrammable robots, because they don’t consider animals to have moral worth.
As the result, many human cultures/demographics that love animals, if given godlike power, would decompose the “animal” concept and put together some new type of entities that would score well on all of their animal-focused value functions, but which may not actually be an “animal” in the initial sense.
The utility-as-scored-by-actual-animals might end up completely driven out of the universe in the process.
An anti-example is many people’s love for other people. Most people, even if given godlike power, wouldn’t want to disassemble their friends and put them back together in ways that appeal to their aesthetics better.
But it’s a pretty unusual case (I’ll discuss why a bit more later). The default case of valuing some abstract system very much permits disassembling it into lower-level parts and building something more awesome out of its pieces.
Or perhaps you think “niceness” isn’t about consequentialist goals, but about deontological actions. Perhaps AIs would end up “nice” in the sense that they’d have constraints on their actions such as “don’t kill people”, or “don’t be mean”. Well:
The above arguments apply. “Be a nice person” is a value function defined over an abstract concept, and the underlying “niceness” might be decomposed into something that satisfies the AI’s values better, but which doesn’t correspond to human-style niceness at all.
This is a “three laws of robotics”-style constraint: a superintelligent AGI that’s constrained to act nice, but which doesn’t have ultimately nice goals, would find a way to bring about a state of its (human-less) utopia without actually acting “mean”. Consider how we can wipe out animals as mere side-effects of our activity, or how a smart-enough human might end up disempowering their enemies without ever backstabbing or manipulating others.
As a more controversial example, we also have evolution. Humans aren’t actually completely misaligned with its “goals”: we do want to procreate, we do want to outcompete everyone else and consume all resources. But inasmuch as evolution has an “utility function”, it’s stated more clearly as “maximize inclusive generic fitness”, and we may end up wiping out the very concept of “genes” in the course of our technology-assisted procreation.
So although we’re still “a bit ‘nice’”, from evolution’s point of view, that “niceness” is incomprehensibly alien from its own (metaphorical) point of view.
I expect similar to happen as an AGI undergoes self-reflection. It would start out “nice”, in the sense that it’d have a “niceness” concept with some value function attached to it. But it’d then drop down to a lower level of abstraction, disassemble its concepts of “niceness” or “a human”, then re-assemble them into something that’s just as or more valuable from its own perspective, but which (1) is more compatible with its other values (the same way we’d e. g. change animals not to be aggressive towards us, to satisfy our value of “avoid pain”), (2) is completely alien and potentially value-less from our perspective.
One important factor here is that “humans” aren’t “agents” the way Paul is talking about. Humans are very complicated hybrid systems that sometimes function as game-theoretic agents, sometimes can be more well-approximated as shard ecosystems, et cetera. So there’s a free-ish parameter in how exactly we decide to draw the boundaries of a human’s agency; there isn’t a unique solution for how to validly interpret a “human” as a “weak agent”.
See my comment here, for example. When we talk about “a human’s values”, which of the following are we talking about?:
The momentary desires and urges currently active in the human’s mind.
Or: The goals that the human would profess to have if asked to immediately state them in human language.
Or: The goals that the human would write down if given an hour to think and the ability to consult their friends.
Or: Some function/agglomeration of the value functions learned by the human, including the unconscious ones.
Or: The output of some long-term self-reflection process (which is itself can be set up in many different ways, with the outcome sensitive to the details).
Or: Something else?
And so, even if the AGI-upon-reflection ends up “caring about weaker agents”, it might still end up wiping out humans-as-valued-by-us if it ends up interpreting “humans-as-agents” in a different way to how we would like to interpret them. (E. g., perhaps it’d just scoop out everyone’s momentary mental states, then tile the universe with copies of these states frozen in a moment of bliss, unchanging.)
There’s one potential exception: it’s theoretically possible that AIs would end up caring about humans the same way humans care about their friends (as above). But I would not expect that at all. In particular, because human concepts of mutual caring were subjected to a lot of cultural optimization pressure:
On any known training paradigm, we would not have as much fidelity and pushback on the AI’s values and behavior as humans had on their own values. So it wouldn’t end up caring about humans the way humans care about their friends; it’d care about humans the way humans care about animals or cultures.
And so it’d end up recombining the abstract concepts comprising “humanity” into some other abstract structure that ticks off all the boxes “humanity” ticked off, but which wouldn’t be human at all.
I hope “the way humans care about their friends” is another natural abstraction, something like “my utility function includes link to yours utility function”. But we still don’t know how to direct AI to the specific abstraction, so it’s not a big hope.
My model is that friendship is one particular strategy for alliance-formation that happened to evolve in humans. I expect this is natural in the sense of being a local optimum (in the ancestral environment), but probably not in the sense of being simple to formally define or implement.
I think friendship is substantially more complicated than “I care some about your utility function”. For instance, you probably stop valuing their utility function if they betray you (friendship can “break”). I also think the friendship algorithm includes a bunch of signalling to help with coordination (so that you understand the other person is trying to be friends), and some less-pleasant stuff like evaluations of how valuable an ally the other person is and how the friendship will affect your social standing.
Friendship also appears to include some sort of check that the other person is making friendship-related-decisions using system 1 instead of system 2--possibly as a security feature to make it harder for people to consciously exploit (with the unfortunate side-effect that we penalize system-2-thinkers even when they sincerely want to be allies), or possibly just because the signalling parts evolved for system 1 and don’t generalize properly.
(One could claim that “the true spirit of friendship” is loving someone unconditionally or something, and that might be simple, but I don’t think that’s what humans actually implement.)
Yeah, I agree that humans implement something more complex. But it is what we want AI to implement, isn’t it? And it looks like may be quite natural abstraction to have.
(But again, it’s useless while we don’t know how to direct AI to the specific abstraction.)
Then we’re no longer talking about “the way humans care about their friends”, we’re inventing new hypothetical algorithms that we might like our AIs to use. Humans no longer provide an example of how that behavior could arise naturally in an evolved organism, nor a case study of how it works out for people to behave that way.