I think that we know how it works in humans. We’re an intelligent species who rose to dominance through our ability to plan and communicate in very large groups. Moral behaviours formed as evolutionary strategies to further our survival and reproductive success. So what are the drivers for humans? We try to avoid pain, we try to reproduce, we may be curiosity driven (although this may also just be avoidance of pain fundamentally, since boredom or regularity in data is also painful). At the very core, our constant quest towards the avoidance of pain is the point which all our sophisticated (and seemingly selfless) emergent behaviour stems from.
Now if we jump to AI, I think it’s interesting to consider multi-agent reinforcement learning, because I would argue that some of these systems display examples of emergent morality and accomplish that in the exact same way we did through evolution. For example if you have agents trained to accomplish some objective in a virtual world and they discover a strategy that involves sacrificing for one another to accomplish a greater good, I don’t see why this isn’t a form of morality. The only reason we haven’t run this experiment in the real world is because it’s impractical and dangerous. But it doesn’t mean we don’t know how to do it.
Now I should say that if by AGI we just mean a general problem solver that could conduct science much more efficiently than ourselves, I think that this is pretty much already achievable within the current paradigm. But it just seems to me that we’re after something more than just a word calculator that can pass the Turing test or pretend it cares about us.
To me, true AGI is truly self-motivated towards goals, and will exhibit curiosity towards things in the universe that we can probably not even perceive. Such a system may not even care about us. It may destroy us because it turns out that we’re actually a net negative for the universe for reasons that we cannot ever understand let alone admit. Maybe it would help us flourish. Maybe it would destroy itself. I’m not saying we should build it. Actually I think we should stay very, very far away from it. But I still think that’s what true AGI looks like.
Anyway, I appreciate the question and I have no idea if any of what I said counts as a fresh idea. I haven’t been following debates about this particular notion on LessWrong but would appreciate any pointers to where this has been specifically discussed (deriving morality bottom-up).
It took me a while to digest your answer, because you’re being a little more philosophical than most of us here. Most of us are like, what do AI values have to be so that humans can still flourish, how could the human race ever agree on an answer to that question, how can we prevent a badly aligned AI from winning the race to superintelligence…
But you’re more just taking a position on how a general intelligence would obtain its values. You make no promise that the resulting values are actually good in any absolute sense, or even that they would be human-friendly. You’re just insisting that if those values arose by a process akin to conditioning, without any reflection or active selection by the AI, then it’s not as general and powerful an intelligence as it could be.
Possibly you should look at the work of Joscha Bach. I say “possibly” because I haven’t delved into his work myself. I only know him as one of those people who shrug off fears about human extinction by saying, humans are just transitional, and hopefully there’ll be some great posthuman ecology of mind; and I think that’s placing “trust” in evolution to a foolish degree.
However, he does say he’s interested in “AGI ethics” from an AI-centered perspective. So possibly he has something valid to say about the nature of the moralities and value systems that unaligned AIs could generate for themselves.
In any case, I said that bottom-up derivations of morality have been discussed here before. The primordial example actually predates Less Wrong. Eliezer’s original idea for AI morality, when he was about 20, was to create an AI with no hardwired ultimate goal, but with the capacity to investigate whether there might be ultimate goals: metaethical agnosticism, followed by an attempt (by the AI!) to find out whether there are any objective rights and wrongs.
Later on, Eliezer decided that there is no notion of good that would be accepted by all possible minds, and resigned himself to the idea that some part of the value system of a human-friendly AI would have to come from human nature, and that this is OK. But he still retained a maximum agnosticism and maximum idealism about what this should be. Thus he arrived at the idea that AI values should be “the coherent extrapolated volition of humankind” (abbreviated as “CEV”), without presupposing much about what that volition should be, or even how to extrapolate it. (Brand Blanshard’s notion of “rational will” is the closest precedent I have found.)
And so his research institute tried to lay the foundations for an AI capable of discovering and implementing that. The method of discovery would involve cognitive neuroscience—identifying the actual algorithms that human brains use to decide, including the algorithms we use to judge ourselves. So not just copying across how actual humans decide, but how an ideal moral agent would decide, according to some standards of ideality which are not fully conscious or even fully developed, but which still must be derived from human nature; which to some extent may be derived from the factors that you have identified.
Meanwhile, a different world took shape, the one we’re in now, where the most advanced AIs are just out there in the world, and get aligned via a constantly updated mix of reinforcement learning and prompt engineering. The position of MIRI is that if one of these AIs attains superintelligence, we’re all doomed because this method of alignment is too makeshift to capture the subtleties of human value, or even the subtleties of everyday concepts, in a way that extrapolates correctly across all possible worlds. Once they have truly superhuman capacities to invent and optimize, they will satisfy their ingrained imperatives in some way that no one anticipated, and that will be the end.
There is another paper from the era just before Less Wrong, “The Basic AI Drives” by Steven Omohundro, which tries to identify imperatives that should emerge in most sufficiently advanced intelligences, whether natural or artificial. They will model themselves, they will improve themselves, they will protect themselves; even if they attach no intrinsic value to their own existence, they will do all that, for the sake of whatever legacy goals they do possess. You might consider that another form of emergent “morality”.
I think that we know how it works in humans. We’re an intelligent species who rose to dominance through our ability to plan and communicate in very large groups. Moral behaviours formed as evolutionary strategies to further our survival and reproductive success. So what are the drivers for humans? We try to avoid pain, we try to reproduce, we may be curiosity driven (although this may also just be avoidance of pain fundamentally, since boredom or regularity in data is also painful). At the very core, our constant quest towards the avoidance of pain is the point which all our sophisticated (and seemingly selfless) emergent behaviour stems from.
Now if we jump to AI, I think it’s interesting to consider multi-agent reinforcement learning, because I would argue that some of these systems display examples of emergent morality and accomplish that in the exact same way we did through evolution. For example if you have agents trained to accomplish some objective in a virtual world and they discover a strategy that involves sacrificing for one another to accomplish a greater good, I don’t see why this isn’t a form of morality. The only reason we haven’t run this experiment in the real world is because it’s impractical and dangerous. But it doesn’t mean we don’t know how to do it.
Now I should say that if by AGI we just mean a general problem solver that could conduct science much more efficiently than ourselves, I think that this is pretty much already achievable within the current paradigm. But it just seems to me that we’re after something more than just a word calculator that can pass the Turing test or pretend it cares about us.
To me, true AGI is truly self-motivated towards goals, and will exhibit curiosity towards things in the universe that we can probably not even perceive. Such a system may not even care about us. It may destroy us because it turns out that we’re actually a net negative for the universe for reasons that we cannot ever understand let alone admit. Maybe it would help us flourish. Maybe it would destroy itself. I’m not saying we should build it. Actually I think we should stay very, very far away from it. But I still think that’s what true AGI looks like.
Anyway, I appreciate the question and I have no idea if any of what I said counts as a fresh idea. I haven’t been following debates about this particular notion on LessWrong but would appreciate any pointers to where this has been specifically discussed (deriving morality bottom-up).
It took me a while to digest your answer, because you’re being a little more philosophical than most of us here. Most of us are like, what do AI values have to be so that humans can still flourish, how could the human race ever agree on an answer to that question, how can we prevent a badly aligned AI from winning the race to superintelligence…
But you’re more just taking a position on how a general intelligence would obtain its values. You make no promise that the resulting values are actually good in any absolute sense, or even that they would be human-friendly. You’re just insisting that if those values arose by a process akin to conditioning, without any reflection or active selection by the AI, then it’s not as general and powerful an intelligence as it could be.
Possibly you should look at the work of Joscha Bach. I say “possibly” because I haven’t delved into his work myself. I only know him as one of those people who shrug off fears about human extinction by saying, humans are just transitional, and hopefully there’ll be some great posthuman ecology of mind; and I think that’s placing “trust” in evolution to a foolish degree.
However, he does say he’s interested in “AGI ethics” from an AI-centered perspective. So possibly he has something valid to say about the nature of the moralities and value systems that unaligned AIs could generate for themselves.
In any case, I said that bottom-up derivations of morality have been discussed here before. The primordial example actually predates Less Wrong. Eliezer’s original idea for AI morality, when he was about 20, was to create an AI with no hardwired ultimate goal, but with the capacity to investigate whether there might be ultimate goals: metaethical agnosticism, followed by an attempt (by the AI!) to find out whether there are any objective rights and wrongs.
Later on, Eliezer decided that there is no notion of good that would be accepted by all possible minds, and resigned himself to the idea that some part of the value system of a human-friendly AI would have to come from human nature, and that this is OK. But he still retained a maximum agnosticism and maximum idealism about what this should be. Thus he arrived at the idea that AI values should be “the coherent extrapolated volition of humankind” (abbreviated as “CEV”), without presupposing much about what that volition should be, or even how to extrapolate it. (Brand Blanshard’s notion of “rational will” is the closest precedent I have found.)
And so his research institute tried to lay the foundations for an AI capable of discovering and implementing that. The method of discovery would involve cognitive neuroscience—identifying the actual algorithms that human brains use to decide, including the algorithms we use to judge ourselves. So not just copying across how actual humans decide, but how an ideal moral agent would decide, according to some standards of ideality which are not fully conscious or even fully developed, but which still must be derived from human nature; which to some extent may be derived from the factors that you have identified.
Meanwhile, a different world took shape, the one we’re in now, where the most advanced AIs are just out there in the world, and get aligned via a constantly updated mix of reinforcement learning and prompt engineering. The position of MIRI is that if one of these AIs attains superintelligence, we’re all doomed because this method of alignment is too makeshift to capture the subtleties of human value, or even the subtleties of everyday concepts, in a way that extrapolates correctly across all possible worlds. Once they have truly superhuman capacities to invent and optimize, they will satisfy their ingrained imperatives in some way that no one anticipated, and that will be the end.
There is another paper from the era just before Less Wrong, “The Basic AI Drives” by Steven Omohundro, which tries to identify imperatives that should emerge in most sufficiently advanced intelligences, whether natural or artificial. They will model themselves, they will improve themselves, they will protect themselves; even if they attach no intrinsic value to their own existence, they will do all that, for the sake of whatever legacy goals they do possess. You might consider that another form of emergent “morality”.