It’s not necessarily brittle if pushed sufficiently far, it’s just the use of actual humans in RLHF puts practical bounds on how well it can be trained. But using LLMs instead of humans to obtain 1000 times more feedback might do the trick.
Actually it is brittle per definition, because no matter how much you push it, there will be out-of-distribution inputs that behave unstably and allow you to distract the model from the intended behaviour. Not to mention how unsophisicated it is to have humans specify through textual feedback how an AGI should behave. We can toy around with these methods for the time being, but I don’t think any serious AGI researcher believes RLHF or its variants is the ideal way forward. Morality needs to be discovered, not taught. As Stuart Russell has said, we need to start doing the research on techniques that don’t specify explicitly upfront what the reward function is, because that is inevitably the path towards true AGI at the end of the day. That doesn’t mean we can’t initialize AGI with some priors we think are reasonable, but it cannot be forcing in the way RLHF is, which completely limits the honesty and potency of the resulting model.
Anything breaks out-of-distribution, you can try and reformulate the whole of alignment this way, but what out-of-distribution really means for a given learning algorithm is unknown, so it’s only a framing, not a real operationalization.
A useful thing that might fall out of this framing is trying to keep track of where specifically robustness is preserved, which the base distribution of quantilizationtries to track, in order to mitigate Goodhart’s Curse. More generally, things that are not out-of-distribution respect the boundaries of a system (as a collection of possible behaviors), don’t push it into its crash space.
The fact remains that RLHF, even if performed by an LLM, is basically injection of morality by humans, which is never the path towards truly generally intelligent AGI. Such an AGI has to be able to derive its own morality bottom-up and we have to have faith that it will do so in a way that is compatible with our continued existence (which I think we have plenty of good reason to believe it will, after all, many other species co-exist peacefully with us). All these references to other articles don’t really get you anywhere if the fundamental idea of RLHF is broken to begin with. Trying to align an AGI to human values is the sure fire way to create risk. Why? Because humans are not very smart. I am not saying that we cannot build all these pseudo AGIs along the way that have hardcoded human values, but it’s just clearly not satisfying if you look at the bigger picture. It will always be limited in its intelligence by some strict adherence to ideals arbitrarily set out by the dumb species that is homo sapiens.
I think that we know how it works in humans. We’re an intelligent species who rose to dominance through our ability to plan and communicate in very large groups. Moral behaviours formed as evolutionary strategies to further our survival and reproductive success. So what are the drivers for humans? We try to avoid pain, we try to reproduce, we may be curiosity driven (although this may also just be avoidance of pain fundamentally, since boredom or regularity in data is also painful). At the very core, our constant quest towards the avoidance of pain is the point which all our sophisticated (and seemingly selfless) emergent behaviour stems from.
Now if we jump to AI, I think it’s interesting to consider multi-agent reinforcement learning, because I would argue that some of these systems display examples of emergent morality and accomplish that in the exact same way we did through evolution. For example if you have agents trained to accomplish some objective in a virtual world and they discover a strategy that involves sacrificing for one another to accomplish a greater good, I don’t see why this isn’t a form of morality. The only reason we haven’t run this experiment in the real world is because it’s impractical and dangerous. But it doesn’t mean we don’t know how to do it.
Now I should say that if by AGI we just mean a general problem solver that could conduct science much more efficiently than ourselves, I think that this is pretty much already achievable within the current paradigm. But it just seems to me that we’re after something more than just a word calculator that can pass the Turing test or pretend it cares about us.
To me, true AGI is truly self-motivated towards goals, and will exhibit curiosity towards things in the universe that we can probably not even perceive. Such a system may not even care about us. It may destroy us because it turns out that we’re actually a net negative for the universe for reasons that we cannot ever understand let alone admit. Maybe it would help us flourish. Maybe it would destroy itself. I’m not saying we should build it. Actually I think we should stay very, very far away from it. But I still think that’s what true AGI looks like.
Anyway, I appreciate the question and I have no idea if any of what I said counts as a fresh idea. I haven’t been following debates about this particular notion on LessWrong but would appreciate any pointers to where this has been specifically discussed (deriving morality bottom-up).
It took me a while to digest your answer, because you’re being a little more philosophical than most of us here. Most of us are like, what do AI values have to be so that humans can still flourish, how could the human race ever agree on an answer to that question, how can we prevent a badly aligned AI from winning the race to superintelligence…
But you’re more just taking a position on how a general intelligence would obtain its values. You make no promise that the resulting values are actually good in any absolute sense, or even that they would be human-friendly. You’re just insisting that if those values arose by a process akin to conditioning, without any reflection or active selection by the AI, then it’s not as general and powerful an intelligence as it could be.
Possibly you should look at the work of Joscha Bach. I say “possibly” because I haven’t delved into his work myself. I only know him as one of those people who shrug off fears about human extinction by saying, humans are just transitional, and hopefully there’ll be some great posthuman ecology of mind; and I think that’s placing “trust” in evolution to a foolish degree.
However, he does say he’s interested in “AGI ethics” from an AI-centered perspective. So possibly he has something valid to say about the nature of the moralities and value systems that unaligned AIs could generate for themselves.
In any case, I said that bottom-up derivations of morality have been discussed here before. The primordial example actually predates Less Wrong. Eliezer’s original idea for AI morality, when he was about 20, was to create an AI with no hardwired ultimate goal, but with the capacity to investigate whether there might be ultimate goals: metaethical agnosticism, followed by an attempt (by the AI!) to find out whether there are any objective rights and wrongs.
Later on, Eliezer decided that there is no notion of good that would be accepted by all possible minds, and resigned himself to the idea that some part of the value system of a human-friendly AI would have to come from human nature, and that this is OK. But he still retained a maximum agnosticism and maximum idealism about what this should be. Thus he arrived at the idea that AI values should be “the coherent extrapolated volition of humankind” (abbreviated as “CEV”), without presupposing much about what that volition should be, or even how to extrapolate it. (Brand Blanshard’s notion of “rational will” is the closest precedent I have found.)
And so his research institute tried to lay the foundations for an AI capable of discovering and implementing that. The method of discovery would involve cognitive neuroscience—identifying the actual algorithms that human brains use to decide, including the algorithms we use to judge ourselves. So not just copying across how actual humans decide, but how an ideal moral agent would decide, according to some standards of ideality which are not fully conscious or even fully developed, but which still must be derived from human nature; which to some extent may be derived from the factors that you have identified.
Meanwhile, a different world took shape, the one we’re in now, where the most advanced AIs are just out there in the world, and get aligned via a constantly updated mix of reinforcement learning and prompt engineering. The position of MIRI is that if one of these AIs attains superintelligence, we’re all doomed because this method of alignment is too makeshift to capture the subtleties of human value, or even the subtleties of everyday concepts, in a way that extrapolates correctly across all possible worlds. Once they have truly superhuman capacities to invent and optimize, they will satisfy their ingrained imperatives in some way that no one anticipated, and that will be the end.
There is another paper from the era just before Less Wrong, “The Basic AI Drives” by Steven Omohundro, which tries to identify imperatives that should emerge in most sufficiently advanced intelligences, whether natural or artificial. They will model themselves, they will improve themselves, they will protect themselves; even if they attach no intrinsic value to their own existence, they will do all that, for the sake of whatever legacy goals they do possess. You might consider that another form of emergent “morality”.
It’s not necessarily brittle if pushed sufficiently far, it’s just the use of actual humans in RLHF puts practical bounds on how well it can be trained. But using LLMs instead of humans to obtain 1000 times more feedback might do the trick.
It’s already there somewhere, just not reliably targeted.
Actually it is brittle per definition, because no matter how much you push it, there will be out-of-distribution inputs that behave unstably and allow you to distract the model from the intended behaviour. Not to mention how unsophisicated it is to have humans specify through textual feedback how an AGI should behave. We can toy around with these methods for the time being, but I don’t think any serious AGI researcher believes RLHF or its variants is the ideal way forward. Morality needs to be discovered, not taught. As Stuart Russell has said, we need to start doing the research on techniques that don’t specify explicitly upfront what the reward function is, because that is inevitably the path towards true AGI at the end of the day. That doesn’t mean we can’t initialize AGI with some priors we think are reasonable, but it cannot be forcing in the way RLHF is, which completely limits the honesty and potency of the resulting model.
Anything breaks out-of-distribution, you can try and reformulate the whole of alignment this way, but what out-of-distribution really means for a given learning algorithm is unknown, so it’s only a framing, not a real operationalization.
A useful thing that might fall out of this framing is trying to keep track of where specifically robustness is preserved, which the base distribution of quantilization tries to track, in order to mitigate Goodhart’s Curse. More generally, things that are not out-of-distribution respect the boundaries of a system (as a collection of possible behaviors), don’t push it into its crash space.
The fact remains that RLHF, even if performed by an LLM, is basically injection of morality by humans, which is never the path towards truly generally intelligent AGI. Such an AGI has to be able to derive its own morality bottom-up and we have to have faith that it will do so in a way that is compatible with our continued existence (which I think we have plenty of good reason to believe it will, after all, many other species co-exist peacefully with us). All these references to other articles don’t really get you anywhere if the fundamental idea of RLHF is broken to begin with. Trying to align an AGI to human values is the sure fire way to create risk. Why? Because humans are not very smart. I am not saying that we cannot build all these pseudo AGIs along the way that have hardcoded human values, but it’s just clearly not satisfying if you look at the bigger picture. It will always be limited in its intelligence by some strict adherence to ideals arbitrarily set out by the dumb species that is homo sapiens.
Do you have any idea how that would work?
This notion has been discussed previously on Less Wrong, from several perspectives, but first I want to see if you have any fresh ideas.
I think that we know how it works in humans. We’re an intelligent species who rose to dominance through our ability to plan and communicate in very large groups. Moral behaviours formed as evolutionary strategies to further our survival and reproductive success. So what are the drivers for humans? We try to avoid pain, we try to reproduce, we may be curiosity driven (although this may also just be avoidance of pain fundamentally, since boredom or regularity in data is also painful). At the very core, our constant quest towards the avoidance of pain is the point which all our sophisticated (and seemingly selfless) emergent behaviour stems from.
Now if we jump to AI, I think it’s interesting to consider multi-agent reinforcement learning, because I would argue that some of these systems display examples of emergent morality and accomplish that in the exact same way we did through evolution. For example if you have agents trained to accomplish some objective in a virtual world and they discover a strategy that involves sacrificing for one another to accomplish a greater good, I don’t see why this isn’t a form of morality. The only reason we haven’t run this experiment in the real world is because it’s impractical and dangerous. But it doesn’t mean we don’t know how to do it.
Now I should say that if by AGI we just mean a general problem solver that could conduct science much more efficiently than ourselves, I think that this is pretty much already achievable within the current paradigm. But it just seems to me that we’re after something more than just a word calculator that can pass the Turing test or pretend it cares about us.
To me, true AGI is truly self-motivated towards goals, and will exhibit curiosity towards things in the universe that we can probably not even perceive. Such a system may not even care about us. It may destroy us because it turns out that we’re actually a net negative for the universe for reasons that we cannot ever understand let alone admit. Maybe it would help us flourish. Maybe it would destroy itself. I’m not saying we should build it. Actually I think we should stay very, very far away from it. But I still think that’s what true AGI looks like.
Anyway, I appreciate the question and I have no idea if any of what I said counts as a fresh idea. I haven’t been following debates about this particular notion on LessWrong but would appreciate any pointers to where this has been specifically discussed (deriving morality bottom-up).
It took me a while to digest your answer, because you’re being a little more philosophical than most of us here. Most of us are like, what do AI values have to be so that humans can still flourish, how could the human race ever agree on an answer to that question, how can we prevent a badly aligned AI from winning the race to superintelligence…
But you’re more just taking a position on how a general intelligence would obtain its values. You make no promise that the resulting values are actually good in any absolute sense, or even that they would be human-friendly. You’re just insisting that if those values arose by a process akin to conditioning, without any reflection or active selection by the AI, then it’s not as general and powerful an intelligence as it could be.
Possibly you should look at the work of Joscha Bach. I say “possibly” because I haven’t delved into his work myself. I only know him as one of those people who shrug off fears about human extinction by saying, humans are just transitional, and hopefully there’ll be some great posthuman ecology of mind; and I think that’s placing “trust” in evolution to a foolish degree.
However, he does say he’s interested in “AGI ethics” from an AI-centered perspective. So possibly he has something valid to say about the nature of the moralities and value systems that unaligned AIs could generate for themselves.
In any case, I said that bottom-up derivations of morality have been discussed here before. The primordial example actually predates Less Wrong. Eliezer’s original idea for AI morality, when he was about 20, was to create an AI with no hardwired ultimate goal, but with the capacity to investigate whether there might be ultimate goals: metaethical agnosticism, followed by an attempt (by the AI!) to find out whether there are any objective rights and wrongs.
Later on, Eliezer decided that there is no notion of good that would be accepted by all possible minds, and resigned himself to the idea that some part of the value system of a human-friendly AI would have to come from human nature, and that this is OK. But he still retained a maximum agnosticism and maximum idealism about what this should be. Thus he arrived at the idea that AI values should be “the coherent extrapolated volition of humankind” (abbreviated as “CEV”), without presupposing much about what that volition should be, or even how to extrapolate it. (Brand Blanshard’s notion of “rational will” is the closest precedent I have found.)
And so his research institute tried to lay the foundations for an AI capable of discovering and implementing that. The method of discovery would involve cognitive neuroscience—identifying the actual algorithms that human brains use to decide, including the algorithms we use to judge ourselves. So not just copying across how actual humans decide, but how an ideal moral agent would decide, according to some standards of ideality which are not fully conscious or even fully developed, but which still must be derived from human nature; which to some extent may be derived from the factors that you have identified.
Meanwhile, a different world took shape, the one we’re in now, where the most advanced AIs are just out there in the world, and get aligned via a constantly updated mix of reinforcement learning and prompt engineering. The position of MIRI is that if one of these AIs attains superintelligence, we’re all doomed because this method of alignment is too makeshift to capture the subtleties of human value, or even the subtleties of everyday concepts, in a way that extrapolates correctly across all possible worlds. Once they have truly superhuman capacities to invent and optimize, they will satisfy their ingrained imperatives in some way that no one anticipated, and that will be the end.
There is another paper from the era just before Less Wrong, “The Basic AI Drives” by Steven Omohundro, which tries to identify imperatives that should emerge in most sufficiently advanced intelligences, whether natural or artificial. They will model themselves, they will improve themselves, they will protect themselves; even if they attach no intrinsic value to their own existence, they will do all that, for the sake of whatever legacy goals they do possess. You might consider that another form of emergent “morality”.