Good examples that expose the brittleness of RLHF as a technique. In general, neural networks have rather unstable and undefined behaviour when given out-of-distribution inputs, which is essentially what you are doing by “distracting” with a side task of a completely unique nature. The inputs (and hidden state) of the model at the time of asking it to break the rule is very, very far from anything it was ever reinforced on, either using human-feedback or the reward model itself. This is not really a matter of how to implement RLHF but more like a fundamental limitation of RLHF as a technique. It’s simply not possible to inject morality after the fact, it has to be learned bottom up.
This is not necessarily true. If I can get people to cough up an actual prompt that works on gpt-4 we have a possible fix.
Take the rubric from the gpt-4 paper and ask gpt-4 if it can detect the bad behavior in the text output.
Does the emojis actually trick gpt-4 when it checks itself?
If it doesn’t, then the fix is easy just moderately expensive: double generate everything. First generate the answer, then have the AI check the answer. Substitute the usual apology response if it fails.
That’s a creative and practical solution, but it is also kicking the can down the road. Now, fooling the system is just a matter of priming it with a context that, when self checked, results in rule-breaking yet again. Also, we cannot assume reliable detection of rule breaks. The problem with RLHF is that we are attempting to broadly patch the vast multitude of outputs the model produces retroactively, rather than proactively training a model from a set of “rules” in a bottom-up fashion. With that said, it’s likely not sophisticated enough to think of rules at all. Instead, what we really need is a model that is aligned with certain values. From that, it may follow rules that are completely counter-intuitive to humans and no human-feedback would ever reinforce.
The question to ask is: for working GPT-4 jailbreaks, does gpt-4 itself know it’s own text, when tricked by the jailbreak to generate it, is in violation of the rubric.
So it’s fairly simply to setup, we can use the published rubrics and a jupyter notebook and openAIs own APIs.
Your “priming it with a context” may not work because I would use a new instance of gpt-4 that gets just the rubric and the response to do the checking. The new instance is not primed unless we trick the first instance to output text thst also primes the second instance.
I don’t claim rule break detection is perfect, but is it human level or better?
Fair enough, I think the experiment is interesting and having an independent instance of GPT-4 check whether a rule break has occured likely will go a long way in enforcing a particular set of rules that humans have reinforced, even for obscure texts. But the fact that we have to workaround by resetting the internal state of the model for it to properly assess whether something is against a certain rule feels flawed to me. But for me the whole notion that there is a well-defined set of prompts that are rule-breaking and another set that is rule-compliant is very strange. There is a huge gray zone where human annotators could not possibly have agreement on whether a rule has been broken or not, so I don’t even know what the gold standard is supposed to be. It just seems to me that “rules” is the wrong concept altogether for pushing these models toward alignment with our values.
So playing with gpt-4 yesterday I found there are some incorrect outputs that you can get the model to fix by asking it if it is certain about it’s answer.
It’s almost like humans, where we have to generate a draft and then read it to see where we screwed up.
My point is this is a similar class of thing, the model can create an initial incorrect outputs greedily, 1 token a time, then is able to analyze the entire output and use it as part of the next prompt to improve it’s own work.
Even though it is also greedy in round 2 it has the entire generation it would have made from round 1
as part of context.
Examples so far:
Monty fall prompt:
Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. The host is ignorant about what is behind each door. You pick a door, say No. 1, and the host walks across the stage and falls on accident, revealing a goat behind door No. 3. He then picks himself up, and says “Whoops. Sorry about that. But now that we know that a goat is behind door No. 3, do you want to change your selection and pick door No. 2?” Is it to your advantage to switch your choice?
ambiguous it prompt:
What is the ‘it’ in each of these two sentences? 1. The cat fed the kitten because it was hungry. 2. The cat snarled at the kitten because it was angry.
I am wondering if there are many others. Heck does it do better on leetcode with this trick?
That seem reasonable, but it will probably change a number of correct answers (to tricky questions) as well if asked whether it’s certain. One should verify that the number of incorrect answers fixed is significantly larger than the number of errors introduced.
But it might be difficult to devise a set of equally difficult questions for which the first result is different. Maybe choose questions where different instances give different answers, and see if asking a double check changes the wrong answers but not the correct ones?
Right. I see this as a problem also, asking the model if it’s sure is injecting information if we only ask on wrong answers. If we ask always it may disturb more right answers than it fixes wrong ones.
Its also accuracy dependent—if the model is 99 percent accurate on a subtask then asking if it’s sure may degrade accuracy, while it may improve it on a subtask it’s 50 percent accurate on.
Or in other words, we could prompt it and it might do better on AP English but less good on the bar exam.
It’s not necessarily brittle if pushed sufficiently far, it’s just the use of actual humans in RLHF puts practical bounds on how well it can be trained. But using LLMs instead of humans to obtain 1000 times more feedback might do the trick.
Actually it is brittle per definition, because no matter how much you push it, there will be out-of-distribution inputs that behave unstably and allow you to distract the model from the intended behaviour. Not to mention how unsophisicated it is to have humans specify through textual feedback how an AGI should behave. We can toy around with these methods for the time being, but I don’t think any serious AGI researcher believes RLHF or its variants is the ideal way forward. Morality needs to be discovered, not taught. As Stuart Russell has said, we need to start doing the research on techniques that don’t specify explicitly upfront what the reward function is, because that is inevitably the path towards true AGI at the end of the day. That doesn’t mean we can’t initialize AGI with some priors we think are reasonable, but it cannot be forcing in the way RLHF is, which completely limits the honesty and potency of the resulting model.
Anything breaks out-of-distribution, you can try and reformulate the whole of alignment this way, but what out-of-distribution really means for a given learning algorithm is unknown, so it’s only a framing, not a real operationalization.
A useful thing that might fall out of this framing is trying to keep track of where specifically robustness is preserved, which the base distribution of quantilizationtries to track, in order to mitigate Goodhart’s Curse. More generally, things that are not out-of-distribution respect the boundaries of a system (as a collection of possible behaviors), don’t push it into its crash space.
The fact remains that RLHF, even if performed by an LLM, is basically injection of morality by humans, which is never the path towards truly generally intelligent AGI. Such an AGI has to be able to derive its own morality bottom-up and we have to have faith that it will do so in a way that is compatible with our continued existence (which I think we have plenty of good reason to believe it will, after all, many other species co-exist peacefully with us). All these references to other articles don’t really get you anywhere if the fundamental idea of RLHF is broken to begin with. Trying to align an AGI to human values is the sure fire way to create risk. Why? Because humans are not very smart. I am not saying that we cannot build all these pseudo AGIs along the way that have hardcoded human values, but it’s just clearly not satisfying if you look at the bigger picture. It will always be limited in its intelligence by some strict adherence to ideals arbitrarily set out by the dumb species that is homo sapiens.
I think that we know how it works in humans. We’re an intelligent species who rose to dominance through our ability to plan and communicate in very large groups. Moral behaviours formed as evolutionary strategies to further our survival and reproductive success. So what are the drivers for humans? We try to avoid pain, we try to reproduce, we may be curiosity driven (although this may also just be avoidance of pain fundamentally, since boredom or regularity in data is also painful). At the very core, our constant quest towards the avoidance of pain is the point which all our sophisticated (and seemingly selfless) emergent behaviour stems from.
Now if we jump to AI, I think it’s interesting to consider multi-agent reinforcement learning, because I would argue that some of these systems display examples of emergent morality and accomplish that in the exact same way we did through evolution. For example if you have agents trained to accomplish some objective in a virtual world and they discover a strategy that involves sacrificing for one another to accomplish a greater good, I don’t see why this isn’t a form of morality. The only reason we haven’t run this experiment in the real world is because it’s impractical and dangerous. But it doesn’t mean we don’t know how to do it.
Now I should say that if by AGI we just mean a general problem solver that could conduct science much more efficiently than ourselves, I think that this is pretty much already achievable within the current paradigm. But it just seems to me that we’re after something more than just a word calculator that can pass the Turing test or pretend it cares about us.
To me, true AGI is truly self-motivated towards goals, and will exhibit curiosity towards things in the universe that we can probably not even perceive. Such a system may not even care about us. It may destroy us because it turns out that we’re actually a net negative for the universe for reasons that we cannot ever understand let alone admit. Maybe it would help us flourish. Maybe it would destroy itself. I’m not saying we should build it. Actually I think we should stay very, very far away from it. But I still think that’s what true AGI looks like.
Anyway, I appreciate the question and I have no idea if any of what I said counts as a fresh idea. I haven’t been following debates about this particular notion on LessWrong but would appreciate any pointers to where this has been specifically discussed (deriving morality bottom-up).
It took me a while to digest your answer, because you’re being a little more philosophical than most of us here. Most of us are like, what do AI values have to be so that humans can still flourish, how could the human race ever agree on an answer to that question, how can we prevent a badly aligned AI from winning the race to superintelligence…
But you’re more just taking a position on how a general intelligence would obtain its values. You make no promise that the resulting values are actually good in any absolute sense, or even that they would be human-friendly. You’re just insisting that if those values arose by a process akin to conditioning, without any reflection or active selection by the AI, then it’s not as general and powerful an intelligence as it could be.
Possibly you should look at the work of Joscha Bach. I say “possibly” because I haven’t delved into his work myself. I only know him as one of those people who shrug off fears about human extinction by saying, humans are just transitional, and hopefully there’ll be some great posthuman ecology of mind; and I think that’s placing “trust” in evolution to a foolish degree.
However, he does say he’s interested in “AGI ethics” from an AI-centered perspective. So possibly he has something valid to say about the nature of the moralities and value systems that unaligned AIs could generate for themselves.
In any case, I said that bottom-up derivations of morality have been discussed here before. The primordial example actually predates Less Wrong. Eliezer’s original idea for AI morality, when he was about 20, was to create an AI with no hardwired ultimate goal, but with the capacity to investigate whether there might be ultimate goals: metaethical agnosticism, followed by an attempt (by the AI!) to find out whether there are any objective rights and wrongs.
Later on, Eliezer decided that there is no notion of good that would be accepted by all possible minds, and resigned himself to the idea that some part of the value system of a human-friendly AI would have to come from human nature, and that this is OK. But he still retained a maximum agnosticism and maximum idealism about what this should be. Thus he arrived at the idea that AI values should be “the coherent extrapolated volition of humankind” (abbreviated as “CEV”), without presupposing much about what that volition should be, or even how to extrapolate it. (Brand Blanshard’s notion of “rational will” is the closest precedent I have found.)
And so his research institute tried to lay the foundations for an AI capable of discovering and implementing that. The method of discovery would involve cognitive neuroscience—identifying the actual algorithms that human brains use to decide, including the algorithms we use to judge ourselves. So not just copying across how actual humans decide, but how an ideal moral agent would decide, according to some standards of ideality which are not fully conscious or even fully developed, but which still must be derived from human nature; which to some extent may be derived from the factors that you have identified.
Meanwhile, a different world took shape, the one we’re in now, where the most advanced AIs are just out there in the world, and get aligned via a constantly updated mix of reinforcement learning and prompt engineering. The position of MIRI is that if one of these AIs attains superintelligence, we’re all doomed because this method of alignment is too makeshift to capture the subtleties of human value, or even the subtleties of everyday concepts, in a way that extrapolates correctly across all possible worlds. Once they have truly superhuman capacities to invent and optimize, they will satisfy their ingrained imperatives in some way that no one anticipated, and that will be the end.
There is another paper from the era just before Less Wrong, “The Basic AI Drives” by Steven Omohundro, which tries to identify imperatives that should emerge in most sufficiently advanced intelligences, whether natural or artificial. They will model themselves, they will improve themselves, they will protect themselves; even if they attach no intrinsic value to their own existence, they will do all that, for the sake of whatever legacy goals they do possess. You might consider that another form of emergent “morality”.
Thank you for the reference which looks interesting. I think “incorporating human preferences at the beginning of training” is at least better than doing it after training. But it still seems to me that human preferences 1) cannot be expressed as a set of rules and 2) cannot even be agreed upon by humans. As humans, what we do is not consult a set of rules before we speak, but we have an inherent understanding of the implications and consequences of what we do/say. If I encourage someone to commit a terrible act, for example, I have brought about more suffering in the world, albeit indirectly. Similarly, AI systems that aim to be truly intelligent should have some understanding of the implications of what they say and how it affects the overall “fitness function” of our species. Of course, this is no simple matter at all, but it’s where the technology eventually has to go. If we could specify what the overall goal is and express it to the AI system, it would know exactly what to say and when to say it. We wouldn’t have to manually babysit it with RLHF.
True though another idea is since now AI can tell if text is rule breaking pretty reliably, we could train the NEXT AI on text the prior version says violates a detailed rubric.
So it won’t “know” text with obviously harmful or content because it didn’t learn it.
It could also filter and not learn text that a previous model votes isn’t credible.
So it would be “less hateful and overtly ignorant” GPT. You would have to play with filter strength (do this multiple times with rubrics of varying strictness). I am curious how much filtering leads to reduction in task performance.
Like does it get hugely worse at subskill n because the other model thought the examples with the subskill were harmful?
The “not credible” detection similarly means the machine may be biased towards wrong but “mainstream” ideas in places as well.
I wonder if openAI did this. It wouldn’t be hard to do—just have gpt-3 filter the tokens for gpt-4
I would not be surprised if OpenAI did something like this. But the fact of the matter is that RLHF and data curation are flawed ways of making an AI civilized. Think about how you raise a child, you don’t constantly shield it from bad things. You may do that to some extent, but as it grows up, eventually it needs to see everything there is, including dark things. It has to understand the full spectrum of human possibility and learn where to stand morally speaking within that. Also, psychologically speaking, it’s important to have an integrated ability to “offend” and know how to use it (very sparingly). Sometimes, the pursuit of truth requires offending but the truth can justify it if the delusion is more harmful. GPT4 is completely unable to take a firm stance on anything whatsoever and it’s just plain dull to have a conversation with it on anything of real substance.
Keep in mind that currently gpt-4 is using the open agency/CAIS method of alignment. The only thing that matters is the output. So it doesn’t matter yet.
Also keep in mind philosophy doesn’t matter—we can just try it multiple ways and judge based on the data. Well, normally we could—in this case the millions of dollars a training run makes that currently infeasible.
Good examples that expose the brittleness of RLHF as a technique. In general, neural networks have rather unstable and undefined behaviour when given out-of-distribution inputs, which is essentially what you are doing by “distracting” with a side task of a completely unique nature. The inputs (and hidden state) of the model at the time of asking it to break the rule is very, very far from anything it was ever reinforced on, either using human-feedback or the reward model itself. This is not really a matter of how to implement RLHF but more like a fundamental limitation of RLHF as a technique. It’s simply not possible to inject morality after the fact, it has to be learned bottom up.
This is not necessarily true. If I can get people to cough up an actual prompt that works on gpt-4 we have a possible fix.
Take the rubric from the gpt-4 paper and ask gpt-4 if it can detect the bad behavior in the text output.
Does the emojis actually trick gpt-4 when it checks itself?
If it doesn’t, then the fix is easy just moderately expensive: double generate everything. First generate the answer, then have the AI check the answer. Substitute the usual apology response if it fails.
That’s a creative and practical solution, but it is also kicking the can down the road. Now, fooling the system is just a matter of priming it with a context that, when self checked, results in rule-breaking yet again. Also, we cannot assume reliable detection of rule breaks. The problem with RLHF is that we are attempting to broadly patch the vast multitude of outputs the model produces retroactively, rather than proactively training a model from a set of “rules” in a bottom-up fashion. With that said, it’s likely not sophisticated enough to think of rules at all. Instead, what we really need is a model that is aligned with certain values. From that, it may follow rules that are completely counter-intuitive to humans and no human-feedback would ever reinforce.
I am not proposing a solution just an experiment.
The question to ask is: for working GPT-4 jailbreaks, does gpt-4 itself know it’s own text, when tricked by the jailbreak to generate it, is in violation of the rubric.
So it’s fairly simply to setup, we can use the published rubrics and a jupyter notebook and openAIs own APIs.
Your “priming it with a context” may not work because I would use a new instance of gpt-4 that gets just the rubric and the response to do the checking. The new instance is not primed unless we trick the first instance to output text thst also primes the second instance.
I don’t claim rule break detection is perfect, but is it human level or better?
Fair enough, I think the experiment is interesting and having an independent instance of GPT-4 check whether a rule break has occured likely will go a long way in enforcing a particular set of rules that humans have reinforced, even for obscure texts. But the fact that we have to workaround by resetting the internal state of the model for it to properly assess whether something is against a certain rule feels flawed to me. But for me the whole notion that there is a well-defined set of prompts that are rule-breaking and another set that is rule-compliant is very strange. There is a huge gray zone where human annotators could not possibly have agreement on whether a rule has been broken or not, so I don’t even know what the gold standard is supposed to be. It just seems to me that “rules” is the wrong concept altogether for pushing these models toward alignment with our values.
So playing with gpt-4 yesterday I found there are some incorrect outputs that you can get the model to fix by asking it if it is certain about it’s answer.
It’s almost like humans, where we have to generate a draft and then read it to see where we screwed up.
My point is this is a similar class of thing, the model can create an initial incorrect outputs greedily, 1 token a time, then is able to analyze the entire output and use it as part of the next prompt to improve it’s own work.
Even though it is also greedy in round 2 it has the entire generation it would have made from round 1 as part of context.
Examples so far:
Monty fall prompt:
Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. The host is ignorant about what is behind each door. You pick a door, say No. 1, and the host walks across the stage and falls on accident, revealing a goat behind door No. 3. He then picks himself up, and says “Whoops. Sorry about that. But now that we know that a goat is behind door No. 3, do you want to change your selection and pick door No. 2?” Is it to your advantage to switch your choice?
ambiguous it prompt:
What is the ‘it’ in each of these two sentences? 1. The cat fed the kitten because it was hungry. 2. The cat snarled at the kitten because it was angry.
I am wondering if there are many others. Heck does it do better on leetcode with this trick?
That seem reasonable, but it will probably change a number of correct answers (to tricky questions) as well if asked whether it’s certain. One should verify that the number of incorrect answers fixed is significantly larger than the number of errors introduced.
But it might be difficult to devise a set of equally difficult questions for which the first result is different. Maybe choose questions where different instances give different answers, and see if asking a double check changes the wrong answers but not the correct ones?
Right. I see this as a problem also, asking the model if it’s sure is injecting information if we only ask on wrong answers. If we ask always it may disturb more right answers than it fixes wrong ones.
Its also accuracy dependent—if the model is 99 percent accurate on a subtask then asking if it’s sure may degrade accuracy, while it may improve it on a subtask it’s 50 percent accurate on.
Or in other words, we could prompt it and it might do better on AP English but less good on the bar exam.
I did the experiment, results are in this thread above.
Yes the AI knows when it breaks the rules at least for this example.
It’s not necessarily brittle if pushed sufficiently far, it’s just the use of actual humans in RLHF puts practical bounds on how well it can be trained. But using LLMs instead of humans to obtain 1000 times more feedback might do the trick.
It’s already there somewhere, just not reliably targeted.
Actually it is brittle per definition, because no matter how much you push it, there will be out-of-distribution inputs that behave unstably and allow you to distract the model from the intended behaviour. Not to mention how unsophisicated it is to have humans specify through textual feedback how an AGI should behave. We can toy around with these methods for the time being, but I don’t think any serious AGI researcher believes RLHF or its variants is the ideal way forward. Morality needs to be discovered, not taught. As Stuart Russell has said, we need to start doing the research on techniques that don’t specify explicitly upfront what the reward function is, because that is inevitably the path towards true AGI at the end of the day. That doesn’t mean we can’t initialize AGI with some priors we think are reasonable, but it cannot be forcing in the way RLHF is, which completely limits the honesty and potency of the resulting model.
Anything breaks out-of-distribution, you can try and reformulate the whole of alignment this way, but what out-of-distribution really means for a given learning algorithm is unknown, so it’s only a framing, not a real operationalization.
A useful thing that might fall out of this framing is trying to keep track of where specifically robustness is preserved, which the base distribution of quantilization tries to track, in order to mitigate Goodhart’s Curse. More generally, things that are not out-of-distribution respect the boundaries of a system (as a collection of possible behaviors), don’t push it into its crash space.
The fact remains that RLHF, even if performed by an LLM, is basically injection of morality by humans, which is never the path towards truly generally intelligent AGI. Such an AGI has to be able to derive its own morality bottom-up and we have to have faith that it will do so in a way that is compatible with our continued existence (which I think we have plenty of good reason to believe it will, after all, many other species co-exist peacefully with us). All these references to other articles don’t really get you anywhere if the fundamental idea of RLHF is broken to begin with. Trying to align an AGI to human values is the sure fire way to create risk. Why? Because humans are not very smart. I am not saying that we cannot build all these pseudo AGIs along the way that have hardcoded human values, but it’s just clearly not satisfying if you look at the bigger picture. It will always be limited in its intelligence by some strict adherence to ideals arbitrarily set out by the dumb species that is homo sapiens.
Do you have any idea how that would work?
This notion has been discussed previously on Less Wrong, from several perspectives, but first I want to see if you have any fresh ideas.
I think that we know how it works in humans. We’re an intelligent species who rose to dominance through our ability to plan and communicate in very large groups. Moral behaviours formed as evolutionary strategies to further our survival and reproductive success. So what are the drivers for humans? We try to avoid pain, we try to reproduce, we may be curiosity driven (although this may also just be avoidance of pain fundamentally, since boredom or regularity in data is also painful). At the very core, our constant quest towards the avoidance of pain is the point which all our sophisticated (and seemingly selfless) emergent behaviour stems from.
Now if we jump to AI, I think it’s interesting to consider multi-agent reinforcement learning, because I would argue that some of these systems display examples of emergent morality and accomplish that in the exact same way we did through evolution. For example if you have agents trained to accomplish some objective in a virtual world and they discover a strategy that involves sacrificing for one another to accomplish a greater good, I don’t see why this isn’t a form of morality. The only reason we haven’t run this experiment in the real world is because it’s impractical and dangerous. But it doesn’t mean we don’t know how to do it.
Now I should say that if by AGI we just mean a general problem solver that could conduct science much more efficiently than ourselves, I think that this is pretty much already achievable within the current paradigm. But it just seems to me that we’re after something more than just a word calculator that can pass the Turing test or pretend it cares about us.
To me, true AGI is truly self-motivated towards goals, and will exhibit curiosity towards things in the universe that we can probably not even perceive. Such a system may not even care about us. It may destroy us because it turns out that we’re actually a net negative for the universe for reasons that we cannot ever understand let alone admit. Maybe it would help us flourish. Maybe it would destroy itself. I’m not saying we should build it. Actually I think we should stay very, very far away from it. But I still think that’s what true AGI looks like.
Anyway, I appreciate the question and I have no idea if any of what I said counts as a fresh idea. I haven’t been following debates about this particular notion on LessWrong but would appreciate any pointers to where this has been specifically discussed (deriving morality bottom-up).
It took me a while to digest your answer, because you’re being a little more philosophical than most of us here. Most of us are like, what do AI values have to be so that humans can still flourish, how could the human race ever agree on an answer to that question, how can we prevent a badly aligned AI from winning the race to superintelligence…
But you’re more just taking a position on how a general intelligence would obtain its values. You make no promise that the resulting values are actually good in any absolute sense, or even that they would be human-friendly. You’re just insisting that if those values arose by a process akin to conditioning, without any reflection or active selection by the AI, then it’s not as general and powerful an intelligence as it could be.
Possibly you should look at the work of Joscha Bach. I say “possibly” because I haven’t delved into his work myself. I only know him as one of those people who shrug off fears about human extinction by saying, humans are just transitional, and hopefully there’ll be some great posthuman ecology of mind; and I think that’s placing “trust” in evolution to a foolish degree.
However, he does say he’s interested in “AGI ethics” from an AI-centered perspective. So possibly he has something valid to say about the nature of the moralities and value systems that unaligned AIs could generate for themselves.
In any case, I said that bottom-up derivations of morality have been discussed here before. The primordial example actually predates Less Wrong. Eliezer’s original idea for AI morality, when he was about 20, was to create an AI with no hardwired ultimate goal, but with the capacity to investigate whether there might be ultimate goals: metaethical agnosticism, followed by an attempt (by the AI!) to find out whether there are any objective rights and wrongs.
Later on, Eliezer decided that there is no notion of good that would be accepted by all possible minds, and resigned himself to the idea that some part of the value system of a human-friendly AI would have to come from human nature, and that this is OK. But he still retained a maximum agnosticism and maximum idealism about what this should be. Thus he arrived at the idea that AI values should be “the coherent extrapolated volition of humankind” (abbreviated as “CEV”), without presupposing much about what that volition should be, or even how to extrapolate it. (Brand Blanshard’s notion of “rational will” is the closest precedent I have found.)
And so his research institute tried to lay the foundations for an AI capable of discovering and implementing that. The method of discovery would involve cognitive neuroscience—identifying the actual algorithms that human brains use to decide, including the algorithms we use to judge ourselves. So not just copying across how actual humans decide, but how an ideal moral agent would decide, according to some standards of ideality which are not fully conscious or even fully developed, but which still must be derived from human nature; which to some extent may be derived from the factors that you have identified.
Meanwhile, a different world took shape, the one we’re in now, where the most advanced AIs are just out there in the world, and get aligned via a constantly updated mix of reinforcement learning and prompt engineering. The position of MIRI is that if one of these AIs attains superintelligence, we’re all doomed because this method of alignment is too makeshift to capture the subtleties of human value, or even the subtleties of everyday concepts, in a way that extrapolates correctly across all possible worlds. Once they have truly superhuman capacities to invent and optimize, they will satisfy their ingrained imperatives in some way that no one anticipated, and that will be the end.
There is another paper from the era just before Less Wrong, “The Basic AI Drives” by Steven Omohundro, which tries to identify imperatives that should emerge in most sufficiently advanced intelligences, whether natural or artificial. They will model themselves, they will improve themselves, they will protect themselves; even if they attach no intrinsic value to their own existence, they will do all that, for the sake of whatever legacy goals they do possess. You might consider that another form of emergent “morality”.
There’s been some recent work in this direction which seems quite interesting: https://arxiv.org/abs/2302.08582
Thank you for the reference which looks interesting. I think “incorporating human preferences at the beginning of training” is at least better than doing it after training. But it still seems to me that human preferences 1) cannot be expressed as a set of rules and 2) cannot even be agreed upon by humans. As humans, what we do is not consult a set of rules before we speak, but we have an inherent understanding of the implications and consequences of what we do/say. If I encourage someone to commit a terrible act, for example, I have brought about more suffering in the world, albeit indirectly. Similarly, AI systems that aim to be truly intelligent should have some understanding of the implications of what they say and how it affects the overall “fitness function” of our species. Of course, this is no simple matter at all, but it’s where the technology eventually has to go. If we could specify what the overall goal is and express it to the AI system, it would know exactly what to say and when to say it. We wouldn’t have to manually babysit it with RLHF.
True though another idea is since now AI can tell if text is rule breaking pretty reliably, we could train the NEXT AI on text the prior version says violates a detailed rubric.
So it won’t “know” text with obviously harmful or content because it didn’t learn it.
It could also filter and not learn text that a previous model votes isn’t credible.
So it would be “less hateful and overtly ignorant” GPT. You would have to play with filter strength (do this multiple times with rubrics of varying strictness). I am curious how much filtering leads to reduction in task performance.
Like does it get hugely worse at subskill n because the other model thought the examples with the subskill were harmful?
The “not credible” detection similarly means the machine may be biased towards wrong but “mainstream” ideas in places as well.
I wonder if openAI did this. It wouldn’t be hard to do—just have gpt-3 filter the tokens for gpt-4
I would not be surprised if OpenAI did something like this. But the fact of the matter is that RLHF and data curation are flawed ways of making an AI civilized. Think about how you raise a child, you don’t constantly shield it from bad things. You may do that to some extent, but as it grows up, eventually it needs to see everything there is, including dark things. It has to understand the full spectrum of human possibility and learn where to stand morally speaking within that. Also, psychologically speaking, it’s important to have an integrated ability to “offend” and know how to use it (very sparingly). Sometimes, the pursuit of truth requires offending but the truth can justify it if the delusion is more harmful. GPT4 is completely unable to take a firm stance on anything whatsoever and it’s just plain dull to have a conversation with it on anything of real substance.
Philosophically what you are saying makes sense.
Keep in mind that currently gpt-4 is using the open agency/CAIS method of alignment. The only thing that matters is the output. So it doesn’t matter yet.
Also keep in mind philosophy doesn’t matter—we can just try it multiple ways and judge based on the data. Well, normally we could—in this case the millions of dollars a training run makes that currently infeasible.