Thank you for the reference which looks interesting. I think “incorporating human preferences at the beginning of training” is at least better than doing it after training. But it still seems to me that human preferences 1) cannot be expressed as a set of rules and 2) cannot even be agreed upon by humans. As humans, what we do is not consult a set of rules before we speak, but we have an inherent understanding of the implications and consequences of what we do/say. If I encourage someone to commit a terrible act, for example, I have brought about more suffering in the world, albeit indirectly. Similarly, AI systems that aim to be truly intelligent should have some understanding of the implications of what they say and how it affects the overall “fitness function” of our species. Of course, this is no simple matter at all, but it’s where the technology eventually has to go. If we could specify what the overall goal is and express it to the AI system, it would know exactly what to say and when to say it. We wouldn’t have to manually babysit it with RLHF.
True though another idea is since now AI can tell if text is rule breaking pretty reliably, we could train the NEXT AI on text the prior version says violates a detailed rubric.
So it won’t “know” text with obviously harmful or content because it didn’t learn it.
It could also filter and not learn text that a previous model votes isn’t credible.
So it would be “less hateful and overtly ignorant” GPT. You would have to play with filter strength (do this multiple times with rubrics of varying strictness). I am curious how much filtering leads to reduction in task performance.
Like does it get hugely worse at subskill n because the other model thought the examples with the subskill were harmful?
The “not credible” detection similarly means the machine may be biased towards wrong but “mainstream” ideas in places as well.
I wonder if openAI did this. It wouldn’t be hard to do—just have gpt-3 filter the tokens for gpt-4
I would not be surprised if OpenAI did something like this. But the fact of the matter is that RLHF and data curation are flawed ways of making an AI civilized. Think about how you raise a child, you don’t constantly shield it from bad things. You may do that to some extent, but as it grows up, eventually it needs to see everything there is, including dark things. It has to understand the full spectrum of human possibility and learn where to stand morally speaking within that. Also, psychologically speaking, it’s important to have an integrated ability to “offend” and know how to use it (very sparingly). Sometimes, the pursuit of truth requires offending but the truth can justify it if the delusion is more harmful. GPT4 is completely unable to take a firm stance on anything whatsoever and it’s just plain dull to have a conversation with it on anything of real substance.
Keep in mind that currently gpt-4 is using the open agency/CAIS method of alignment. The only thing that matters is the output. So it doesn’t matter yet.
Also keep in mind philosophy doesn’t matter—we can just try it multiple ways and judge based on the data. Well, normally we could—in this case the millions of dollars a training run makes that currently infeasible.
There’s been some recent work in this direction which seems quite interesting: https://arxiv.org/abs/2302.08582
Thank you for the reference which looks interesting. I think “incorporating human preferences at the beginning of training” is at least better than doing it after training. But it still seems to me that human preferences 1) cannot be expressed as a set of rules and 2) cannot even be agreed upon by humans. As humans, what we do is not consult a set of rules before we speak, but we have an inherent understanding of the implications and consequences of what we do/say. If I encourage someone to commit a terrible act, for example, I have brought about more suffering in the world, albeit indirectly. Similarly, AI systems that aim to be truly intelligent should have some understanding of the implications of what they say and how it affects the overall “fitness function” of our species. Of course, this is no simple matter at all, but it’s where the technology eventually has to go. If we could specify what the overall goal is and express it to the AI system, it would know exactly what to say and when to say it. We wouldn’t have to manually babysit it with RLHF.
True though another idea is since now AI can tell if text is rule breaking pretty reliably, we could train the NEXT AI on text the prior version says violates a detailed rubric.
So it won’t “know” text with obviously harmful or content because it didn’t learn it.
It could also filter and not learn text that a previous model votes isn’t credible.
So it would be “less hateful and overtly ignorant” GPT. You would have to play with filter strength (do this multiple times with rubrics of varying strictness). I am curious how much filtering leads to reduction in task performance.
Like does it get hugely worse at subskill n because the other model thought the examples with the subskill were harmful?
The “not credible” detection similarly means the machine may be biased towards wrong but “mainstream” ideas in places as well.
I wonder if openAI did this. It wouldn’t be hard to do—just have gpt-3 filter the tokens for gpt-4
I would not be surprised if OpenAI did something like this. But the fact of the matter is that RLHF and data curation are flawed ways of making an AI civilized. Think about how you raise a child, you don’t constantly shield it from bad things. You may do that to some extent, but as it grows up, eventually it needs to see everything there is, including dark things. It has to understand the full spectrum of human possibility and learn where to stand morally speaking within that. Also, psychologically speaking, it’s important to have an integrated ability to “offend” and know how to use it (very sparingly). Sometimes, the pursuit of truth requires offending but the truth can justify it if the delusion is more harmful. GPT4 is completely unable to take a firm stance on anything whatsoever and it’s just plain dull to have a conversation with it on anything of real substance.
Philosophically what you are saying makes sense.
Keep in mind that currently gpt-4 is using the open agency/CAIS method of alignment. The only thing that matters is the output. So it doesn’t matter yet.
Also keep in mind philosophy doesn’t matter—we can just try it multiple ways and judge based on the data. Well, normally we could—in this case the millions of dollars a training run makes that currently infeasible.