True though another idea is since now AI can tell if text is rule breaking pretty reliably, we could train the NEXT AI on text the prior version says violates a detailed rubric.
So it won’t “know” text with obviously harmful or content because it didn’t learn it.
It could also filter and not learn text that a previous model votes isn’t credible.
So it would be “less hateful and overtly ignorant” GPT. You would have to play with filter strength (do this multiple times with rubrics of varying strictness). I am curious how much filtering leads to reduction in task performance.
Like does it get hugely worse at subskill n because the other model thought the examples with the subskill were harmful?
The “not credible” detection similarly means the machine may be biased towards wrong but “mainstream” ideas in places as well.
I wonder if openAI did this. It wouldn’t be hard to do—just have gpt-3 filter the tokens for gpt-4
I would not be surprised if OpenAI did something like this. But the fact of the matter is that RLHF and data curation are flawed ways of making an AI civilized. Think about how you raise a child, you don’t constantly shield it from bad things. You may do that to some extent, but as it grows up, eventually it needs to see everything there is, including dark things. It has to understand the full spectrum of human possibility and learn where to stand morally speaking within that. Also, psychologically speaking, it’s important to have an integrated ability to “offend” and know how to use it (very sparingly). Sometimes, the pursuit of truth requires offending but the truth can justify it if the delusion is more harmful. GPT4 is completely unable to take a firm stance on anything whatsoever and it’s just plain dull to have a conversation with it on anything of real substance.
Keep in mind that currently gpt-4 is using the open agency/CAIS method of alignment. The only thing that matters is the output. So it doesn’t matter yet.
Also keep in mind philosophy doesn’t matter—we can just try it multiple ways and judge based on the data. Well, normally we could—in this case the millions of dollars a training run makes that currently infeasible.
True though another idea is since now AI can tell if text is rule breaking pretty reliably, we could train the NEXT AI on text the prior version says violates a detailed rubric.
So it won’t “know” text with obviously harmful or content because it didn’t learn it.
It could also filter and not learn text that a previous model votes isn’t credible.
So it would be “less hateful and overtly ignorant” GPT. You would have to play with filter strength (do this multiple times with rubrics of varying strictness). I am curious how much filtering leads to reduction in task performance.
Like does it get hugely worse at subskill n because the other model thought the examples with the subskill were harmful?
The “not credible” detection similarly means the machine may be biased towards wrong but “mainstream” ideas in places as well.
I wonder if openAI did this. It wouldn’t be hard to do—just have gpt-3 filter the tokens for gpt-4
I would not be surprised if OpenAI did something like this. But the fact of the matter is that RLHF and data curation are flawed ways of making an AI civilized. Think about how you raise a child, you don’t constantly shield it from bad things. You may do that to some extent, but as it grows up, eventually it needs to see everything there is, including dark things. It has to understand the full spectrum of human possibility and learn where to stand morally speaking within that. Also, psychologically speaking, it’s important to have an integrated ability to “offend” and know how to use it (very sparingly). Sometimes, the pursuit of truth requires offending but the truth can justify it if the delusion is more harmful. GPT4 is completely unable to take a firm stance on anything whatsoever and it’s just plain dull to have a conversation with it on anything of real substance.
Philosophically what you are saying makes sense.
Keep in mind that currently gpt-4 is using the open agency/CAIS method of alignment. The only thing that matters is the output. So it doesn’t matter yet.
Also keep in mind philosophy doesn’t matter—we can just try it multiple ways and judge based on the data. Well, normally we could—in this case the millions of dollars a training run makes that currently infeasible.