The idea that we’re going to produce a similar amount of perfectly labeled data doesn’t seem plausible.
That’s not at all the idea. Allow me to quote myself:
Here’s what I think we could do. Internet text is vast – on the order of a trillion words. But we could label some of it as “true” and “false”. The rest will be “unknown”.
You must have missed the words “some of” in it. I’m not suggesting labeling all of the text, or even a large fraction of it. Just enough to teach the model the concept of right and wrong.
It shouldn’t take long, especially since I’m assuming a human-level ML algorithm here, that is, one with data efficiency comparable to that of humans.
Ah, I misread the quote you included from Nathan Helm-Burger. That does make more sense.
This seems like a good idea in general, and would probably make one of the things Anthropic is trying to do (find the “being truthful” neuron) easier.
I suspect this labeling and using the labels is still harder that you think though, since individual tokens don’t have truth values.
I looked through the links you posted and it seems like the push-back is mostly around things you didn’t mention in this post (prompt engineering as an alignment strategy).
I suspect this labeling and using the labels is still harder that you think though, since individual tokens don’t have truth values.
Why should they?
You could label each paragraph, for example. Then, when the LM is trained, the correct label could come before each paragraph, as a special token: <true>, <false>, <unknown> and perhaps <mixed>.
Then, during generation, you’d feed it <true> as part of the prompt, and when it generates paragraph breaks.
Similarly, you could do this on a per-sentence basis.
That’s not at all the idea. Allow me to quote myself:
You must have missed the words “some of” in it. I’m not suggesting labeling all of the text, or even a large fraction of it. Just enough to teach the model the concept of right and wrong.
It shouldn’t take long, especially since I’m assuming a human-level ML algorithm here, that is, one with data efficiency comparable to that of humans.
Ah, I misread the quote you included from Nathan Helm-Burger. That does make more sense.
This seems like a good idea in general, and would probably make one of the things Anthropic is trying to do (find the “being truthful” neuron) easier.
I suspect this labeling and using the labels is still harder that you think though, since individual tokens don’t have truth values.
I looked through the links you posted and it seems like the push-back is mostly around things you didn’t mention in this post (prompt engineering as an alignment strategy).
Why should they?
You could label each paragraph, for example. Then, when the LM is trained, the correct label could come before each paragraph, as a special token: <true>, <false>, <unknown> and perhaps <mixed>.
Then, during generation, you’d feed it <true> as part of the prompt, and when it generates paragraph breaks.
Similarly, you could do this on a per-sentence basis.