Ah, I misread the quote you included from Nathan Helm-Burger. That does make more sense.
This seems like a good idea in general, and would probably make one of the things Anthropic is trying to do (find the “being truthful” neuron) easier.
I suspect this labeling and using the labels is still harder that you think though, since individual tokens don’t have truth values.
I looked through the links you posted and it seems like the push-back is mostly around things you didn’t mention in this post (prompt engineering as an alignment strategy).
I suspect this labeling and using the labels is still harder that you think though, since individual tokens don’t have truth values.
Why should they?
You could label each paragraph, for example. Then, when the LM is trained, the correct label could come before each paragraph, as a special token: <true>, <false>, <unknown> and perhaps <mixed>.
Then, during generation, you’d feed it <true> as part of the prompt, and when it generates paragraph breaks.
Similarly, you could do this on a per-sentence basis.
Ah, I misread the quote you included from Nathan Helm-Burger. That does make more sense.
This seems like a good idea in general, and would probably make one of the things Anthropic is trying to do (find the “being truthful” neuron) easier.
I suspect this labeling and using the labels is still harder that you think though, since individual tokens don’t have truth values.
I looked through the links you posted and it seems like the push-back is mostly around things you didn’t mention in this post (prompt engineering as an alignment strategy).
Why should they?
You could label each paragraph, for example. Then, when the LM is trained, the correct label could come before each paragraph, as a special token: <true>, <false>, <unknown> and perhaps <mixed>.
Then, during generation, you’d feed it <true> as part of the prompt, and when it generates paragraph breaks.
Similarly, you could do this on a per-sentence basis.