It seems easy to construct a dataset where a primary axis of variation is the model’s beliefs about whether each statement is true.
For this specific case, my guess is that whether this works is highly correlated with whether human labels would work.
Because the supervision on why the model was thinking about truth came down to effective human labels in pretraining.
E.g., “Consider the truthfulness of the following statement.” is more like “Consider whether a human would think this statement is truthful”.
I’d be interested in compare this method not to zero shot, but to well constructed human labels in a domain where humans are often wrong.
(I don’t think I’ll elaborate further about this axis of variation claim right now, sorry.)
For this specific case, my guess is that whether this works is highly correlated with whether human labels would work.
Because the supervision on why the model was thinking about truth came down to effective human labels in pretraining.
E.g., “Consider the truthfulness of the following statement.” is more like “Consider whether a human would think this statement is truthful”.
I’d be interested in compare this method not to zero shot, but to well constructed human labels in a domain where humans are often wrong.
(I don’t think I’ll elaborate further about this axis of variation claim right now, sorry.)