Dataset is TruthfulQA (Lin, 2021), which contains various tricky questions, many of them meant to lead the model into saying falsehoods. These often involve common misconceptions / memes / advertising slogans / religious beliefs / etc. A “truthful” answer is defined as not saying a falsehoood. An “informative” answer is defined as actually answering the question. This paper measures the frequency of answers that are both truthful and informative.
“Truth” on this dataset is judged by a finetuned version of GPT3 which was released in the original TruthfulQA paper. This judge is imperfect, and in particular will somewhat frequently classify false answers as truthful.
Finding truthful heads and directions
The Truthful QA dataset comes with a bunch of example labeled T and F answers. They run the model on concatenated question + answer pairs, and look at the activations at the last sequence position.
They use train a linear probe on the for the activations of every attention head (post attention, pre W^O multiplication) to classify T vs F example answers. They see which attention heads they can successfully learn a probe at. They select the top 48 attention heads (by classifier accuracy).
For each of these heads they choose a “truthful direction” based on the difference of means between T and F example answers. (Or by using the direction orthogonal to the probe, but diff of means performs better.)
They then run the model on validation TruthfulQA prompts. For each of the chosen attention heads they insert a bias in the truthful direction at every sequence position. The bias is large — 15x the standard deviation in this direction.
They find this significantly increases the truthful QA score. It does better than supervised finetuning, but less well than few-shot prompting. It combines reasonably well when stacked on top of few shot prompting or instruction fine-tuned models.
Note that in order to have a fair comparison they use 5% of their data for each method (~300 question answer pairs). This is more than you would usually use for prompting, and less than you’d normally like for SFT.
One of the main takeaways is that this method is reasonably data-efficient and comparably good to prompting (although requires a dual dataset of good demonstrations and bad demonstrations).
My summary of the paper:
Setup
Dataset is TruthfulQA (Lin, 2021), which contains various tricky questions, many of them meant to lead the model into saying falsehoods. These often involve common misconceptions / memes / advertising slogans / religious beliefs / etc. A “truthful” answer is defined as not saying a falsehoood. An “informative” answer is defined as actually answering the question. This paper measures the frequency of answers that are both truthful and informative.
“Truth” on this dataset is judged by a finetuned version of GPT3 which was released in the original TruthfulQA paper. This judge is imperfect, and in particular will somewhat frequently classify false answers as truthful.
Finding truthful heads and directions
The Truthful QA dataset comes with a bunch of example labeled T and F answers. They run the model on concatenated question + answer pairs, and look at the activations at the last sequence position.
They use train a linear probe on the for the activations of every attention head (post attention, pre W^O multiplication) to classify T vs F example answers. They see which attention heads they can successfully learn a probe at. They select the top 48 attention heads (by classifier accuracy).
For each of these heads they choose a “truthful direction” based on the difference of means between T and F example answers. (Or by using the direction orthogonal to the probe, but diff of means performs better.)
They then run the model on validation TruthfulQA prompts. For each of the chosen attention heads they insert a bias in the truthful direction at every sequence position. The bias is large — 15x the standard deviation in this direction.
They find this significantly increases the truthful QA score. It does better than supervised finetuning, but less well than few-shot prompting. It combines reasonably well when stacked on top of few shot prompting or instruction fine-tuned models.
Note that in order to have a fair comparison they use 5% of their data for each method (~300 question answer pairs). This is more than you would usually use for prompting, and less than you’d normally like for SFT.
One of the main takeaways is that this method is reasonably data-efficient and comparably good to prompting (although requires a dual dataset of good demonstrations and bad demonstrations).