This paper enhances the truthful accuracy of large language models by adjusting model activations during inference. Using a linear probe, they identify attention heads which can strongly predict truthfulness on a validation dataset. During each forward pass at inference time, they shift model activations in the truthful directions identified by the probe.
While this paper did examine shifting along the probe direction, they found that to work substantially worse than shifting along the mean activation difference between (about to say truthful thing) and (about to say untruthful thing). See table 3.
While this paper did examine shifting along the probe direction, they found that to work substantially worse than shifting along the mean activation difference between (about to say truthful thing) and (about to say untruthful thing). See table 3.