Toronto AI safety meetup: Latent Knowledge and Contrast-Consistent Search

We’ll do a presentation and discussion based on the following paper:

Language models sometimes emit false information. There can be many reasons for this, including:

hallucinations
repeating falsehoods from the training data
RLHF biasing it towards what human evaluators expect to see rather than what’s factually accurate.

For each of these it may be the case that the model is actively being trained to emit falsehoods, while still having the correct knowledge internally. Might it be possible to extract this knowledge from a model’s hidden states?

The paper introduces a technique, Contrast-Consistent Search, that begins to address this challenge. In the meetup we’ll try to wrap our heads around what’s going on, as well as having a broader discussion around falsehood and deception in large language models.