You mention prompting for calibration. I’ve been experimenting with prompting models to give their probabilities for the set of answers on a multiple choice question in order to calculate a Brier score. This is just vague speculation, but I wonder if there’s a training regime where the data involves getting the model to be well calibrated in its reported probabilities which could lead to the model having a clearer, more generalized representation of truth that would be easier to detect.
That would certainly be an interesting experiment. A related experiment I’d like to try is to do this but instead of fine-tuning just experimenting with the prompt format. For example, if you ask a model to be calibrated in its output, and perhaps give some few-shot examples, does this improve the truth probes?
I’m now curious what would happen if you did an ensemble probe. Ensembles of different techniques for measuring the same thing tend to work better than individual techniques. What if you train some sort of decision model on the outputs of the probes? (e.g. XGBoost) I bet it’d do better than any probe alone.
Yes! An obvious thing to try is a two-layer MLP probe, that should allow some kind of decision process while keeping the solution relatively interpretable. More generally, I’m excited about using RepEng to craft slightly more complex but still interpretable approaches to model interp / control.
That would certainly be an interesting experiment. A related experiment I’d like to try is to do this but instead of fine-tuning just experimenting with the prompt format. For example, if you ask a model to be calibrated in its output, and perhaps give some few-shot examples, does this improve the truth probes?
Yes! An obvious thing to try is a two-layer MLP probe, that should allow some kind of decision process while keeping the solution relatively interpretable. More generally, I’m excited about using RepEng to craft slightly more complex but still interpretable approaches to model interp / control.