HTTP/1.1 303 See other
Location: https://mishajw.com
mishajw
Sabotage Evaluations for Frontier Models
You mention prompting for calibration. I’ve been experimenting with prompting models to give their probabilities for the set of answers on a multiple choice question in order to calculate a Brier score. This is just vague speculation, but I wonder if there’s a training regime where the data involves getting the model to be well calibrated in its reported probabilities which could lead to the model having a clearer, more generalized representation of truth that would be easier to detect.
That would certainly be an interesting experiment. A related experiment I’d like to try is to do this but instead of fine-tuning just experimenting with the prompt format. For example, if you ask a model to be calibrated in its output, and perhaps give some few-shot examples, does this improve the truth probes?
I’m now curious what would happen if you did an ensemble probe. Ensembles of different techniques for measuring the same thing tend to work better than individual techniques. What if you train some sort of decision model on the outputs of the probes? (e.g. XGBoost) I bet it’d do better than any probe alone.
Yes! An obvious thing to try is a two-layer MLP probe, that should allow some kind of decision process while keeping the solution relatively interpretable. More generally, I’m excited about using RepEng to craft slightly more complex but still interpretable approaches to model interp / control.
Cool to see the generalisation results for Llama-2 7/13/70B! I originally ran some of these experiments on 7B and got very different results, that PCA plot of 7B looks familiar (and bizarre). Excited to read the paper in its entirety. The first GoT paper was very good.
One approach here is to use a dataset in which the truth and likelihood of inputs are uncorrelated (or negatively correlated), as you kinda did with TruthfulQA. For that, I like to use the “neg_” versions of the datasets from GoT, containing negated statements like “The city of Beijing is not in China.” For these datasets, the correlation between truth value and likelihood (operationalized as LLaMA-2-70B’s log probability of the full statement) is strong and negative (-0.63 for neg_cities and -.89 for neg_sp_en_trans). But truth probes still often generalize well to these negated datsets. Here are results for LLaMA-2-70B (the horizontal axis shows the train set, and the vertical axis shows the test set).
This is an interesting approach! I suppose there are two things we want to separate: “truth” from likely statements, and “truth” from what humans think (under some kind of simulacra framing). I think this approach would allow you to do the former, but not the latter. And to be honest, I’m not confident on TruthfulQA’s ability to do the latter either.
P.S. I realised an important note got removed while editing this post—added back, but FYI:
We differ slightly from the original GoT paper in naming, and use
got_cities
to refer to both thecities
andneg_cities
datasets. The same is true forsp_en_trans
andlarger_than
. We don’t do this forcities_cities_{conj,disj}
and leave them unpaired.
That’s right—thanks for pointing out! Added a footnote:
For unsupervised methods, we do technically use the labels in two places. One, we select the sign of the probe based on labels. Two, for some datasets, we only want one true and false answer each, while there may be many. We use the labels to limit to one each.
How well do truth probes generalise?
Jailbreaking GPT-4 with the tool API
One perspective is that representation engineering allows us to do “single-bit edits” to the network’s behaviour. Pre-training changes a lot of bits; fine-tuning changes slightly less; LoRA even less; adding a single vector to a residual stream should flip a single flag in the program implemented by the network.
(This of course is predicated on us being able to create monosemantic directions, and predicated on monosemanticity being a good way to think about this at all.)
This is beneficial from a safety point of view, as instead of saying “we trained the model, it learnt some unknown circuit that fits the data” we can say “no new circuits were learnt, we just flipped a flag and this fits the data”.
In the world where this works, and works well enough to replace RLHF (or some other major part of the training process), we should end up with more controlled network edits.
Distilled Representations Research Agenda
I like this idea, it matches quite closely how I naturally work. I had some spare time this weekend, so I made a quick prototype site: https://rationalbreaks.vercel.app
(Apologies, been on holiday.)
For recovered accuracy, we select a single threshold per dataset, taking the best value across all probe algorithms and datasets. So a random probe would be compared to the best probe algorithm on that dataset, and likely perform poorly.
I did check the thresholds used for recovered accuracy, and they seemed sensible, but I didn’t put this in the report. I’ll try to find time next week to put this in the appendix.