What’s the relationship between this method and representation engineering? They seem quite similar, though maybe I’m missing something. You train a linear probe on a model’s activations at a particular layer in order to distinguish between normal forward passes and catastrophic ones where the model provides advice for theft.
Representation engineering asks models to generate both positive and negative examples of a particular kind of behavior. For example, the model would generate outputs with and without theft, or with and without general power-seeking. You’d collect the model’s activations from a particular layer during those forward passes, and then construct a linear model to distinguish between positives and negatives.
Both methods construct a linear probe to distinguish between positive and negative examples of catastrophic behavior. One difference is that your negatives are generic instruction following examples from the Alpaca dataset, while RepE uses negatives generated by the model. There may also be differences in whether you’re examining activations in every token vs. in the last token of the generation.
Probes fall within the representation engineering monitoring framework.
LAT (the specific technique they use to train probes in the RePE paper) is just regular probe training, but with a specific kind of training dataset ((positive, negative) pairs) and a slightly more fancy loss. It might work better in practice, but just because of better inductive biases, not because something fundamentally different is going on (so arguments against coup probes mostly apply if LAT is used to train them). It also makes creating a dataset slightly more annoying—especially for concepts like “takeover” and “theft” for which I don’t think there are natural (positive, negative) pairs that would allow narrowing down the probe direction with <5 hand-crafted pairs.
I think it’s worth speaking directly about probes rather than representation engineering, as I find it more clear (even if probes were trained with LAT).
Yea I think it’s fair to say probes is a technique under rep reading which is under RepE (https://www.ai-transparency.org/). Though I did want to mention, in many settings, LAT is performing unsupervised learning with PCA and does not use any labels. And we find regular linear probing often does not generalize well and is ineffective for (causal) model control (e.g., details in section 5). So equating LAT to regular probing might be an oversimplification. How to best elicit the desired neural activity patterns requires careful design of 1) the experimental task and 2) locations to read the neural activity, which contribute to the success of LAT over regular probing (section 3.1.1).
In general, we have shown some promise of monitoring high-level representations for harmful (or catastrophic) intents/behaviors. It’s exciting to see follow-ups in this direction which demonstrate more fine-grained monitoring/control.
What’s the relationship between this method and representation engineering? They seem quite similar, though maybe I’m missing something. You train a linear probe on a model’s activations at a particular layer in order to distinguish between normal forward passes and catastrophic ones where the model provides advice for theft.
Representation engineering asks models to generate both positive and negative examples of a particular kind of behavior. For example, the model would generate outputs with and without theft, or with and without general power-seeking. You’d collect the model’s activations from a particular layer during those forward passes, and then construct a linear model to distinguish between positives and negatives.
Both methods construct a linear probe to distinguish between positive and negative examples of catastrophic behavior. One difference is that your negatives are generic instruction following examples from the Alpaca dataset, while RepE uses negatives generated by the model. There may also be differences in whether you’re examining activations in every token vs. in the last token of the generation.
Probes fall within the representation engineering monitoring framework.
LAT (the specific technique they use to train probes in the RePE paper) is just regular probe training, but with a specific kind of training dataset ((positive, negative) pairs) and a slightly more fancy loss. It might work better in practice, but just because of better inductive biases, not because something fundamentally different is going on (so arguments against coup probes mostly apply if LAT is used to train them). It also makes creating a dataset slightly more annoying—especially for concepts like “takeover” and “theft” for which I don’t think there are natural (positive, negative) pairs that would allow narrowing down the probe direction with <5 hand-crafted pairs.
I think it’s worth speaking directly about probes rather than representation engineering, as I find it more clear (even if probes were trained with LAT).
Yea I think it’s fair to say probes is a technique under rep reading which is under RepE (https://www.ai-transparency.org/). Though I did want to mention, in many settings, LAT is performing unsupervised learning with PCA and does not use any labels. And we find regular linear probing often does not generalize well and is ineffective for (causal) model control (e.g., details in section 5). So equating LAT to regular probing might be an oversimplification. How to best elicit the desired neural activity patterns requires careful design of 1) the experimental task and 2) locations to read the neural activity, which contribute to the success of LAT over regular probing (section 3.1.1).
In general, we have shown some promise of monitoring high-level representations for harmful (or catastrophic) intents/behaviors. It’s exciting to see follow-ups in this direction which demonstrate more fine-grained monitoring/control.
Nice, that makes sense. I agree that RepE / LAT might not be helpful as terminology. “Unsupervised probing” is more straightforward and descriptive.
(Note that in this work, we’re just doing supervised probing though we do use models to generate some of the training data.)