aogara comments on Coup probes: Catching catastrophes with probes trained off-policy

aogara 17 Nov 2023 19:24 UTC
LW: 9 AF: 6
0
AF
What’s the relationship between this method and representation engineering? They seem quite similar, though maybe I’m missing something. You train a linear probe on a model’s activations at a particular layer in order to distinguish between normal forward passes and catastrophic ones where the model provides advice for theft.
Representation engineering asks models to generate both positive and negative examples of a particular kind of behavior. For example, the model would generate outputs with and without theft, or with and without general power-seeking. You’d collect the model’s activations from a particular layer during those forward passes, and then construct a linear model to distinguish between positives and negatives.
Both methods construct a linear probe to distinguish between positive and negative examples of catastrophic behavior. One difference is that your negatives are generic instruction following examples from the Alpaca dataset, while RepE uses negatives generated by the model. There may also be differences in whether you’re examining activations in every token vs. in the last token of the generation.
- Fabien Roger 17 Nov 2023 19:43 UTC
  LW: 13 AF: 8
  4
  AF Parent
  Probes fall within the representation engineering monitoring framework.
  LAT (the specific technique they use to train probes in the RePE paper) is just regular probe training, but with a specific kind of training dataset ((positive, negative) pairs) and a slightly more fancy loss. It might work better in practice, but just because of better inductive biases, not because something fundamentally different is going on (so arguments against coup probes mostly apply if LAT is used to train them). It also makes creating a dataset slightly more annoying—especially for concepts like “takeover” and “theft” for which I don’t think there are natural (positive, negative) pairs that would allow narrowing down the probe direction with <5 hand-crafted pairs.
  I think it’s worth speaking directly about probes rather than representation engineering, as I find it more clear (even if probes were trained with LAT).
  - Andy Zou 18 Nov 2023 22:32 UTC
    3 points
    0
    Parent
    Yea I think it’s fair to say probes is a technique under rep reading which is under RepE (https://www.ai-transparency.org/). Though I did want to mention, in many settings, LAT is performing unsupervised learning with PCA and does not use any labels. And we find regular linear probing often does not generalize well and is ineffective for (causal) model control (e.g., details in section 5). So equating LAT to regular probing might be an oversimplification. How to best elicit the desired neural activity patterns requires careful design of 1) the experimental task and 2) locations to read the neural activity, which contribute to the success of LAT over regular probing (section 3.1.1).
    In general, we have shown some promise of monitoring high-level representations for harmful (or catastrophic) intents/behaviors. It’s exciting to see follow-ups in this direction which demonstrate more fine-grained monitoring/control.
  - aogara 18 Nov 2023 16:39 UTC
    2 points
    0
    Parent
    Nice, that makes sense. I agree that RepE / LAT might not be helpful as terminology. “Unsupervised probing” is more straightforward and descriptive.
    - ryan_greenblatt 18 Nov 2023 17:37 UTC
      3 points
      0
      Parent
      (Note that in this work, we’re just doing supervised probing though we do use models to generate some of the training data.)