Fabien Roger comments on Coup probes: Catching catastrophes with probes trained off-policy

Fabien Roger 17 Nov 2023 19:43 UTC
LW: 13 AF: 8
4
AF
Probes fall within the representation engineering monitoring framework.
LAT (the specific technique they use to train probes in the RePE paper) is just regular probe training, but with a specific kind of training dataset ((positive, negative) pairs) and a slightly more fancy loss. It might work better in practice, but just because of better inductive biases, not because something fundamentally different is going on (so arguments against coup probes mostly apply if LAT is used to train them). It also makes creating a dataset slightly more annoying—especially for concepts like “takeover” and “theft” for which I don’t think there are natural (positive, negative) pairs that would allow narrowing down the probe direction with <5 hand-crafted pairs.
I think it’s worth speaking directly about probes rather than representation engineering, as I find it more clear (even if probes were trained with LAT).
- Andy Zou 18 Nov 2023 22:32 UTC
  3 points
  0
  Parent
  Yea I think it’s fair to say probes is a technique under rep reading which is under RepE (https://www.ai-transparency.org/). Though I did want to mention, in many settings, LAT is performing unsupervised learning with PCA and does not use any labels. And we find regular linear probing often does not generalize well and is ineffective for (causal) model control (e.g., details in section 5). So equating LAT to regular probing might be an oversimplification. How to best elicit the desired neural activity patterns requires careful design of 1) the experimental task and 2) locations to read the neural activity, which contribute to the success of LAT over regular probing (section 3.1.1).
  In general, we have shown some promise of monitoring high-level representations for harmful (or catastrophic) intents/behaviors. It’s exciting to see follow-ups in this direction which demonstrate more fine-grained monitoring/control.
- aogara 18 Nov 2023 16:39 UTC
  2 points
  0
  Parent
  Nice, that makes sense. I agree that RepE / LAT might not be helpful as terminology. “Unsupervised probing” is more straightforward and descriptive.
  - ryan_greenblatt 18 Nov 2023 17:37 UTC
    3 points
    0
    Parent
    (Note that in this work, we’re just doing supervised probing though we do use models to generate some of the training data.)