aogara comments on Coup probes: Catching catastrophes with probes trained off-policy

aogara 19 Nov 2023 0:21 UTC
LW: 2 AF: 1
0
AF
This is a comment from Andy Zou, who led the RepE paper but doesn’t have a LW account:

“Yea I think it’s fair to say probes is a technique under rep reading which is under RepE (https://www.ai-transparency.org/). Though I did want to mention, in many settings, LAT is performing unsupervised learning with PCA and does not use any labels. And we find regular linear probing often does not generalize well and is ineffective for (causal) model control (e.g., details in section 5). So equating LAT to regular probing might be an oversimplification. How to best elicit the desired neural activity patterns requires careful design of 1) the experimental task and 2) locations to read the neural activity, which contribute to the success of LAT over regular probing (section 3.1.1).

In general, we have shown some promise of monitoring high-level representations for harmful (or catastrophic) intents/behaviors. It’s exciting to see follow-ups in this direction which demonstrate more fine-grained monitoring/control.”
- Fabien Roger 19 Nov 2023 15:54 UTC
  LW: 5 AF: 2
  0
  AF Parent
  I agree that not using labels is interesting from a data generation perspective, but I expect this to be useful mostly if you have clean pairs of concepts for which it is hard to get labels—and I think this will not be the case for takeover attempts datasets.
  About the performance of LAT: for monitoring, we mostly care about correlation—so LAT is worse IID, and it’s unclear if LAT is better OOD. If causality leads to better generalization properties, then LAT is dominated by mean difference probing (see the screenshot of Zou’s paper below), which is just regular probing with high enough L2 regularization (as shown in the first Appendix of this post).