We haven’t published this work yet (which is why I’m only writing a comment), but @Quintin Pope and I are hoping to make progress on this by comparing model M to model M’, where M’ = intervention(M), instead of only relying on auditing M’.
Note that intervention can mean anything, e.g., continued pre-training, RLHF, activation steering, and model editing.
We haven’t published this work yet (which is why I’m only writing a comment), but @Quintin Pope and I are hoping to make progress on this by comparing model M to model M’, where M’ = intervention(M), instead of only relying on auditing M’.
Note that intervention can mean anything, e.g., continued pre-training, RLHF, activation steering, and model editing.
Also, I haven’t gone through the whole post yet, but so far I’m curious how “Mechanistically Eliciting Latent Behaviors in Language Models” future work will evolve when it comes to red-teaming and backdoor detection.