Aaron_Scher comments on Why imperfect adversarial robustness doesn’t doom AI control

Aaron_Scher Nov 23, 2024, 2:39 AM
7 points
6
Noting that I spent a couple minutes pondering the quoted passage which I don’t think was a good use of time (I basically would have immediately dismissed it if I knew Claude wrote it, and I only thought about it because my prior on Buck saying true things is way higher), and I would have preferred the text not have this.
I don’t see anybody having mentioned it yet, but the recent paper about LLM Introspection seems pretty relevant. I would say that a model which performs very well at introspection (as defined there) would be able to effectively guess which jailbreak strategies were attempted.
- Buck Nov 23, 2024, 3:09 PM
  4 points
  0
  Parent
  Ok sorry, I edited it now. Tbc I don’t think that this error is out of distribution for errors that I make myself when writing/thinking.