Separately for: “But adversarial attacks often succeed in 50% or 100% of attempts against various detection systems.”
I expect that these numbers weren’t against monitoring ensembles in the sense I described earlier and the red team had additional affordances beyond just understanding the high level description of the monitoring setup? E.g., the red team was able to iterate?
This is correct about the paper I cited, but others have achieved similar attack success rates against models like Claude which use an ensemble of defenses. AFAIK Claude does not ban users who attempt misuse, so that element of your plan has never been tested and would likely help a lot.
This is correct about the paper I cited, but others have achieved similar attack success rates against models like Claude which use an ensemble of defenses. AFAIK Claude does not ban users who attempt misuse, so that element of your plan has never been tested and would likely help a lot.