Rohin Shah comments on Alignment Newsletter #24

Rohin Shah Sep 17, 2018, 9:58 PM
6 points
The organizers also have a heldout evaluation set on which the defenses must get an accuracy of at least 80%. It’s possible that by submitting many many defenses, you could get enough information about the heldout evaluation to classify the eligibility and evaluation set correctly while classifying everything else as ambiguous—there’s a sentence in there somewhere that says the organizers reserve the right to evaluate whether or not this is happening and reject defenses on those grounds. (EDIT: The comment below identified the exact line I was referring to.)
You can imagine that defenses might try to notice when something adversarial has been done (eg. perhaps you notice that you’re close to a decision boundary) and label those as ambiguous. That’s a perfectly fair solution, and I think both the organizers and I would be excited to see a defense that won using such a method.