Ofer comments on Automating Auditing: An ambitious concrete technical research proposal

Ofer 12 Aug 2021 17:15 UTC
LW: 3 AF: 3
AF

In particular, if automating auditing fails, that should mean we now have a concrete style of attack that we can’t build an auditor to discover—which is an extremely useful thing to have, as it provides both a concrete open problem for further work to focus on, as well as a counter-example/impossibility result to the general possibility of being able to make current systems safely auditable.

How does such a scenario (in which “automating auditing fails”) look like? The alignment researchers who will work on this will always be able to say: “Our current ML models are just not capable enough for implementing such an auditor. But if we use 10x training compute or a better architecture etc., we may succeed.”
- evhub 12 Aug 2021 20:29 UTC
  LW: 3 AF: 2
  AF Parent
  Sure, but presumably they’ll also say what particular attacks are so hard that current ML models aren’t capable of solving them—and I think that’s a valuable piece of information to have.