In particular, if automating auditing fails, that should mean we now have a concrete style of attack that we can’t build an auditor to discover—which is an extremely useful thing to have, as it provides both a concrete open problem for further work to focus on, as well as a counter-example/impossibility result to the general possibility of being able to make current systems safely auditable.
How does such a scenario (in which “automating auditing fails”) look like? The alignment researchers who will work on this will always be able to say: “Our current ML models are just not capable enough for implementing such an auditor. But if we use 10x training compute or a better architecture etc., we may succeed.”
Sure, but presumably they’ll also say what particular attacks are so hard that current ML models aren’t capable of solving them—and I think that’s a valuable piece of information to have.
How does such a scenario (in which “automating auditing fails”) look like? The alignment researchers who will work on this will always be able to say: “Our current ML models are just not capable enough for implementing such an auditor. But if we use 10x training compute or a better architecture etc., we may succeed.”
Sure, but presumably they’ll also say what particular attacks are so hard that current ML models aren’t capable of solving them—and I think that’s a valuable piece of information to have.