A core worry with inner alignment is that we cannot determine whether a system is deceptive or not just by inspecting its behavior, since it may simply be behaving well for now in order to wait until a more opportune moment to deceive us. In order for interpretability to help with such an issue, we need _worst-case_ interpretability, that surfaces all the problems in a model. When we hear “worst-case”, we should be thinking of adversaries.
This post considers the _auditing game_, in which an attacker introduces a vulnerability in the model to violate some known specification, and the auditor must find and describe the vulnerability given only the modified model (i.e. it does not get to see the original model, or what the adversary did). The attacker aims to produce the largest vulnerability that they can get away with, and the auditor aims to describe the vulnerability as completely as possible. Note that both the attacker and the auditor can be humans (potentially assisted by AI tools). This game forms a good benchmark for worst-case interpretability work.
While the author is excited about direct progress on this game (i.e. better and better human auditors), he is particularly interested in fully _automating_ the auditors. For example, we could collect a dataset of possible attacks and the corresponding desired audit, and finetune a large language model on such a dataset.
Planned opinion:
I like the auditing game as a framework for constructing benchmarks for worst-case interpretability—you can instantiate a particular benchmark by defining a specific adversary (or distribution of adversaries). Automating auditing against a human attacker seems like a good long-term goal, but it seems quite intractable given current capabilities.
Planned summary for the Alignment Newsletter:
Planned opinion: