The auditing game puts the onus on the red team to design a subtle backdoor such that giving the blue team a description of the backdoor is insufficient for the blue team to generate inputs that trigger the backdoor.
The blue team is given the ideal behavior specification by the judge
In MAD, the blue team is given a backdoored model but not necessarily a description of what the model is doing (from judge) and backdoor behavior (from red team). Instead, the blue team is given a model that does something and a clean dataset. Then, on all test inputs, they must determine if the model is using the same mechanism as on the clean set or some other mechanism (presumably backdoor).
Redwood Research has been doing experimental work on MAD in toy settings. We have some techniques we’re happy with that do quite well on small problems but that have theoretical issues solving the downstream deceptive alignment/ELK cases we’re interested in.
Has anyone tried to work on this experimentally?
Mechanistic Anomaly Detection (MAD) is a version of the auditing game with a few differences:
The auditing game puts the onus on the red team to design a subtle backdoor such that giving the blue team a description of the backdoor is insufficient for the blue team to generate inputs that trigger the backdoor.
The blue team is given the ideal behavior specification by the judge
In MAD, the blue team is given a backdoored model but not necessarily a description of what the model is doing (from judge) and backdoor behavior (from red team). Instead, the blue team is given a model that does something and a clean dataset. Then, on all test inputs, they must determine if the model is using the same mechanism as on the clean set or some other mechanism (presumably backdoor).
Redwood Research has been doing experimental work on MAD in toy settings. We have some techniques we’re happy with that do quite well on small problems but that have theoretical issues solving the downstream deceptive alignment/ELK cases we’re interested in.