mic comments on Automating Auditing: An ambitious concrete technical research proposal

mic 21 Jan 2023 6:00 UTC
LW: 2 AF: 1
AF
Has anyone tried to work on this experimentally?
- Kshitij Sachan 3 Apr 2023 21:39 UTC
  3 points
  Parent
  Mechanistic Anomaly Detection (MAD) is a version of the auditing game with a few differences:
  1. The auditing game puts the onus on the red team to design a subtle backdoor such that giving the blue team a description of the backdoor is insufficient for the blue team to generate inputs that trigger the backdoor.
  2. The blue team is given the ideal behavior specification by the judge
  In MAD, the blue team is given a backdoored model but not necessarily a description of what the model is doing (from judge) and backdoor behavior (from red team). Instead, the blue team is given a model that does something and a clean dataset. Then, on all test inputs, they must determine if the model is using the same mechanism as on the clean set or some other mechanism (presumably backdoor).
  Redwood Research has been doing experimental work on MAD in toy settings. We have some techniques we’re happy with that do quite well on small problems but that have theoretical issues solving the downstream deceptive alignment/ELK cases we’re interested in.