Yea, I agree that if you give a deceptive model the chance to emerge then a lot more risks arise for interpretability and it could become much more difficult. Circumventing interpretability: How to defeat mind-readers kind of goes through the gauntlet, but I think one workaround/solution Lee lays out there which I haven’t seen anyone shoot down yet (aside from it seeming terribly expensive) is to run the interpretability tools continuously or near continuously from the beginning of training. This would give us the opportunity to examine the mesa-optimizer’s goals as soon as they emerge, before it has a chance to do any kind of obfuscation.
Yea, I agree that if you give a deceptive model the chance to emerge then a lot more risks arise for interpretability and it could become much more difficult. Circumventing interpretability: How to defeat mind-readers kind of goes through the gauntlet, but I think one workaround/solution Lee lays out there which I haven’t seen anyone shoot down yet (aside from it seeming terribly expensive) is to run the interpretability tools continuously or near continuously from the beginning of training. This would give us the opportunity to examine the mesa-optimizer’s goals as soon as they emerge, before it has a chance to do any kind of obfuscation.