I thought this was a great post. One thing which I think this post misses, however, is the extent to which “post-hoc” approaches can be turned into “intrinsic” approaches by using the post-hoc approach as a training objective—in the language of my transparency trichotomy, that lets you turn inspection transparency into training transparency. Relaxed adversarial training is an example of this in which you directly train the model on an overseer’s evaluation of some transparency condition given access to inspection transparency tools.
I thought this was a great post. One thing which I think this post misses, however, is the extent to which “post-hoc” approaches can be turned into “intrinsic” approaches by using the post-hoc approach as a training objective—in the language of my transparency trichotomy, that lets you turn inspection transparency into training transparency. Relaxed adversarial training is an example of this in which you directly train the model on an overseer’s evaluation of some transparency condition given access to inspection transparency tools.