I thought about your objection longer and realized that there are circumstances where we can expect the model to adversarially optimize against us. I think I’ve less changed my mind, and more clarified when I think these tools are useful. In the process, I also discovered that Chris Olah and Evan Hubinger seem to agree: naively using transparency tools can break down in the deception case.
I thought about your objection longer and realized that there are circumstances where we can expect the model to adversarially optimize against us. I think I’ve less changed my mind, and more clarified when I think these tools are useful. In the process, I also discovered that Chris Olah and Evan Hubinger seem to agree: naively using transparency tools can break down in the deception case.