Perhaps more importantly though, we shouldn’t expect by default that in the X% of times where our guess is bad, that the neural network is adversarially optimizing against us.
I’m confused because the post you made one day later from this comment seems to argue the opposite of this. Did you change your mind in between, or am I missing something?
I thought about your objection longer and realized that there are circumstances where we can expect the model to adversarially optimize against us. I think I’ve less changed my mind, and more clarified when I think these tools are useful. In the process, I also discovered that Chris Olah and Evan Hubinger seem to agree: naively using transparency tools can break down in the deception case.
I’m confused because the post you made one day later from this comment seems to argue the opposite of this. Did you change your mind in between, or am I missing something?
I thought about your objection longer and realized that there are circumstances where we can expect the model to adversarially optimize against us. I think I’ve less changed my mind, and more clarified when I think these tools are useful. In the process, I also discovered that Chris Olah and Evan Hubinger seem to agree: naively using transparency tools can break down in the deception case.