Matthew Barnett comments on AI Alignment Open Thread October 2019

Matthew Barnett 5 Nov 2019 4:29 UTC
LW: 3 AF: 2
AF
I thought about your objection longer and realized that there are circumstances where we can expect the model to adversarially optimize against us. I think I’ve less changed my mind, and more clarified when I think these tools are useful. In the process, I also discovered that Chris Olah and Evan Hubinger seem to agree: naively using transparency tools can break down in the deception case.