Lee Sharkey comments on Circumventing interpretability: How to defeat mind-readers

Lee Sharkey 23 Jul 2022 11:36 UTC
LW: 5 AF: 4
2
AF
This sounds really reasonable. I had only been thinking of a naive version of interpretability tools in the loss function that doesn’t attempt to interpret the gradient descent process. I’d be genuinely enthusiastic about the strong version you outlined. I expect to think a lot about it in the near future.