Have you read the Redwood post on causal scrubbing? To me, it’s an excellent example of evaluating interpretability using something other than intuition.
Thanks. I’ll talk in some depth about causal scrubbing in two of the upcoming posts which narrow down discussion specifically to AI safety work. I think it’s a highly valuable way of measuring how well a hypothesis seems to explain a network, but there are some pitfalls with it to be aware of.
Have you read the Redwood post on causal scrubbing? To me, it’s an excellent example of evaluating interpretability using something other than intuition.
Thanks. I’ll talk in some depth about causal scrubbing in two of the upcoming posts which narrow down discussion specifically to AI safety work. I think it’s a highly valuable way of measuring how well a hypothesis seems to explain a network, but there are some pitfalls with it to be aware of.