Can you expand on how model agnostic methods capture global phenomena better than white box methods like circuits?
It seems like with LIME or saliency mapping the importance of the regions of the image are conditioned on the rest of the instance. For example, we may have a dolphin classifier that uses the background colour when the animal is in water, but uses the shape of the animal when it’s not on a blue background. If you had no idea about the second possibility, you might determine that you have a dolphin classifier from black box interpretability tools. On the other hand, if you have identified a circuit that activates on your water dolphin images, you could use some formal verification approach to find the conditions in which that circuit is excited (and thereby have learned something global about the model).
Can you expand on how model agnostic methods capture global phenomena better than white box methods like circuits?
It seems like with LIME or saliency mapping the importance of the regions of the image are conditioned on the rest of the instance. For example, we may have a dolphin classifier that uses the background colour when the animal is in water, but uses the shape of the animal when it’s not on a blue background. If you had no idea about the second possibility, you might determine that you have a dolphin classifier from black box interpretability tools. On the other hand, if you have identified a circuit that activates on your water dolphin images, you could use some formal verification approach to find the conditions in which that circuit is excited (and thereby have learned something global about the model).