I think that interpretability research isn’t going to be able to produce explanations that are very faithful explanations of what’s going on in non-toy models (e.g. I think that no such explanation has ever been produced). Since I think faithful explanations are infeasible, measures of faithfulness of explanations don’t seem very important to me now.
By “explanations” you mean labeled high-level causal graphs right? Do you also think it’s infeasible to identify sparse, unlabeled circuits as “the part of the model that’s doing the task”, like in ACDC, in a way that gets good performance on some downstream task?
By explanations, I think Buck means fully human understandable explanations.
Do you also think it’s infeasible to identify sparse, unlabeled circuits as “the part of the model that’s doing the task”, like in ACDC, in a way that gets good performance on some downstream task?
Personally, I don’t have a strong opinion and this will probably depend on the exact architecture and the extent of sparsity we demand. This seems related to other views I have on difficulties in interp (ETA: so I’m probably more pessimistic here than people who are more optimistic about interp), but at least partially orthogonal.
By “explanations” you mean labeled high-level causal graphs right? Do you also think it’s infeasible to identify sparse, unlabeled circuits as “the part of the model that’s doing the task”, like in ACDC, in a way that gets good performance on some downstream task?
By explanations, I think Buck means fully human understandable explanations.
Personally, I don’t have a strong opinion and this will probably depend on the exact architecture and the extent of sparsity we demand. This seems related to other views I have on difficulties in interp (ETA: so I’m probably more pessimistic here than people who are more optimistic about interp), but at least partially orthogonal.