By explanations, I think Buck means fully human understandable explanations.
Do you also think it’s infeasible to identify sparse, unlabeled circuits as “the part of the model that’s doing the task”, like in ACDC, in a way that gets good performance on some downstream task?
Personally, I don’t have a strong opinion and this will probably depend on the exact architecture and the extent of sparsity we demand. This seems related to other views I have on difficulties in interp (ETA: so I’m probably more pessimistic here than people who are more optimistic about interp), but at least partially orthogonal.
By explanations, I think Buck means fully human understandable explanations.
Personally, I don’t have a strong opinion and this will probably depend on the exact architecture and the extent of sparsity we demand. This seems related to other views I have on difficulties in interp (ETA: so I’m probably more pessimistic here than people who are more optimistic about interp), but at least partially orthogonal.