ryan_greenblatt comments on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

ryan_greenblatt 7 Jan 2024 20:07 UTC
LW: 6 AF: 3
0
AF
By explanations, I think Buck means fully human understandable explanations.

Do you also think it’s infeasible to identify sparse, unlabeled circuits as “the part of the model that’s doing the task”, like in ACDC, in a way that gets good performance on some downstream task?

Personally, I don’t have a strong opinion and this will probably depend on the exact architecture and the extent of sparsity we demand. This seems related to other views I have on difficulties in interp (ETA: so I’m probably more pessimistic here than people who are more optimistic about interp), but at least partially orthogonal.