StefanHex comments on Why I’m bearish on mechanistic interpretability: the shards are not in the network

StefanHex 14 Sep 2024 13:46 UTC
2 points
0

[…] no reason to be concentrated in any one spot of the network (whether activation-space or weight-space). So studying weights and activations is pretty doomed.

I find myself really confused by this argument. Shards (or anything) do not need to be “concentrated in one spot” for studying them to make sense?

As Neel and Lucius say, you might study SAE latents or abstractions built on the weights, no one requires (or assumes) than things are concentrated in one spot.

Or to make another analogy, one can study neuroscience even though things are not concentrated in individual cells or atoms.

If we still disagree it’d help me if you clarified how the “So […]” part of your argument follows

Edit: The “the real thinking happens in the scaffolding” is a reasonable argument (and current mech interp doesn’t address this) but that’s a different argument (and just means we understand individual forward passes with mech interp).
- tailcalled 14 Sep 2024 13:58 UTC
  2 points
  0
  Parent
  I don’t doubt you can find may facts about SAE latents, I just don’t think they will be relevant for anything that matters.
  I’m by-default bearish on neuroscience too, though it’s more nuanced there.
  Edit: The “the real thinking happens in the scaffolding” is a reasonable argument (and current mech interp doesn’t address this) but that’s a different argument (and just means we understand individual forward passes with mech interp).
  Feeding the output into the input isn’t much thinking. It just allows the thinking to occur in a very diffuse way.