Logan Riggs comments on Why I’m bearish on mechanistic interpretability: the shards are not in the network

Logan Riggs 13 Sep 2024 20:34 UTC
2 points
0
But through gradient descent, shards act upon the neural networks by leaving imprints of themselves, and these imprints have no reason to be concentrated in any one spot of the network (whether activation-space or weight-space). So studying weights and activations is pretty doomed.
This paragraph sounded like you’re claiming LLMs do have concepts, but they’re not in specific activations or weights, but distributed across them instead.
But from your comment, you mean that LLMs themselves don’t learn the true simple-compressed features of reality, but a mere shadow of them.
This interpretation also matches the title better!
But are you saying the “true features” in the dataset + network? Because SAEs are trained on a dataset! (ignoring the problem pointed out in footnote 1).
Possibly clustering the data points by their network gradients would be a way to put some order into this mess?
Eric Michaud did cluster datapoints by their gradients here. From the abstract:
...Using language model gradients, we automatically decompose model behavior into a diverse set of skills (quanta).
- tailcalled 13 Sep 2024 20:46 UTC
  2 points
  0
  Parent
  This paragraph sounded like you’re claiming LLMs do have concepts, but they’re not in specific activations or weights, but distributed across them instead.
  But from your comment, you mean that LLMs themselves don’t learn the true simple-compressed features of reality, but a mere shadow of them.
  This interpretation also matches the title better!
  A true feature of reality get diminished into many small fragments. These fragments birfucate into multiple groups, of which we will consider two groups, A and B. Group A gets collected and analysed by humans into human knowledge, which then again gets diminished into many small fragments, which we will call group C.
  Group B and group C make impacts on the network. Each fragment in group B and group C produces a shadow in the network, leading to there being many shadows distributed across activation space and weight space. These many shadows form a channel which is highly reflective of the true feature of reality.
  That allows there to be simple useful ways to connect the LLM to the true feature of reality. However, the simplicity of the feature and its connection is not reflected into a simple representation of the feature within the network; instead the concept works as a result of the many independent shadows making way for it.
  But are you saying the “true features” in the dataset + network? Because SAEs are trained on a dataset! (ignoring the problem pointed out in footnote 1).
  The true features branch of from the sun (and the earth). Why would you ignore the problem pointed out in footnote 1? It’s a pretty important problem.