Paul Colognese comments on Charbel-Raphaël and Lucius discuss interpretability

Paul Colognese 6 Nov 2023 14:43 UTC
2 points
1
Interesting discussion; thanks for posting!

I’m curious about what elementary units in NNs could be.
the elementary units are not the neurons, but some other thing.
I tend to model NNs as computational graphs where activation spaces/layers are the nodes and weights/tensors are the edges of the graph. Under this framing, my initial intuition is that elementary units are either going to be contained in the activation spaces or the weights.

There does seem to be empirical evidence that features of the dataset are represented as linear directions in activation space.

I’d be interested in any thoughts regarding what other forms elementary units in NNs could take. In particular, I’d be surprised if they aren’t represented in subspaces of activation spaces.
- Charbel-Raphaël 9 Nov 2023 11:15 UTC
  4 points
  1
  Parent
  I think you could imagine many different types of elementary units wrapped in different ontologies:
  - Information may be encoded linearly in NN, with superposition or composition, locally or highly distributed. (see the figure below from Distributed Representations: Composition & Superposition)
  - Maybe a good way to understand NN is the polytope theory?
  - Maybe some form of memory is encoded as key-value pairs in the MLP of the transformers?
  - Or maybe you could think of NNs as Bayesian causal graphs.
  - Or maybe you should think instead of algorithms inside transformers (Induction heads, modular addition algorithm, etc...) and it’s not that meaningful to think of linear direction.
  Or most likely a mixture of everything.
  - Paul Colognese 14 Nov 2023 13:01 UTC
    1 point
    0
    Parent
    Thanks, that’s the kind of answer I was looking for