Charbel-Raphaël comments on Charbel-Raphaël and Lucius discuss interpretability

Charbel-Raphaël 9 Nov 2023 11:15 UTC
4 points
1
I think you could imagine many different types of elementary units wrapped in different ontologies:
- Information may be encoded linearly in NN, with superposition or composition, locally or highly distributed. (see the figure below from Distributed Representations: Composition & Superposition)
- Maybe a good way to understand NN is the polytope theory?
- Maybe some form of memory is encoded as key-value pairs in the MLP of the transformers?
- Or maybe you could think of NNs as Bayesian causal graphs.
- Or maybe you should think instead of algorithms inside transformers (Induction heads, modular addition algorithm, etc...) and it’s not that meaningful to think of linear direction.
Or most likely a mixture of everything.
- Paul Colognese 14 Nov 2023 13:01 UTC
  1 point
  0
  Parent
  Thanks, that’s the kind of answer I was looking for