ChosunOne comments on Transformers Represent Belief State Geometry in their Residual Stream

ChosunOne 17 Apr 2024 16:46 UTC
6 points
0
I’m curious how much space is left after learning the MSP in the network. Does representing the MSP take up the full bandwidth of the model (even if it is represented inefficiently)? Could you maintain performance of the model by subtracting out the contributions of anything else that isn’t part of the MSP?
- Adam Shai 17 Apr 2024 18:24 UTC
  1 point
  0
  Parent
  Cool question. This is one of the things we’d like to explore more going forward. We are pretty sure this is pretty nuanced and has to do with the relationship between the (minimal) state of the generative model, the token vocab size, and the residual stream dimensionality.
  One your last question, I believe so but one would have to do the experiment! It totally should be done. check out the Hackathon if you are interested ;)