This is a great approach imo. I’ve tried something similar in transformers using the singular vectors of the embedding matrix (the d_model x d_model matrix) to rotate the matrices connected to the residual stream. This seemed to induce sparsity in the weights close to the first layer with decreasing effect moving deeper into the model. Tried this with the clip VIT-B and GPT-J, with the effect being a lot weaker in GPT-J. Also, some of the singular vectors of the embeddings were easily interpretable, with the top component being related to raw token frequency and interesting directions in GPT-J, (religion—technology) (positive—negative valence), and the top components of CLIP being color and frequency filters.
This is interesting, as I’ve (preliminarily) found the opposite with my methods. In my MNIST model, the first and last layers can’t really be optimized any more than they are for sparsity, but the middle layer undergoes a drastic change.
This is a great approach imo. I’ve tried something similar in transformers using the singular vectors of the embedding matrix (the d_model x d_model matrix) to rotate the matrices connected to the residual stream. This seemed to induce sparsity in the weights close to the first layer with decreasing effect moving deeper into the model. Tried this with the clip VIT-B and GPT-J, with the effect being a lot weaker in GPT-J. Also, some of the singular vectors of the embeddings were easily interpretable, with the top component being related to raw token frequency and interesting directions in GPT-J, (religion—technology) (positive—negative valence), and the top components of CLIP being color and frequency filters.
This is interesting, as I’ve (preliminarily) found the opposite with my methods. In my MNIST model, the first and last layers can’t really be optimized any more than they are for sparsity, but the middle layer undergoes a drastic change.