I agree with pretty much all these points. This problem has motivated some work I have been doing and has been pretty relevant to think well about and test so I made some toy models of the situation. This is a minimal proof of concept example I had lying around, and insufficient to prove this will happen in larger model, but definitely shows that composite features are a possible outcome and validates what you’re saying:
Here, it has not learned a single atomic feature. All the true features are orthogonal which makes it easier to read the cosine-similarity heatmap. μ and σ are mean and standard deviation of all features.
Note on “bias only” on the y-axis: The “bias only” entry in the learned features is the cosine similarity of the decoder bias with the ground truth features. It’s all zeros, because I disabled the bias for this run to make the decoder weights more directly interpretable. otherwise, in such a small model it’ll use the bias to do tricky things which also make the graph much less readable. We know the features are centered around the origin in this toy so zeroing the bias seems fine to do.
Edit: Remembered I have a larger example of the phenomena. Same setup as above.
I agree with pretty much all these points. This problem has motivated some work I have been doing and has been pretty relevant to think well about and test so I made some toy models of the situation.
This is a minimal proof of concept example I had lying around, and insufficient to prove this will happen in larger model, but definitely shows that composite features are a possible outcome and validates what you’re saying:
Here, it has not learned a single atomic feature.
All the true features are orthogonal which makes it easier to read the cosine-similarity heatmap.
μ and σ are mean and standard deviation of all features.
Note on “bias only” on the y-axis: The “bias only” entry in the learned features is the cosine similarity of the decoder bias with the ground truth features. It’s all zeros, because I disabled the bias for this run to make the decoder weights more directly interpretable. otherwise, in such a small model it’ll use the bias to do tricky things which also make the graph much less readable. We know the features are centered around the origin in this toy so zeroing the bias seems fine to do.
Edit: Remembered I have a larger example of the phenomena. Same setup as above.
This looks interesting. I’m having a difficult time understanding the results though. It would be great to see a more detailed write up!