I agree with pretty much all these points. This problem has motivated some work I have been doing and has been pretty relevant to think well about and test so I made some toy models of the situation.
This is a minimal proof of concept example I had lying around, and insufficient to prove this will happen in larger model, but definitely shows that composite features are a possible outcome and validates what you’re saying:
Here, it has not learned a single atomic feature.
All the true features are orthogonal which makes it easier to read the cosine-similarity heatmap.
and are mean and standard deviation of all features.
Note on “bias only” on the y-axis: The “bias only” entry in the learned features is the cosine similarity of the decoder bias with the ground truth features. It’s all zeros, because I disabled the bias for this run to make the decoder weights more directly interpretable. otherwise, in such a small model it’ll use the bias to do tricky things which also make the graph much less readable. We know the features are centered around the origin in this toy so zeroing the bias seems fine to do.
Edit: Remembered I have a larger example of the phenomena. Same setup as above.
I want to mention that in my experience a factor of 2 difference in L0 makes a pretty huge difference in reconstruction score/L2 norm. IMO ideally you should compare pareto curves for each architecture or get two datapoints that have almost the exact same L0 if you want to compare two architectures.