Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 17 Jul 2024 7:20 UTC
3 points
0
Oh that’s really interesting! Can you clarify what “MCS” means? And can you elaborate a bit on how I’m supposed to interpret these graphs?
- faul_sname 17 Jul 2024 8:24 UTC
  4 points
  0
  Parent
  Yeah, stands for Max Cosine Similarity. Cosine similarity is a pretty standard measure for how close two vectors are to pointing in the same direction. It’s the cosine of the angle between the two vectors, so +1.0 means the vectors are pointing in exactly the same direction, 0.0 means the vectors are orthogonal, −1.0 means the vectors are pointing in exactly opposite directions.
  
  To generate this graph, I think he took each of the learned features in the smaller dictionary, and then calculated to cosine similarity of that small-dictionary feature with every feature in the larger dictionary, and then the maximal cosine similarity was the MCS for that small-dictionary feature. I have a vague memory of him also doing some fancy linear_sum_assignment() thing (to ensure that each feature in the large dictionary could only be used once in order avoid having multiple features in the small dictionary have their MCS come from the same feature on the large dictionary) though IIRC it didn’t actually matter.
  
  Also I think the small and large dictionaries were trained using different methods as each other for layer 2, and this was on pythia-70m-deduped so layer 5 was the final layer immediately before unembedding (so naively I’d expect most of the “features” to just be “the output token will be the” or “the output token will be when” etc).
  
  Edit: In terms of “how to interpret these graphs”, they’re histograms with the horizontal axis being bins of cosine similarity, and the vertical axis being how many small-dictionary features had a the cosine similarity with a large-dictionary feature within that bucket. So you can see at layer 3 it looks like somewhere around half of the small dictionary features had a cosine similarity of 0.96-1.0 with one of the large dictionary features, and almost all of them had a cosine similarity of at least 0.8 with the best large-dictionary feature.
  
  Which I read as “large dictionaries find basically the same features as small ones, plus some new ones”.
  
  Bear in mind also that these were some fairly small dictionaries. I think these charts were generated with this notebook so I think smaller_dict was of size 2048 and larger_dict was size 4096 (with a residual width of 512, so 4x and 8x respectively). Anthropic went all the way to 256x residual width with their “Towards Monosemanticity” paper later that year, and the behavior might have changed at that scale.