While I think this is important, and will probably edit the post, I think even in the unembedding, when getting the logits, the behaviour cares more about direction than distance.
When I think of distance, I implicitly think Euclidean distance: d(x1,x2)=|x1−x2|=√∑i(x1,i−x2,i)2
But the actual “distance” used for calculating logits looks like this: d(x1,x2)=x1⋅x2=|x1||x2|cosθ12
Which is a lot more similar to cosine similarity: d(x1,x2)=^x1⋅^x2=cosθ12
I think that because the metric is so similar to the cosine similarity, it makes more sense to think of size + directions instead of distances and points.
Yeah, I agree! You 100% should not think about the unembed as looking for “the closest token”, as opposed to looking for the token with the largest dot product (= high cosine similarity + large size).
I suspect the piece would be helpful for people with similar confusions, though I think by default most people already think of features as directions (this is an incredible tacit assumption that’s made everywhere in mech interp work), especially since the embed/unembed are linear functions.
While I think this is important, and will probably edit the post, I think even in the unembedding, when getting the logits, the behaviour cares more about direction than distance.
When I think of distance, I implicitly think Euclidean distance:
d(x1,x2)=|x1−x2|=√∑i(x1,i−x2,i)2
But the actual “distance” used for calculating logits looks like this:
d(x1,x2)=x1⋅x2=|x1||x2|cosθ12
Which is a lot more similar to cosine similarity:
d(x1,x2)=^x1⋅^x2=cosθ12
I think that because the metric is so similar to the cosine similarity, it makes more sense to think of size + directions instead of distances and points.
Yeah, I agree! You 100% should not think about the unembed as looking for “the closest token”, as opposed to looking for the token with the largest dot product (= high cosine similarity + large size).
I suspect the piece would be helpful for people with similar confusions, though I think by default most people already think of features as directions (this is an incredible tacit assumption that’s made everywhere in mech interp work), especially since the embed/unembed are linear functions.