Thank you so much for the post! I’m starting to get a sense of induction heads.
Probably an unrelated question—can a single attention head store multiple orthogonal information? For example, in this post, the layer 0 may store the information “I follow ‘D’”. Can it also store information like “I am a noun”?
Or, to put it another way, should an attention head have a single, dedicated functionality?
Thank you so much for the post! I’m starting to get a sense of induction heads.
Probably an unrelated question—can a single attention head store multiple orthogonal information? For example, in this post, the layer 0 may store the information “I follow ‘D’”. Can it also store information like “I am a noun”?
Or, to put it another way, should an attention head have a single, dedicated functionality?