I’m curious if you have found any multiplicity in the output directions (what you denote as →u∗l), or if the multiplicity is only in the input directions. I would predict that there would be some multiplicity in output directions, but much less than the multiplicity in input directions for the corresponding concept.
Relatedly, how do you think about output directions in general? Do you think they are just upweighting/downweighting tokens? I’d imagine that their level of abstraction depends on how far from the end of the network the output layer is, which will ultimately end up determining out much of their effect is directly on the unembed v.s. indirectly through other layers.
Nice work! A few questions:
I’m curious if you have found any multiplicity in the output directions (what you denote as →u∗l), or if the multiplicity is only in the input directions. I would predict that there would be some multiplicity in output directions, but much less than the multiplicity in input directions for the corresponding concept.
Relatedly, how do you think about output directions in general? Do you think they are just upweighting/downweighting tokens? I’d imagine that their level of abstraction depends on how far from the end of the network the output layer is, which will ultimately end up determining out much of their effect is directly on the unembed v.s. indirectly through other layers.