One easy way to decompose the OV map would be to generate two SAEs for the residual stream before and after the attention layer, and then just generate a matrix of maps between SAE features by the multiplication:
Wenc2,jWOWVWdec1,i
To get the value of the connection between feature i in SAE 1 and feature j in SAE 2.
Similarly, you could look at the features in SAE 1 and check how they attend to one another using this system. When working with transcoders in attention-free resnets, I’ve been able to totally decompose the model into a stack of transcoders, then throw away the original model.
Seems we are on the cusp of being able to totally decompose an entire transformer into sparse features and linear maps between them. This is incredibly impressive work.
Sorry for the delay—thanks for this! Yeah I agree, in general the OV circuit seems like it’ll be much easier given the fact that it doesn’t have the bilinearity or the softmax issue. I think the idea you sketch here sounds like a really promising one and pretty in line with some of the things we’re trying atm
I think the tough part will be the next step which is somehow “stitching together” the QK and OV decompositions that give you an end-to-end understanding of what the whole attention layer is doing. Although I think the extent to which we should be thinking about the QK and OV circuit as totally independent is still unclear to me
Interested to hear more about your work though! Being able to replace the entire model sounds impressive given how much reconstruction errors seem to compound
One easy way to decompose the OV map would be to generate two SAEs for the residual stream before and after the attention layer, and then just generate a matrix of maps between SAE features by the multiplication:
Wenc2,jWOWVWdec1,i
To get the value of the connection between feature i in SAE 1 and feature j in SAE 2.
Similarly, you could look at the features in SAE 1 and check how they attend to one another using this system. When working with transcoders in attention-free resnets, I’ve been able to totally decompose the model into a stack of transcoders, then throw away the original model.
Seems we are on the cusp of being able to totally decompose an entire transformer into sparse features and linear maps between them. This is incredibly impressive work.
Sorry for the delay—thanks for this! Yeah I agree, in general the OV circuit seems like it’ll be much easier given the fact that it doesn’t have the bilinearity or the softmax issue. I think the idea you sketch here sounds like a really promising one and pretty in line with some of the things we’re trying atm
I think the tough part will be the next step which is somehow “stitching together” the QK and OV decompositions that give you an end-to-end understanding of what the whole attention layer is doing. Although I think the extent to which we should be thinking about the QK and OV circuit as totally independent is still unclear to me
Interested to hear more about your work though! Being able to replace the entire model sounds impressive given how much reconstruction errors seem to compound