This is further evidence that there’s no single layer at which individual outputs are learned, instead they’re smoothly spread across the full set of available layers.
I don’t think this simple experiment is by any means decisive, but to me it makes it more likely that features in real models are in large part refined iteratively layer-by-layer, with (more speculatively) the intermediate parts not having any particularly natural representation.
I’ve also updated more and more in this direction.
I think my favorite explanation/evidence of this in general comes from Appendix C of the tuned lens paper.
This seems like a not-so-small issue for SAEs? If there are lots of half baked features in the residual stream (or feature updates/computations in the MLPs) then many of the dictionary elements have to be spent reconstructing something which is not finalized and hence are less likely to be meaningful. Are there any ideas on how to fix this?
Huh, I’d never seen that figure, super interesting! I agree it’s a big issue for SAEs and one that I expect to be thinking about a lot. Didn’t have any strong candidate solutions as of writing the post, wouldn’t even able to be able to say any thoughts I have on the topic now, sorry. Wish I’d posted this a couple of weeks ago.
I’ve also updated more and more in this direction.
I think my favorite explanation/evidence of this in general comes from Appendix C of the tuned lens paper.
This seems like a not-so-small issue for SAEs? If there are lots of half baked features in the residual stream (or feature updates/computations in the MLPs) then many of the dictionary elements have to be spent reconstructing something which is not finalized and hence are less likely to be meaningful. Are there any ideas on how to fix this?
Huh, I’d never seen that figure, super interesting! I agree it’s a big issue for SAEs and one that I expect to be thinking about a lot. Didn’t have any strong candidate solutions as of writing the post, wouldn’t even able to be able to say any thoughts I have on the topic now, sorry. Wish I’d posted this a couple of weeks ago.