Logan Riggs comments on Really Strong Features Found in Residual Stream

Logan Riggs 14 Jul 2023 4:27 UTC
LW: 3 AF: 2
0
AF
[word] and [word]
can be thought of as “the previous token is ′ and’.”
I think it’s mostly this, but looking at the ablated text, removing the previous word before and does have a significant effect some of the time. I’m less confident on the specifics of why the previous word matter or in what contexts.
Maybe the reason you found ′ and’ first is because ′ and’ is an especially frequent word. If you train on the normal document distribution, you’ll find the most frequent features first.
This is a database method, so I do believe we’d find the features most frequently present in that dataset, plus the most important for reconstruction. An example of the latter: the highest MCS feature across many layers & model sizes is the “beginning & end of first sentence” feature which appears to line up w/ the emergent outlier dimensions from Tim Dettmer’s post here, but I do need to do more work to actually show that.