I broadly agree with this. This method definitely does not uncover any nonlinear representations in the network and is not expected to. We are primarily trying to uncover the relatively ‘easy’ information we can get with linear methods first. In further defence of linear methods, I would also argue that ‘most’ of the transformer architecture is pretty linear looking. The residual stream is linear, and the I/O matrices reading from and writing to the residual stream are also linear (if we ignore the layernorms!). I suspect that because of this some kind of linear directions might be the best way to understand representations in the residual stream, as well as writes to it, but that obviously the process of computing these writes involves nonlinear token-wise functions for the MLPs and nonlinear mixing across tokens for the attention blocks.
I broadly agree with this. This method definitely does not uncover any nonlinear representations in the network and is not expected to. We are primarily trying to uncover the relatively ‘easy’ information we can get with linear methods first. In further defence of linear methods, I would also argue that ‘most’ of the transformer architecture is pretty linear looking. The residual stream is linear, and the I/O matrices reading from and writing to the residual stream are also linear (if we ignore the layernorms!). I suspect that because of this some kind of linear directions might be the best way to understand representations in the residual stream, as well as writes to it, but that obviously the process of computing these writes involves nonlinear token-wise functions for the MLPs and nonlinear mixing across tokens for the attention blocks.