Thank you so much for this writeup of your fascinating findings about interpreting the SVD of the weight matrix, Beren and Sid!
Understanding the degree to which transformer representations are linear vs nonlinear, and developing methods that can help us discover, locate, and interpret nonlinear representations will ultimately be necessary for fully solving interpretability of any nonlinear neural network.
Completely agree. For what it’s worth, I expect interpreting nonlinear representations in complex neural nets to be quite difficult. We should expect linear-algebra methods like SVD to uncover useful information about linear representations in a straightforward manner. But we shouldn’t overupdate as a result of the ease with which linear-algebra methods uncovers this subset of information, because a lot of the relevant information is likely to pertain to nonlinear and interconnected representations, and therefore outside of this subset.
Analyses of weights of a given network therefore is a promising type of static analysis for neural networks equivalent to static analysis of source code which can just be run quickly on any given network before actually having to run it on live inputs. This could potentially be used for alignment as a first line of defense against any kind of harmful behaviour without having to run the network at all. Techniques that analyze the weights are also typically cheaper computationally, since they do not involve running large numbers of forward passes through the network and/or storing large amounts of activations or dealing with large datasets.
Conversely, the downsides of weight analysis is that it cannot tell us about specific model behaviours on specific tokens. The weights instead can be thought of as encoding the space of potential transformations that can be applied to a specific input datapoint but not any specific transformation. They probably can also be used to derive information about average behaviour of the network but not necessarily extreme behaviour which might be most useful for alignment.
I thought this was a really good summary of the pros and cons of the methodology.
I broadly agree with this. This method definitely does not uncover any nonlinear representations in the network and is not expected to. We are primarily trying to uncover the relatively ‘easy’ information we can get with linear methods first. In further defence of linear methods, I would also argue that ‘most’ of the transformer architecture is pretty linear looking. The residual stream is linear, and the I/O matrices reading from and writing to the residual stream are also linear (if we ignore the layernorms!). I suspect that because of this some kind of linear directions might be the best way to understand representations in the residual stream, as well as writes to it, but that obviously the process of computing these writes involves nonlinear token-wise functions for the MLPs and nonlinear mixing across tokens for the attention blocks.
Thank you so much for this writeup of your fascinating findings about interpreting the SVD of the weight matrix, Beren and Sid!
Completely agree. For what it’s worth, I expect interpreting nonlinear representations in complex neural nets to be quite difficult. We should expect linear-algebra methods like SVD to uncover useful information about linear representations in a straightforward manner. But we shouldn’t overupdate as a result of the ease with which linear-algebra methods uncovers this subset of information, because a lot of the relevant information is likely to pertain to nonlinear and interconnected representations, and therefore outside of this subset.
I thought this was a really good summary of the pros and cons of the methodology.
I broadly agree with this. This method definitely does not uncover any nonlinear representations in the network and is not expected to. We are primarily trying to uncover the relatively ‘easy’ information we can get with linear methods first. In further defence of linear methods, I would also argue that ‘most’ of the transformer architecture is pretty linear looking. The residual stream is linear, and the I/O matrices reading from and writing to the residual stream are also linear (if we ignore the layernorms!). I suspect that because of this some kind of linear directions might be the best way to understand representations in the residual stream, as well as writes to it, but that obviously the process of computing these writes involves nonlinear token-wise functions for the MLPs and nonlinear mixing across tokens for the attention blocks.