Has anyone developed a metric for quantifying the level of linearity versus nonlinearity of a model’s representations? A metric like that would let us compare the levels of linearity for models of different sizes, which would help us extrapolate whether interpretability and alignment techniques that rely on approximate linearity will scale to larger models.
A simple way to quantify this: first define a “feature” as some decision boundary over the data domain, then train a linear classifier to predict that decision boundary from the network’s activations on that data. Quantify the “linearity” of the feature in the network as the accuracy that the linear classifier achieves.
For example, train a classifier to detect when some text has positive or negative sentiment, then pass the same text through some pretrained LLM (e.g. BERT) whose “feature-linearity” you’re trying to measure, and try to predict the sentiment from the BERT’s activation vectors using linear regression. The accuracy of this linear model tells you how linear the “sentiment” feature is in your LLM.
IMO the most useful version of this would be to get empirical evidence on techniques. E.g. erasing certain concepts using LEACE and seeing if they can inhibit the model’s use of those concepts including during further training. It seems hard to ensure otherwise that there is not some gap between your definitions and reality.
Has anyone developed a metric for quantifying the level of linearity versus nonlinearity of a model’s representations? A metric like that would let us compare the levels of linearity for models of different sizes, which would help us extrapolate whether interpretability and alignment techniques that rely on approximate linearity will scale to larger models.
I don’t know, but would love to find out.
I asked on Discord and someone told me this:
IMO the most useful version of this would be to get empirical evidence on techniques. E.g. erasing certain concepts using LEACE and seeing if they can inhibit the model’s use of those concepts including during further training. It seems hard to ensure otherwise that there is not some gap between your definitions and reality.