Nate Showell comments on Don’t Dismiss Simple Alignment Approaches

Nate Showell 8 Oct 2023 17:13 UTC
3 points
0
Has anyone developed a metric for quantifying the level of linearity versus nonlinearity of a model’s representations? A metric like that would let us compare the levels of linearity for models of different sizes, which would help us extrapolate whether interpretability and alignment techniques that rely on approximate linearity will scale to larger models.
- Chris_Leong 9 Oct 2023 1:22 UTC
  3 points
  0
  Parent
  I don’t know, but would love to find out.
  - Nate Showell 9 Nov 2023 2:51 UTC
    1 point
    0
    Parent
    I asked on Discord and someone told me this:
    A simple way to quantify this: first define a “feature” as some decision boundary over the data domain, then train a linear classifier to predict that decision boundary from the network’s activations on that data. Quantify the “linearity” of the feature in the network as the accuracy that the linear classifier achieves.
    For example, train a classifier to detect when some text has positive or negative sentiment, then pass the same text through some pretrained LLM (e.g. BERT) whose “feature-linearity” you’re trying to measure, and try to predict the sentiment from the BERT’s activation vectors using linear regression. The accuracy of this linear model tells you how linear the “sentiment” feature is in your LLM.
- Thomas Kwa 5 Nov 2023 22:06 UTC
  2 points
  0
  Parent
  IMO the most useful version of this would be to get empirical evidence on techniques. E.g. erasing certain concepts using LEACE and seeing if they can inhibit the model’s use of those concepts including during further training. It seems hard to ensure otherwise that there is not some gap between your definitions and reality.