Joseph Bloom comments on Finding Sparse Linear Connections between Features in LLMs

Joseph Bloom 9 Dec 2023 12:32 UTC
10 points
0
Interesting! This is very cool work but I’d like to understand your metrics better.
- “So we take the difference in loss for features (ie for a feature, we take linear loss—MLP loss)”. What do you mean here? Is this the difference between the mean MSE loss when the feature is on vs not on?
- Can you please report the L0′s for each of the auto-encoders and the linear model as well as the next token prediction loss when using the autoencoder/linear model. These are important metrics on which my generally excitement hinges. (eg: if those are both great, I’m way more interested in results about specific features).
- I’d be very interested in you can take a specific input, look at the features present and compare them between autoencoder/the linear model. This would be especially cool if you pick an example where ablating the MLP out causes the incorrect prediction so we know it’s representing something important.
- Are you using a holdout dataset of eval tokens when measuring losses? Or how many tokens are you using to measure losses?
- Have you plotted per token MSE loss vs l0 for each model? Do they look similar? Are there any outliers in that relationship?