Interesting! This is very cool work but I’d like to understand your metrics better. - “So we take the difference in loss for features (ie for a feature, we take linear loss—MLP loss)”. What do you mean here? Is this the difference between the mean MSE loss when the feature is on vs not on? - Can you please report the L0′s for each of the auto-encoders and the linear model as well as the next token prediction loss when using the autoencoder/linear model. These are important metrics on which my generally excitement hinges. (eg: if those are both great, I’m way more interested in results about specific features). - I’d be very interested in you can take a specific input, look at the features present and compare them between autoencoder/the linear model. This would be especially cool if you pick an example where ablating the MLP out causes the incorrect prediction so we know it’s representing something important. - Are you using a holdout dataset of eval tokens when measuring losses? Or how many tokens are you using to measure losses? - Have you plotted per token MSE loss vs l0 for each model? Do they look similar? Are there any outliers in that relationship?
Interesting! This is very cool work but I’d like to understand your metrics better.
- “So we take the difference in loss for features (ie for a feature, we take linear loss—MLP loss)”. What do you mean here? Is this the difference between the mean MSE loss when the feature is on vs not on?
- Can you please report the L0′s for each of the auto-encoders and the linear model as well as the next token prediction loss when using the autoencoder/linear model. These are important metrics on which my generally excitement hinges. (eg: if those are both great, I’m way more interested in results about specific features).
- I’d be very interested in you can take a specific input, look at the features present and compare them between autoencoder/the linear model. This would be especially cool if you pick an example where ablating the MLP out causes the incorrect prediction so we know it’s representing something important.
- Are you using a holdout dataset of eval tokens when measuring losses? Or how many tokens are you using to measure losses?
- Have you plotted per token MSE loss vs l0 for each model? Do they look similar? Are there any outliers in that relationship?