When presenting data from SAEs, try plotting 1/L0 against 1−RecoveredLoss and fitting a Hill curve.
Long
Sparse autoencoders are hot, people are experimenting. The typical graph for SAE experimentation looks something like this. I’m using borrowed data here to better illustrate my point, but I have also noticed this pattern in my own data:
Which shows quantitative performance adequately in this case. However it gets a bit messy when there are 5-6 plots very close to each other (e.g. in an ablation study), and doesn’t give an easily-interpreted (heh) value to quantify pareto improvements.
I’ve found it much more helpful to to plot Sparsity=1/L0 on the x-axis, and “performance hit” (i.e.1−LossRecovered i.e.(LSAE−Lbase)/(Lmean−Lbase) where “mean” is mean-ablation and “base” is the base model loss.
I think some people instead calculate the loss when all features are set to zero, instead of strictly doing the mean ablation loss, but these are conceptually extremely similar.
If we re-plot the data from above we get this:
This lets us say something like “In this case, gated SAEs outperform baseline SAEs by a factor of around 2.7 as measured by Performance Hit/Sparsity”.
One might want to use a dimensionless sparsity measure, relative to the dimension of the stream from the base model that we are encoding. I don’t know whether this would actually enable comparisons between wildly different model sizes.
Of course as Sparsity→∞, Performance Hit won’t approach infinity, instead we would expect it to approach 1 (and if you run an autoencoder with λL1=100 you will in fact see this). This could be modelled with a the following equation:
y=xx+x1/2
Where x1/2 is the value of x at which y=1/2. Near x=0 it looks like y=1x1/2x, but it flattens off as x gets larger. In biology this is a Hill curve with Hill coefficient 1. It has just one free parameter, so even for small datasets (such as a pareto-frontier of four SAEs) it’s possible to get a valid fit.
What about MLP/Attention?
In these cases, we get a better fit with letting the Hill coefficient vary:
y=xkxk+xk1/2
Attention:
MLP:
These kind of look like a good fit for the Hill equation with variable Hill coefficients, but they also kind of just look like linear fits with non-zero intercept in some cases. It’s difficult to tell (also they kind of look like regular power fits of the form y=axk) I’ll plot the first graph with a Hill curve for completion:
If we consider the relative values of x1/2 and k for gated vs baseline SAEs, we can start to see a pattern:
Baseline x1/2
Baseline k
Gated x1/2
Gated k
x1/2 ratio
k ratio
Residual
1.04
1.01
6.28
0.816
6.0
1.2
MLP
1.16
0.461
7.29
0.383
6.3
1.2
Attention
0.256
0.826
1.28
0.6
5.0
1.4
So in this case we might want to say “Gated SAEs increase x1/2 by a factor of around 5-6 and decrease k by a factor of around 1.3 across the board, as compared to baseline SAEs”.
What about k>1?
In some of my own data, I’ve noticed that it sometimes looks like we have k>1. These are some results I have from some quick and dirty Residual→Residual transcoders.
Some of these look more like quadratics, and trying to interpret them as linear plots with non-zero fit seems wrong here!
Conclusions
So we have three options for fitting Performance/Sparsity graphs
y=mx+c: This has two fitted parameters except when c≈0. This is the “simplest” plot in the sense that it’s an obvious first choice. Fails to capture our expectation that the plot passes through the origin, and also fails to capture our expectation that the plot levels off at high sparsity.
y=axk: Two fitted parameters except when k≈1. This is more unusual than the first plot. Always passes through the origin but doesn’t level off at high sparsity.
y=xkxk1/2+xk: Two fitted parameters except when k≈1. This both passes through the origin and levels off, but is a slightly weird function.
I plan to use option 3 (the Hill equation) to report my own data where I can, since the added weirdness seems worth the theoretical considerations, especially since I often get very high-sparsity SAEs when scanning various L1 coefficients, which would break an automated fitting system.
I also think that a value of x1/2 in the Hill equation is slightly easier to interpret than a value of a from option 2, though I admit neither is as easy to interpret as m,c from a linear plot.
How to Better Report Sparse Autoencoder Performance
TL;DR
When presenting data from SAEs, try plotting 1/L0 against 1−Recovered Loss and fitting a Hill curve.
Long
Sparse autoencoders are hot, people are experimenting. The typical graph for SAE experimentation looks something like this. I’m using borrowed data here to better illustrate my point, but I have also noticed this pattern in my own data:
Which shows quantitative performance adequately in this case. However it gets a bit messy when there are 5-6 plots very close to each other (e.g. in an ablation study), and doesn’t give an easily-interpreted (heh) value to quantify pareto improvements.
I’ve found it much more helpful to to plot Sparsity=1/L0 on the x-axis, and “performance hit” (i.e.1−Loss Recovered i.e.(LSAE−Lbase)/(Lmean−Lbase) where “mean” is mean-ablation and “base” is the base model loss.
I think some people instead calculate the loss when all features are set to zero, instead of strictly doing the mean ablation loss, but these are conceptually extremely similar.
If we re-plot the data from above we get this:
This lets us say something like “In this case, gated SAEs outperform baseline SAEs by a factor of around 2.7 as measured by Performance Hit/Sparsity”.
One might want to use a dimensionless sparsity measure, relative to the dimension of the stream from the base model that we are encoding. I don’t know whether this would actually enable comparisons between wildly different model sizes.
Of course as Sparsity→∞, Performance Hit won’t approach infinity, instead we would expect it to approach 1 (and if you run an autoencoder with λL1=100 you will in fact see this). This could be modelled with a the following equation:
y=xx+x1/2
Where x1/2 is the value of x at which y=1/2. Near x=0 it looks like y=1x1/2x, but it flattens off as x gets larger. In biology this is a Hill curve with Hill coefficient 1. It has just one free parameter, so even for small datasets (such as a pareto-frontier of four SAEs) it’s possible to get a valid fit.
What about MLP/Attention?
In these cases, we get a better fit with letting the Hill coefficient vary:
y=xkxk+xk1/2
Attention:
MLP:These kind of look like a good fit for the Hill equation with variable Hill coefficients, but they also kind of just look like linear fits with non-zero intercept in some cases. It’s difficult to tell (also they kind of look like regular power fits of the form y=axk) I’ll plot the first graph with a Hill curve for completion:
If we consider the relative values of x1/2 and k for gated vs baseline SAEs, we can start to see a pattern:
So in this case we might want to say “Gated SAEs increase x1/2 by a factor of around 5-6 and decrease k by a factor of around 1.3 across the board, as compared to baseline SAEs”.
What about k>1?
In some of my own data, I’ve noticed that it sometimes looks like we have k>1. These are some results I have from some quick and dirty Residual→Residual transcoders.
Some of these look more like quadratics, and trying to interpret them as linear plots with non-zero fit seems wrong here!
Conclusions
So we have three options for fitting Performance/Sparsity graphs
y=mx+c: This has two fitted parameters except when c≈0. This is the “simplest” plot in the sense that it’s an obvious first choice. Fails to capture our expectation that the plot passes through the origin, and also fails to capture our expectation that the plot levels off at high sparsity.
y=axk: Two fitted parameters except when k≈1. This is more unusual than the first plot. Always passes through the origin but doesn’t level off at high sparsity.
y=xkxk1/2+xk: Two fitted parameters except when k≈1. This both passes through the origin and levels off, but is a slightly weird function.
I plan to use option 3 (the Hill equation) to report my own data where I can, since the added weirdness seems worth the theoretical considerations, especially since I often get very high-sparsity SAEs when scanning various L1 coefficients, which would break an automated fitting system.
I also think that a value of x1/2 in the Hill equation is slightly easier to interpret than a value of a from option 2, though I admit neither is as easy to interpret as m,c from a linear plot.