Standard sparse autoencoder training uses an L1 sparsity loss term to induce sparsity in the hidden layer. However, theoretical justifications for this choice are lacking (in my opinion), and there may be better ways to induce sparsity. In this post, I explore other methods of inducing sparsity and experiment with them using Robert_AIZI’s methods and code from this research report, where he trained sparse autoencoders on OthelloGPT. I find several methods that produce significantly better results than L1 sparsity loss, including a leaky top-k activation function.
Introduction
This research builds directly onRobert_AIZI’s work from this research report. While I highly recommend reading his full report, I will briefly summarize the parts of it that are directly relevant to my work.
Although sparse autoencoders trained on language models have been shown to find feature directions that are more interpretable than individual neurons (Bricken et al,Cunningham et al), it remains unclear whether or not a given linearly-represented feature will be found by a sparse autoencoder. This is an important consideration for applications of sparse autoencoders to AI safety; if we believe that all relevant safety information is represented linearly, can we expect sparse autoencoders to bring all of it to light?
Motivated by this question, Robert trained sparse autoencoders on a version of OthelloGPT (based on the work of Li et al), a language model trained on Othello game histories to predict legal moves. Previous research (Nanda, Hazineh et al)had found that linear probes trained on the residual stream of OthelloGPT could classify each position on the board as either empty, containing an enemy piece, or containing an allied piece, with high accuracy. Robert reproduced similar results on a version of OthelloGPT that he trained himself, finding linear probes with 0.9 AUROC or greater for the vast majority of (board position, position state) pairs. He then investigated whether or not sparse autoencoders trained on OthelloGPT’s residual stream would find features that classified board positions with levels of accuracy similar to the linear probes. Out of 180 possible (board position, position state) pair classifiers, Robert’s best autoencoder had 33 features that classified distinct (board position, position state) pairs with at least 0.9 AUROC.
Robert’s autoencoders were trained with the standard L1 sparsity loss used in recent research applying sparse autoencoders to interpreting language models (Sharkey et al, Bricken et al, Cunningham et al, and similar toTempleton et al). However, there are theoretical and empirical reasons to believe that this may not be an ideal method for inducing sparsity. From a theoretical perspective, the L0 norm is the definition of sparsity used in sparse dictionary learning (High-Dimensional Data Analysis with Low-Dimensional Models by Wright and Ma, Section 2.2.3). Minimizing the L1 norm has been proven sufficient to recover sparse solutions in much simpler contexts (Wright and Ma, Section 3.2.2), as Sharkey et alandCunningham et al point out to justify their use of the L1 norm. However, I am not aware of any results demonstrating that minimizing the L1 norm is a theoretically sound way to solve the problem (overcomplete sparse dictionary learning) that sparse autoencoders are designed to solve[1]. Using the L1 norm for sparsity loss has been shown to underestimate the true feature activations in toy data (Wright and Sharkey), a phenomenon known as shrinkage. This is no surprise, since minimizing the L1 norm encourages all feature activations to be closer to zero, including feature activations that should be larger. The L1 norm also apparently leads to too many features being learned (Sharkey).
For these reasons, I wanted to experiment with ways of inducing sparsity in the feature activations of sparse autoencoders that seemed to me more aligned with the theoretical definition of sparsity, in the hope of finding methods that perform better than L1 sparsity loss. I chose to run these experiments by making modifications to Robert’s OthelloGPT code, for a couple of reasons. Firstly, Robert’s work provides a clear and reasonable metric by which to measure the performance of sparse autoencoders on a language model quickly and cheaply: the number of good board position classifiers given by the feature activations of the SAE. While I do have some reservations about this metric (for reasons I’ll mention in the conclusion), I think it is a valuable alternative to methods like using a language model to interpret features found by an SAE that are significantly more computationally expensive. Secondly, I know Robert personally and he offered to provide guidance getting this project up and running. Thanks to his help, I was able to start running experiments after only a week of work on the project.
Methods
I trained several variants of the SAE architecture that had different architectural processes for encouraging sparsity. The following aspects of the architecture were held constant. They contained one hidden layer of neurons—which I’m calling the feature layer--, the encoder and decoder weights were left untied, i.e. they were trained as separate parameters of the model, and a bias term was used in both the encoder and decoder. They were trained on the residual stream of the OthelloGPT model after layer 3 using a training set of 100,000 game histories for four epochs, with the Adam optimizer and a learning rate of 0.001.
I came up with five different methods of inducing sparsity, which I detail below. Three of them use a different kind of sparsity loss, in which case the activation function used on the feature layer is ReLU. The other two use custom activation functions designed to output sparse activations instead of including a sparsity loss term. In these cases, I still applied ReLU to the output of the encoder before applying the custom activation function to ensure that the activations would be non-negative.
Following Robert[2], the main measure of SAE performance that I focused on was the number of (board position, position state) pair classifiers given by the SAE feature activations that had an AUROC over 0.9; from now on I will just call these “good classifiers”.
Each method of inducing sparsity has its own set of hyperparameters, and I did not have the resources to perform a massive sweep of the hyperparameter space to be confident that I had found values for the hyperparameters that roughly maximize the number of good classifiers found. Instead, I generally took some educated guesses about what hyperparameter values might work well and first trained a batch of SAEs—typically somewhere between 8 and 16 of them. Since finding the number of good classifiers for a given SAE is computationally expensive, I did not do this for every SAE trained. I first weeded out weaker candidates based on evaluations done during training. Specifically, at this stage, I considered the average sparsity[3] of the feature layer and the unexplained variance of the reconstruction of the OthelloGPT activations, which were evaluated on the test data set at regular intervals during training. The SAEs that seemed promising based on these metrics were evaluated for good classifiers. Based on this data, I trained another batch of SAEs, repeating this search process until I was reasonably satisfied that I had gotten close to a local maximum of good classifiers in the hyperparameter space.
Since this search process was based on intuition, guessing, and a somewhat arbitrary AUROC threshold of 0.9, I view these experiments as a search for methods that deserve further study. I certainly don’t consider any of my results as strong evidence that L1 sparsity loss should be replaced by one of these methods.
Sparsity loss functions
Controls: L1 sparsity loss and no sparsity loss
I first reproduced Robert’s results with L1 sparsity loss to use as a baseline. Using the sparsity loss coefficient of α=0.077 that he found with a hyperparameter sweep, I trained an SAE with L1 sparsity loss, which found 29 good classifiers, similar to Robert’s result of 33[4]. I also wanted to confirm that training with sparsity loss in fact significantly impacted the number of good classifiers found, so I trained another SAE without sparsity loss; it found only one good classifier.
Smoothed-L0 sparsity loss
Since L0 is what we ideally want to minimize to achieve sparsity, why not try a loss function based on the L0 norm instead of the L1 norm? Because the feature layer of the SAE uses ReLU as its activation function, all the feature activations are non-negative. Taking the L0 norm of a vector with no negative entries is equivalent to applying the unit step function—U(x)=0if ,x≤0U(x)=1otherwise—to each entry, and then summing the results. Unfortunately, the unit step function is not differentiable, so would be difficult to use in a loss function. Therefore, I tried various smoothed versions of the unit step function using the sigmoid function sig(x)=11+e−x as a base.
Call the smoothed unit step function we’re looking to define f. Since all the feature activations are positive, we want the transition from 0 to 1 to roughly “start” at x=0. So choose a small ϵ>0 such that we are satisfied if f(0)=ϵ. We will then also consider the transition from 0 to 1 to be complete once the value of f reaches 1−ϵ. So choose a duration δ>0 for the transition, and require that f(δ)=1−ϵ. These requirements are satisfied by defining fϵ,δ(x):=sig(m(x−12δ)) where m=2δln(1ϵ−1). Then the smoothed-L0 sparsity loss of an n-dimensional feature vector x with respect to a choice of ϵ,δ>0 is given by 1n∑ni=1fϵ,δ(xi). In practice, I chose ϵ=0.01 for all SAEs trained, since I expected δ to have a more interesting impact on the results.
Results of smoothed-L0 sparsity loss experiments
Out of the SAEs trained, the best one found 38 good classifiers, and had on average 32.6% of feature activations greater than 1.0 and 7.9% unexplained variance. It was trained with the hyperparameters δ=10, ϵ=0.01, and sparsity loss coefficient α=0.7.
Freshman’s dream sparsity loss
The erroneous (over the real numbers, at least) equation ap+bp=(a+b)p is known as the “freshman’s dream”. In general, for values xi≥0, we have ∑xpi≤(∑xi)p, but we get closer to equality if a small number of the xi‘s are responsible for a majority of the value of ∑xi, with equality achieved if and only if all but one of the xi’s are 0. This sounds a lot like saying that an activation vector will be more sparse the closer the values of ∑xpi and (∑xi)p. So we will define the freshman’s dream sparsity loss of an n-dimensional feature vector x with respect to a choice of the power p as
1n⋅(∑ni=1xi)p∑ni=1xpi.
I focused on p=2, which has the additional notable property that a k-sparse[5] vector with all equal non-zero activations has a loss of kn. So this loss is in some sense linear in the sparsity of the vector, which I thought might be a nice property; I intuitively like the idea of putting similar amounts of effort into reducing the sparsity of a 100-sparse vector and a 50-sparse vector.
Results of freshman’s dream sparsity loss experiments
Out of the SAEs trained, the best one found 21 good classifiers, and had on average 25.7% of feature activations greater than 0 and 7.7% unexplained variance. It was trained with the hyperparameters p=2 and sparsity loss coefficient α=1.2.
Without-top-k sparsity loss
Suppose we have a value of k such that we would be satisfied if all of our feature vectors were k-sparse; we don’t care about trying to make them sparser than that. Given an n-dimensional feature vector x, we can project it into the space of k-sparse vectors by finding the k largest activations and replacing the rest with zeros; let ^x be the resulting k-sparse vector. Then, for an appropriate choice of p, let 1n||x−^x||p be the without top k sparsity loss of x. Intuitively, this measures how close x is to being k-sparse. I did experiments with p=1 and p=2.
Results of without-top-k sparsity loss experiments
Out of the SAEs trained, the best one found 28 good classifiers, and had on average 12.1% of feature activations greater than 0 and 12.3% unexplained variance. It was trained with the hyperparameters p=1 and sparsity loss coefficient α=0.1.
Using activation functions to enforce sparsity
While thinking about different types of sparsity loss, I also considered other possibilities for inducing sparsity that don’t involve training with a sparsity loss term. Instead, we could take the output from the encoder and map it directly to a more sparse version of itself, and use the sparser version as the feature vector. I’m calling these maps “activation functions” because they output the activations of the feature layer, even though they may not be anything like the activation functions you would typically use to add non-linearity to a neural net. In each of my experiments, I did still apply a ReLU to the output of the encoder before applying the given activation function to ensure that all the activations were positive.
Leaky top-k activation
You might have noticed that we just discussed a way to map a vector to a sparser version of itself in the previous section on without-top-k sparsity loss: pick a value of k, and given a vector x, map it to ^x, the vector with the same largest k entries as x and zeros elsewhere. We will use a generalization of this map for our first activation function.
Pick a value for k and a small ϵ≥0. Then define the activation function Tk,ϵ in the following way. Given a vector x, let b be the value of the kth-largest entry in x. Then define the vector Tk,ϵ(x) by
Tk,ϵ(x)i={xi,if xi≥bϵbxi,otherwise.
This way, every activation other than the largest k activations will be at most ϵ, making the resulting vector within ϵ of being k-sparse (excepting the rare case where there are multiple entries of x with a value of b). I wanted to allow for ϵ>0 so that some information about other entries could be kept to help with the reconstruction.
Note: This method of inducing sparsity (but restricting to ϵ=0) was previously used in autoencoders by Makhzani and Frey and applied to interpreting language models by Gao et al.
I also tried using a version of this activation function where entries smaller than b are multiplied by ϵ instead of ϵb, allowing the reduced activations to be small relative to b. I additionally tried defining smoothed versions of these functions, wondering if that might help the model learn to deal with the activation functions better. However, these variations turned out to yield worse initial results, and I did not explore them further.
Results of leaky top-k activation experiments
Out of the SAEs trained, the best one found 45 good classifiers, and had on average 13.2% unexplained variance. It was trained with the hyperparameters k=70 and ϵ=0.5.
Out of those trained with ϵ=0 as in Makhzani and Frey and Gao et al, the best one found 34 good classifiers. It was trained with k=85 and also had on average 13.2% unexplained variance. Notably, all of the SAEs trained with ϵ>0 had no dead neurons, whereas all of the SAEs trained with ϵ=0 had a significant number of dead neurons; the one trained with k=85 had 15.1% dead neurons.
Dimension reduction activation
For a leaky top-k activation function, we choose a single value of k to use for all inputs. This didn’t fully sit right with me: what if it makes more sense to use a smaller value of k for some inputs, and a larger value for others? For example, it seems like a reasonable choice to map the vector (60,50,40,30,9,8) to a nearly 4-sparse vector, but the vector (60,50,15,14,14,12) seems like it corresponds more naturally to a nearly 2-sparse vector. So I wanted to try out a function that does something very similar to the leaky top-k activation function, but chooses an appropriate k based on the vector input.
As in the definition of a leaky top-k activation function, choose a bound ϵ≥0. Define a dimension reduction activation function Dϵ in the following way. Given an n-dimensional vector x with non-negative entries, let b=min{xi:i∈Sx} and define Dϵ(x) by
Dϵ(x)i={xi,if i∈Sxϵbxi,otherwise,
where Sx⊆[n]:={1,2,…,n} is chosen in the following way. We will start with S0=[n]. Remove any i∈S0 with xi<b1, for some bound b1 depending on x, resulting in a smaller set S1. Continuing in this way, recursivly define Sj={i∈Sj−1:xi≥bj}, where bj is some bound depending on x and Sj−1 that we have yet to define. Eventually, we will reach a value of j where Sj=Sj+1, at which point we will define Sx:=Sj. The bj‘s will be chosen in a way that distinguishes between relatively large and small entries of x and that is invariant under scaling x. The details and motivation behind how I chose appropriate bj’s is long and convoluted, so I will leave them in an appendix.
Results of dimension reduction activation experiments
Out of the SAEs trained, the best one found 43 good classifiers, and had on average 3.8% of feature activations greater than 0.01 and 18.5% unexplained variance. It was trained with the hyperparameters ϵ=0.01, and with the sequence of ak’s chosen as described in the appendix. I think it’s particularly notable that the best SAEs using this method had much lower sparsity and higher unexplained variance than the best SAEs trained with other methods. I don’t know why that is!
Conclusion
Method
Number of good classifiers
L1 sparsity loss
29
Smoothed-L0 sparsity loss
38
Freshman’s dream sparsity loss
21
Without-top-k sparsity loss
28
Leaky top-k activation
45
Dimension reduction activation
43
Of the methods I tried, without-top-k sparsity loss performed on par with L1 sparsity loss, and smoothed-L0 sparsity loss, leaky top-k activation, and dimension reduction activation both performed significantly better. Leaky top-k activation takes the cake with 45 good classifiers, compared to L1 sparsity loss’s 29. I think that all four of these methods are worthy of further study, both in test beds like OthelloGPT, and in real LLMs.
I’m interested in continuing to pursue this research by experimenting with more ways of tweaking the architecture and training of SAEs. I’m particularly interested to try more activation functions similar to leaky top-k activation, and to see if adding more layers to the encoder and/or decoder could improve results, especially when using custom activation functions that may make the reconstruction more challenging for the linear encoder/decoder architecture. As Robert mentioned in his report, I also think it’s important to try these experiments on the fully-trained OthelloGPT created by Li et al, which is both better at predicting legal moves and has much more accurate linear probes than Robert’s version.
Finally, I think it would be useful to get a better understanding of the features found by these SAEs that don’t classify board positions well. Are some of these features providing interpretable information about the board state, but just in different ways? It may be that OthelloGPT “really” represents the board state in a way that doesn’t fully factor into individual board positions, in spite of the fact that linear probes can find good classifiers for individual board positions. For example, Robert noticed that a feature from one of his autoencoders seemed to keep track of lines of adjacent positions that all contain a white piece. I hypothesize that, because the rules of Othello significantly restrict what board states are possible/likely (for example, there tend to be lots of lines of pieces with similar colors), we should not expect SAEs to be able to find as many good classifiers for board positions as if we were working with a game like chess, where more board states are possible. ChessGPT anyone?
Appendix: Details for finding Sx
The algorithm described above for finding Sx was designed as a fast way to compute a different, geometrically-motivated definition of Sx. Here I’ll describe that original definition and motivation, and define the bj’s that allow the sped-up algorithm to find the same Sx given by the original definition.
Given a non-empty S⊆[n], let k be the number of indices in S. If we let zi represent the ith coordinate in Rn, then the equation ∑i∈Szi=ak describes an (n−1)-dimensional hyperplane; call it PS. Note that PS is perpendicular to the vector yS that has a 1 for every entry whose index is in S, and all other entries 0, and that yS points from the origin directly towards PS. In some sense, we will choose Sx such that ySx is pointing in a similar direction as x, and we will use the sequence of ak’s to favor choices of Sx with fewer elements (resulting in a sparser output).
The set of hyperplanes {PS:S⊆[n],S≠∅} cut Rn into a number of components; let Obe the one containing the origin. Consider the ray r in the direction of x. Since x has non-negative entries, r intersects the boundary of O at one (or possibly more) of our hyperplanes. Generically, we should expect there to be a unique hyperplane PSx that contains the intersection of r and the boundary of O, which then uniquely defines Sx. If r happens to intersect the boundary of O where two or more of our hyperplanes meet, let Px be the set of all such hyperplanes, and define Sx=⋃PS∈PxS.
How does this favor choices of Sx with fewer elements? Well, it doesn’t always, not for every non-decreasing sequence of ak’s. But we can design a sequence that does. For a given k∈[n], if ak+1−ak is bigger, then Sx is more likely to have k, rather than k+1, elements. So choosing a sequence where ak+1−ak is bigger for smaller k favors choices of Sx with fewer elements, encouraging sparsity.
Besides that, I added a few constraints to the sequence to guarantee some properties that I thought would be beneficial. First, I wanted to ensure that every non-empty S⊆[n] actually does correspond to some input, i.e., there is some x with Sx=S. This property holds if ak<kk−1ak−1 for all k. Second, in order to speed up the computation of Sx, I wanted to use an algorithm that started with S=[n] and then repeatedly removed dimensions from S until only the dimensions in Sx were left. To ensure that the order in which dimensions were removed in this algorithm did not matter, I needed to add the requirement that ak−ak−1≤ak−1−ak−2 for all k. Adding this constraint guarantees that Sx will be given by the algorithm described in the main section of the report if we let
bj=(1−ak−1ak)∑i∈Sj−1xi,
where k=|Sj−1|.
Then to choose a sequence of ak’s, I arbitrarily started with a1=1. At the kth step, having chosen a1,...,ak−1, we want to choose a value of ak between ak−1 and kk−1ak−1. Initially, I tried choosing a proportion p and letting ak=(1−p)ak−1+pkk−1ak−1; note that for such a sequence, ak+1−ak will increase as k decreases, as desired to induce sparsity. With this method I found that, if the feature vector gets too sparse at some point during training, the training will get into a positive feedback loop and just keep making it sparser and sparser, until it’s way too sparse and isn’t reconstructing well at all (why this could happen is still a complete mystery to me). Similarly, if the feature vector isn’t sparse enough, it keeps getting less and less sparse. It was hard to find a value of p that didn’t fall into one of these two traps, though I did find some where the training just happened to end with reasonable sparsity and unexplained variance. The best SAE trained in this way found 29 good classifiers, and had on average 6.9% feature activations greater than 0.01 and 17.3% unexplained variance. It was trained with hyperparameters p=0.49 and ϵ=0.01. If I had continued training, though, I’m confident the feedback loop would have continued until the reconstruction was ruined.
To find a better sequence of ak‘s, I tried varying the proportion p with respect to k, i.e. coming up with a sequence of proportions (pk)k∈[n] and letting ak=(1−pk)ak−1+pkkk−1ak−1. Specifically, since low values of p resulted in not enough sparsity and high values resulted in too much, I thought to try choosing a low value of p to use for small k, but to start increasing p once k passes some threshold. How fast p could be increased was bounded above by the ak−ak−1≤ak−1−ak−2 restriction, and I found that increasing p as fast as this restriction allowed yielded reasonable results. The best SAE was trained using pk=0.15 for k≤10, with the pk’s increasing as fast as possible past k=10.
Please do not take this as an assertion that no such results exist! I know very little about the field of sparse coding. I am only trying to say that, after reading some of the recent research using sparse autoencoders to interpret language models, I am unsatisfied with the theoretical justifications given for using the L1 norm, and a brief search through the relevant resources that those papers cited did not turn up any other justifications that I find satisfying.
In fact, Robert measured the number of features that are good classifiers, which is slightly different, since he counted two different features separately even if they were both good classifiers for the same (board position, position state) pair.
In most cases, I measured sparsity as the percentage of activations greater than some relatively small bound ϵ≥0. Different values of ϵ seemed more or less informative depending on the method used to induce sparsity. I’ll note the specific measure of sparsity used for each method.
As mentioned in a previous footnote, this number is slightly inflated compared to mine as a result of duplicated features. If I had also included duplicate features, I would have found 35, compared to his 33.
Research Report: Alternative sparsity methods for sparse autoencoders with OthelloGPT.
Abstract
Standard sparse autoencoder training uses an L1 sparsity loss term to induce sparsity in the hidden layer. However, theoretical justifications for this choice are lacking (in my opinion), and there may be better ways to induce sparsity. In this post, I explore other methods of inducing sparsity and experiment with them using Robert_AIZI’s methods and code from this research report, where he trained sparse autoencoders on OthelloGPT. I find several methods that produce significantly better results than L1 sparsity loss, including a leaky top-k activation function.
Introduction
This research builds directly on Robert_AIZI’s work from this research report. While I highly recommend reading his full report, I will briefly summarize the parts of it that are directly relevant to my work.
Although sparse autoencoders trained on language models have been shown to find feature directions that are more interpretable than individual neurons (Bricken et al, Cunningham et al), it remains unclear whether or not a given linearly-represented feature will be found by a sparse autoencoder. This is an important consideration for applications of sparse autoencoders to AI safety; if we believe that all relevant safety information is represented linearly, can we expect sparse autoencoders to bring all of it to light?
Motivated by this question, Robert trained sparse autoencoders on a version of OthelloGPT (based on the work of Li et al), a language model trained on Othello game histories to predict legal moves. Previous research (Nanda, Hazineh et al) had found that linear probes trained on the residual stream of OthelloGPT could classify each position on the board as either empty, containing an enemy piece, or containing an allied piece, with high accuracy. Robert reproduced similar results on a version of OthelloGPT that he trained himself, finding linear probes with 0.9 AUROC or greater for the vast majority of (board position, position state) pairs. He then investigated whether or not sparse autoencoders trained on OthelloGPT’s residual stream would find features that classified board positions with levels of accuracy similar to the linear probes. Out of 180 possible (board position, position state) pair classifiers, Robert’s best autoencoder had 33 features that classified distinct (board position, position state) pairs with at least 0.9 AUROC.
Robert’s autoencoders were trained with the standard L1 sparsity loss used in recent research applying sparse autoencoders to interpreting language models (Sharkey et al, Bricken et al, Cunningham et al, and similar to Templeton et al). However, there are theoretical and empirical reasons to believe that this may not be an ideal method for inducing sparsity. From a theoretical perspective, the L0 norm is the definition of sparsity used in sparse dictionary learning (High-Dimensional Data Analysis with Low-Dimensional Models by Wright and Ma, Section 2.2.3). Minimizing the L1 norm has been proven sufficient to recover sparse solutions in much simpler contexts (Wright and Ma, Section 3.2.2), as Sharkey et al and Cunningham et al point out to justify their use of the L1 norm. However, I am not aware of any results demonstrating that minimizing the L1 norm is a theoretically sound way to solve the problem (overcomplete sparse dictionary learning) that sparse autoencoders are designed to solve[1]. Using the L1 norm for sparsity loss has been shown to underestimate the true feature activations in toy data (Wright and Sharkey), a phenomenon known as shrinkage. This is no surprise, since minimizing the L1 norm encourages all feature activations to be closer to zero, including feature activations that should be larger. The L1 norm also apparently leads to too many features being learned (Sharkey).
For these reasons, I wanted to experiment with ways of inducing sparsity in the feature activations of sparse autoencoders that seemed to me more aligned with the theoretical definition of sparsity, in the hope of finding methods that perform better than L1 sparsity loss. I chose to run these experiments by making modifications to Robert’s OthelloGPT code, for a couple of reasons. Firstly, Robert’s work provides a clear and reasonable metric by which to measure the performance of sparse autoencoders on a language model quickly and cheaply: the number of good board position classifiers given by the feature activations of the SAE. While I do have some reservations about this metric (for reasons I’ll mention in the conclusion), I think it is a valuable alternative to methods like using a language model to interpret features found by an SAE that are significantly more computationally expensive. Secondly, I know Robert personally and he offered to provide guidance getting this project up and running. Thanks to his help, I was able to start running experiments after only a week of work on the project.
Methods
I trained several variants of the SAE architecture that had different architectural processes for encouraging sparsity. The following aspects of the architecture were held constant. They contained one hidden layer of neurons—which I’m calling the feature layer--, the encoder and decoder weights were left untied, i.e. they were trained as separate parameters of the model, and a bias term was used in both the encoder and decoder. They were trained on the residual stream of the OthelloGPT model after layer 3 using a training set of 100,000 game histories for four epochs, with the Adam optimizer and a learning rate of 0.001.
I came up with five different methods of inducing sparsity, which I detail below. Three of them use a different kind of sparsity loss, in which case the activation function used on the feature layer is ReLU. The other two use custom activation functions designed to output sparse activations instead of including a sparsity loss term. In these cases, I still applied ReLU to the output of the encoder before applying the custom activation function to ensure that the activations would be non-negative.
Following Robert[2], the main measure of SAE performance that I focused on was the number of (board position, position state) pair classifiers given by the SAE feature activations that had an AUROC over 0.9; from now on I will just call these “good classifiers”.
Each method of inducing sparsity has its own set of hyperparameters, and I did not have the resources to perform a massive sweep of the hyperparameter space to be confident that I had found values for the hyperparameters that roughly maximize the number of good classifiers found. Instead, I generally took some educated guesses about what hyperparameter values might work well and first trained a batch of SAEs—typically somewhere between 8 and 16 of them. Since finding the number of good classifiers for a given SAE is computationally expensive, I did not do this for every SAE trained. I first weeded out weaker candidates based on evaluations done during training. Specifically, at this stage, I considered the average sparsity[3] of the feature layer and the unexplained variance of the reconstruction of the OthelloGPT activations, which were evaluated on the test data set at regular intervals during training. The SAEs that seemed promising based on these metrics were evaluated for good classifiers. Based on this data, I trained another batch of SAEs, repeating this search process until I was reasonably satisfied that I had gotten close to a local maximum of good classifiers in the hyperparameter space.
Since this search process was based on intuition, guessing, and a somewhat arbitrary AUROC threshold of 0.9, I view these experiments as a search for methods that deserve further study. I certainly don’t consider any of my results as strong evidence that L1 sparsity loss should be replaced by one of these methods.
Sparsity loss functions
Controls: L1 sparsity loss and no sparsity loss
I first reproduced Robert’s results with L1 sparsity loss to use as a baseline. Using the sparsity loss coefficient of α=0.077 that he found with a hyperparameter sweep, I trained an SAE with L1 sparsity loss, which found 29 good classifiers, similar to Robert’s result of 33[4]. I also wanted to confirm that training with sparsity loss in fact significantly impacted the number of good classifiers found, so I trained another SAE without sparsity loss; it found only one good classifier.
Smoothed-L0 sparsity loss
Since L0 is what we ideally want to minimize to achieve sparsity, why not try a loss function based on the L0 norm instead of the L1 norm? Because the feature layer of the SAE uses ReLU as its activation function, all the feature activations are non-negative. Taking the L0 norm of a vector with no negative entries is equivalent to applying the unit step function—U(x)=0if ,x≤0 U(x)=1otherwise—to each entry, and then summing the results. Unfortunately, the unit step function is not differentiable, so would be difficult to use in a loss function. Therefore, I tried various smoothed versions of the unit step function using the sigmoid function sig(x)=11+e−x as a base.
Call the smoothed unit step function we’re looking to define f. Since all the feature activations are positive, we want the transition from 0 to 1 to roughly “start” at x=0. So choose a small ϵ>0 such that we are satisfied if f(0)=ϵ. We will then also consider the transition from 0 to 1 to be complete once the value of f reaches 1−ϵ. So choose a duration δ>0 for the transition, and require that f(δ)=1−ϵ. These requirements are satisfied by defining fϵ,δ(x):=sig(m(x−12δ)) where m=2δln(1ϵ−1). Then the smoothed-L0 sparsity loss of an n-dimensional feature vector x with respect to a choice of ϵ,δ>0 is given by 1n∑ni=1fϵ,δ(xi). In practice, I chose ϵ=0.01 for all SAEs trained, since I expected δ to have a more interesting impact on the results.
Results of smoothed-L0 sparsity loss experiments
Out of the SAEs trained, the best one found 38 good classifiers, and had on average 32.6% of feature activations greater than 1.0 and 7.9% unexplained variance. It was trained with the hyperparameters δ=10, ϵ=0.01, and sparsity loss coefficient α=0.7.
Freshman’s dream sparsity loss
The erroneous (over the real numbers, at least) equation ap+bp=(a+b)p is known as the “freshman’s dream”. In general, for values xi≥0, we have ∑xpi≤(∑xi)p, but we get closer to equality if a small number of the xi‘s are responsible for a majority of the value of ∑xi, with equality achieved if and only if all but one of the xi’s are 0. This sounds a lot like saying that an activation vector will be more sparse the closer the values of ∑xpi and (∑xi)p. So we will define the freshman’s dream sparsity loss of an n-dimensional feature vector x with respect to a choice of the power p as
1n⋅(∑ni=1xi)p∑ni=1xpi.I focused on p=2, which has the additional notable property that a k-sparse[5] vector with all equal non-zero activations has a loss of kn. So this loss is in some sense linear in the sparsity of the vector, which I thought might be a nice property; I intuitively like the idea of putting similar amounts of effort into reducing the sparsity of a 100-sparse vector and a 50-sparse vector.
Results of freshman’s dream sparsity loss experiments
Out of the SAEs trained, the best one found 21 good classifiers, and had on average 25.7% of feature activations greater than 0 and 7.7% unexplained variance. It was trained with the hyperparameters p=2 and sparsity loss coefficient α=1.2.
Without-top-k sparsity loss
Suppose we have a value of k such that we would be satisfied if all of our feature vectors were k-sparse; we don’t care about trying to make them sparser than that. Given an n-dimensional feature vector x, we can project it into the space of k-sparse vectors by finding the k largest activations and replacing the rest with zeros; let ^x be the resulting k-sparse vector. Then, for an appropriate choice of p, let 1n||x−^x||p be the without top k sparsity loss of x. Intuitively, this measures how close x is to being k-sparse. I did experiments with p=1 and p=2.
Results of without-top-k sparsity loss experiments
Out of the SAEs trained, the best one found 28 good classifiers, and had on average 12.1% of feature activations greater than 0 and 12.3% unexplained variance. It was trained with the hyperparameters p=1 and sparsity loss coefficient α=0.1.
Using activation functions to enforce sparsity
While thinking about different types of sparsity loss, I also considered other possibilities for inducing sparsity that don’t involve training with a sparsity loss term. Instead, we could take the output from the encoder and map it directly to a more sparse version of itself, and use the sparser version as the feature vector. I’m calling these maps “activation functions” because they output the activations of the feature layer, even though they may not be anything like the activation functions you would typically use to add non-linearity to a neural net. In each of my experiments, I did still apply a ReLU to the output of the encoder before applying the given activation function to ensure that all the activations were positive.
Leaky top-k activation
You might have noticed that we just discussed a way to map a vector to a sparser version of itself in the previous section on without-top-k sparsity loss: pick a value of k, and given a vector x, map it to ^x, the vector with the same largest k entries as x and zeros elsewhere. We will use a generalization of this map for our first activation function.
Pick a value for k and a small ϵ≥0. Then define the activation function Tk,ϵ in the following way. Given a vector x, let b be the value of the kth-largest entry in x. Then define the vector Tk,ϵ(x) by
Tk,ϵ(x)i={xi,if xi≥bϵbxi,otherwise.This way, every activation other than the largest k activations will be at most ϵ, making the resulting vector within ϵ of being k-sparse (excepting the rare case where there are multiple entries of x with a value of b). I wanted to allow for ϵ>0 so that some information about other entries could be kept to help with the reconstruction.
Note: This method of inducing sparsity (but restricting to ϵ=0) was previously used in autoencoders by Makhzani and Frey and applied to interpreting language models by Gao et al.
I also tried using a version of this activation function where entries smaller than b are multiplied by ϵ instead of ϵb, allowing the reduced activations to be small relative to b. I additionally tried defining smoothed versions of these functions, wondering if that might help the model learn to deal with the activation functions better. However, these variations turned out to yield worse initial results, and I did not explore them further.
Results of leaky top-k activation experiments
Out of the SAEs trained, the best one found 45 good classifiers, and had on average 13.2% unexplained variance. It was trained with the hyperparameters k=70 and ϵ=0.5.
Out of those trained with ϵ=0 as in Makhzani and Frey and Gao et al, the best one found 34 good classifiers. It was trained with k=85 and also had on average 13.2% unexplained variance. Notably, all of the SAEs trained with ϵ>0 had no dead neurons, whereas all of the SAEs trained with ϵ=0 had a significant number of dead neurons; the one trained with k=85 had 15.1% dead neurons.
Dimension reduction activation
For a leaky top-k activation function, we choose a single value of k to use for all inputs. This didn’t fully sit right with me: what if it makes more sense to use a smaller value of k for some inputs, and a larger value for others? For example, it seems like a reasonable choice to map the vector (60,50,40,30,9,8) to a nearly 4-sparse vector, but the vector (60,50,15,14,14,12) seems like it corresponds more naturally to a nearly 2-sparse vector. So I wanted to try out a function that does something very similar to the leaky top-k activation function, but chooses an appropriate k based on the vector input.
As in the definition of a leaky top-k activation function, choose a bound ϵ≥0. Define a dimension reduction activation function Dϵ in the following way. Given an n-dimensional vector x with non-negative entries, let b=min{xi:i∈Sx} and define Dϵ(x) by
Dϵ(x)i={xi,if i∈Sxϵbxi,otherwise,where Sx⊆[n]:={1,2,…,n} is chosen in the following way. We will start with S0=[n]. Remove any i∈S0 with xi<b1, for some bound b1 depending on x, resulting in a smaller set S1. Continuing in this way, recursivly define Sj={i∈Sj−1:xi≥bj}, where bj is some bound depending on x and Sj−1 that we have yet to define. Eventually, we will reach a value of j where Sj=Sj+1, at which point we will define Sx:=Sj. The bj‘s will be chosen in a way that distinguishes between relatively large and small entries of x and that is invariant under scaling x. The details and motivation behind how I chose appropriate bj’s is long and convoluted, so I will leave them in an appendix.
Results of dimension reduction activation experiments
Out of the SAEs trained, the best one found 43 good classifiers, and had on average 3.8% of feature activations greater than 0.01 and 18.5% unexplained variance. It was trained with the hyperparameters ϵ=0.01, and with the sequence of ak’s chosen as described in the appendix. I think it’s particularly notable that the best SAEs using this method had much lower sparsity and higher unexplained variance than the best SAEs trained with other methods. I don’t know why that is!
Conclusion
Of the methods I tried, without-top-k sparsity loss performed on par with L1 sparsity loss, and smoothed-L0 sparsity loss, leaky top-k activation, and dimension reduction activation both performed significantly better. Leaky top-k activation takes the cake with 45 good classifiers, compared to L1 sparsity loss’s 29. I think that all four of these methods are worthy of further study, both in test beds like OthelloGPT, and in real LLMs.
I’m interested in continuing to pursue this research by experimenting with more ways of tweaking the architecture and training of SAEs. I’m particularly interested to try more activation functions similar to leaky top-k activation, and to see if adding more layers to the encoder and/or decoder could improve results, especially when using custom activation functions that may make the reconstruction more challenging for the linear encoder/decoder architecture. As Robert mentioned in his report, I also think it’s important to try these experiments on the fully-trained OthelloGPT created by Li et al, which is both better at predicting legal moves and has much more accurate linear probes than Robert’s version.
Finally, I think it would be useful to get a better understanding of the features found by these SAEs that don’t classify board positions well. Are some of these features providing interpretable information about the board state, but just in different ways? It may be that OthelloGPT “really” represents the board state in a way that doesn’t fully factor into individual board positions, in spite of the fact that linear probes can find good classifiers for individual board positions. For example, Robert noticed that a feature from one of his autoencoders seemed to keep track of lines of adjacent positions that all contain a white piece. I hypothesize that, because the rules of Othello significantly restrict what board states are possible/likely (for example, there tend to be lots of lines of pieces with similar colors), we should not expect SAEs to be able to find as many good classifiers for board positions as if we were working with a game like chess, where more board states are possible. ChessGPT anyone?
Appendix: Details for finding Sx
The algorithm described above for finding Sx was designed as a fast way to compute a different, geometrically-motivated definition of Sx. Here I’ll describe that original definition and motivation, and define the bj’s that allow the sped-up algorithm to find the same Sx given by the original definition.
Given a non-empty S⊆[n], let k be the number of indices in S. If we let zi represent the ith coordinate in Rn, then the equation ∑i∈Szi=ak describes an (n−1)-dimensional hyperplane; call it PS. Note that PS is perpendicular to the vector yS that has a 1 for every entry whose index is in S, and all other entries 0, and that yS points from the origin directly towards PS. In some sense, we will choose Sx such that ySx is pointing in a similar direction as x, and we will use the sequence of ak’s to favor choices of Sx with fewer elements (resulting in a sparser output).
The set of hyperplanes {PS:S⊆[n],S≠∅} cut Rn into a number of components; let Obe the one containing the origin. Consider the ray r in the direction of x. Since x has non-negative entries, r intersects the boundary of O at one (or possibly more) of our hyperplanes. Generically, we should expect there to be a unique hyperplane PSx that contains the intersection of r and the boundary of O, which then uniquely defines Sx. If r happens to intersect the boundary of O where two or more of our hyperplanes meet, let Px be the set of all such hyperplanes, and define Sx=⋃PS∈PxS.
How does this favor choices of Sx with fewer elements? Well, it doesn’t always, not for every non-decreasing sequence of ak’s. But we can design a sequence that does. For a given k∈[n], if ak+1−ak is bigger, then Sx is more likely to have k, rather than k+1, elements. So choosing a sequence where ak+1−ak is bigger for smaller k favors choices of Sx with fewer elements, encouraging sparsity.
Besides that, I added a few constraints to the sequence to guarantee some properties that I thought would be beneficial. First, I wanted to ensure that every non-empty S⊆[n] actually does correspond to some input, i.e., there is some x with Sx=S. This property holds if ak<kk−1ak−1 for all k. Second, in order to speed up the computation of Sx, I wanted to use an algorithm that started with S=[n] and then repeatedly removed dimensions from S until only the dimensions in Sx were left. To ensure that the order in which dimensions were removed in this algorithm did not matter, I needed to add the requirement that ak−ak−1≤ak−1−ak−2 for all k. Adding this constraint guarantees that Sx will be given by the algorithm described in the main section of the report if we let
bj=(1−ak−1ak)∑i∈Sj−1xi,where k=|Sj−1|.
Then to choose a sequence of ak’s, I arbitrarily started with a1=1. At the kth step, having chosen a1,...,ak−1, we want to choose a value of ak between ak−1 and kk−1ak−1. Initially, I tried choosing a proportion p and letting ak=(1−p)ak−1+pkk−1ak−1; note that for such a sequence, ak+1−ak will increase as k decreases, as desired to induce sparsity. With this method I found that, if the feature vector gets too sparse at some point during training, the training will get into a positive feedback loop and just keep making it sparser and sparser, until it’s way too sparse and isn’t reconstructing well at all (why this could happen is still a complete mystery to me). Similarly, if the feature vector isn’t sparse enough, it keeps getting less and less sparse. It was hard to find a value of p that didn’t fall into one of these two traps, though I did find some where the training just happened to end with reasonable sparsity and unexplained variance. The best SAE trained in this way found 29 good classifiers, and had on average 6.9% feature activations greater than 0.01 and 17.3% unexplained variance. It was trained with hyperparameters p=0.49 and ϵ=0.01. If I had continued training, though, I’m confident the feedback loop would have continued until the reconstruction was ruined.
To find a better sequence of ak‘s, I tried varying the proportion p with respect to k, i.e. coming up with a sequence of proportions (pk)k∈[n] and letting ak=(1−pk)ak−1+pkkk−1ak−1. Specifically, since low values of p resulted in not enough sparsity and high values resulted in too much, I thought to try choosing a low value of p to use for small k, but to start increasing p once k passes some threshold. How fast p could be increased was bounded above by the ak−ak−1≤ak−1−ak−2 restriction, and I found that increasing p as fast as this restriction allowed yielded reasonable results. The best SAE was trained using pk=0.15 for k≤10, with the pk’s increasing as fast as possible past k=10.
Please do not take this as an assertion that no such results exist! I know very little about the field of sparse coding. I am only trying to say that, after reading some of the recent research using sparse autoencoders to interpret language models, I am unsatisfied with the theoretical justifications given for using the L1 norm, and a brief search through the relevant resources that those papers cited did not turn up any other justifications that I find satisfying.
In fact, Robert measured the number of features that are good classifiers, which is slightly different, since he counted two different features separately even if they were both good classifiers for the same (board position, position state) pair.
In most cases, I measured sparsity as the percentage of activations greater than some relatively small bound ϵ≥0. Different values of ϵ seemed more or less informative depending on the method used to induce sparsity. I’ll note the specific measure of sparsity used for each method.
As mentioned in a previous footnote, this number is slightly inflated compared to mine as a result of duplicated features. If I had also included duplicate features, I would have found 35, compared to his 33.
A vector x is k-sparse if ||x||0≤k, i.e., if x has at most k non-zero entries.