Research Report: Alternative sparsity methods for sparse autoencoders with OthelloGPT.

Abstract

Standard sparse autoencoder training uses an sparsity loss term to induce sparsity in the hidden layer. However, theoretical justifications for this choice are lacking (in my opinion), and there may be better ways to induce sparsity. In this post, I explore other methods of inducing sparsity and experiment with them using Robert_AIZI’s methods and code from this research report, where he trained sparse autoencoders on OthelloGPT. I find several methods that produce significantly better results than sparsity loss, including a leaky top- activation function.

Introduction

This research builds directly on Robert_AIZI’s work from this research report. While I highly recommend reading his full report, I will briefly summarize the parts of it that are directly relevant to my work.

Although sparse autoencoders trained on language models have been shown to find feature directions that are more interpretable than individual neurons (Bricken et al, Cunningham et al), it remains unclear whether or not a given linearly-represented feature will be found by a sparse autoencoder. This is an important consideration for applications of sparse autoencoders to AI safety; if we believe that all relevant safety information is represented linearly, can we expect sparse autoencoders to bring all of it to light?

Motivated by this question, Robert trained sparse autoencoders on a version of OthelloGPT (based on the work of Li et al), a language model trained on Othello game histories to predict legal moves. Previous research (Nanda, Hazineh et al) had found that linear probes trained on the residual stream of OthelloGPT could classify each position on the board as either empty, containing an enemy piece, or containing an allied piece, with high accuracy. Robert reproduced similar results on a version of OthelloGPT that he trained himself, finding linear probes with 0.9 AUROC or greater for the vast majority of (board position, position state) pairs. He then investigated whether or not sparse autoencoders trained on OthelloGPT’s residual stream would find features that classified board positions with levels of accuracy similar to the linear probes. Out of 180 possible (board position, position state) pair classifiers, Robert’s best autoencoder had 33 features that classified distinct (board position, position state) pairs with at least 0.9 AUROC.

Robert’s autoencoders were trained with the standard sparsity loss used in recent research applying sparse autoencoders to interpreting language models (Sharkey et al, Bricken et al, Cunningham et al, and similar to Templeton et al). However, there are theoretical and empirical reasons to believe that this may not be an ideal method for inducing sparsity. From a theoretical perspective, the norm is the definition of sparsity used in sparse dictionary learning (High-Dimensional Data Analysis with Low-Dimensional Models by Wright and Ma, Section 2.2.3). Minimizing the norm has been proven sufficient to recover sparse solutions in much simpler contexts (Wright and Ma, Section 3.2.2), as Sharkey et al and Cunningham et al point out to justify their use of the norm. However, I am not aware of any results demonstrating that minimizing the norm is a theoretically sound way to solve the problem (overcomplete sparse dictionary learning) that sparse autoencoders are designed to solve[1]. Using the norm for sparsity loss has been shown to underestimate the true feature activations in toy data (Wright and Sharkey), a phenomenon known as shrinkage. This is no surprise, since minimizing the norm encourages all feature activations to be closer to zero, including feature activations that should be larger. The norm also apparently leads to too many features being learned (Sharkey).

For these reasons, I wanted to experiment with ways of inducing sparsity in the feature activations of sparse autoencoders that seemed to me more aligned with the theoretical definition of sparsity, in the hope of finding methods that perform better than sparsity loss. I chose to run these experiments by making modifications to Robert’s OthelloGPT code, for a couple of reasons. Firstly, Robert’s work provides a clear and reasonable metric by which to measure the performance of sparse autoencoders on a language model quickly and cheaply: the number of good board position classifiers given by the feature activations of the SAE. While I do have some reservations about this metric (for reasons I’ll mention in the conclusion), I think it is a valuable alternative to methods like using a language model to interpret features found by an SAE that are significantly more computationally expensive. Secondly, I know Robert personally and he offered to provide guidance getting this project up and running. Thanks to his help, I was able to start running experiments after only a week of work on the project.

Methods

I trained several variants of the SAE architecture that had different architectural processes for encouraging sparsity. The following aspects of the architecture were held constant. They contained one hidden layer of neurons—which I’m calling the feature layer--, the encoder and decoder weights were left untied, i.e. they were trained as separate parameters of the model, and a bias term was used in both the encoder and decoder. They were trained on the residual stream of the OthelloGPT model after layer 3 using a training set of 100,000 game histories for four epochs, with the Adam optimizer and a learning rate of 0.001.

I came up with five different methods of inducing sparsity, which I detail below. Three of them use a different kind of sparsity loss, in which case the activation function used on the feature layer is ReLU. The other two use custom activation functions designed to output sparse activations instead of including a sparsity loss term. In these cases, I still applied ReLU to the output of the encoder before applying the custom activation function to ensure that the activations would be non-negative.

Following Robert[2], the main measure of SAE performance that I focused on was the number of (board position, position state) pair classifiers given by the SAE feature activations that had an AUROC over 0.9; from now on I will just call these “good classifiers”.

Each method of inducing sparsity has its own set of hyperparameters, and I did not have the resources to perform a massive sweep of the hyperparameter space to be confident that I had found values for the hyperparameters that roughly maximize the number of good classifiers found. Instead, I generally took some educated guesses about what hyperparameter values might work well and first trained a batch of SAEs—typically somewhere between 8 and 16 of them. Since finding the number of good classifiers for a given SAE is computationally expensive, I did not do this for every SAE trained. I first weeded out weaker candidates based on evaluations done during training. Specifically, at this stage, I considered the average sparsity[3] of the feature layer and the unexplained variance of the reconstruction of the OthelloGPT activations, which were evaluated on the test data set at regular intervals during training. The SAEs that seemed promising based on these metrics were evaluated for good classifiers. Based on this data, I trained another batch of SAEs, repeating this search process until I was reasonably satisfied that I had gotten close to a local maximum of good classifiers in the hyperparameter space.

Since this search process was based on intuition, guessing, and a somewhat arbitrary AUROC threshold of 0.9, I view these experiments as a search for methods that deserve further study. I certainly don’t consider any of my results as strong evidence that sparsity loss should be replaced by one of these methods.

Sparsity loss functions

Controls: sparsity loss and no sparsity loss

I first reproduced Robert’s results with sparsity loss to use as a baseline. Using the sparsity loss coefficient of that he found with a hyperparameter sweep, I trained an SAE with sparsity loss, which found 29 good classifiers, similar to Robert’s result of 33[4]. I also wanted to confirm that training with sparsity loss in fact significantly impacted the number of good classifiers found, so I trained another SAE without sparsity loss; it found only one good classifier.

Smoothed- sparsity loss

Since is what we ideally want to minimize to achieve sparsity, why not try a loss function based on the norm instead of the norm? Because the feature layer of the SAE uses ReLU as its activation function, all the feature activations are non-negative. Taking the norm of a vector with no negative entries is equivalent to applying the unit step function—if , otherwise—to each entry, and then summing the results. Unfortunately, the unit step function is not differentiable, so would be difficult to use in a loss function. Therefore, I tried various smoothed versions of the unit step function using the sigmoid function as a base.

Call the smoothed unit step function we’re looking to define . Since all the feature activations are positive, we want the transition from 0 to 1 to roughly “start” at . So choose a small such that we are satisfied if . We will then also consider the transition from 0 to 1 to be complete once the value of reaches . So choose a duration for the transition, and require that . These requirements are satisfied by defining where . Then the smoothed- sparsity loss of an -dimensional feature vector with respect to a choice of is given by . In practice, I chose for all SAEs trained, since I expected to have a more interesting impact on the results.

Results of smoothed- sparsity loss experiments

Each data point represents an SAE trained with smoothed- sparsity loss. Sparsity is measured as the percentage of feature activations greater than . Points in gray were not evaluated for number of good classifiers.

Out of the SAEs trained, the best one found 38 good classifiers, and had on average 32.6% of feature activations greater than 1.0 and 7.9% unexplained variance. It was trained with the hyperparameters , , and sparsity loss coefficient .

Freshman’s dream sparsity loss

The erroneous (over the real numbers, at least) equation is known as the “freshman’s dream”. In general, for values , we have , but we get closer to equality if a small number of the ‘s are responsible for a majority of the value of , with equality achieved if and only if all but one of the ’s are 0. This sounds a lot like saying that an activation vector will be more sparse the closer the values of and . So we will define the freshman’s dream sparsity loss of an -dimensional feature vector with respect to a choice of the power as

I focused on , which has the additional notable property that a -sparse[5] vector with all equal non-zero activations has a loss of . So this loss is in some sense linear in the sparsity of the vector, which I thought might be a nice property; I intuitively like the idea of putting similar amounts of effort into reducing the sparsity of a 100-sparse vector and a 50-sparse vector.

Results of freshman’s dream sparsity loss experiments

Each data point represents an SAE trained with freshman’s dream sparsity loss. Sparsity is measured as the percentage of feature activations greater than 0.

Out of the SAEs trained, the best one found 21 good classifiers, and had on average 25.7% of feature activations greater than 0 and 7.7% unexplained variance. It was trained with the hyperparameters and sparsity loss coefficient .

Without-top- sparsity loss

Suppose we have a value of such that we would be satisfied if all of our feature vectors were -sparse; we don’t care about trying to make them sparser than that. Given an -dimensional feature vector , we can project it into the space of -sparse vectors by finding the largest activations and replacing the rest with zeros; let be the resulting -sparse vector. Then, for an appropriate choice of , let be the without top k sparsity loss of . Intuitively, this measures how close is to being -sparse. I did experiments with and .

Results of without-top- sparsity loss experiments

Each data point represents an SAE trained with without-top- sparsity loss. Sparsity is measured as the percentage of feature activations greater than 0.

Out of the SAEs trained, the best one found 28 good classifiers, and had on average 12.1% of feature activations greater than 0 and 12.3% unexplained variance. It was trained with the hyperparameters and sparsity loss coefficient .

Using activation functions to enforce sparsity

While thinking about different types of sparsity loss, I also considered other possibilities for inducing sparsity that don’t involve training with a sparsity loss term. Instead, we could take the output from the encoder and map it directly to a more sparse version of itself, and use the sparser version as the feature vector. I’m calling these maps “activation functions” because they output the activations of the feature layer, even though they may not be anything like the activation functions you would typically use to add non-linearity to a neural net. In each of my experiments, I did still apply a ReLU to the output of the encoder before applying the given activation function to ensure that all the activations were positive.

Leaky top- activation

You might have noticed that we just discussed a way to map a vector to a sparser version of itself in the previous section on without-top- sparsity loss: pick a value of , and given a vector , map it to , the vector with the same largest entries as and zeros elsewhere. We will use a generalization of this map for our first activation function.

Pick a value for and a small . Then define the activation function in the following way. Given a vector , let be the value of the th-largest entry in . Then define the vector by

This way, every activation other than the largest activations will be at most , making the resulting vector within of being -sparse (excepting the rare case where there are multiple entries of with a value of ). I wanted to allow for so that some information about other entries could be kept to help with the reconstruction.

Note: This method of inducing sparsity (but restricting to ) was previously used in autoencoders by Makhzani and Frey and applied to interpreting language models by Gao et al.

I also tried using a version of this activation function where entries smaller than are multiplied by instead of , allowing the reduced activations to be small relative to . I additionally tried defining smoothed versions of these functions, wondering if that might help the model learn to deal with the activation functions better. However, these variations turned out to yield worse initial results, and I did not explore them further.

Results of leaky top- activation experiments

Each data point represents an SAE trained with a leaky top- activation function. Sparsity is measured as the choice of the hyperparameter . Points in gray were not evaluated for number of good classifiers.

Out of the SAEs trained, the best one found 45 good classifiers, and had on average 13.2% unexplained variance. It was trained with the hyperparameters and .

Out of those trained with as in Makhzani and Frey and Gao et al, the best one found 34 good classifiers. It was trained with and also had on average 13.2% unexplained variance. Notably, all of the SAEs trained with had no dead neurons, whereas all of the SAEs trained with had a significant number of dead neurons; the one trained with had 15.1% dead neurons.

Dimension reduction activation

For a leaky top- activation function, we choose a single value of to use for all inputs. This didn’t fully sit right with me: what if it makes more sense to use a smaller value of for some inputs, and a larger value for others? For example, it seems like a reasonable choice to map the vector to a nearly 4-sparse vector, but the vector seems like it corresponds more naturally to a nearly 2-sparse vector. So I wanted to try out a function that does something very similar to the leaky top- activation function, but chooses an appropriate based on the vector input.

As in the definition of a leaky top- activation function, choose a bound . Define a dimension reduction activation function in the following way. Given an -dimensional vector with non-negative entries, let and define by

where is chosen in the following way. We will start with . Remove any with , for some bound depending on , resulting in a smaller set . Continuing in this way, recursivly define , where is some bound depending on and that we have yet to define. Eventually, we will reach a value of where , at which point we will define . The ‘s will be chosen in a way that distinguishes between relatively large and small entries of and that is invariant under scaling . The details and motivation behind how I chose appropriate ’s is long and convoluted, so I will leave them in an appendix.

Results of dimension reduction activation experiments

Each data point represents an SAE trained with a dimension reduction activation function. Sparsity is measured as the percentage of feature activations greater than 0.01. Points in gray were not evaluated for number of good classifiers.

Out of the SAEs trained, the best one found 43 good classifiers, and had on average 3.8% of feature activations greater than 0.01 and 18.5% unexplained variance. It was trained with the hyperparameters , and with the sequence of ’s chosen as described in the appendix. I think it’s particularly notable that the best SAEs using this method had much lower sparsity and higher unexplained variance than the best SAEs trained with other methods. I don’t know why that is!

Conclusion

MethodNumber of good classifiers
sparsity loss29
Smoothed- sparsity loss38
Freshman’s dream sparsity loss21
Without-top- sparsity loss28
Leaky top- activation45
Dimension reduction activation43

Of the methods I tried, without-top- sparsity loss performed on par with sparsity loss, and smoothed- sparsity loss, leaky top- activation, and dimension reduction activation both performed significantly better. Leaky top- activation takes the cake with 45 good classifiers, compared to sparsity loss’s 29. I think that all four of these methods are worthy of further study, both in test beds like OthelloGPT, and in real LLMs.

I’m interested in continuing to pursue this research by experimenting with more ways of tweaking the architecture and training of SAEs. I’m particularly interested to try more activation functions similar to leaky top- activation, and to see if adding more layers to the encoder and/​or decoder could improve results, especially when using custom activation functions that may make the reconstruction more challenging for the linear encoder/​decoder architecture. As Robert mentioned in his report, I also think it’s important to try these experiments on the fully-trained OthelloGPT created by Li et al, which is both better at predicting legal moves and has much more accurate linear probes than Robert’s version.

Finally, I think it would be useful to get a better understanding of the features found by these SAEs that don’t classify board positions well. Are some of these features providing interpretable information about the board state, but just in different ways? It may be that OthelloGPT “really” represents the board state in a way that doesn’t fully factor into individual board positions, in spite of the fact that linear probes can find good classifiers for individual board positions. For example, Robert noticed that a feature from one of his autoencoders seemed to keep track of lines of adjacent positions that all contain a white piece. I hypothesize that, because the rules of Othello significantly restrict what board states are possible/​likely (for example, there tend to be lots of lines of pieces with similar colors), we should not expect SAEs to be able to find as many good classifiers for board positions as if we were working with a game like chess, where more board states are possible. ChessGPT anyone?

Appendix: Details for finding

The algorithm described above for finding was designed as a fast way to compute a different, geometrically-motivated definition of . Here I’ll describe that original definition and motivation, and define the ’s that allow the sped-up algorithm to find the same given by the original definition.

Given a non-empty , let be the number of indices in . If we let represent the th coordinate in , then the equation describes an -dimensional hyperplane; call it . Note that is perpendicular to the vector that has a 1 for every entry whose index is in , and all other entries 0, and that points from the origin directly towards . In some sense, we will choose such that is pointing in a similar direction as , and we will use the sequence of ’s to favor choices of with fewer elements (resulting in a sparser output).

The set of hyperplanes cut into a number of components; let be the one containing the origin. Consider the ray in the direction of . Since has non-negative entries, intersects the boundary of at one (or possibly more) of our hyperplanes. Generically, we should expect there to be a unique hyperplane that contains the intersection of and the boundary of , which then uniquely defines . If happens to intersect the boundary of where two or more of our hyperplanes meet, let be the set of all such hyperplanes, and define .

How does this favor choices of with fewer elements? Well, it doesn’t always, not for every non-decreasing sequence of ’s. But we can design a sequence that does. For a given , if is bigger, then is more likely to have , rather than , elements. So choosing a sequence where is bigger for smaller favors choices of with fewer elements, encouraging sparsity.

Besides that, I added a few constraints to the sequence to guarantee some properties that I thought would be beneficial. First, I wanted to ensure that every non-empty actually does correspond to some input, i.e., there is some with This property holds if for all . Second, in order to speed up the computation of , I wanted to use an algorithm that started with and then repeatedly removed dimensions from until only the dimensions in were left. To ensure that the order in which dimensions were removed in this algorithm did not matter, I needed to add the requirement that for all . Adding this constraint guarantees that will be given by the algorithm described in the main section of the report if we let

where .

Then to choose a sequence of ’s, I arbitrarily started with . At the th step, having chosen , we want to choose a value of between and . Initially, I tried choosing a proportion and letting ; note that for such a sequence, will increase as decreases, as desired to induce sparsity. With this method I found that, if the feature vector gets too sparse at some point during training, the training will get into a positive feedback loop and just keep making it sparser and sparser, until it’s way too sparse and isn’t reconstructing well at all (why this could happen is still a complete mystery to me). Similarly, if the feature vector isn’t sparse enough, it keeps getting less and less sparse. It was hard to find a value of that didn’t fall into one of these two traps, though I did find some where the training just happened to end with reasonable sparsity and unexplained variance. The best SAE trained in this way found 29 good classifiers, and had on average 6.9% feature activations greater than 0.01 and 17.3% unexplained variance. It was trained with hyperparameters and If I had continued training, though, I’m confident the feedback loop would have continued until the reconstruction was ruined.

To find a better sequence of ‘s, I tried varying the proportion with respect to , i.e. coming up with a sequence of proportions and letting . Specifically, since low values of resulted in not enough sparsity and high values resulted in too much, I thought to try choosing a low value of to use for small , but to start increasing once passes some threshold. How fast could be increased was bounded above by the restriction, and I found that increasing as fast as this restriction allowed yielded reasonable results. The best SAE was trained using for , with the ’s increasing as fast as possible past .

  1. ^

    Please do not take this as an assertion that no such results exist! I know very little about the field of sparse coding. I am only trying to say that, after reading some of the recent research using sparse autoencoders to interpret language models, I am unsatisfied with the theoretical justifications given for using the norm, and a brief search through the relevant resources that those papers cited did not turn up any other justifications that I find satisfying.

  2. ^

    In fact, Robert measured the number of features that are good classifiers, which is slightly different, since he counted two different features separately even if they were both good classifiers for the same (board position, position state) pair.

  3. ^

    In most cases, I measured sparsity as the percentage of activations greater than some relatively small bound . Different values of seemed more or less informative depending on the method used to induce sparsity. I’ll note the specific measure of sparsity used for each method.

  4. ^

    As mentioned in a previous footnote, this number is slightly inflated compared to mine as a result of duplicated features. If I had also included duplicate features, I would have found 35, compared to his 33.

  5. ^

    A vector is -sparse if , i.e., if has at most non-zero entries.