While I put only a medium probability that the current SAE algorithm works to recover all the features, my main concerns with the work are due to the quality of the model and the natural “features” not being board positions.
I’d be interested in running the code on the model used by Li et al, which he’s hosted on Google drive:
Also, in addition to the future work you list, I’d be interested in running the SAEs with much larger Rs and with alternative hyperparameter selection criteria.
Just uploaded the code here: https://github.com/RobertHuben/othellogpt_sparse_autoencoders/. Apologies in advance, the code is kinda a mess since I’ve been writing for myself. I’ll take the hour or so to add info to the readme about the files and how to replicate my experiments.
I’d be interested in running the code on the model used by Li et al, which he’s hosted on Google drive:
Thanks for the link! I think replicating the experiment with Li et al’s model is definite next step! Perhaps we can have a friendly competition to see who writes it up first :)
I have mixed feelings about whether the results will be different with the high-accuracy model from Li et al:
On priors, if the features are more “unambiguous”, they should be easier for the sparse autoencoder to find.
But my hacky model was at least trained enough that those features do emerge from linear probes. If sparse autoencoders can’t match linear probes, thats also worth knowing.
If there is a difference, and sparse autoencoders work on a language model that’s sufficiently trained, would LLMs meet that criteria?
Also, in addition to the future work you list, I’d be interested in running the SAEs with much larger Rs and with alternative hyperparameter selection criteria.
Agree that its worth experimenting with R, but the only other hyperparameter is the sparsity coefficient alpha, and I found that alpha had to be in a narrow range or the training would collapse to “all variance is unexplained” or “no active features”. (Maybe you mean Adam hyperparameters, which I suppose might also be worth experimenting with.) Here’s the result of my hyperparameter sweep for alpha:
Fwiw, I find it’s much more useful to have (log) active features on the x axis, and (log) unexplained variance on the y axis. (if you want you can then also plot the L1 coefficient above the points, but that seems less important)
Good thinking, here’s that graph! I also annotated it to show where the alpha value I ended up using for the experiment. Its improved over the pareto frontier shown on the graph, and I believe thats because the data in this sweep was from training for 1 epoch, and the real run I used for the SAE was 4 epochs.
In my experiments log L0 vs log unexplained variance should be a nice straight line. I think your autoencoders might be substantially undertrained (especially given that training longer moves off the frontier a lot). Scaling up the data by 10x or 100x wouldn’t be crazy.
(Also, I think L0 is more meaningful than L0 / d_hidden for comparing across different d_hidden (I assume that’s what “percent active features” is))
Agree that its worth experimenting with R, but the only other hyperparameter is the sparsity coefficient alpha, and I found that alpha had to be in a narrow range or the training would collapse to “all variance is unexplained” or “no active features”.
Yeah, the main hyperparameters are the expansion factor and “what optimization algorithm do you use/what hyperparameters do you use for the optimization algorithm”.
We haven’t written up our results yet.. but after seeing this post I don’t think we have to :P.
We trained SAEs (with various expansion factors and L1 penalties) on the original Li et al model at layer 6, and found extremely similar results as presented in this analysis.
It’s very nice to see independent efforts converge to the same findings!
Likewise, I’m glad to hear there was some confirmation from your team!
An option for you if you don’t want to do a full writeup is to make a “diff” or comparison post, just listing where your methods and results were different (or the same). I think there’s demnad for that, people liked Comparing Anthropic’s Dictionary Learning to Ours
Thanks for doing this—could you share your code?
While I put only a medium probability that the current SAE algorithm works to recover all the features, my main concerns with the work are due to the quality of the model and the natural “features” not being board positions.
I’d be interested in running the code on the model used by Li et al, which he’s hosted on Google drive:
https://drive.google.com/drive/folders/1bpnwJnccpr9W-N_hzXSm59hT7Lij4HxZ
Also, in addition to the future work you list, I’d be interested in running the SAEs with much larger Rs and with alternative hyperparameter selection criteria.
Just uploaded the code here: https://github.com/RobertHuben/othellogpt_sparse_autoencoders/. Apologies in advance, the code is kinda a mess since I’ve been writing for myself. I’ll take the hour or so to add info to the readme about the files and how to replicate my experiments.
Thanks for the link! I think replicating the experiment with Li et al’s model is definite next step! Perhaps we can have a friendly competition to see who writes it up first :)
I have mixed feelings about whether the results will be different with the high-accuracy model from Li et al:
On priors, if the features are more “unambiguous”, they should be easier for the sparse autoencoder to find.
But my hacky model was at least trained enough that those features do emerge from linear probes. If sparse autoencoders can’t match linear probes, thats also worth knowing.
If there is a difference, and sparse autoencoders work on a language model that’s sufficiently trained, would LLMs meet that criteria?
Agree that its worth experimenting with R, but the only other hyperparameter is the sparsity coefficient alpha, and I found that alpha had to be in a narrow range or the training would collapse to “all variance is unexplained” or “no active features”. (Maybe you mean Adam hyperparameters, which I suppose might also be worth experimenting with.) Here’s the result of my hyperparameter sweep for alpha:
Fwiw, I find it’s much more useful to have (log) active features on the x axis, and (log) unexplained variance on the y axis. (if you want you can then also plot the L1 coefficient above the points, but that seems less important)
Good thinking, here’s that graph! I also annotated it to show where the alpha value I ended up using for the experiment. Its improved over the pareto frontier shown on the graph, and I believe thats because the data in this sweep was from training for 1 epoch, and the real run I used for the SAE was 4 epochs.
In my experiments log L0 vs log unexplained variance should be a nice straight line. I think your autoencoders might be substantially undertrained (especially given that training longer moves off the frontier a lot). Scaling up the data by 10x or 100x wouldn’t be crazy.
(Also, I think L0 is more meaningful than L0 / d_hidden for comparing across different d_hidden (I assume that’s what “percent active features” is))
Thanks for uploading your interp and training code!
Could you upload your model and/or datasets somewhere as well, for reproducibility? (i.e. your datasets folder containing the datasets:)
Yeah, the main hyperparameters are the expansion factor and “what optimization algorithm do you use/what hyperparameters do you use for the optimization algorithm”.
Here are the datasets, OthelloGPT model (“trained_model_full.pkl”), autoencoders (saes/), probes, and a lot of the cached results (it takes a while to compute AUROC for all position/feature pairs, so I found it easier to save those): https://drive.google.com/drive/folders/1CSzsq_mlNqRwwXNN50UOcK8sfbpU74MV
You should download all of these into the same level directory as the main repo.
@LawrenceC Nanda MATS stream played around with this as group project with code here: https://github.com/andyrdt/mats_sae_training/tree/othellogpt
Cool! Do you know if they’ve written up results anywhere?
I think we got similar-ish results. @Andy Arditi was going to comment here to share them shortly.
We haven’t written up our results yet.. but after seeing this post I don’t think we have to :P.
We trained SAEs (with various expansion factors and L1 penalties) on the original Li et al model at layer 6, and found extremely similar results as presented in this analysis.
It’s very nice to see independent efforts converge to the same findings!
Likewise, I’m glad to hear there was some confirmation from your team!
An option for you if you don’t want to do a full writeup is to make a “diff” or comparison post, just listing where your methods and results were different (or the same). I think there’s demnad for that, people liked Comparing Anthropic’s Dictionary Learning to Ours
Thanks!