I don’t think there is anything on that front other than the paragraphs in the SoLU paper. I alluded to a possible experiment for this on Twitter in response to that paper but haven’t had the time to try it out myself: You could take a tiny autoencoder to reconstruct some artificially generated data where you vary attributes such as sparsity, ratio of input dimensions vs. bottleneck dimensions, etc. You could then look at the weight matrices of the autoencoder to figure out how it’s embedding the features in the bottleneck and which settings lead to superposition, if any.
I’m not at liberty to share it directly but I am aware that Anthropic have a draft of small toy models with hand-coded synthetic data showing superposition very cleanly. They go as far as saying that searching for an interpretable basis may essentially be mistaken.
I don’t think there is anything on that front other than the paragraphs in the SoLU paper. I alluded to a possible experiment for this on Twitter in response to that paper but haven’t had the time to try it out myself: You could take a tiny autoencoder to reconstruct some artificially generated data where you vary attributes such as sparsity, ratio of input dimensions vs. bottleneck dimensions, etc. You could then look at the weight matrices of the autoencoder to figure out how it’s embedding the features in the bottleneck and which settings lead to superposition, if any.
I’m not at liberty to share it directly but I am aware that Anthropic have a draft of small toy models with hand-coded synthetic data showing superposition very cleanly. They go as far as saying that searching for an interpretable basis may essentially be mistaken.