Scott Emmons comments on Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Scott Emmons 26 Sep 2023 5:35 UTC
LW: 14 AF: 7
0
AF
Did you try searching for similar ideas to your work in the broader academic literature? There seems to be lots of closely related work that you’d find interesting. For example:

Elite BackProp: Training Sparse Interpretable Neurons. They train CNNs to have “class-wise activation sparsity.” They claim their method achieves “high degrees of activation sparsity with no accuracy loss” and “can assist in understanding the reasoning behind a CNN.”

Accelerating Convolutional Neural Networks via Activation Map Compression. They “propose a three-stage compression and acceleration pipeline that sparsifies, quantizes, and entropy encodes activation maps of Convolutional Neural Networks.” The sparsification step adds an L1 penalty to the activations in the network, which they do at finetuning time. The work just examines accuracy, not interpretability.

Enhancing Adversarial Defense by $k$ -Winners-Take-All. Proposes the $k$ -Winners-Take-All activation function, which keeps only the $k$ largest activations and sets all other activations to 0. This is a drop-in replacement during neural network training, and they find it improves adversarial robustness in image classification. How Can We Be So Dense? The Benefits of Using Highly Sparse Representations also uses the $k$ -Winners-Take-All activation function, among other sparsification techniques.

The Neural LASSO: Local Linear Sparsity for Interpretable Explanations. Adds an L1 penalty to the gradient wrt the input. The intuition is to make the final output have a “sparse local explanation” (where “local explanation” = input gradient)
Adaptively Sparse Transformers. They replace softmax with $α$ -entmax, “a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight.” They claim “improve[d] interpretability and [attention] head diversity” and also that “at no cost in accuracy, sparsity in attention heads helps to uncover different head specializations.”

Interpretable Neural Predictions with Differentiable Binary Variables. They train two neural networks. One “selects a rationale (i.e. a short and informative part of the input text)”, and the other “classifies… from the words in the rationale alone.”

I ask because your paper doesn’t seem to have a related works section, and most of your citations in the intro are from other safety research teams (eg Anthropic, OpenAI, CAIS, and Redwood.)
- Hoagy 26 Sep 2023 13:40 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Hi Scott, thanks for this!
  
  Yes I did do a fair bit of literature searching (though maybe not enough tbf) but very focused on sparse coding and approaches to learning decompositions of model activation spaces rather than approaches to learning models which are monosemantic by default which I’ve never had much confidence in, and it seems that there’s not a huge amount beyond Yun et al’s work, at least as far as I’ve seen.
  
  Still though, I’ve not seen almost any of these which suggests a big hole in my knowledge, and in the paper I’ll go through and add a lot more background to attempts to make more interpretable models.