Jason Gross comments on SAE feature geometry is outside the superposition hypothesis

Jason Gross 25 Jun 2024 19:33 UTC
21 points
1

I think it would be valuable to take a set of interesting examples of understood internal structure, and to ask what happens when we train SAEs to try to capture this structure. [...] In other cases, it may seem to us very unnatural to think of the structure we have uncovered in terms of a set of directions (sparse or otherwise) — what does the SAE do in this case?

I’m not sure how SAEs would capture the internal structure of the activations of the pizza model for modular addition, even in theory. In this case, ReLU is used to compute numerical integration, approximating $\int_{- π}^{π} ∣ ∣ cos (\frac{k}{2} + ϕ) ∣ ∣ cos (2 ϕ) d ϕ = \frac{4}{3} cos k$ (and/or similarly for sin). Each neuron is responsible for one small rectangle under the curve. Its input is the part of the integrand under the absolute value/ReLU, $cos (\frac{k}{2} + ϕ)$ (times a shared scaling coefficient), and the neuron’s coefficient in the fourier-transformed decoder matrix is the area element $cos (2 ϕ) d ϕ$ (again times a shared scaling coefficient).

Notably, in this scheme, the only fully free parameters are: the frequencies of interest, the ordering of neurons, and the two scaling coefficients. There are also constrained parameters for how evenly the space is divided up into boxes and where the function evaluation points are within each box. But the geometry of activation space here is effectively fully constrained up to permutation of the axes and global scaling factors.

What could SAEs even find in this case?