[Proposal] Do SAEs capture simplicial structure? Investigating SAE representations of known case studies
It’s an open question whether SAEs capture underlying properties of feature geometry. Fortunately, careful research has elucidated a few examples of nonlinear geometry already. It would be useful to think about whether SAEs recover these geometries.
The proposal here is: look at the SAE activations for the tetrahedron, identify a relevant cluster, and then evaluate whether this matches the ground-truth.
[Proposal] Do SAEs capture simplicial structure? Investigating SAE representations of known case studies
It’s an open question whether SAEs capture underlying properties of feature geometry. Fortunately, careful research has elucidated a few examples of nonlinear geometry already. It would be useful to think about whether SAEs recover these geometries.
Simplices in models. Work studying hierarchical structure in feature geometry finds that sets of things are often represented as simplices, which are a specific kind of regular polytope. Simplices are also the structure of belief state geometry.
The proposal here is: look at the SAE activations for the tetrahedron, identify a relevant cluster, and then evaluate whether this matches the ground-truth.