Jason Gross comments on A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Jason Gross 22 Jul 2024 6:53 UTC
LW: 1 AF: 1
0
AF
[Lucius] Identify better SAE sparsity penalties by reasoning about the distribution of feature activations
- In sparse coding, one can derive what prior over encoded variables a particular sparsity penalty corresponds to. E.g. an L1 penalty assumes a Laplacian prior over feature activations, while a log(1+a^2) would assume a Cauchy prior. Can we figure out what distribution of feature activations over the data we’d expect, and use this to derive a better sparsity penalty that improves SAE quality?
This is very interesting! What prior does log(1+|a|) correspond to? And what about using $\prod_{i} (1 + | a_{i} |)$ instead of $\sum_{i} log (1 + | a_{i} |)$ ? Does this only hold if we expect feature activations to be independent (rather than, say, mutually exclusive)?
- Lucius Bushnaq 22 Jul 2024 21:36 UTC
  2 points
  0
  Parent
  A prior that doesn’t assume independence should give you a sparsity penalty that isn’t a sum of independent penalties for each activation.