chanind comments on [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

chanind 25 Sep 2024 17:13 UTC
5 points
0
Also worth noting, in the paper we only classify something as “absorption” if the main latent fully doesn’t fire. We also saw cases which I would call “partial absorption” where the main latent fires, but weakly, and both the absorbing latent and the main latent have positive cosine sim with the probe direction, and both have ablation effect on the spelling task.
Another intuition I have is that when the SAE absorbs a dense feature like “starts with S” into a sparse latent like “snake”, it loses the ability to adjust the relative levels of the various component features relative to each other. So maybe the “snake” latent is 80% snakiness, and 20% starts with S, but then in a real activation the SAE needs to reconstruct 75% snakiness and 25% starts with S. So to do this, it might fire a proper “starts with S” latent but weakly to make up the difference.

Hopefully this is something we can validate with toy models. I suspect that the specific values of L1 penalty and feature co-occurrence rates / magnitudes will lead to different levels of absorption.