Logan Riggs comments on Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Logan Riggs 27 Aug 2024 16:44 UTC
LW: 2 AF: 1
0
AF
What is the activation name for the resid SAEs? hook_resid_post or hook_resid_pre?

I found https://github.com/ApolloResearch/e2e_sae/blob/main/e2e_sae/scripts/train_tlens_saes/run_train_tlens_saes.py#L220
to suggest _post
but downloading the SAETransformer from wandb shows:
(saes):
ModuleDict( (blocks-6-hook_resid_pre):
SAE( (encoder): Sequential( (0):...
which suggests _pre.
- Dan Braun 29 Aug 2024 15:05 UTC
  8 points
  0
  Parent
  They are indeed all hook_resid_pre. The code you’re looking at just lists a set of positions that we are interested in viewing the reconstruction error of during evaluation. In particular, we want to view the reconstruction error at hook_resid_post of every layer, including the final layer (which you can’t get from hook_resid_pre).