Vladimir_Nesov comments on Transcoders enable fine-grained interpretable circuit analysis for language models

Vladimir_Nesov 1 May 2024 15:55 UTC
LW: 3 AF: 2
0
AF

If the transcoders are used to predict next tokens, they may lose interpretability

Possibly. But there is no optimization pressure from pre-training on the relationship between MLPs and transcoders. The MLPs are the thing that pre-training optimizes (as the “full-precision” master model), while transcoders only need to be maintained to remain in sync with the MLPs, whatever they are (according to the same local objective as before, which doesn’t care at all about token prediction). The search is for MLPs such that their transcoders are good predictors, not directly for transcoders that are good predictors.

Substituting multiple transcoders at once is possible, but degrades model performance a lot compared to single-transcoder substitutions.

Unclear given the extreme quantization results, where similarly post-training replacement would degrade model performance a lot, yet quantization-aware pre-training somehow doesn’t.

We don’t really know how transcoders (or SAEs, to the best of my knowledge) behave when they’re being trained to imitate a model component that’s still updating

This seems to be the main technical hurdle to do the experiment, updating transcoders both efficiently and correctly, as underlying MLPs gradually change. (I’m guessing some discontinuous jumps in choice of transcoders might be OK.)
- Philippe Chlenski 1 May 2024 23:40 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Possibly. But there is no optimization pressure from pre-training on the relationship between MLPs and transcoders. The MLPs are the thing that pre-training optimizes (as the “full-precision” master model), while transcoders only need to be maintained to remain in sync with the MLPs
  I see. I was in fact misunderstanding this detail in your training setup. In this case, only engineering considerations really remain: these boil down to incorporating multiple transcoders simultaneously and modeling shifting MLP behavior with transcoders. These seem like tractable, although probably nontrivial and, because of the LLM pretraining objective, quite computationally expensive. If transcoders catch on, I hope to see someone with the compute budget for it run this experiment!