Vladimir_Nesov comments on Transcoders enable fine-grained interpretable circuit analysis for language models

Vladimir_Nesov 30 Apr 2024 21:31 UTC
LW: 3 AF: 2
0
AF

There is a tradeoff between interpretability and fidelity

I wonder what would happen if something like transcoders is used to guide pre-training in a way similar to quantization-aware training. There, forward passes are computed under quantization, while gradients and optimizer states are maintained in full precision. For extreme levels of quantization, this produces quantized models that achieve loss much closer to that of a full-precision model, compared to post-training quantization (to the same degree) of a model whose training wasn’t guided this way. With transcoders, “full precision” is the MLPs, while “quantization” is transition to the corresponding transcoders.
- Philippe Chlenski 1 May 2024 14:25 UTC
  LW: 3 AF: 1
  0
  AF Parent
  This sounds like it could work. I can think of a few reasons why this approach could be challenging, however:
  1. We don’t really know how transcoders (or SAEs, to the best of my knowledge) behave when they’re being trained to imitate a model component that’s still updating
  2. Substituting multiple transcoders at once is possible, but degrades model performance a lot compared to single-transcoder substitutions. Substituting one transcoder at a time would require restarting the forward pass at each layer.
  3. If the transcoders are used to predict next tokens, they may lose interpretability and return to superposition.
  Under a “transcoder-aware” training regime, these would be the first things I would check for.
  Also, you may be interested in Jacob’s comment here for some details on when we tried to co-train SAEs and transcoders to have sparse connections to one another. This is a very different question, of course, but it provides some preliminary evidence that the fidelity-interpretability tradeoff persists across more elaborate training settings.
  - Vladimir_Nesov 1 May 2024 15:55 UTC
    LW: 3 AF: 2
    0
    AF Parent
    
    If the transcoders are used to predict next tokens, they may lose interpretability
    
    Possibly. But there is no optimization pressure from pre-training on the relationship between MLPs and transcoders. The MLPs are the thing that pre-training optimizes (as the “full-precision” master model), while transcoders only need to be maintained to remain in sync with the MLPs, whatever they are (according to the same local objective as before, which doesn’t care at all about token prediction). The search is for MLPs such that their transcoders are good predictors, not directly for transcoders that are good predictors.
    
    Substituting multiple transcoders at once is possible, but degrades model performance a lot compared to single-transcoder substitutions.
    
    Unclear given the extreme quantization results, where similarly post-training replacement would degrade model performance a lot, yet quantization-aware pre-training somehow doesn’t.
    
    We don’t really know how transcoders (or SAEs, to the best of my knowledge) behave when they’re being trained to imitate a model component that’s still updating
    
    This seems to be the main technical hurdle to do the experiment, updating transcoders both efficiently and correctly, as underlying MLPs gradually change. (I’m guessing some discontinuous jumps in choice of transcoders might be OK.)
    - Philippe Chlenski 1 May 2024 23:40 UTC
      LW: 3 AF: 2
      0
      AF Parent
      Possibly. But there is no optimization pressure from pre-training on the relationship between MLPs and transcoders. The MLPs are the thing that pre-training optimizes (as the “full-precision” master model), while transcoders only need to be maintained to remain in sync with the MLPs
      I see. I was in fact misunderstanding this detail in your training setup. In this case, only engineering considerations really remain: these boil down to incorporating multiple transcoders simultaneously and modeling shifting MLP behavior with transcoders. These seem like tractable, although probably nontrivial and, because of the LLM pretraining objective, quite computationally expensive. If transcoders catch on, I hope to see someone with the compute budget for it run this experiment!