Lucius Bushnaq comments on Activation space interpretability may be doomed

Lucius Bushnaq 9 Jan 2025 13:41 UTC
3 points
0
In my limited experience, attribution-patching style attributions tend to be a pain to optimise for sparsity. Very brittle. I agree it seems like a good thing to keep poking at though.
- Louis Jaburi 9 Jan 2025 14:00 UTC
  1 point
  0
  Parent
  Did you use something like $L_{S A E}$ as described here ? By brittle do you mean w.r.t the sparsity penality (and other hyperparameters)?
  - Lucius Bushnaq 9 Jan 2025 14:04 UTC
    3 points
    0
    Parent
    The third term in that. Though it was in a somewhat different context related to the weight partitioning project mentioned in the last paragraph, not SAE training.
    
    Yes, brittle in hyperparameters. It was also just very painful to train in general. I wouldn’t straightforwardly extrapolate our experience to a standard SAE setup though, we had a lot of other things going on in that optimisation.
    - Louis Jaburi 9 Jan 2025 14:05 UTC
      1 point
      0
      Parent
      I see, thanks for sharing!