Charlie Steiner comments on Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Charlie Steiner 22 Sep 2023 9:30 UTC
LW: 4 AF: 2
0
AF
Did you ever try out independent component analysis? There’s a scikit-learn implementation even. If you haven’t, I’m strongly tempted to throw an undergrad at it (in a RL setting where it makes sense to look for features that are coherent across time).
EDIT: Nevermind, it’s in the paper. And also I guess in the figure if I was paying closer attention :P
- Hoagy 22 Sep 2023 15:01 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Hi Charlie, yep it’s in the paper—but I should say that we did not find a working CUDA-compatible version and used the scikit version you mention. This meant that the data volumes used are somewhat limited—still on the order of a million examples but 10-50x less than went into the autoencoders.
  
  It’s not clear whether the extra data would provide much signal since it can’t learn an overcomplete basis and so has no way of learning rare features but it might be able to outperform our ICA baseline presented here, so if you wanted to give someone a project of making that available, I’d be interested to see it!