I’m finishing up my PhD on tensor network algorithms at the University of Queensland, Australia, under Ian McCulloch. I’ve also proposed a new definition of wavefunction branches using quantum circuit complexity.
Predictably, I’m moving into AI safety work. See my post on graphical tensor notation for interpretability. I also attended the Machine Learning for Alignment Bootcamp in Berkeley in 2022, did a machine learning/ neuroscience internship in 2020/2021, and also wrote a post exploring the potential counterfactual impact of AI safety work.
My website: https://sites.google.com/view/jordantensor/
Contact me: jordantensor [at] gmail [dot] com Also see my CV, LinkedIn, or Twitter.
This seems easy to try and a potential point to iterate from, so you should give it a go. But I worry that U and V will be dense and very uninterpretable:
A contains no information about which actual tokens each SAE feature activated on right? Just the token positions? So activations in completely different contexts but with the same features active in the same token positions cannot be distinguished by A?
I’m not sure why you expect A to have low-rank structure. Being low-rank is often in tension with being sparse, and we know that A is a very sparse matrix.
Perhaps it would be better to utilize the fact that A is a very sparse matrix of positive entries? Maybe permutation matrices or sparse matrices would be more apt than general orthogonal matrices (which can be negative)? (Then you might have to settle for something like a block-diagonal central matrix, rather than a diagonal matrix of singular values).
I’m keen to see stuff in this direction though! I certainly think you could construct some matrix or tensor of SAE activations such that some decomposition of it is interpretable in an interesting way.