Noa Nabeshima comments on Matryoshka Sparse Autoencoders

Noa Nabeshima 14 Dec 2024 6:34 UTC
LW: 3 AF: 3
0
AF
Yes, follow up work with bigger LMs seems good!

I use number of prefix-losses per batch = 10 here; I tried 100 prefixes per batch and the learned latents looked similar at a quick glance, so I wonder if naively training with block size = 1 might not be qualitatively different. I’m not that sure and training faster with kernels on its own seems good also!

Maybe if you had a kernel for training with block size = 1 it would create surface area for figuring out how to work on absorption when latents are right next to each other in the matryoshka latent ordering.