This is great work! I like that you tested on large models and your very comprehensive benchmarking. I also like the BatchTopK architecture.
It’s interesting to me that MSE has a smaller hit than cross-entropy.
Here are some notes I made:
We suspect that using a fixed group size leads to more stable training and faster convergence.
This seems plausible to me!
Should the smallest sub-SAE get gradients from all losses, or should the losses from larger sub-SAEs be stopped?
When I tried stopping the gradient from flowing from large sub-SAE losses to small it made later latents much less interpretable. I tried an approach where early latents got less gradient from larger sub-SAE losses and it seemed to also have less interpretable late latents. I don’t know what’s going on with this.
What is the effect of latent sorting that Noa uses on the benchmarks?
I tried not ordering the latents and it did comparably on FVU/L0. I vaguely recall that for mean-max correlation, permuting did worse on early latents and better on the medium latents. At a quick glance I weakly preferred the permuted SAE latents but it was very preliminary and I’m not confident in this.
I’d love to chat more with the authors, I think it’d be fun to explore our beliefs and process over the course of making the papers and compare notes and ideas.
This is great work! I like that you tested on large models and your very comprehensive benchmarking. I also like the BatchTopK architecture.
It’s interesting to me that MSE has a smaller hit than cross-entropy.
Here are some notes I made:
This seems plausible to me!
When I tried stopping the gradient from flowing from large sub-SAE losses to small it made later latents much less interpretable. I tried an approach where early latents got less gradient from larger sub-SAE losses and it seemed to also have less interpretable late latents. I don’t know what’s going on with this.
I tried not ordering the latents and it did comparably on FVU/L0. I vaguely recall that for mean-max correlation, permuting did worse on early latents and better on the medium latents. At a quick glance I weakly preferred the permuted SAE latents but it was very preliminary and I’m not confident in this.
I’d love to chat more with the authors, I think it’d be fun to explore our beliefs and process over the course of making the papers and compare notes and ideas.