In my experiments log L0 vs log unexplained variance should be a nice straight line. I think your autoencoders might be substantially undertrained (especially given that training longer moves off the frontier a lot). Scaling up the data by 10x or 100x wouldn’t be crazy.
(Also, I think L0 is more meaningful than L0 / d_hidden for comparing across different d_hidden (I assume that’s what “percent active features” is))
In my experiments log L0 vs log unexplained variance should be a nice straight line. I think your autoencoders might be substantially undertrained (especially given that training longer moves off the frontier a lot). Scaling up the data by 10x or 100x wouldn’t be crazy.
(Also, I think L0 is more meaningful than L0 / d_hidden for comparing across different d_hidden (I assume that’s what “percent active features” is))