wassname comments on Can quantised autoencoders find and interpret circuits in language models?

wassname 5 Apr 2024 23:55 UTC
1 point
0

as long as training and eval error are similar

It’s just that eval and training are so damn similar, and all other problems are so different’t. So while it is technical not overfitting (to this problem), if is certainly overfitting to this specific problems, and it certainly isn’t measuring generalization in any sense of the word. Certainly not in the sense of helping us debug alignment for all problems.

This is an error that, imo, all papers currently make though! So it’s not a criticism so much as an interesting debate, and a nudge to use a harder test or OOD set in your benchmarks next time.

but you can’t say they’re more scalable than SAE, because SAEs don’t have to have 8 times the number of features

Yeah, good point. I just can’t help but think there must be a way of using unsupervised learning to force a compressed human-readable encoding. Going uncompressed just seems wasteful, and like it won’t scale. But I can’t think of a machine learnable, unsupervised learning, human-readable coding. Any ideas?