Huh, what is up with the ultra low frequency cluster? If the things are actually firing on the same inputs, then you should really only need one output vector. And if they’re serving some useful purpose, then why is there only one and not more?
Idk man, I am quite confused. It’s possible they’re firing on different inputs—even with the same encoder vector, if you have a different bias then you’ll fire somewhat differently (lower bias fires on a superset of what higher bias fires on). And cosine sim 0.975 is not the same as 1, so maybe the error term matters...? But idk, my guess is it’s a weird artifact of the autoencoder training process, that’s finding some weird property of transformers. Being shared across random seeds is by far the weirdest result, which suggests it can’t just be an artifact
Huh, what is up with the ultra low frequency cluster? If the things are actually firing on the same inputs, then you should really only need one output vector. And if they’re serving some useful purpose, then why is there only one and not more?
Idk man, I am quite confused. It’s possible they’re firing on different inputs—even with the same encoder vector, if you have a different bias then you’ll fire somewhat differently (lower bias fires on a superset of what higher bias fires on). And cosine sim 0.975 is not the same as 1, so maybe the error term matters...? But idk, my guess is it’s a weird artifact of the autoencoder training process, that’s finding some weird property of transformers. Being shared across random seeds is by far the weirdest result, which suggests it can’t just be an artifact