Did you use the initialization scheme in our paper where the decoder is initialized to the transpose of the encoder (and then columns unit normalized)? There should not be any dead latents with topk at small scale with this init.
Also, if I understand correctly, leaky topk is similar to the multi-topk method in our paper. I’d be interested in a comparison of the two methods.
I did not use your initialization scheme, since I was unaware of your paper at the time I was running those experiments. I will definitely try that soon!
Yeah, I can see how leaky topk and multi-topk are doing similar things. I wonder if leaky topk also gives a progressive code past the value of k used in training. That definitely seems worth looking into. Thanks for the suggestions!
Did you use the initialization scheme in our paper where the decoder is initialized to the transpose of the encoder (and then columns unit normalized)? There should not be any dead latents with topk at small scale with this init.
Also, if I understand correctly, leaky topk is similar to the multi-topk method in our paper. I’d be interested in a comparison of the two methods.
I did not use your initialization scheme, since I was unaware of your paper at the time I was running those experiments. I will definitely try that soon!
Yeah, I can see how leaky topk and multi-topk are doing similar things. I wonder if leaky topk also gives a progressive code past the value of k used in training. That definitely seems worth looking into. Thanks for the suggestions!