Thanks for the link! My read is that they describe an architecture where each attention head has some fixed “persistent memory vectors”, and train a model under that architecture. In contrast, I’m showing how one can convert an existing attention+FFN model to an attention-only model (with only epsilon-scale differences in the output).
This paper looks relevant. They also show that you can get rid of FFN by modifying the attention slightly
https://arxiv.org/abs/1907.01470
Thanks for the link! My read is that they describe an architecture where each attention head has some fixed “persistent memory vectors”, and train a model under that architecture. In contrast, I’m showing how one can convert an existing attention+FFN model to an attention-only model (with only epsilon-scale differences in the output).