Kevin Slagle comments on Addendum: More Efficient FFNs via Attention

Kevin Slagle 19 Jun 2023 22:51 UTC
1 point
0
This paper looks relevant. They also show that you can get rid of FFN by modifying the attention slightly
https://arxiv.org/abs/1907.01470
- Robert_AIZI 21 Jun 2023 12:36 UTC
  1 point
  0
  Parent
  Thanks for the link! My read is that they describe an architecture where each attention head has some fixed “persistent memory vectors”, and train a model under that architecture. In contrast, I’m showing how one can convert an existing attention+FFN model to an attention-only model (with only epsilon-scale differences in the output).