That paper is new to me—and yes related and interesting. I like their use of a ‘glimpse’ = more resolution in centre, less resolution further away.
About ‘hard’ and ‘soft’ - if ‘hard’ and ‘soft’ mean what I think they do, then yes, the attention is ‘hard’. It forces some weights to zero that in a fully connected network could end up non zero. That might require some attention in training, as a network that has attention ‘way off’ where it should be has no gradient to give it better solutions.
Thanks for the link to the paper and the idea of thinking about to what extent the attention is/is-not differentiable.
That paper is new to me—and yes related and interesting. I like their use of a ‘glimpse’ = more resolution in centre, less resolution further away.
About ‘hard’ and ‘soft’ - if ‘hard’ and ‘soft’ mean what I think they do, then yes, the attention is ‘hard’. It forces some weights to zero that in a fully connected network could end up non zero. That might require some attention in training, as a network that has attention ‘way off’ where it should be has no gradient to give it better solutions.
Thanks for the link to the paper and the idea of thinking about to what extent the attention is/is-not differentiable.