So if I understand you, for (1) you’re proposing a “hard” attention over the image, rather than the “soft” differentiable attention which is typically meant by “attention” for NNs.
You might find interesting “Recurrent Models of Visual Attention” by DeepMind (https://arxiv.org/pdf/1406.6247.pdf). They use a hard attention over the image with RL to train where to attend. I found it interesting—there’s been subsequent work using hard attention (I thiiink this is a central paper for the topic, but I could be wrong, and I’m not at all sure what the most interesting recent one is) as well.
That paper is new to me—and yes related and interesting. I like their use of a ‘glimpse’ = more resolution in centre, less resolution further away.
About ‘hard’ and ‘soft’ - if ‘hard’ and ‘soft’ mean what I think they do, then yes, the attention is ‘hard’. It forces some weights to zero that in a fully connected network could end up non zero. That might require some attention in training, as a network that has attention ‘way off’ where it should be has no gradient to give it better solutions.
Thanks for the link to the paper and the idea of thinking about to what extent the attention is/is-not differentiable.
So if I understand you, for (1) you’re proposing a “hard” attention over the image, rather than the “soft” differentiable attention which is typically meant by “attention” for NNs.
You might find interesting “Recurrent Models of Visual Attention” by DeepMind (https://arxiv.org/pdf/1406.6247.pdf). They use a hard attention over the image with RL to train where to attend. I found it interesting—there’s been subsequent work using hard attention (I thiiink this is a central paper for the topic, but I could be wrong, and I’m not at all sure what the most interesting recent one is) as well.
That paper is new to me—and yes related and interesting. I like their use of a ‘glimpse’ = more resolution in centre, less resolution further away.
About ‘hard’ and ‘soft’ - if ‘hard’ and ‘soft’ mean what I think they do, then yes, the attention is ‘hard’. It forces some weights to zero that in a fully connected network could end up non zero. That might require some attention in training, as a network that has attention ‘way off’ where it should be has no gradient to give it better solutions.
Thanks for the link to the paper and the idea of thinking about to what extent the attention is/is-not differentiable.