Ok, so the motivation is to learn templates to do correlation at each image location with. But where would you get the idea from to do the same with the correlation map again? That seems non-obvious to me. Or do you mean biological vision?
Nope, didn’t mean biological vision. Not totally sure I understand your comment, so let me know if I’m rambling.
You can think of lower layers (the ones closer to the input pixels) as “smaller” or “more local,” and higher layers as “bigger,” or “more global,” or “composed of nonlinear combinations of lower-level features.” (EDIT: In fact, this restricted connectivity of neurons is an important insight of CNNs, compared to full NNs.)
So if you want to recognize horizontal lines, the lowest layer of a CNN might have a “short horizontal line” feature that is big when it sees a small, local horizontal line. And of course there is a copy of this feature for every place you could put it in the image, so you can think of its activation as a map of where there are short horizontal lines in your image.
But if you wanted to recognize longer horizontal lines, you’d need to combine several short-horizontal-line detectors together, with a specific spatial orientation (horizontal!). To do this you’d use a feature detector that looked at the map of where there were short horizontal lines, and found short horizontal lines of short horizontal lines, i.e. longer horizontal lines. And of course you’d need to have a copy of this higher-level feature detector for every place you could put it in the map of where there are short lines, so that if you moved the longer horizontal line around, a different copy of of this feature detector would light up—the activation of these copies would form a map of where there were longer horizontal lines in your image.
If you think about the logistics of this, you’ll find that I’ve been lying to you a little bit, and you might also see where pooling comes from. In order for “short horizontal lines of short horizontal lines” to actually correspond to longer horizontal lines, you need to zoom out in spatial dimensions as you go up layers, i.e. pooling or something similar. You can zoom out without pooling by connecting higher-level feature detectors to complete (in terms of the patch of pixels) sets of separated lower-level feature detectors, but this is both conceptually and computationally more complicated.
Ok, so the motivation is to learn templates to do correlation at each image location with. But where would you get the idea from to do the same with the correlation map again? That seems non-obvious to me. Or do you mean biological vision?
Nope, didn’t mean biological vision. Not totally sure I understand your comment, so let me know if I’m rambling.
You can think of lower layers (the ones closer to the input pixels) as “smaller” or “more local,” and higher layers as “bigger,” or “more global,” or “composed of nonlinear combinations of lower-level features.” (EDIT: In fact, this restricted connectivity of neurons is an important insight of CNNs, compared to full NNs.)
So if you want to recognize horizontal lines, the lowest layer of a CNN might have a “short horizontal line” feature that is big when it sees a small, local horizontal line. And of course there is a copy of this feature for every place you could put it in the image, so you can think of its activation as a map of where there are short horizontal lines in your image.
But if you wanted to recognize longer horizontal lines, you’d need to combine several short-horizontal-line detectors together, with a specific spatial orientation (horizontal!). To do this you’d use a feature detector that looked at the map of where there were short horizontal lines, and found short horizontal lines of short horizontal lines, i.e. longer horizontal lines. And of course you’d need to have a copy of this higher-level feature detector for every place you could put it in the map of where there are short lines, so that if you moved the longer horizontal line around, a different copy of of this feature detector would light up—the activation of these copies would form a map of where there were longer horizontal lines in your image.
If you think about the logistics of this, you’ll find that I’ve been lying to you a little bit, and you might also see where pooling comes from. In order for “short horizontal lines of short horizontal lines” to actually correspond to longer horizontal lines, you need to zoom out in spatial dimensions as you go up layers, i.e. pooling or something similar. You can zoom out without pooling by connecting higher-level feature detectors to complete (in terms of the patch of pixels) sets of separated lower-level feature detectors, but this is both conceptually and computationally more complicated.