TL/DR; The invention of Convolution Neural Networks for image and audio processing was a key landmark in machine learning.
This topic is for people who already know what CNNs are, and are interested in how to innovate to riff on and extend the core reason (perhaps?) that CNNs learn faster. Probing the technology topic is one ‘sub goal’ in questioning where our AI knowledge is heading, and how fast. In turn, that’s because we want it to progress in a good direction.
Sub Goal
Q: Can the reduction in number of parameters that a CNN introduces be achieved in a more general way?
A: Yes. Here are sketches of two ways:
1) Saccades. Train one network (layer) on attention. Train it to learn which local blocks of the image to give attention to. Train the second part of the network using those chosen ‘local blocks’ in conjunction with coordinates of their locations.
The number of blocks that have large CNN kernels applied to them is much reduced. Those blocks are the blocks that matter.
2) Parameter Compression. Give each layer of a neural network more (potential) connections than you think will actually end up being used. After training for a few cycles, compress the parameter values using a lossy algorithm, always choosing the compression which scores best on some weighting of size and quality. Uncompress and repeat this process till you have completed the training set.
The number of bits used to represent parameters is being kept low, helping to guard against over fitting.
Dialog
[Doubter] This all sounds very hand wavy. How exactly would you train a saccadic network on the right movements?
[Optimist] One stepping stone, before you get to a true saccadic network with the locus of attention following a temporal trajectory, is to train a shallow network to classify where to give attention. So this stepping stone outputs a weighting for how much attention to give to each location. For sake of being more concrete, it works on a down sampled image and gives 0 for no attention, 1 for convolution with a 3x3 kernel, 2 for convolution with a 5x5 kernel.
[Doubter] You still haven’t said how you would do that attention training.
[Optimist] You could reward a network for robustness to corruption of the image. Reward it for zeroes in the attention layers.
[Doubter] That’s not clear, and I think there is a Catch 22. You need to have analysed the image to decide where to give it attention.
[Optimist] …but not analyse in full detail. Use only a few down sampled layers to decide where to give attention. You save a ton of CPU by only giving more attention where it is needed.
[Doubter] I really doubt that. You will pay for that saving many times over by the less regular pattern of ‘attention’ and the more complex code. It will be really hard to use a GPU to accelerate it as well as is already done with a standard CNNs. Besides, even a 16x reduction in total workload, and I actually doubt there would be any reduction in workload at all, is not that significant. What actually matters is the quality of the end result.
[Optimist] We shouldn’t be worrying about that GPU. That’s ‘premature optimisation’. You’re artificially constraining your thinking by the hardware we use right now.
[Doubter] Nevertheless, GPU is the hardware we have right now, and we want practical systems. An alternative to CNNs using hybrid CPU/GPU at least has to come close on speed to current CNNs on GPU, and have some other key advantage.
[Optimist] Explainability in a saccadic CNN is better, since you have the explicit weightings for attention. For any output, you can show where the attention is.
[Doubter] But that is not new. We can already show where attention is by looking at what weights mattered in a classification. See for example the way we learned that ‘hands’ were important in detecting dumbbell weights, or that snow was important in differentiating wolves from dogs.
[Optimist] Right. And those insights in how CNNs classify were really valuable landmarks, weren’t they? And now we would have something more direct to do that, as we can go straight to the attention weights. And we can explore better strategies for setting those weights.
[Doubter] You still haven’t explained exactly how the attention layers would be constructed, nor have you explained the later ‘better strategies’ nor how you would progress to temporal attention strategies. I doubt the basic idea would do more than a slightly deeper CNN would. Until I see an actual working example, I’m unconvinced. Can we move onto ‘parameter compression’?
[Optimist] Sure.
-----
[Doubter] So what I am struggling with is that you are throwing away data after a little training. Why ‘lossy compression’ and not ‘lossless compression’?
[Optimist] That’s part of the point of it. We’re trying to reward a low bit count description of the weights.
[Doubter] Hold on a moment. You’re talking more like a proponent of evolutionary algorithms than of neural networks. You can’t ‘back propagate’ a reward for a low entropy solution back up the net. All you can do is choose one such parameter set over another.
[Optimist] Exactly. Neural networks are in fact just a particular rather constrained case of evolutionary algorithm. I’d contend there is advantage in exploring new ways of reducing the degrees of freedom in them. CNNs do reduce the degrees of freedom, but not in a very general way. We need to add something like compression of parameters if we want low degrees of freedom with more generality.
[Doubter] In CNNs that lack of generality is an advantage. Your approach could encode a network with a ridiculously large number of useless non-zero weights—whilst still using very few bits. That won’t work. That would take way longer to compute one iteration. It would be as slow as pitch drops dripping.
[Optimist] Right. So some attention must be paid to exactly what the lossy compression algorithm is. Just as jpeg throws away low weight vectors, this compression algorithm could too.
[Doubter] So I have a couple of comments here. You have not worked out the details, right? It also doesn’t sound like this is bio-inspired, which was at least a saving grace of the saccadic idea.
[Optimistic] Well, the compression idea wasn’t bio-inspired originally, but later I got to thinking about how genes could create many ‘similar patterns’ of connections locally. That could do CNN type connections, but genes can also do similar patterns with long range connections. So for example, genes could learn the ideal density of long range connections relative to short range connections. That connection plan gets repeated in many places whilst being encoded compactly. In that sense genes are a compression code.
[Doubter] So you are mixing genetic algorithms and neural networks? That sounds like a recipe for more parameters.
[Optimistic] …a recipe for new ways of reducing the number of parameters.
[Doubter] I think I see a pattern here, in that both ideas offer CNNs as a special case. With saccadic networks the secret sauce is some not too clear way you would program the ‘attention’ function. With parameter compression your secret sauce is the choice of lossy compression function. If you ‘got funded’ to do some demo coding, you could keep naive investors happy for a long while with networks that were actually no better than existing CNNs and plenty of promises of more to come with more funding. But the ‘more to come later’ never would come. Your deep problem is the ‘secret sauce’ is more aspiration than actually demonstrable.
[Optimist] I think that’s a little unfair. I am not claiming these approaches are implemented demonstrable improvements. I am not claiming that I know exactly how to get the details of these two ideas right quickly. You are also losing sight of the overall goal, which is to progress the value of AI as a positive transformative force.
[Doubter] Hmm. I see only a not-too-convincing claim of being able to increase the power of machine learning and an attempt to burnish your ego and your reputation. Where is the focus on positive transformative force?
[Optimist] Breaking the mould on how to think about machine learning is a pretty important subgoal in progressing thought on AI, don’t you think? “Less Wrong” is the best possible place on the internet for engaging in discussion of ethical progression of AI. If this ‘subgoal’ post does not gather any useful feedback at all, then I’ll have to agree with you that my post is not helping progress the possible positive transformative aspects of AI—and try again with another iteration and different post, until I find what works.
Generalising CNNs
TL/DR; The invention of Convolution Neural Networks for image and audio processing was a key landmark in machine learning.
This topic is for people who already know what CNNs are, and are interested in how to innovate to riff on and extend the core reason (perhaps?) that CNNs learn faster. Probing the technology topic is one ‘sub goal’ in questioning where our AI knowledge is heading, and how fast. In turn, that’s because we want it to progress in a good direction.
Sub Goal
Q: Can the reduction in number of parameters that a CNN introduces be achieved in a more general way?
A: Yes. Here are sketches of two ways:
1) Saccades. Train one network (layer) on attention. Train it to learn which local blocks of the image to give attention to. Train the second part of the network using those chosen ‘local blocks’ in conjunction with coordinates of their locations.
The number of blocks that have large CNN kernels applied to them is much reduced. Those blocks are the blocks that matter.
2) Parameter Compression. Give each layer of a neural network more (potential) connections than you think will actually end up being used. After training for a few cycles, compress the parameter values using a lossy algorithm, always choosing the compression which scores best on some weighting of size and quality. Uncompress and repeat this process till you have completed the training set.
The number of bits used to represent parameters is being kept low, helping to guard against over fitting.
Dialog
[Doubter] This all sounds very hand wavy. How exactly would you train a saccadic network on the right movements?
[Optimist] One stepping stone, before you get to a true saccadic network with the locus of attention following a temporal trajectory, is to train a shallow network to classify where to give attention. So this stepping stone outputs a weighting for how much attention to give to each location. For sake of being more concrete, it works on a down sampled image and gives 0 for no attention, 1 for convolution with a 3x3 kernel, 2 for convolution with a 5x5 kernel.
[Doubter] You still haven’t said how you would do that attention training.
[Optimist] You could reward a network for robustness to corruption of the image. Reward it for zeroes in the attention layers.
[Doubter] That’s not clear, and I think there is a Catch 22. You need to have analysed the image to decide where to give it attention.
[Optimist] …but not analyse in full detail. Use only a few down sampled layers to decide where to give attention. You save a ton of CPU by only giving more attention where it is needed.
[Doubter] I really doubt that. You will pay for that saving many times over by the less regular pattern of ‘attention’ and the more complex code. It will be really hard to use a GPU to accelerate it as well as is already done with a standard CNNs. Besides, even a 16x reduction in total workload, and I actually doubt there would be any reduction in workload at all, is not that significant. What actually matters is the quality of the end result.
[Optimist] We shouldn’t be worrying about that GPU. That’s ‘premature optimisation’. You’re artificially constraining your thinking by the hardware we use right now.
[Doubter] Nevertheless, GPU is the hardware we have right now, and we want practical systems. An alternative to CNNs using hybrid CPU/GPU at least has to come close on speed to current CNNs on GPU, and have some other key advantage.
[Optimist] Explainability in a saccadic CNN is better, since you have the explicit weightings for attention. For any output, you can show where the attention is.
[Doubter] But that is not new. We can already show where attention is by looking at what weights mattered in a classification. See for example the way we learned that ‘hands’ were important in detecting dumbbell weights, or that snow was important in differentiating wolves from dogs.
[Optimist] Right. And those insights in how CNNs classify were really valuable landmarks, weren’t they? And now we would have something more direct to do that, as we can go straight to the attention weights. And we can explore better strategies for setting those weights.
[Doubter] You still haven’t explained exactly how the attention layers would be constructed, nor have you explained the later ‘better strategies’ nor how you would progress to temporal attention strategies. I doubt the basic idea would do more than a slightly deeper CNN would. Until I see an actual working example, I’m unconvinced. Can we move onto ‘parameter compression’?
[Optimist] Sure.
-----
[Doubter] So what I am struggling with is that you are throwing away data after a little training. Why ‘lossy compression’ and not ‘lossless compression’?
[Optimist] That’s part of the point of it. We’re trying to reward a low bit count description of the weights.
[Doubter] Hold on a moment. You’re talking more like a proponent of evolutionary algorithms than of neural networks. You can’t ‘back propagate’ a reward for a low entropy solution back up the net. All you can do is choose one such parameter set over another.
[Optimist] Exactly. Neural networks are in fact just a particular rather constrained case of evolutionary algorithm. I’d contend there is advantage in exploring new ways of reducing the degrees of freedom in them. CNNs do reduce the degrees of freedom, but not in a very general way. We need to add something like compression of parameters if we want low degrees of freedom with more generality.
[Doubter] In CNNs that lack of generality is an advantage. Your approach could encode a network with a ridiculously large number of useless non-zero weights—whilst still using very few bits. That won’t work. That would take way longer to compute one iteration. It would be as slow as pitch drops dripping.
[Optimist] Right. So some attention must be paid to exactly what the lossy compression algorithm is. Just as jpeg throws away low weight vectors, this compression algorithm could too.
[Doubter] So I have a couple of comments here. You have not worked out the details, right? It also doesn’t sound like this is bio-inspired, which was at least a saving grace of the saccadic idea.
[Optimistic] Well, the compression idea wasn’t bio-inspired originally, but later I got to thinking about how genes could create many ‘similar patterns’ of connections locally. That could do CNN type connections, but genes can also do similar patterns with long range connections. So for example, genes could learn the ideal density of long range connections relative to short range connections. That connection plan gets repeated in many places whilst being encoded compactly. In that sense genes are a compression code.
[Doubter] So you are mixing genetic algorithms and neural networks? That sounds like a recipe for more parameters.
[Optimistic] …a recipe for new ways of reducing the number of parameters.
[Doubter] I think I see a pattern here, in that both ideas offer CNNs as a special case. With saccadic networks the secret sauce is some not too clear way you would program the ‘attention’ function. With parameter compression your secret sauce is the choice of lossy compression function. If you ‘got funded’ to do some demo coding, you could keep naive investors happy for a long while with networks that were actually no better than existing CNNs and plenty of promises of more to come with more funding. But the ‘more to come later’ never would come. Your deep problem is the ‘secret sauce’ is more aspiration than actually demonstrable.
[Optimist] I think that’s a little unfair. I am not claiming these approaches are implemented demonstrable improvements. I am not claiming that I know exactly how to get the details of these two ideas right quickly. You are also losing sight of the overall goal, which is to progress the value of AI as a positive transformative force.
[Doubter] Hmm. I see only a not-too-convincing claim of being able to increase the power of machine learning and an attempt to burnish your ego and your reputation. Where is the focus on positive transformative force?
[Optimist] Breaking the mould on how to think about machine learning is a pretty important subgoal in progressing thought on AI, don’t you think? “Less Wrong” is the best possible place on the internet for engaging in discussion of ethical progression of AI. If this ‘subgoal’ post does not gather any useful feedback at all, then I’ll have to agree with you that my post is not helping progress the possible positive transformative aspects of AI—and try again with another iteration and different post, until I find what works.