I think that is the wrong question at this point.[1] When a new paradigm is proposed the question can’t be “is it faster”? That comes later when things get optimized. The question is: Does it show new ways of thinking about learning? Or maybe: Does it allow for different optimizations?
I’m interested in sniffing out potential new paradigms and this might be the beginnings of one even if it isn’t already faster than the current mainstream alternatives.
I disagree. The other questions are cool too, but only the “Is it faster?” question has the potential to make me drop everything I’m doing and pay attention.
Fair. Would you have agreed if I had said something like: “This question has the risk of leading the discussion away from the core insight or innovation of this post.”
I think so? I would definitely had agreed if you said something like “I’m interested in sniffing out potential new paradigms and this might be the beginnings of one even if it isn’t already faster than the current mainstream alternatives.”
I don’t think PI-MNIST SOTA is really a thing. The OP even links to the original dropout paper from 2014, which shows this. MNIST SOTA is much less of a thing than it used to be but that’s at 99.9%+, not 98.9%.
That is a question of low philosophical value, but of the highest practical importance.
At line 3,000,000 with the 98.9% setup, in the full log, there is these two informations:
‘sp: 1640 734 548’, and ‘nbupd: 7940.54 6894.20’ (and the test accuracy is exactly 98.9%)
It means that the average spike count for the IS-group is 1640 per sample, 734 for the highest ISNOT-group and 548 for the other ones. The number of weight updates per IS-learn is 7940.54 and 6894.20 per ISNOT-learn. With the coding scheme used, the average number of inputs per sample over the four cycles is 2162 (total number of activated pixels in the input matrix) for 784 pixels. There is 7920 neurons per group with 10 connections each (so 10⁄784 th of the pixel matrix), for a total of 79200 neurons.
From those numbers:
The average number of integer additions done for all neurons in a group when a sample is presented is: 79200 * 2162 * 10⁄784 = 2,184,061 integer additions in total.
And for the spike counts: 1640 + 734 + 8*548 = 6758 INCrements (counting the spikes).
When learning, for each IS-learn, there is 7940, and for each ISNOT-learn, 6894, weight updates . Those are additions of integers. So an average of (7940 + 9*6894) / 10 = 7000 additions of integer.
That is to be compared with a 3 fully connected, say 800 units (to make up for the difference between 98.90% and 98.94%) layers.
That would be at least 800*800*2 + 800*784 = 1,907,200 floating point multiplications, plus what would be used for Max-norm, ReLu,… that I am not qualified to evaluate, but might roughly double that ?
And the same for each update (low estimate).
Even with recent works on sparse updates that do reduce that by 93%, it is still more than 133,000 floating-point multiplications (against 7000 additions of integers).
I have managed to get over 98.5% with 2000 neurons (20,000 connections). I would like to know if BP/SGD can perform to that level with such a small number of parameters (that would be one fully connected layer of 25 units) ? And, as I said in the roadmap, that is what will matter for full real systems.
That is the basic building bloc. the 1x1x1 Lego brick. 1.5/1.1 = 36% improvement with 40 times the ressources is useless in practice.
And that is missing the real point laid out in the Roadmap: this system CAN and MUST be implemented in analog (hybrid until we get practical memristors), whereas BP/SGD CAN NOT.
There is, at least, another order of magnitude in efficiency to be gained there.
There is a lot of effort invested, at this time, in the industry to implement AI at IC-level. Now is the time.
How much compute does this take to beat SOTA on PI-MNIST? How does that amount of compute compare to the amount used by the previous SOTA systems?
Have you tried it on other tests besides PI-MNIST? What were the results?
I think that is the wrong question at this point.[1] When a new paradigm is proposed the question can’t be “is it faster”? That comes later when things get optimized. The question is: Does it show new ways of thinking about learning? Or maybe: Does it allow for different optimizations?[1] UPDATE: I should have phrased this as
I’m interested in sniffing out potential new paradigms and this might be the beginnings of one even if it isn’t already faster than the current mainstream alternatives.
I disagree. The other questions are cool too, but only the “Is it faster?” question has the potential to make me drop everything I’m doing and pay attention.
We all have different strategies we follow. Arguments that invalidate one don’t invalidate others.
Yes. I didn’t say your questions were wrong; you said my question was wrong.
Fair. Would you have agreed if I had said something like: “This question has the risk of leading the discussion away from the core insight or innovation of this post.”
I think so? I would definitely had agreed if you said something like “I’m interested in sniffing out potential new paradigms and this might be the beginnings of one even if it isn’t already faster than the current mainstream alternatives.”
Thx! Amended.
I don’t think PI-MNIST SOTA is really a thing. The OP even links to the original dropout paper from 2014, which shows this. MNIST SOTA is much less of a thing than it used to be but that’s at 99.9%+, not 98.9%.
That is a question of low philosophical value, but of the highest practical importance.
At line 3,000,000 with the 98.9% setup, in the full log, there is these two informations:
‘sp: 1640 734 548’, and ‘nbupd: 7940.54 6894.20’ (and the test accuracy is exactly 98.9%)
It means that the average spike count for the IS-group is 1640 per sample, 734 for the highest ISNOT-group and 548 for the other ones. The number of weight updates per IS-learn is 7940.54 and 6894.20 per ISNOT-learn. With the coding scheme used, the average number of inputs per sample over the four cycles is 2162 (total number of activated pixels in the input matrix) for 784 pixels. There is 7920 neurons per group with 10 connections each (so 10⁄784 th of the pixel matrix), for a total of 79200 neurons.
From those numbers:
The average number of integer additions done for all neurons in a group when a sample is presented is: 79200 * 2162 * 10⁄784 = 2,184,061 integer additions in total.
And for the spike counts: 1640 + 734 + 8*548 = 6758 INCrements (counting the spikes).
When learning, for each IS-learn, there is 7940, and for each ISNOT-learn, 6894, weight updates . Those are additions of integers. So an average of (7940 + 9*6894) / 10 = 7000 additions of integer.
That is to be compared with a 3 fully connected, say 800 units (to make up for the difference between 98.90% and 98.94%) layers.
That would be at least 800*800*2 + 800*784 = 1,907,200 floating point multiplications, plus what would be used for Max-norm, ReLu,… that I am not qualified to evaluate, but might roughly double that ?
And the same for each update (low estimate).
Even with recent works on sparse updates that do reduce that by 93%, it is still more than 133,000 floating-point multiplications (against 7000 additions of integers).
I have managed to get over 98.5% with 2000 neurons (20,000 connections). I would like to know if BP/SGD can perform to that level with such a small number of parameters (that would be one fully connected layer of 25 units) ? And, as I said in the roadmap, that is what will matter for full real systems.
That is the basic building bloc. the 1x1x1 Lego brick. 1.5/1.1 = 36% improvement with 40 times the ressources is useless in practice.
And that is missing the real point laid out in the Roadmap: this system CAN and MUST be implemented in analog (hybrid until we get practical memristors), whereas BP/SGD CAN NOT.
There is, at least, another order of magnitude in efficiency to be gained there.
There is a lot of effort invested, at this time, in the industry to implement AI at IC-level. Now is the time.