Concretely speaking, are you to suggest that a 2-layered fully connected network trained by backpropagation, with ~100 neurons in each layer (thus ~20000 weights), would have been uneconomical even in the 1960s, even if they had backprop?
I am asking this because the great successes in 1990s connectionism, including LeNet digit recognition, NETtalk, and the TD-gammon, all were on that order of magnitude. They seem within reach for the 1960s.
And interesting that you mentioned magnetic cores. The MINOS II machine built in 1962 by the Stanford Research Institute group had precisely a grid of magnetic core memory. Can’t they have scaled it up and built some extra circuitry to allow backpropagation?
Corroborating the calculation, according to some 1960s literature, magnetic core logic could go up to 10 kHz. So if we have ~1e4 weights updated 1e4 times a second, that would be 1e8 FLOP/sec right there. TD-gammon would take ~1e5 seconds ~ 1 day, the same OOM as the previous calculation.
magnetic cores. The MINOS II machine built in 1962 by the Stanford Research Institute group had precisely a grid of magnetic core memory
By magnetic memory I was referring to :
The core idea is that you can bulk store a reasonably large amount of data − 5 mb a module for the picture above, up to hundreds of mb in the 1970s—that get accessed infrequently.
For example, suppose you wanted to calculate a paycheck for an hourly worker, an early application for this type of computer. Every time they use a time card reader (a punchcard with their employee ID), you save to a data structure on a hard drive the recorded time.
Then, once every 2 weeks, for each n employees, you retrieve the data for all the clock in/clock out times, retrieve a record with their rate of pay and other meta rules (are they eligible for overtime etc), and then in a relatively simply program you can calculate the payroll record, print it to text, print their paycheck, save the record to a tape drive, and then zero out the record for the next payroll cycle.
So you have relative large dataset : n employee records with m times for each, plus bits for meta rules, and a relatively tiny program. You will only check each record 1 time.
Neural networks have the issue that any practical one is pretty large, and all weights must be accessed every inference.
great successes in 1990s connectionism, including LeNet digit recognition, NETtalk, and the TD-gammon
These are not useful applications. The closest is digit recognition.
From the 1970s to somewhere in the 2000s, OCR was done with computer vision. You collect features from each letter, such as the lines with hough transforms, harris corner detectors, and you map these features to a relative coordinate space. Then you use some type of classifier, SVMs or similar, to detect the letter.
I’m not quite sure of the flop requirements for a CV solution, since calculating these transforms is some matrix operations, but I can say that the kernels to do it are much smaller than a neural network.
You can only detect printed characters reliably this way—so an earlier LeNet would have saved a little bit of money for the post office.
And just to reiterate, as bad as Builder the Robot was at navigation, Minsky was right. Avionics, factory control systems, motor control, automotive throttle by wire—I have worked on a variety of these systems, and they all work as a “society of the mind” of small specialized programs. They don’t scale to large, complex robots, but it doesn’t matter, startups today driving big robots have immensely more compute and memory to do it. RT-X is getting closer to a useful capability level, and it’s 50B weights.
The main limitation is without neural networks, you need really clean perception. For example resolvers for joint position, or GPS or older systems for avionics. The machine needs the exact coordinates or rotational angle, it can’t use a camera.
Concretely speaking, are you to suggest that a 2-layered fully connected network trained by backpropagation, with ~100 neurons in each layer (thus ~20000 weights), would have been uneconomical even in the 1960s, even if they had backprop?
I am asking this because the great successes in 1990s connectionism, including LeNet digit recognition, NETtalk, and the TD-gammon, all were on that order of magnitude. They seem within reach for the 1960s.
Concretely speaking, TD-gammon cost about 2e13 FLOPs to train, and in 1970, 1 million FLOP/sec cost 1 USD, so with 10000 USD of hardware, it would take about 1 day to train.
And interesting that you mentioned magnetic cores. The MINOS II machine built in 1962 by the Stanford Research Institute group had precisely a grid of magnetic core memory. Can’t they have scaled it up and built some extra circuitry to allow backpropagation?
Corroborating the calculation, according to some 1960s literature, magnetic core logic could go up to 10 kHz. So if we have ~1e4 weights updated 1e4 times a second, that would be 1e8 FLOP/sec right there. TD-gammon would take ~1e5 seconds ~ 1 day, the same OOM as the previous calculation.
By magnetic memory I was referring to :
The core idea is that you can bulk store a reasonably large amount of data − 5 mb a module for the picture above, up to hundreds of mb in the 1970s—that get accessed infrequently.
For example, suppose you wanted to calculate a paycheck for an hourly worker, an early application for this type of computer. Every time they use a time card reader (a punchcard with their employee ID), you save to a data structure on a hard drive the recorded time.
Then, once every 2 weeks, for each n employees, you retrieve the data for all the clock in/clock out times, retrieve a record with their rate of pay and other meta rules (are they eligible for overtime etc), and then in a relatively simply program you can calculate the payroll record, print it to text, print their paycheck, save the record to a tape drive, and then zero out the record for the next payroll cycle.
So you have relative large dataset : n employee records with m times for each, plus bits for meta rules, and a relatively tiny program. You will only check each record 1 time.
Neural networks have the issue that any practical one is pretty large, and all weights must be accessed every inference.
These are not useful applications. The closest is digit recognition.
From the 1970s to somewhere in the 2000s, OCR was done with computer vision. You collect features from each letter, such as the lines with hough transforms, harris corner detectors, and you map these features to a relative coordinate space. Then you use some type of classifier, SVMs or similar, to detect the letter.
I’m not quite sure of the flop requirements for a CV solution, since calculating these transforms is some matrix operations, but I can say that the kernels to do it are much smaller than a neural network.
You can only detect printed characters reliably this way—so an earlier LeNet would have saved a little bit of money for the post office.
And just to reiterate, as bad as Builder the Robot was at navigation, Minsky was right. Avionics, factory control systems, motor control, automotive throttle by wire—I have worked on a variety of these systems, and they all work as a “society of the mind” of small specialized programs. They don’t scale to large, complex robots, but it doesn’t matter, startups today driving big robots have immensely more compute and memory to do it. RT-X is getting closer to a useful capability level, and it’s 50B weights.
The main limitation is without neural networks, you need really clean perception. For example resolvers for joint position, or GPS or older systems for avionics. The machine needs the exact coordinates or rotational angle, it can’t use a camera.