I wanted to make a comment here on alternate histories from a computer architecture perspective.
I’m a computer engineer and I’ve written many small control systems that interact with other systems—exactly the kind of ‘society of the mind’ that Builder the Robot used. I also have built robots that can pick up blocks as well as mimic human gestures, basic exercises 45 years later.
So the core of modern computers—that Ted Hoff et all developed into the 4004 in 1971 - is the ALU. I’ve built one in VHDL. The 4004 had only 2300 transistors, and is Turing complete.
Structure wise, you have bytecode with <opcode, data> and you keep as much of the program and data as possible on magnetic storage. (tape or hard drives). The ALU is a small reusedchunk of silicon that can execute any program in theory. (in practice address space limited the practical size)
This is the problem with an alternate history of neural networks working 20 years early. Silicon consumption. With very slow clock speeds, and parallel architectures where you need many unnecessary connections (something we’re still stuck with right now, because irregular sparsity is not supported in current hardware), you need probably unaffordable amounts of silicon area. The early computers, and actually the later computers, up until the slowdown for Moore’s law, pretty much all work by minimizing the usage of the expensive resource, refined/patterned silicon.
You absolutely could build “special hardware boards” that implemented perceptrons and back prop, but they would also be unavoidably expensive*. Minsky’s small programs consume much less memory, and are the backbone of 1970s-present control systems. (everything except things like autonomous car perception stacks and similar)
Another way to view it is, with the microprocessor + small serial program design, you repeatedly access the same cached instructions and data kernel, and operate across a larger amount of data.
With a neural network design, you end up accessing every weight in the network to generate every output**. This is extremely cache unfriendly, and it means current hardware is often limited by memory bandwidth, which you need extreme quantities of.
In a related topic to everyone’s favorite AI risk topic, the above choice for computer architecture is the culprit for many of the cybersecurity vulnerabilities we are still stuck with. (the structure of the early 80s language C is responsible for many vulnerabilities as well)
*they were. Various early 2000s AI experiments used FPGAs for this purpose
**unless MoE or other bleeding edge techniques are in use
Concretely speaking, are you to suggest that a 2-layered fully connected network trained by backpropagation, with ~100 neurons in each layer (thus ~20000 weights), would have been uneconomical even in the 1960s, even if they had backprop?
I am asking this because the great successes in 1990s connectionism, including LeNet digit recognition, NETtalk, and the TD-gammon, all were on that order of magnitude. They seem within reach for the 1960s.
And interesting that you mentioned magnetic cores. The MINOS II machine built in 1962 by the Stanford Research Institute group had precisely a grid of magnetic core memory. Can’t they have scaled it up and built some extra circuitry to allow backpropagation?
Corroborating the calculation, according to some 1960s literature, magnetic core logic could go up to 10 kHz. So if we have ~1e4 weights updated 1e4 times a second, that would be 1e8 FLOP/sec right there. TD-gammon would take ~1e5 seconds ~ 1 day, the same OOM as the previous calculation.
magnetic cores. The MINOS II machine built in 1962 by the Stanford Research Institute group had precisely a grid of magnetic core memory
By magnetic memory I was referring to :
The core idea is that you can bulk store a reasonably large amount of data − 5 mb a module for the picture above, up to hundreds of mb in the 1970s—that get accessed infrequently.
For example, suppose you wanted to calculate a paycheck for an hourly worker, an early application for this type of computer. Every time they use a time card reader (a punchcard with their employee ID), you save to a data structure on a hard drive the recorded time.
Then, once every 2 weeks, for each n employees, you retrieve the data for all the clock in/clock out times, retrieve a record with their rate of pay and other meta rules (are they eligible for overtime etc), and then in a relatively simply program you can calculate the payroll record, print it to text, print their paycheck, save the record to a tape drive, and then zero out the record for the next payroll cycle.
So you have relative large dataset : n employee records with m times for each, plus bits for meta rules, and a relatively tiny program. You will only check each record 1 time.
Neural networks have the issue that any practical one is pretty large, and all weights must be accessed every inference.
great successes in 1990s connectionism, including LeNet digit recognition, NETtalk, and the TD-gammon
These are not useful applications. The closest is digit recognition.
From the 1970s to somewhere in the 2000s, OCR was done with computer vision. You collect features from each letter, such as the lines with hough transforms, harris corner detectors, and you map these features to a relative coordinate space. Then you use some type of classifier, SVMs or similar, to detect the letter.
I’m not quite sure of the flop requirements for a CV solution, since calculating these transforms is some matrix operations, but I can say that the kernels to do it are much smaller than a neural network.
You can only detect printed characters reliably this way—so an earlier LeNet would have saved a little bit of money for the post office.
And just to reiterate, as bad as Builder the Robot was at navigation, Minsky was right. Avionics, factory control systems, motor control, automotive throttle by wire—I have worked on a variety of these systems, and they all work as a “society of the mind” of small specialized programs. They don’t scale to large, complex robots, but it doesn’t matter, startups today driving big robots have immensely more compute and memory to do it. RT-X is getting closer to a useful capability level, and it’s 50B weights.
The main limitation is without neural networks, you need really clean perception. For example resolvers for joint position, or GPS or older systems for avionics. The machine needs the exact coordinates or rotational angle, it can’t use a camera.
I wanted to make a comment here on alternate histories from a computer architecture perspective.
I’m a computer engineer and I’ve written many small control systems that interact with other systems—exactly the kind of ‘society of the mind’ that Builder the Robot used. I also have built robots that can pick up blocks as well as mimic human gestures, basic exercises 45 years later.
So the core of modern computers—that Ted Hoff et all developed into the 4004 in 1971 - is the ALU. I’ve built one in VHDL. The 4004 had only 2300 transistors, and is Turing complete.
Structure wise, you have bytecode with <opcode, data> and you keep as much of the program and data as possible on magnetic storage. (tape or hard drives). The ALU is a small reused chunk of silicon that can execute any program in theory. (in practice address space limited the practical size)
This is the problem with an alternate history of neural networks working 20 years early. Silicon consumption. With very slow clock speeds, and parallel architectures where you need many unnecessary connections (something we’re still stuck with right now, because irregular sparsity is not supported in current hardware), you need probably unaffordable amounts of silicon area. The early computers, and actually the later computers, up until the slowdown for Moore’s law, pretty much all work by minimizing the usage of the expensive resource, refined/patterned silicon.
You absolutely could build “special hardware boards” that implemented perceptrons and back prop, but they would also be unavoidably expensive*. Minsky’s small programs consume much less memory, and are the backbone of 1970s-present control systems. (everything except things like autonomous car perception stacks and similar)
Another way to view it is, with the microprocessor + small serial program design, you repeatedly access the same cached instructions and data kernel, and operate across a larger amount of data.
With a neural network design, you end up accessing every weight in the network to generate every output**. This is extremely cache unfriendly, and it means current hardware is often limited by memory bandwidth, which you need extreme quantities of.
In a related topic to everyone’s favorite AI risk topic, the above choice for computer architecture is the culprit for many of the cybersecurity vulnerabilities we are still stuck with. (the structure of the early 80s language C is responsible for many vulnerabilities as well)
*they were. Various early 2000s AI experiments used FPGAs for this purpose
**unless MoE or other bleeding edge techniques are in use
Concretely speaking, are you to suggest that a 2-layered fully connected network trained by backpropagation, with ~100 neurons in each layer (thus ~20000 weights), would have been uneconomical even in the 1960s, even if they had backprop?
I am asking this because the great successes in 1990s connectionism, including LeNet digit recognition, NETtalk, and the TD-gammon, all were on that order of magnitude. They seem within reach for the 1960s.
Concretely speaking, TD-gammon cost about 2e13 FLOPs to train, and in 1970, 1 million FLOP/sec cost 1 USD, so with 10000 USD of hardware, it would take about 1 day to train.
And interesting that you mentioned magnetic cores. The MINOS II machine built in 1962 by the Stanford Research Institute group had precisely a grid of magnetic core memory. Can’t they have scaled it up and built some extra circuitry to allow backpropagation?
Corroborating the calculation, according to some 1960s literature, magnetic core logic could go up to 10 kHz. So if we have ~1e4 weights updated 1e4 times a second, that would be 1e8 FLOP/sec right there. TD-gammon would take ~1e5 seconds ~ 1 day, the same OOM as the previous calculation.
By magnetic memory I was referring to :
The core idea is that you can bulk store a reasonably large amount of data − 5 mb a module for the picture above, up to hundreds of mb in the 1970s—that get accessed infrequently.
For example, suppose you wanted to calculate a paycheck for an hourly worker, an early application for this type of computer. Every time they use a time card reader (a punchcard with their employee ID), you save to a data structure on a hard drive the recorded time.
Then, once every 2 weeks, for each n employees, you retrieve the data for all the clock in/clock out times, retrieve a record with their rate of pay and other meta rules (are they eligible for overtime etc), and then in a relatively simply program you can calculate the payroll record, print it to text, print their paycheck, save the record to a tape drive, and then zero out the record for the next payroll cycle.
So you have relative large dataset : n employee records with m times for each, plus bits for meta rules, and a relatively tiny program. You will only check each record 1 time.
Neural networks have the issue that any practical one is pretty large, and all weights must be accessed every inference.
These are not useful applications. The closest is digit recognition.
From the 1970s to somewhere in the 2000s, OCR was done with computer vision. You collect features from each letter, such as the lines with hough transforms, harris corner detectors, and you map these features to a relative coordinate space. Then you use some type of classifier, SVMs or similar, to detect the letter.
I’m not quite sure of the flop requirements for a CV solution, since calculating these transforms is some matrix operations, but I can say that the kernels to do it are much smaller than a neural network.
You can only detect printed characters reliably this way—so an earlier LeNet would have saved a little bit of money for the post office.
And just to reiterate, as bad as Builder the Robot was at navigation, Minsky was right. Avionics, factory control systems, motor control, automotive throttle by wire—I have worked on a variety of these systems, and they all work as a “society of the mind” of small specialized programs. They don’t scale to large, complex robots, but it doesn’t matter, startups today driving big robots have immensely more compute and memory to do it. RT-X is getting closer to a useful capability level, and it’s 50B weights.
The main limitation is without neural networks, you need really clean perception. For example resolvers for joint position, or GPS or older systems for avionics. The machine needs the exact coordinates or rotational angle, it can’t use a camera.