People who don’t work in AI, who hear that I work in AI, often ask me: “Do you build neural networks or expert systems?” This is said in much the same tones as “Are you a good witch or a bad witch?”
Now that’s what I call successful marketing.
Yesterday I covered what I see when I look at “logic” as an AI technique. I see something with a particular shape, a particular power, and a well-defined domain of useful application where cognition is concerned. Logic is good for leaping from crisp real-world events to compact general laws, and then verifying that a given manipulation of the laws preserves truth. It isn’t even remotely close to the whole, or the center, of a mathematical outlook on cognition.
But for a long time, years and years, there was a tremendous focus in Artificial Intelligence on what I call “suggestively namedLISP tokens”—a misuse of logic to try to handle cases like “Socrates is human, all humans are mortal, therefore Socrates is mortal”. For many researchers, this one small element of math was indeed their universe.
And then along came the amazing revolution, the new AI, namely connectionism.
In the beginning (1957) was Rosenblatt’s Perceptron. It was, I believe, billed as being inspired by the brain’s biological neurons. The Perceptron had exactly two layers, a set of input units, and a single binary output unit. You multiplied the inputs by the weightings on those units, added up the results, and took the sign: that was the classification. To learn from the training data, you checked the current classification on an input, and if it was wrong, you dropped a delta on all the weights to nudge the classification in the right direction.
The Perceptron could only learn to deal with training data that was linearly separable—points in a hyperspace that could be cleanly separated by a hyperplane.
And that was all that this amazing algorithm, “inspired by the brain”, could do.
In 1969, Marvin Minsky and Seymour Papert pointed out that Perceptrons couldn’t learn the XOR function because it wasn’t linearly separable. This killed off research in neural networks for the next ten years.
Now, you might think to yourself: “Hey, what if you had more than two layers in a neural network? Maybe then it could learn the XOR function?”
Well, but if you know a bit of linear algebra, you’ll realize that if the units in your neural network have outputs that are linear functions of input, then any number of hidden layers is going to behave the same way as a single layer—you’ll only be able to learn functions that are linearly separable.
Okay, so what if you had hidden layers and the outputs weren’t linear functions of the input?
But you see—no one had any idea how to train a neural network like that. Cuz, like, then this weight would affect that output and that other output too, nonlinearly, so how were you supposed to figure out how to nudge the weights in the right direction?
Just make random changes to the network and see if it did any better? You may be underestimating how much computing power it takes to do things the truly stupid way. It wasn’t a popular line of research.
Then along came this brilliant idea, called “backpropagation”:
You handed the network a training input. The network classified it incorrectly. So you took the partial derivative of the output error (in layer N) with respect to each of the individual nodes in the preceding layer (N − 1). Then you could calculate the partial derivative of the output error with respect to any single weight or bias in the layer N − 1. And you could also go ahead and calculate the partial derivative of the output error with respect to each node in the layer N − 2. So you did layer N − 2, and then N − 3, and so on back to the input layer. (Though backprop nets usually had a grand total of 3 layers.) Then you just nudged the whole network a delta—that is, nudged each weight or bias by delta times its partial derivative with respect to the output error.
It says a lot about the nonobvious difficulty of doing math that it took years to come up with this algorithm.
I find it difficult to put into words just how obvious this is in retrospect. You’re just taking a system whose behavior is a differentiable function of continuous paramaters, and sliding the whole thing down the slope of the error function. There are much more clever ways to train neural nets, taking into account more than the first derivative, e.g. conjugate gradient optimization, and these take some effort to understand even if you know calculus. But backpropagation is ridiculously simple. Take the network, take the partial derivative of the error function with respect to each weight in the network, slide it down the slope.
If I didn’t know the history of connectionism, and I didn’t know scientific history in general—if I had needed to guess without benefit of hindsight how long it ought to take to go from Perceptrons to backpropagation—then I would probably say something like: “Maybe a couple of hours? Lower bound, five minutes—upper bound, three days.”
“Seventeen years” would have floored me.
And I know that backpropagation may be slightly less obvious if you don’t have the idea of “gradient descent” as a standard optimization technique bopping around in your head. I know that these were smart people, and I’m doing the equivalent of complaining that Newton only invented first-year undergraduate stuff, etc.
So I’m just mentioning this little historical note about the timescale of mathematical progress, to emphasize that all the people who say “AI is 30 years away so we don’t need to worry about Friendliness theory yet” have moldy jello in their skulls.
(Which I suspect is part of a general syndrome where people’s picture of Science comes from reading press releases that announce important discoveries, so that they’re like, “Really? You do science? What kind of important discoveries do you announce?” Apparently, in their world, when AI finally is “visibly imminent”, someone just needs to issue a press release to announce the completion of Friendly AI theory.)
Backpropagation is not just clever; much more importantly, it turns out to work well in real life on a wide class of tractable problems. Not all “neural network” algorithms use backprop, but if you said, “networks of connected units with continuous parameters and differentiable behavior which learn by traveling up a performance gradient”, you would cover a pretty large swathe.
But the real cleverness is in how neural networks were marketed.
They left out the math.
To me, at least, it seems that a backprop neural network involves substantially deeper mathematical ideas than “Socrates is human, all humans are mortal, Socrates is mortal”. Newton versus Aristotle. I would even say that a neural network is more analyzable—since it does more real cognitive labor on board a computer chip where I can actually look at it, rather than relying on inscrutable human operators who type “|- Human(Socrates)” into the keyboard under God knows what circumstances.
But neural networks were not marketed as cleverer math. Instead they were marketed as a revolt against Spock.
No, more than that—the neural network was the new champion of the Other Side of the Force—the antihero of a Manichaean conflict between Law and Chaos. And all good researchers and true were called to fight on the side of Chaos, to overthrow the corrupt Authority and its Order. To champion Freedom and Individuality against Control and Uniformity. To Decentralize instead of Centralize, substitute Empirical Testing for mere Proof, and replace Rigidity with Flexibility.
I suppose a grand conflict between Law and Chaos, beats trying to explain calculus in a press release.
But the thing is, a neural network isn’t an avatar of Chaos any more than an expert system is an avatar of Law.
It’s just… you know… a system with continuous parameters and differentiable behavior traveling up a performance gradient.
Both algorithms do what they do, and are what they are; nothing more.
But the successful marketing campaign said,
“The failure of logical systems to produce real AI has shown that intelligence isn’t logical. Top-down design doesn’t work; we need bottom-up techniques, like neural networks.”
And this is what I call the Lemon Glazing Fallacy, which generates an argument for a fully arbitrary New Idea in AI using the following template:
Major premise: All previous AI efforts failed to yield true intelligence.
Minor premise: All previous AIs were built without delicious lemon glazing.
Conclusion: If we build AIs with delicious lemon glazing, they will work.
This only has the appearance of plausibility if you present a Grand Dichotomy. It doesn’t do to say “AI Technique #283 has failed for years to produce general intelligence—that’s why you need to adopt my new AI Technique #420.” Someone might ask, “Well, that’s very nice, but what about AI technique #59,832?”
No, you’ve got to make 420 and ¬420 into the whole universe—allow only these two possibilities—put them on opposite sides of the Force—so that ten thousand failed attempts to build AI are actually arguing for your own success. All those failures are weighing down the other side of the scales, pushing up your own side… right? (In Star Wars, the Force has at least one Side that does seem pretty Dark. But who says the Jedi are the Light Side just because they’re not Sith?)
Ten thousand failures don’t tell you what will work. They don’t even say what should not be part of a successful AI system. Reversed stupidity is not intelligence.
If you remove the power cord from your computer, it will stop working. You can’t thereby conclude that everything about the current system is wrong, and an optimal computer should not have an Intel processor or Nvidia video card or case fans or run on electricity. Even though your current system has these properties, and it doesn’t work.
As it so happens, I do believe that the type of systems usually termed GOFAI will not yield general intelligence, even if you run them on a computer the size of the moon. But this opinion follows from my own view of intelligence. It does not follow, even as suggestive evidence, from the historical fact that a thousand systems built using Prolog did not yield general intelligence. So far as the logical sequitur goes, one might as well say that Silicon-Based AI has shown itself deficient, and we must try to build transistors out of carbon atoms instead.
Not to mention that neural networks have also been “failing” (i.e., not yet succeeding) to produce real AI for 30 years now. I don’t think this particular raw fact licenses any conclusions in particular. But at least don’t tell me it’s still the new revolutionary idea in AI.
This is the original example I used when I talked about the “Outside the Box” box—people think of “amazing new AI idea” and return their first cache hit, which is “neural networks” due to a successful marketing campaign thirty goddamned years ago. I mean, not every old idea is bad—but to still be marketing it as the new defiant revolution? Give me a break.
And pity the poor souls who try to think outside the “outside the box” box—outside the ordinary bounds of logical AI vs. connectionist AI—and, after mighty strains, propose a hybrid system that includes both logical and neural-net components.
It goes to show that compromise is not always the path to optimality—though it may sound Deeply Wise to say that the universe must balance between Law and Chaos.
Where do Bayesian networks fit into this dichotomy? They’re parallel, asynchronous, decentralized, distributed, probabilistic. And they can be proven correct from the axioms of probability theory. You can preprogram them, or learn them from a corpus of unsupervised data—using, in some cases, formally correct Bayesian updating. They can reason based on incomplete evidence. Loopy Bayes nets, rather than computing the correct probability estimate, might compute an approximation using Monte Carlo—but the approximation provably converges—but we don’t run long enough to converge...
Where does that fit on the axis that runs from logical AI to neural networks? And the answer is that it doesn’t. It doesn’t fit.
It’s not that Bayesian networks “combine the advantages of logic and neural nets”. They’re simply a different point in the space of algorithms, with different properties.
At the inaugural seminar of Redwood Neuroscience, I once saw a presentation describing a robot that started out walking on legs, and learned to run… in real time, over the course of around a minute. The robot was stabilized in the Z axis, but it was still pretty darned impressive. (When first exhibited, someone apparently stood up and said “You sped up that video, didn’t you?” because they couldn’t believe it.)
This robot ran on a “neural network” built by detailed study of biology. The network had twenty neurons or so. Each neuron had a separate name and its own equation. And believe me, the robot’s builders knew how that network worked.
Where does that fit into the grand dichotomy? Is it top-down? Is it bottom-up? Calling it “parallel” or “distributed” seems like kind of a silly waste when you’ve only got 20 neurons—who’s going to bother multithreading that?
This is what a real biologically inspired system looks like. And let me say again, that video of the running robot would have been damned impressive even if it hadn’t been done using only twenty neurons. But that biological network didn’t much resemble—at all, really—the artificial neural nets that are built using abstract understanding of gradient optimization, like backprop.
That network of 20 neurons, each with its own equation, built and understood from careful study of biology—where does it fit into the Manichaean conflict? It doesn’t. It’s just a different point in AIspace.
At a conference ysterday, I spoke to someone who thought that Google’s translation algorithm was a triumph of Chaotic-aligned AI, because none of the people on the translation team spoke Arabic and yet they built an Arabic translator using a massive corpus of data. And I said that, while I wasn’t familiar in detail with Google’s translator, the little I knew about it led me to believe that they were using well-understood algorithms—Bayesian ones, in fact—and that if no one on the translation team knew any Arabic, this was no more significant than Deep Blue’s programmers playing poor chess.
Since Peter Norvig also happened to be at the conference, I asked him about it, and Norvig said that they started out doing an actual Bayesian calculation, but then took a couple of steps away. I remarked, “Well, you probably weren’t doing the real Bayesian calculation anyway—assuming conditional independence where it doesn’t exist, and stuff”, and Norvig said, “Yes, so we’ve already established what kind of algorithm it is, and now we’re just haggling over the price.”
Where does that fit into the axis of logical AI and neural nets? It doesn’t even talk to that axis. It’s just a different point in the design space.
The grand dichotomy is a lie—which is to say, a highly successful marketing campaign which managed to position two particular fragments of optimization as the Dark Side and Light Side of the Force.
Logical or Connectionist AI?
Previously in series: The Nature of Logic
People who don’t work in AI, who hear that I work in AI, often ask me: “Do you build neural networks or expert systems?” This is said in much the same tones as “Are you a good witch or a bad witch?”
Now that’s what I call successful marketing.
Yesterday I covered what I see when I look at “logic” as an AI technique. I see something with a particular shape, a particular power, and a well-defined domain of useful application where cognition is concerned. Logic is good for leaping from crisp real-world events to compact general laws, and then verifying that a given manipulation of the laws preserves truth. It isn’t even remotely close to the whole, or the center, of a mathematical outlook on cognition.
But for a long time, years and years, there was a tremendous focus in Artificial Intelligence on what I call “suggestively named LISP tokens”—a misuse of logic to try to handle cases like “Socrates is human, all humans are mortal, therefore Socrates is mortal”. For many researchers, this one small element of math was indeed their universe.
And then along came the amazing revolution, the new AI, namely connectionism.
In the beginning (1957) was Rosenblatt’s Perceptron. It was, I believe, billed as being inspired by the brain’s biological neurons. The Perceptron had exactly two layers, a set of input units, and a single binary output unit. You multiplied the inputs by the weightings on those units, added up the results, and took the sign: that was the classification. To learn from the training data, you checked the current classification on an input, and if it was wrong, you dropped a delta on all the weights to nudge the classification in the right direction.
The Perceptron could only learn to deal with training data that was linearly separable—points in a hyperspace that could be cleanly separated by a hyperplane.
And that was all that this amazing algorithm, “inspired by the brain”, could do.
In 1969, Marvin Minsky and Seymour Papert pointed out that Perceptrons couldn’t learn the XOR function because it wasn’t linearly separable. This killed off research in neural networks for the next ten years.
Now, you might think to yourself: “Hey, what if you had more than two layers in a neural network? Maybe then it could learn the XOR function?”
Well, but if you know a bit of linear algebra, you’ll realize that if the units in your neural network have outputs that are linear functions of input, then any number of hidden layers is going to behave the same way as a single layer—you’ll only be able to learn functions that are linearly separable.
Okay, so what if you had hidden layers and the outputs weren’t linear functions of the input?
But you see—no one had any idea how to train a neural network like that. Cuz, like, then this weight would affect that output and that other output too, nonlinearly, so how were you supposed to figure out how to nudge the weights in the right direction?
Just make random changes to the network and see if it did any better? You may be underestimating how much computing power it takes to do things the truly stupid way. It wasn’t a popular line of research.
Then along came this brilliant idea, called “backpropagation”:
You handed the network a training input. The network classified it incorrectly. So you took the partial derivative of the output error (in layer N) with respect to each of the individual nodes in the preceding layer (N − 1). Then you could calculate the partial derivative of the output error with respect to any single weight or bias in the layer N − 1. And you could also go ahead and calculate the partial derivative of the output error with respect to each node in the layer N − 2. So you did layer N − 2, and then N − 3, and so on back to the input layer. (Though backprop nets usually had a grand total of 3 layers.) Then you just nudged the whole network a delta—that is, nudged each weight or bias by delta times its partial derivative with respect to the output error.
It says a lot about the nonobvious difficulty of doing math that it took years to come up with this algorithm.
I find it difficult to put into words just how obvious this is in retrospect. You’re just taking a system whose behavior is a differentiable function of continuous paramaters, and sliding the whole thing down the slope of the error function. There are much more clever ways to train neural nets, taking into account more than the first derivative, e.g. conjugate gradient optimization, and these take some effort to understand even if you know calculus. But backpropagation is ridiculously simple. Take the network, take the partial derivative of the error function with respect to each weight in the network, slide it down the slope.
If I didn’t know the history of connectionism, and I didn’t know scientific history in general—if I had needed to guess without benefit of hindsight how long it ought to take to go from Perceptrons to backpropagation—then I would probably say something like: “Maybe a couple of hours? Lower bound, five minutes—upper bound, three days.”
“Seventeen years” would have floored me.
And I know that backpropagation may be slightly less obvious if you don’t have the idea of “gradient descent” as a standard optimization technique bopping around in your head. I know that these were smart people, and I’m doing the equivalent of complaining that Newton only invented first-year undergraduate stuff, etc.
So I’m just mentioning this little historical note about the timescale of mathematical progress, to emphasize that all the people who say “AI is 30 years away so we don’t need to worry about Friendliness theory yet” have moldy jello in their skulls.
(Which I suspect is part of a general syndrome where people’s picture of Science comes from reading press releases that announce important discoveries, so that they’re like, “Really? You do science? What kind of important discoveries do you announce?” Apparently, in their world, when AI finally is “visibly imminent”, someone just needs to issue a press release to announce the completion of Friendly AI theory.)
Backpropagation is not just clever; much more importantly, it turns out to work well in real life on a wide class of tractable problems. Not all “neural network” algorithms use backprop, but if you said, “networks of connected units with continuous parameters and differentiable behavior which learn by traveling up a performance gradient”, you would cover a pretty large swathe.
But the real cleverness is in how neural networks were marketed.
They left out the math.
To me, at least, it seems that a backprop neural network involves substantially deeper mathematical ideas than “Socrates is human, all humans are mortal, Socrates is mortal”. Newton versus Aristotle. I would even say that a neural network is more analyzable—since it does more real cognitive labor on board a computer chip where I can actually look at it, rather than relying on inscrutable human operators who type “|- Human(Socrates)” into the keyboard under God knows what circumstances.
But neural networks were not marketed as cleverer math. Instead they were marketed as a revolt against Spock.
No, more than that—the neural network was the new champion of the Other Side of the Force—the antihero of a Manichaean conflict between Law and Chaos. And all good researchers and true were called to fight on the side of Chaos, to overthrow the corrupt Authority and its Order. To champion Freedom and Individuality against Control and Uniformity. To Decentralize instead of Centralize, substitute Empirical Testing for mere Proof, and replace Rigidity with Flexibility.
I suppose a grand conflict between Law and Chaos, beats trying to explain calculus in a press release.
But the thing is, a neural network isn’t an avatar of Chaos any more than an expert system is an avatar of Law.
It’s just… you know… a system with continuous parameters and differentiable behavior traveling up a performance gradient.
And logic is a great way of verifying truth preservation by syntactic manipulation of compact generalizations that are true in crisp models. That’s it. That’s all. This kind of logical AI is not the avatar of Math, Reason, or Law.
Both algorithms do what they do, and are what they are; nothing more.
But the successful marketing campaign said,
“The failure of logical systems to produce real AI has shown that intelligence isn’t logical. Top-down design doesn’t work; we need bottom-up techniques, like neural networks.”
And this is what I call the Lemon Glazing Fallacy, which generates an argument for a fully arbitrary New Idea in AI using the following template:
Major premise: All previous AI efforts failed to yield true intelligence.
Minor premise: All previous AIs were built without delicious lemon glazing.
Conclusion: If we build AIs with delicious lemon glazing, they will work.
This only has the appearance of plausibility if you present a Grand Dichotomy. It doesn’t do to say “AI Technique #283 has failed for years to produce general intelligence—that’s why you need to adopt my new AI Technique #420.” Someone might ask, “Well, that’s very nice, but what about AI technique #59,832?”
No, you’ve got to make 420 and ¬420 into the whole universe—allow only these two possibilities—put them on opposite sides of the Force—so that ten thousand failed attempts to build AI are actually arguing for your own success. All those failures are weighing down the other side of the scales, pushing up your own side… right? (In Star Wars, the Force has at least one Side that does seem pretty Dark. But who says the Jedi are the Light Side just because they’re not Sith?)
Ten thousand failures don’t tell you what will work. They don’t even say what should not be part of a successful AI system. Reversed stupidity is not intelligence.
If you remove the power cord from your computer, it will stop working. You can’t thereby conclude that everything about the current system is wrong, and an optimal computer should not have an Intel processor or Nvidia video card or case fans or run on electricity. Even though your current system has these properties, and it doesn’t work.
As it so happens, I do believe that the type of systems usually termed GOFAI will not yield general intelligence, even if you run them on a computer the size of the moon. But this opinion follows from my own view of intelligence. It does not follow, even as suggestive evidence, from the historical fact that a thousand systems built using Prolog did not yield general intelligence. So far as the logical sequitur goes, one might as well say that Silicon-Based AI has shown itself deficient, and we must try to build transistors out of carbon atoms instead.
Not to mention that neural networks have also been “failing” (i.e., not yet succeeding) to produce real AI for 30 years now. I don’t think this particular raw fact licenses any conclusions in particular. But at least don’t tell me it’s still the new revolutionary idea in AI.
This is the original example I used when I talked about the “Outside the Box” box—people think of “amazing new AI idea” and return their first cache hit, which is “neural networks” due to a successful marketing campaign thirty goddamned years ago. I mean, not every old idea is bad—but to still be marketing it as the new defiant revolution? Give me a break.
And pity the poor souls who try to think outside the “outside the box” box—outside the ordinary bounds of logical AI vs. connectionist AI—and, after mighty strains, propose a hybrid system that includes both logical and neural-net components.
It goes to show that compromise is not always the path to optimality—though it may sound Deeply Wise to say that the universe must balance between Law and Chaos.
Where do Bayesian networks fit into this dichotomy? They’re parallel, asynchronous, decentralized, distributed, probabilistic. And they can be proven correct from the axioms of probability theory. You can preprogram them, or learn them from a corpus of unsupervised data—using, in some cases, formally correct Bayesian updating. They can reason based on incomplete evidence. Loopy Bayes nets, rather than computing the correct probability estimate, might compute an approximation using Monte Carlo—but the approximation provably converges—but we don’t run long enough to converge...
Where does that fit on the axis that runs from logical AI to neural networks? And the answer is that it doesn’t. It doesn’t fit.
It’s not that Bayesian networks “combine the advantages of logic and neural nets”. They’re simply a different point in the space of algorithms, with different properties.
At the inaugural seminar of Redwood Neuroscience, I once saw a presentation describing a robot that started out walking on legs, and learned to run… in real time, over the course of around a minute. The robot was stabilized in the Z axis, but it was still pretty darned impressive. (When first exhibited, someone apparently stood up and said “You sped up that video, didn’t you?” because they couldn’t believe it.)
This robot ran on a “neural network” built by detailed study of biology. The network had twenty neurons or so. Each neuron had a separate name and its own equation. And believe me, the robot’s builders knew how that network worked.
Where does that fit into the grand dichotomy? Is it top-down? Is it bottom-up? Calling it “parallel” or “distributed” seems like kind of a silly waste when you’ve only got 20 neurons—who’s going to bother multithreading that?
This is what a real biologically inspired system looks like. And let me say again, that video of the running robot would have been damned impressive even if it hadn’t been done using only twenty neurons. But that biological network didn’t much resemble—at all, really—the artificial neural nets that are built using abstract understanding of gradient optimization, like backprop.
That network of 20 neurons, each with its own equation, built and understood from careful study of biology—where does it fit into the Manichaean conflict? It doesn’t. It’s just a different point in AIspace.
At a conference ysterday, I spoke to someone who thought that Google’s translation algorithm was a triumph of Chaotic-aligned AI, because none of the people on the translation team spoke Arabic and yet they built an Arabic translator using a massive corpus of data. And I said that, while I wasn’t familiar in detail with Google’s translator, the little I knew about it led me to believe that they were using well-understood algorithms—Bayesian ones, in fact—and that if no one on the translation team knew any Arabic, this was no more significant than Deep Blue’s programmers playing poor chess.
Since Peter Norvig also happened to be at the conference, I asked him about it, and Norvig said that they started out doing an actual Bayesian calculation, but then took a couple of steps away. I remarked, “Well, you probably weren’t doing the real Bayesian calculation anyway—assuming conditional independence where it doesn’t exist, and stuff”, and Norvig said, “Yes, so we’ve already established what kind of algorithm it is, and now we’re just haggling over the price.”
Where does that fit into the axis of logical AI and neural nets? It doesn’t even talk to that axis. It’s just a different point in the design space.
The grand dichotomy is a lie—which is to say, a highly successful marketing campaign which managed to position two particular fragments of optimization as the Dark Side and Light Side of the Force.