In the extreme case imagine that the brain is a pure ULM, such that the genetic prior information is close to zero or is simply unimportant. In this case it is vastly more likely that successful AGI will be built around designs very similar to the brain, as the ULM architecture in general is the natural ideal, vs the alternative of having to hand engineer all of the AI’s various cognitive mechanisms.
Not necessarily. There are very different structures that are conceptually equivalent to a UTM (cellular automata, lambda calculus, recursive functions, Wang carpets etc.) In the same manner there can be AI architectures very different from the brain which are ULM-equivalent in a relevant sense.
The ULH suggests that most everything that defines the human mind is cognitive software rather than hardware: the adult mind (in terms of algorithmic information) is 99.999% a cultural/memetic construct.
Frankly, this sounds questionable. For example, do you suggest sexual attaction is a cultrual/memetic construct? It seems to me that your example with one part of the brain overtaking the function of another implies little regarding the flexibility of the goal system.
One key idea—which I proposed five years ago is that the AI should not know it is in a sim.
How do you suggest preventing it from discovering on its own that it is in a sim?
Future superintelligences will exist, but their vast and broad mental capacities will come mainly from vast mental content and computational resources. By comparison, their general architectural innovations will be minor additions.
The ULM supports this conclusion.
It seems to me that the fact we have no conscious awareness of the workings of our brain and no way to consciously influence them suggests that the brain is at best an approximation of a ULM. It seems to me that an ideal ULM wouldn’t need to invent the calculator. Therefore, while there might be a point beyond which general architectural innovations are minor additions, this point lies well beyond human intelligence.
Current ANN engines can already train and run models with around 10 million neurons and 10 billion (compressed/shared) synapses on a single GPU, which suggests that the goal could soon be within the reach of a large organization.
This assume current ANN agents are already ULMs which I seriously doubt.
There are very different structures that are conceptually equivalent to a UTM (cellular automata, lambda calculus, recursive functions, Wang carpets etc.) In the same manner there can be AI architectures very different from the brain which are ULM-equivalent in a relevant sense.
Of course—but all of your examples are not just conceptually equivalent—they are functionally equivalent (they can emulate each other). They are all computational foundations for constructing UTMs—although not all foundations are truly practical and efficient. Likewise there are many routes to implementing a ULM—biology is one example, modern digital computers is another.
Frankly, this sounds questionable. For example, do you suggest sexual attraction is a cultrual/memetic construct?
Well I said “most everything”, and I stressed several times in the article that much of the innate complexity budget is spent on encoding the value/reward system and the learning machinery (which are closely intertwined).
Sexual attraction is an interesting example, because it develops later in adolescence and depends heavily on complex learned sensory models. Current rough hypothesis: evolution encodes sexual attraction as a highly compressed initial ‘seed’ which unfolds over time through learning. It identifies/finds and then plugs into the relevant learned sensory concept representations which code for attractive members of the opposite sex. The compression effect explains the huge variety in human sexual preferences. Investigating/explaining this in more detail would take it’s own post—its a complex interesting topic.
One key idea—which I proposed five years ago is that the AI should not know it is in a sim.
How do you suggest preventing it from discovering on its own that it is in a sim?
I should rephrase—it isn’t necessarily a problem if the AI suspects its in a sim. Rather the key is that knowing one is in a sim and then knowing how to escape should be difficult enough to allow for sufficient time to evaluate the agent’s morality, worth/utility to society, and potential future impact. In other words, the sandbox sim should be a test for both intelligence and morality.
Suspecting or knowing one is in a sim is easy. For example—the gnostics discovered the sim hypothesis long before Bostrom, but without understanding computers and computation they had zero idea how to construct or escape sims—it was just mysticism. In fact, the very term ‘gnostic’ means “one who knows”—and this was their self-identification; they believed they had discovered the great mystery of the universe (and claimed the teaching came from Jesus, although Plato had arguably hit upon an earlier version of the idea, and the term demiurge in particular comes from Plato).
It seems to me that the fact we have no conscious awareness of the workings of our brain and no way to consciously influence them suggests that the brain is at best an approximation of a ULM.
We certainly have some awareness of the workings of our brain—to varying degrees. For example you are probably aware of how you perform long multiplication, such that you could communicate the algorithm and steps. Introspection and verbalization of introspective insights are specific complex computations that require circuitry—they are not somehow innate to a ULM, because nothing is.
Current ANN engines can already train and run models with around 10 million neurons and 10 billion (compressed/shared) synapses on a single GPU, which suggests that the goal could soon be within the reach of a large organization.
This assume current ANN agents are already ULMs which I seriously doubt.
Sorry should have clarified—we will probably soon have the computational power to semi-affordably simulate ANNs with billions of neurons. That doesn’t necessarily have anything to do with whether current ANN systems are ULMs. That being said, some systems—such as Atari’s DRL agent—can be considered simple early versions of ULMs.
There is probably still much research and engineering work to do in going from simple basic ULMs up to brain-competitive systems. But research sometimes moves quickly—punctuated equilibrium and all that.
Here is a useful analogy: a simple abstract turing machine is to a modern GPU as a simple abstract ULM is to the brain. There is a huge engineering gap between the simplest early version of an idea, and a subsequent scaled up complex practical efficient version.
Current rough hypothesis: evolution encodes sexual attraction as a highly compressed initial ‘seed’ which unfolds over time through learning. It identifies/finds and then plugs into the relevant learned sensory concept representations which code for attractive members of the opposite sex.
How does this “seed” find the correct high-level sensory features to plug into? How can it wire complex high-level behavioral programs (such as courtship behaviors) to low-level motor programs learned by unsupervised learning? This seems unlikely.
For example you are probably aware of how you perform long multiplication, such that you could communicate the algorithm and steps.
But long multiplication is something that you were taught in school, which most humans wouldn’t be able to discover independently. And you are certainly not aware of how your brain perform visual recognition, the little you know was discovered through experiments, not introspection.
That being said, some systems—such as Atari’s DRL agent—can be considered simple early versions of ULMs.
Not so fast.
The Atari DRL agent learns a good mapping between short windows of frames and button presses. It has some generalization capability which enables it to achieve human-level or sometimes even super human-level performances on games that are based on eye-hand coordination (after all it’s not burdened by the intrinsic delays that occur in the human body), but it has no reasoning ability and fails miserably at any game which requires planning ahead more than a few frames.
Despite the name, no machine learning system, “deep” or otherwise, has been demonstrated to be able to efficiently learn any provably deep function (in the sense of boolean circuit depth-complexity), such as the parity function which any human of average intelligence could learn from a small number of examples.
I see no particular reason to believe that this could be solved by just throwing more computational power at the problem: you can’t fight exponentials that way.
UPDATE:
Now it seems that Google DeepMind managed to train even feed-forward neural networks to solve the parity problem. My other comment down-thread.
Despite the name, no machine learning system, “deep” or otherwise, has been demonstrated to be able to efficiently learn any provably deep function (in the sense of boolean circuit depth-complexity), such as the parity function which any human of average intelligence could learn from a small number of examples.
Even non-sequential problems may benefit from RNNs. For example, the problem of determining the parity of a set of bits.[15]. This is very simple with RNNs, but doing it with a feedforward neural network would require excessive complexity.
The algorithm I was referring to can be easily represented by an RNN with one hidden layer of a few nodes, the difficult part is learning it from examples.
The examples for the n-parity problem are input-output pairs where each input is a n-bit binary string and its corresponding output is a single bit representing the parity of that string.
In the code you linked, if I understand correctly, however, they solve a different machine learning problem: here the examples are input-output pairs where both the inputs and the outputs are n-bit binary strings, with the i-th output bit representing the parity of the input bits up to the i-th one.
It may look like a minor difference, but actually it makes the learning problem much easier, and in fact it basically guides the network to learn the right algorithm: the network can first learn how to solve parity on 1 bit (identity), then parity on 2 bits (xor), and so on. Since the network is very small and has an ideal architecture for that problem, after learning how to solve parity for a few bits (perhaps even two) it will generalize to arbitrary lengths.
By using this kind of supervision I bet you can also train a feed-forward neural network to solve the problem: use a training set as above except with the input and output strings presented as n-dimensional vectors rather than sequences of individual bits and make sure that the network has enough hidden layers. If you use a specialized architecture (e.g. decrease the width of the hidden layers as their depth increases and connect the i-th output node to the i-th hidden layer) it will learn quite efficiently, but if you use a more standard architecture (hidden layers of constant width and output layer connected only to the last hidden layer) it will probably also work although you will need a quite a bit of training examples to avoid overfitting.
The parity problem is artificial, but it is a representative case of problems that necessarily ( * ) require a non-trivial number of highly non-linear serial computation steps. In a real-world case (a planning problem, maybe), we wouldn’t have access to the internal state of a reference algorithm to use as supervision signals for the machine learning system. The machine learning system will have to figure the algorithm on its own, and current approaches can’t do it in a general way, even for relatively simple algorithms.
You can read the (much more informed) opinion of Ilya Sutskever on the issue here (Yoshua Bengio also participated in the comments).
( * at least for polynomial-time execution, since you can always get constant depth at the expense of an exponential blow-up of parallel nodes)
Your comments made me curious enough to download PyBrain and play around with the sample code, to see if I could modify it to learn the parity function without intermediate parity bits in the output. In the end, I was able to, by trial and error, come up with hyperparameters that allowed the RNN to learn the parity function reliably in a few minutes on my laptop (many other choices of hyperparameters caused the SGD to sometimes get stuck before it converged to a correct solution). I’ve posted the modified sample code here. (Notice that the network now has 2 input nodes, one for the input string and one to indicate end of string, 2 hidden layers with 3 and 2 nodes, and an output node.)
The machine learning system will have to figure the algorithm on its own, and current approaches can’t do it in a general way, even for relatively simple algorithms.
I guess you’re basically correct on this, since even with the tweaked hyperparameters, on the parity problem RNN+SGD isn’t really doing any better than a brute force search through the space of simple circuits or algorithms. But humans arguably aren’t very good at learning algorithms from input/output examples either. The fact that RNNs can learn the parity function, even if barely, makes it less clear that humans have any advantage at this kind of learning.
Anyway, in a paper published on arXiv yesterday, the Google DeepMind people report being able to train a feed-forward neural network to solve the parity problem, using a sophisticated gating mechanism and weight sharing between the layers. They also obtain state of the art or near state of the art results on other problems.
This result makes me update in the increasing direction my belief about the generality of neural networks.
Ah you beat me to it, I just read that paper as well.
Here is the abstract for those that haven’t read it yet:
This paper introduces Grid Long Short-Term Memory, a network of LSTM cells arranged in a multidimensional grid that can be applied to vectors, sequences or higher dimensional data such as images. The network differs from existing deep LSTM architectures in that the cells are connected between network layers as well as along the spatiotemporal dimensions of the data. It therefore provides a unified way of using LSTM for both deep and sequential computation. We apply the model to algorithmic tasks such as integer addition and determining the parity of random binary vectors. It is able to solve these problems for 15-digit integers and 250-bit vectors respectively. We then give results for three empirical tasks. We find that 2D Grid LSTM achieves 1.47 bits per character on the Wikipedia character prediction benchmark, which is state-of-the-art among neural approaches. We also observe that a two-dimensional translation model based on Grid LSTM outperforms a phrase-based reference system on a Chinese-to-English translation task, and that 3D Grid LSTM yields a near state-of-the-art error rate of 0.32% on MNIST.
Also, relevant to this discussion:
It is core to the problem that the k-bit string is given to the neural network as a whole through a single projection; considering one bit at a time and remembering the previous partial result in a recurrent or multi-step architecture reduces the problem of learning k-bit parity to the simple one of learning just 2-bit parity.
The version of the problem that humans can learn well is this easier reduction. Humans can not easily learn the hard version of the parity problem, which would correspond to a rapid test where the human is presented with a flash card with a very large number on it (60+ digits to rival the best machine result) and then must respond immediately. The fast response requirement is important to prevent using much easier multi-step serial algorithms.
You can read the (much more informed) opinion of Ilya Sutskever on the issue here (Yoshua Bengio also participated in the comments).
That is the most cogent, genuinely informative explanation of “Deep Learning” that I’ve ever heard. Most especially so regarding the bit about linear correlations: we can learn well on real problems with nothing more than stochastic gradient descent because the feature data may contain whole hierarchies of linear correlations.
How does this “seed” find the correct high-level sensory features to plug into? How can it wire complex high-level behavioral programs (such as courtship behaviors) to low-level motor programs learned by unsupervised learning?
This particular idea is not well developed yet in my mind, and I haven’t really even searched the literature yet. So keep that in mind.
Leave courtship aside, let us focus on attraction—specifically evolution needs to encode detectors which can reliably identify high quality mates of the opposite sex apart from all kinds of other objects. The problem is that a good high quality face recognizer is too complex to specify in the genome—it requires many billions of synapses, so it needs to be learned. However, the genome can encode an initial crappy face detector. It can also encode scent/pheromone detectors, and it can encode general ‘complexity’ and or symmetry detectors that sit on top, so even if it doesn’t initially know what it is seeing, it can tell when something is about yeh complex/symmetric/interesting. It can encode the equivalent of : if you see an interesting face sized object which appears for many minutes at a time and moves at this speed, and you hear complex speech like sounds, and smell human scents, it’s probably a human face.
Then the problem is reduced in scope. The cortical map will grow a good face/person model/detector on it’s own, and then after this model is ready certain hormones in adolescence activate innate routines that learn where the face/person model patch is and help other modules plug into it. This whole process can also be improved by the use of a weak top down prior described above.
That being said, some systems—such as Atari’s DRL agent—can be considered simple early versions of ULMs.
Not so fast.
Actually on consideration I think you are right and I did get ahead of myself there. The Atari agent doesn’t really have a general memory subsystem. It has an episode replay system, but not general memory. Deepmind is working on general memory—they have the NTM paper and what not, but the Atari agent came before that.
I largely agree with your assessment of the Atari DRL agent.
Despite the name, no machine learning system, “deep” or otherwise, has been demonstrated to be able to efficiently learn any provably deep function (in the sense of boolean circuit depth-complexity), such as the parity function which any human of average intelligence could learn from a small number of examples.
I highly doubt that—but it all depends on what your sampling class for ‘human’ is. An average human drawn from the roughly 10 billion alive today? Or an average human drawn from the roughly 100 billion who have ever lived? (most of which would have no idea what a parity function is).
When you imagine a human learning the parity function from a small number of examples, what you really imagine is a human who has already learned the parity function, and thus internally has ‘parity function’ as one of perhaps a thousand types of functions they have learned, such that if you give them some data, it is one of the obvious things they may try.
Training a machine on a parity data set from scratch and expecting it to learn the parity function is equivalent to it inventing the parity function—and perhaps inventing mathematics as well. It should be compared to raising an infant without any knowledge of mathematics or anything related, and then training them on the raw data.
However, the genome can encode an initial crappy face detector.
It’s not that crappy given that newborns can not only recognize faces with significant accuracy, but also recognize facial expressions.
The cortical map will grow a good face/person model/detector on it’s own, and then after this model is ready certain hormones in adolescence activate innate routines that learn where the face/person model patch is and help other modules plug into it.
Having two separate face recognition modules, one genetically specified and another learned seems redundant, and still it’s not obvious to me how a genetically-specified sexual attraction program could find how to plug into a completely learned system, which would necessarily have some degree of randomness.
It seems more likely that there is a single face recognition module which is genetically specified and then it becomes fine tuned by learning.
I highly doubt that—but it all depends on what your sampling class for ‘human’ is. An average human drawn from the roughly 10 billion alive today? Or an average human drawn from the roughly 100 billion who have ever lived? (most of which would have no idea what a parity function is).
Show a neolithic human a bunch of pebbles, some black and some white, laid out in a line. Ask them to add a black or white pebble to the line, and reward them if the number of black pebbles is even. Repeat multiple times.
Even without a concept of “even number”, wouldn’t this neolithic human be able to figure out an algorithm to compute the right answer? They just need to scan the line, flipping a mental switch for each black pebble they encounter, and then add a black pebble if and only if the switch is not in the initial position.
Maybe I’m overgeneralizing, but it seems unlikely to me that people able to invent complex hunting strategies, to build weapons, tools, traps, clothing, huts, to participate in tribe politics, etc. wouldn’t be able to figure something like that.
It’s not that crappy given that newborns can not only recognize faces with significant accuracy, but also recognize facial expressions.
Do you have a link to that? ‘Newborn’ can mean many things—the visual system starts learning from the second the eyes open, and perhaps even before that through pattern generators projected onto the retina which help to ‘pretrain’ the viscortex.
I know that infants have initial face detectors from the second they open their eyes, but from what I remember reading—they are pretty crappy indeed, and initially can’t tell a human face apart from a simple cartoon with 3 blobs for eyes and mouth.
It seems more likely that there is a single face recognition module which is genetically specified and then it becomes fine tuned by learning.
Except that it isn’t that simple, because—amongst other evidence—congenitally blind people still learn a model and recognizer for attractive people, and can discern someone’s relative beauty by scanning faces with their fingertips.
Even without a concept of “even number”, wouldn’t this neolithic human be able to figure out an algorithm to compute the right answer?
Not sure—we are getting into hypothetical scenarios here. Your visual version, with black and white pebbles laid out in a line, implicitly helps simplify the problem and may guide the priors in the right way. I am reasonably sure that this setup would also help any brain-like AGI.
Even without a concept of “even number”, wouldn’t this neolithic human be able to figure out an algorithm to compute the right answer? They just need to scan the line, flipping a mental switch for each black pebble they encounter, and then add a black pebble if and only if the switch is not in the initial position.
If I understand correctly, in the post you linked Scott is saying that Haitians are functionally innumerate, which should explain the difficulties with numerical sorting.
My point is that the partity function should be learnable even without basic numeracy, although I admit that perhaps I’m overgeneralizing.
Anyway, modern machine learning systems can learn to perform basic arithmentic such as addition and subtraction, and I think even sorting (since they are used for preordering for statstical machine translation), hence the problem doesn’t seem to be a lack of arithmetic knowledge or skill.
Note that both addition and subtraction have constant circuit depth (they are in AC0) while parity has logarithmic circuit depth.
Of course—but all of your examples are not just conceptually equivalent—they are functionally equivalent (they can emulate each other). They are all computational foundations for constructing UTMs—although not all foundations are truly practical and efficient. Likewise there are many routes to implementing a ULM—biology is one example, modern digital computers is another.
Universal computers are equivalent in the sense that any two can simulate each other in polynomial time. ULMs should probably be equivalent in the sense that each can efficiently learn to behave like the other. But it doesn’t imply the software architectures have to be similar. For example I see no reason to assume any ULM should be anything like a neural net.
Well I said “most everything”, and I stressed several times in the article that much of the innate complexity budget is spent on encoding the value/reward system and the learning machinery (which are closely intertwined).
Any value hard coded in human will have to be transferred to the AI in a way different than universal learning. And another thing: teaching an AIs values by placing it in a human environment and counting on reinforcement learning can fail spectacularly if the AIs intelligence grows much faster than that of a human child.
Rather the key is that knowing one is in a sim and then knowing how to escape should be difficult enough to allow for sufficient time to evaluate the agent’s morality, worth/utility to society, and potential future impact.
This is an assumption which might or might not be correct. I would definitely not bet our survival on this assumption without much further evidence.
Introspection and verbalization of introspective insights are specific complex computations that require circuitry—they are not somehow innate to a ULM, because nothing is.
OK, but a ULM is supposed to be able to learn anything. A human brain is never going to learn to rearrange its low level circuitry to efficiently perform operations like numerical calculation.
Here is a useful analogy: a simple abstract turing machine is to a modern GPU as a simple abstract ULM is to the brain. There is a huge engineering gap between the simplest early version of an idea, and a subsequent scaled up complex practical efficient version.
The difference is that we have a solid mathematical theory of Turing machines whereas ULMs, as far as I can see, are only an informal idea so far.
But it doesn’t imply the software architectures have to be similar. For example I see no reason to assume any ULM should be anything like a neural net.
Sure—any general model can simulate any other. Neural networks have strong practical advantages. Their operator base is based on fmads, which is a good match for modern computers. They allow explicit search of program space in terms of the execution graph, which is extremely powerful because it allows one to a priori exclude all programs which don’t halt—you can constrain the search to focus on programs with exact known computational requirements.
Neural nets make deep factoring easy, and deep factoring is the single most important huge gain in any general optimization/learning system: it allows for exponential (albeit limited) speedup.
And another thing: teaching an AIs values by placing it in a human environment and counting on reinforcement learning can fail spectacularly if the AIs intelligence grows much faster than that of a human child.
Yes. There are pitfalls, and in general much more research to do on value learning before we get to useful AGI, let alone safe AGI.
A human brain is never going to learn to rearrange its low level circuitry to efficiently perform operations like numerical calculation.
This is arguably a misconception. The brain has a 100 hz clock rate at most. For general operations that involve memory, it’s more like 10hz. Most people can do basic arithmetic in less than a second, which roughly maps to a dozen clock cycles or so, maybe less. That actually is comparable to many computers—for example on the current maxwell GPU architecture (nvidia’s latest and greatest), even the simpler instructions have a latency of about 6 cycles.
Now, obviously the arithmetic ops that most humans can do in less than a second is very limited—it’s like a minimal 3 bit machine. But some atypical humans can do larger scale arithmetic at the same speed.
Point is, you need to compare everything adjusted for the 6 order of magnitude speed difference.
...They allow explicit search of program space in terms of the execution graph, which is extremely powerful because it allows one to a priori exclude all programs which don’t halt—you can constrain the search to focus on programs with exact known computational requirements.
Right. So Boolean circuits are a better analogy than Turing machines.
Neural nets make deep factoring easy, and deep factoring is the single most important huge gain in any general optimization/learning system: it allows for exponential (albeit limited) speedup.
I’m sorry, what is deep factoring? A reference perhaps?
There are pitfalls, and in general much more research to do on value learning before we get to useful AGI, let alone safe AGI.
I completely agree.
This is arguably a misconception. The brain has a 100 hz clock rate at most. For general operations that involve memory, it’s more like 10hz...
Good point! Nevertheless, it seems to me very dubious that the human brain can learn to do anything within the limits of its computing power. For example, why can’t I learn to look at a page full of exercises in arithmetics and solve all of them in parallel?
Right. So Boolean circuits are a better analogy than Turing machines.
They are of course equivalent in theory, but in practice directly searching through a boolean circuit space is much wiser than searching through a program space. Searching through analog/algebraic circuit space is even better, because you can take advantage of fmads instead of having to spend enormous circuit complexity emulating them. Neural nets are even better than that, because they enforce a mostly continous/differentiable energy landscape which helps inference/optimization.
I’m sorry, what is deep factoring? A reference perhaps?
It’s the general idea that you can reuse subcomputations amongst models and layers. Solonomoff induction is retarded for a number of reasons, but one is this: it treats every function/model as entirely distinct. So if you have say one high level model which has developed a good cat detector, that isn’t shared amongst the other models. Deep nets (of various forms) automatically share submodel components AND subcomputations/subexpressions amongst those submodels. That incredibly, massively speeds up the search. That is deep factoring.
All the successful multi-layer models use deep factoring to some degree. This paper: Sum-Product Networks explains the general idea pretty well.
Good point! Nevertheless, it seems to me very dubious that the human brain can learn to do anything within the limits of its computing power. For example, why can’t I learn to look at a page full of exercises in arithmetics and solve all of them in parallel?
There’s alot of reasons. First, due to nonlinear foveation your visual system can only read/parse a couple of words/symbols during each saccade—only those right in the narrow center of the visual cone, the fovea. So it takes a number of clock cycles or steps to scan the entire page, and your brain only has limited working memory to put stuff in.
Secondly, the bigger problem is that even if you already know how to solve a math problem, just parsing many math problems requires a number of steps, and then actually solving them—even if you know the ideal algorithm that requires the minimal number of steps—that minimal number of steps can still be quite large.
Many interesting problems still require a number of serial steps to solve, even with an infinite parallel machine. Sorting is one simple example.
...Neural nets are even better than that, because they enforce a mostly continous/differentiable energy landscape which helps inference/optimization.
I wonder whether this is a general property or is the success of continuous methods limited to problem with natural continuous models like vision.
Deep nets (of various forms) automatically share submodel components AND subcomputations/subexpressions amongst those submodels.
Yes, this is probably important.
First, due to nonlinear foveation your visual system can only read/parse a couple of words/symbols during each saccade—only those right in the narrow center of the visual cone, the fovea. So it takes a number of clock cycles or steps to scan the entire page, and your brain only has limited working memory to put stuff in.
Scanning the page is clearly not the bottleneck: I can read the page much faster than solve the exercises. “Limited working memory” sounds a claim that higher cognition has much less computing resources than low level tasks. Clearly visual processing requires much more “working memory” than solving a couple of dozens of exercises in arithmetic. But if we accept this constraint then does the brain still qualify for a ULM? It seems to me that if there is a deficiency of the brain’s architecture that prevents higher cognition from enjoying the brain’s full power, solving this deficiency definitely counts as an “architectural innovation”.
This is arguably a misconception. The brain has a 100 hz clock rate at most. For general operations that involve memory, it’s more like 10hz.
Mechanical calculators were slower than that, and still they were very much better at numeric computation than most humans, which made them incredibly useful.
Now, obviously the arithmetic ops that most humans can do in less than a second is very limited—it’s like a minimal 3 bit machine. But some atypical humans can do larger scale arithmetic at the same speed.
Indeed these are very rare people. The vast majority of people, even if they worked for decades in accounting, can’t learn to do numeric computation as fast and accurately as a mechanical calculator does.
The vast majority of people, even if they worked for decades in accounting, can’t learn to do numeric computation as fast and accurately as a mechanical calculator does.
The problems aren’t even remotely comparable. A human is solving a much more complex problem—the inputs are in the form of visual or auditory signals which first need to be recognized and processed into symbolic numbers. The actual computation step is trivial and probably only involves a handful or even a single cycle.
I admit that I somewhat let you walk into this trap by not mentioning it earlier … this example shows that the brain can learn near optimal (in terms of circuit depth or cycles) solutions for these simple arithmetic problems. The main limitation is that the brain’s hardware is strongly suited to approximate inference problems, and not exact solutions, so any exact operators require memoization. This is actually a good thing, and any practical AGI will need to have a similar prior.
Very thought provoking. Thank you.
Not necessarily. There are very different structures that are conceptually equivalent to a UTM (cellular automata, lambda calculus, recursive functions, Wang carpets etc.) In the same manner there can be AI architectures very different from the brain which are ULM-equivalent in a relevant sense.
Frankly, this sounds questionable. For example, do you suggest sexual attaction is a cultrual/memetic construct? It seems to me that your example with one part of the brain overtaking the function of another implies little regarding the flexibility of the goal system.
How do you suggest preventing it from discovering on its own that it is in a sim?
It seems to me that the fact we have no conscious awareness of the workings of our brain and no way to consciously influence them suggests that the brain is at best an approximation of a ULM. It seems to me that an ideal ULM wouldn’t need to invent the calculator. Therefore, while there might be a point beyond which general architectural innovations are minor additions, this point lies well beyond human intelligence.
This assume current ANN agents are already ULMs which I seriously doubt.
Of course—but all of your examples are not just conceptually equivalent—they are functionally equivalent (they can emulate each other). They are all computational foundations for constructing UTMs—although not all foundations are truly practical and efficient. Likewise there are many routes to implementing a ULM—biology is one example, modern digital computers is another.
Well I said “most everything”, and I stressed several times in the article that much of the innate complexity budget is spent on encoding the value/reward system and the learning machinery (which are closely intertwined).
Sexual attraction is an interesting example, because it develops later in adolescence and depends heavily on complex learned sensory models. Current rough hypothesis: evolution encodes sexual attraction as a highly compressed initial ‘seed’ which unfolds over time through learning. It identifies/finds and then plugs into the relevant learned sensory concept representations which code for attractive members of the opposite sex. The compression effect explains the huge variety in human sexual preferences. Investigating/explaining this in more detail would take it’s own post—its a complex interesting topic.
I should rephrase—it isn’t necessarily a problem if the AI suspects its in a sim. Rather the key is that knowing one is in a sim and then knowing how to escape should be difficult enough to allow for sufficient time to evaluate the agent’s morality, worth/utility to society, and potential future impact. In other words, the sandbox sim should be a test for both intelligence and morality.
Suspecting or knowing one is in a sim is easy. For example—the gnostics discovered the sim hypothesis long before Bostrom, but without understanding computers and computation they had zero idea how to construct or escape sims—it was just mysticism. In fact, the very term ‘gnostic’ means “one who knows”—and this was their self-identification; they believed they had discovered the great mystery of the universe (and claimed the teaching came from Jesus, although Plato had arguably hit upon an earlier version of the idea, and the term demiurge in particular comes from Plato).
We certainly have some awareness of the workings of our brain—to varying degrees. For example you are probably aware of how you perform long multiplication, such that you could communicate the algorithm and steps. Introspection and verbalization of introspective insights are specific complex computations that require circuitry—they are not somehow innate to a ULM, because nothing is.
Sorry should have clarified—we will probably soon have the computational power to semi-affordably simulate ANNs with billions of neurons. That doesn’t necessarily have anything to do with whether current ANN systems are ULMs. That being said, some systems—such as Atari’s DRL agent—can be considered simple early versions of ULMs.
There is probably still much research and engineering work to do in going from simple basic ULMs up to brain-competitive systems. But research sometimes moves quickly—punctuated equilibrium and all that.
Here is a useful analogy: a simple abstract turing machine is to a modern GPU as a simple abstract ULM is to the brain. There is a huge engineering gap between the simplest early version of an idea, and a subsequent scaled up complex practical efficient version.
How does this “seed” find the correct high-level sensory features to plug into? How can it wire complex high-level behavioral programs (such as courtship behaviors) to low-level motor programs learned by unsupervised learning?
This seems unlikely.
But long multiplication is something that you were taught in school, which most humans wouldn’t be able to discover independently. And you are certainly not aware of how your brain perform visual recognition, the little you know was discovered through experiments, not introspection.
Not so fast.
The Atari DRL agent learns a good mapping between short windows of frames and button presses. It has some generalization capability which enables it to achieve human-level or sometimes even super human-level performances on games that are based on eye-hand coordination (after all it’s not burdened by the intrinsic delays that occur in the human body), but it has no reasoning ability and fails miserably at any game which requires planning ahead more than a few frames.
Despite the name, no machine learning system, “deep” or otherwise, has been demonstrated to be able to efficiently learn any provably deep function (in the sense of boolean circuit depth-complexity), such as the parity function which any human of average intelligence could learn from a small number of examples.
I see no particular reason to believe that this could be solved by just throwing more computational power at the problem: you can’t fight exponentials that way.
UPDATE:
Now it seems that Google DeepMind managed to train even feed-forward neural networks to solve the parity problem. My other comment down-thread.
I had a guess that recurrent neural networks can solve the parity problem, which Google confirmed. See http://cse-wiki.unl.edu/wiki/index.php/Recurrent_neural_networks where it says:
See also PyBrain’s parity learning RNN example.
The algorithm I was referring to can be easily represented by an RNN with one hidden layer of a few nodes, the difficult part is learning it from examples.
The examples for the n-parity problem are input-output pairs where each input is a n-bit binary string and its corresponding output is a single bit representing the parity of that string.
In the code you linked, if I understand correctly, however, they solve a different machine learning problem: here the examples are input-output pairs where both the inputs and the outputs are n-bit binary strings, with the i-th output bit representing the parity of the input bits up to the i-th one.
It may look like a minor difference, but actually it makes the learning problem much easier, and in fact it basically guides the network to learn the right algorithm:
the network can first learn how to solve parity on 1 bit (identity), then parity on 2 bits (xor), and so on. Since the network is very small and has an ideal architecture for that problem, after learning how to solve parity for a few bits (perhaps even two) it will generalize to arbitrary lengths.
By using this kind of supervision I bet you can also train a feed-forward neural network to solve the problem: use a training set as above except with the input and output strings presented as n-dimensional vectors rather than sequences of individual bits and make sure that the network has enough hidden layers.
If you use a specialized architecture (e.g. decrease the width of the hidden layers as their depth increases and connect the i-th output node to the i-th hidden layer) it will learn quite efficiently, but if you use a more standard architecture (hidden layers of constant width and output layer connected only to the last hidden layer) it will probably also work although you will need a quite a bit of training examples to avoid overfitting.
The parity problem is artificial, but it is a representative case of problems that necessarily ( * ) require a non-trivial number of highly non-linear serial computation steps. In a real-world case (a planning problem, maybe), we wouldn’t have access to the internal state of a reference algorithm to use as supervision signals for the machine learning system. The machine learning system will have to figure the algorithm on its own, and current approaches can’t do it in a general way, even for relatively simple algorithms.
You can read the (much more informed) opinion of Ilya Sutskever on the issue here (Yoshua Bengio also participated in the comments).
( * at least for polynomial-time execution, since you can always get constant depth at the expense of an exponential blow-up of parallel nodes)
Your comments made me curious enough to download PyBrain and play around with the sample code, to see if I could modify it to learn the parity function without intermediate parity bits in the output. In the end, I was able to, by trial and error, come up with hyperparameters that allowed the RNN to learn the parity function reliably in a few minutes on my laptop (many other choices of hyperparameters caused the SGD to sometimes get stuck before it converged to a correct solution). I’ve posted the modified sample code here. (Notice that the network now has 2 input nodes, one for the input string and one to indicate end of string, 2 hidden layers with 3 and 2 nodes, and an output node.)
I guess you’re basically correct on this, since even with the tweaked hyperparameters, on the parity problem RNN+SGD isn’t really doing any better than a brute force search through the space of simple circuits or algorithms. But humans arguably aren’t very good at learning algorithms from input/output examples either. The fact that RNNs can learn the parity function, even if barely, makes it less clear that humans have any advantage at this kind of learning.
Nice work!
Anyway, in a paper published on arXiv yesterday, the Google DeepMind people report being able to train a feed-forward neural network to solve the parity problem, using a sophisticated gating mechanism and weight sharing between the layers. They also obtain state of the art or near state of the art results on other problems.
This result makes me update in the increasing direction my belief about the generality of neural networks.
Ah you beat me to it, I just read that paper as well.
Here is the abstract for those that haven’t read it yet:
Also, relevant to this discussion:
The version of the problem that humans can learn well is this easier reduction. Humans can not easily learn the hard version of the parity problem, which would correspond to a rapid test where the human is presented with a flash card with a very large number on it (60+ digits to rival the best machine result) and then must respond immediately. The fast response requirement is important to prevent using much easier multi-step serial algorithms.
That is the most cogent, genuinely informative explanation of “Deep Learning” that I’ve ever heard. Most especially so regarding the bit about linear correlations: we can learn well on real problems with nothing more than stochastic gradient descent because the feature data may contain whole hierarchies of linear correlations.
This particular idea is not well developed yet in my mind, and I haven’t really even searched the literature yet. So keep that in mind.
Leave courtship aside, let us focus on attraction—specifically evolution needs to encode detectors which can reliably identify high quality mates of the opposite sex apart from all kinds of other objects. The problem is that a good high quality face recognizer is too complex to specify in the genome—it requires many billions of synapses, so it needs to be learned. However, the genome can encode an initial crappy face detector. It can also encode scent/pheromone detectors, and it can encode general ‘complexity’ and or symmetry detectors that sit on top, so even if it doesn’t initially know what it is seeing, it can tell when something is about yeh complex/symmetric/interesting. It can encode the equivalent of : if you see an interesting face sized object which appears for many minutes at a time and moves at this speed, and you hear complex speech like sounds, and smell human scents, it’s probably a human face.
Then the problem is reduced in scope. The cortical map will grow a good face/person model/detector on it’s own, and then after this model is ready certain hormones in adolescence activate innate routines that learn where the face/person model patch is and help other modules plug into it. This whole process can also be improved by the use of a weak top down prior described above.
Actually on consideration I think you are right and I did get ahead of myself there. The Atari agent doesn’t really have a general memory subsystem. It has an episode replay system, but not general memory. Deepmind is working on general memory—they have the NTM paper and what not, but the Atari agent came before that.
I largely agree with your assessment of the Atari DRL agent.
I highly doubt that—but it all depends on what your sampling class for ‘human’ is. An average human drawn from the roughly 10 billion alive today? Or an average human drawn from the roughly 100 billion who have ever lived? (most of which would have no idea what a parity function is).
When you imagine a human learning the parity function from a small number of examples, what you really imagine is a human who has already learned the parity function, and thus internally has ‘parity function’ as one of perhaps a thousand types of functions they have learned, such that if you give them some data, it is one of the obvious things they may try.
Training a machine on a parity data set from scratch and expecting it to learn the parity function is equivalent to it inventing the parity function—and perhaps inventing mathematics as well. It should be compared to raising an infant without any knowledge of mathematics or anything related, and then training them on the raw data.
It’s not that crappy given that newborns can not only recognize faces with significant accuracy, but also recognize facial expressions.
Having two separate face recognition modules, one genetically specified and another learned seems redundant, and still it’s not obvious to me how a genetically-specified sexual attraction program could find how to plug into a completely learned system, which would necessarily have some degree of randomness.
It seems more likely that there is a single face recognition module which is genetically specified and then it becomes fine tuned by learning.
Show a neolithic human a bunch of pebbles, some black and some white, laid out in a line. Ask them to add a black or white pebble to the line, and reward them if the number of black pebbles is even. Repeat multiple times.
Even without a concept of “even number”, wouldn’t this neolithic human be able to figure out an algorithm to compute the right answer? They just need to scan the line, flipping a mental switch for each black pebble they encounter, and then add a black pebble if and only if the switch is not in the initial position.
Maybe I’m overgeneralizing, but it seems unlikely to me that people able to invent complex hunting strategies, to build weapons, tools, traps, clothing, huts, to participate in tribe politics, etc. wouldn’t be able to figure something like that.
Do you have a link to that? ‘Newborn’ can mean many things—the visual system starts learning from the second the eyes open, and perhaps even before that through pattern generators projected onto the retina which help to ‘pretrain’ the viscortex.
I know that infants have initial face detectors from the second they open their eyes, but from what I remember reading—they are pretty crappy indeed, and initially can’t tell a human face apart from a simple cartoon with 3 blobs for eyes and mouth.
Except that it isn’t that simple, because—amongst other evidence—congenitally blind people still learn a model and recognizer for attractive people, and can discern someone’s relative beauty by scanning faces with their fingertips.
Not sure—we are getting into hypothetical scenarios here. Your visual version, with black and white pebbles laid out in a line, implicitly helps simplify the problem and may guide the priors in the right way. I am reasonably sure that this setup would also help any brain-like AGI.
Well, given how hard it is for Haitians to understand numerical sorting...
If I understand correctly, in the post you linked Scott is saying that Haitians are functionally innumerate, which should explain the difficulties with numerical sorting.
My point is that the partity function should be learnable even without basic numeracy, although I admit that perhaps I’m overgeneralizing.
Anyway, modern machine learning systems can learn to perform basic arithmentic such as addition and subtraction, and I think even sorting (since they are used for preordering for statstical machine translation), hence the problem doesn’t seem to be a lack of arithmetic knowledge or skill.
Note that both addition and subtraction have constant circuit depth (they are in AC0) while parity has logarithmic circuit depth.
Thank you for replying!
Universal computers are equivalent in the sense that any two can simulate each other in polynomial time. ULMs should probably be equivalent in the sense that each can efficiently learn to behave like the other. But it doesn’t imply the software architectures have to be similar. For example I see no reason to assume any ULM should be anything like a neural net.
Any value hard coded in human will have to be transferred to the AI in a way different than universal learning. And another thing: teaching an AIs values by placing it in a human environment and counting on reinforcement learning can fail spectacularly if the AIs intelligence grows much faster than that of a human child.
This is an assumption which might or might not be correct. I would definitely not bet our survival on this assumption without much further evidence.
OK, but a ULM is supposed to be able to learn anything. A human brain is never going to learn to rearrange its low level circuitry to efficiently perform operations like numerical calculation.
The difference is that we have a solid mathematical theory of Turing machines whereas ULMs, as far as I can see, are only an informal idea so far.
Sure—any general model can simulate any other. Neural networks have strong practical advantages. Their operator base is based on fmads, which is a good match for modern computers. They allow explicit search of program space in terms of the execution graph, which is extremely powerful because it allows one to a priori exclude all programs which don’t halt—you can constrain the search to focus on programs with exact known computational requirements.
Neural nets make deep factoring easy, and deep factoring is the single most important huge gain in any general optimization/learning system: it allows for exponential (albeit limited) speedup.
Yes. There are pitfalls, and in general much more research to do on value learning before we get to useful AGI, let alone safe AGI.
This is arguably a misconception. The brain has a 100 hz clock rate at most. For general operations that involve memory, it’s more like 10hz. Most people can do basic arithmetic in less than a second, which roughly maps to a dozen clock cycles or so, maybe less. That actually is comparable to many computers—for example on the current maxwell GPU architecture (nvidia’s latest and greatest), even the simpler instructions have a latency of about 6 cycles.
Now, obviously the arithmetic ops that most humans can do in less than a second is very limited—it’s like a minimal 3 bit machine. But some atypical humans can do larger scale arithmetic at the same speed.
Point is, you need to compare everything adjusted for the 6 order of magnitude speed difference.
Right. So Boolean circuits are a better analogy than Turing machines.
I’m sorry, what is deep factoring? A reference perhaps?
I completely agree.
Good point! Nevertheless, it seems to me very dubious that the human brain can learn to do anything within the limits of its computing power. For example, why can’t I learn to look at a page full of exercises in arithmetics and solve all of them in parallel?
They are of course equivalent in theory, but in practice directly searching through a boolean circuit space is much wiser than searching through a program space. Searching through analog/algebraic circuit space is even better, because you can take advantage of fmads instead of having to spend enormous circuit complexity emulating them. Neural nets are even better than that, because they enforce a mostly continous/differentiable energy landscape which helps inference/optimization.
It’s the general idea that you can reuse subcomputations amongst models and layers. Solonomoff induction is retarded for a number of reasons, but one is this: it treats every function/model as entirely distinct. So if you have say one high level model which has developed a good cat detector, that isn’t shared amongst the other models. Deep nets (of various forms) automatically share submodel components AND subcomputations/subexpressions amongst those submodels. That incredibly, massively speeds up the search. That is deep factoring.
All the successful multi-layer models use deep factoring to some degree. This paper: Sum-Product Networks explains the general idea pretty well.
There’s alot of reasons. First, due to nonlinear foveation your visual system can only read/parse a couple of words/symbols during each saccade—only those right in the narrow center of the visual cone, the fovea. So it takes a number of clock cycles or steps to scan the entire page, and your brain only has limited working memory to put stuff in.
Secondly, the bigger problem is that even if you already know how to solve a math problem, just parsing many math problems requires a number of steps, and then actually solving them—even if you know the ideal algorithm that requires the minimal number of steps—that minimal number of steps can still be quite large.
Many interesting problems still require a number of serial steps to solve, even with an infinite parallel machine. Sorting is one simple example.
I wonder whether this is a general property or is the success of continuous methods limited to problem with natural continuous models like vision.
Yes, this is probably important.
Scanning the page is clearly not the bottleneck: I can read the page much faster than solve the exercises. “Limited working memory” sounds a claim that higher cognition has much less computing resources than low level tasks. Clearly visual processing requires much more “working memory” than solving a couple of dozens of exercises in arithmetic. But if we accept this constraint then does the brain still qualify for a ULM? It seems to me that if there is a deficiency of the brain’s architecture that prevents higher cognition from enjoying the brain’s full power, solving this deficiency definitely counts as an “architectural innovation”.
Mechanical calculators were slower than that, and still they were very much better at numeric computation than most humans, which made them incredibly useful.
Indeed these are very rare people. The vast majority of people, even if they worked for decades in accounting, can’t learn to do numeric computation as fast and accurately as a mechanical calculator does.
The problems aren’t even remotely comparable. A human is solving a much more complex problem—the inputs are in the form of visual or auditory signals which first need to be recognized and processed into symbolic numbers. The actual computation step is trivial and probably only involves a handful or even a single cycle.
I admit that I somewhat let you walk into this trap by not mentioning it earlier … this example shows that the brain can learn near optimal (in terms of circuit depth or cycles) solutions for these simple arithmetic problems. The main limitation is that the brain’s hardware is strongly suited to approximate inference problems, and not exact solutions, so any exact operators require memoization. This is actually a good thing, and any practical AGI will need to have a similar prior.