The interesting thing is that they have essentially implemented a (sort of) Turing machine with differentiable functions, which enables training using gradient-based techniques.
In principle you could always train resource-bounded Turing machines using combinatorial optimization techniques such as classical planners, SAT solvers or ILP solvers, theoretical complexity will be PSPACE-complete or NP-complete depending on the constraints that you impose. In practice, this is very much computationally expensive and doesn’t scale past small examples.
If this differentiable approach generalizes past toy examples, it will allow efficient training. In addition to the obvious practical applications (*) I think it may also have a deeper philosophical value:
Consider word embeddings and the Vector space model: words are appear like discrete objects at syntactic level, while they represent concepts which intuitively feel like they lie in some sort of differentiable manifold: “David Cameron”, “Angela Merkel” and “Jürgen Schmidhuber” are all quite similar to each other on the “species” dimension, Cameron and Merkel are more similar to each other than to Schmidhuber on the “job” dimension, Cameron and Schmidhuber are more similar on the “gender” dimension, Merkel and Schmidhuber are more similar on the “nationality” dimension. The concepts of “species”, “job”, “gender” and “nationality” also have their own topology, and so on. It is plausible that this ability to perceive an essentially continuous topology over discrete things is a fundamental property of how our mind works. Modern techniques in natural language processing and information retrieval are able to learn these kind of representations from data, but these are generally representations of individual words or short phrases. It is hypothesized that you can represent the meaning of arbitrarily complex sentences, and thus ideas, by combining the representations of their constituents, but nobody currently has a clue on how to do it beside simple averaging (which yields the general topic) and a few other tricks.
What about algorithms? It seems plausible that algorithms are the most general way of describing ideas. Algorithms also look extremely discrete and structured things, at least when you look at them in the context of the usual mathematical formalisms and of the programming languages that derive from these formalisms. But we still have an intuitive understanding that algorithms should fit in a differentiable manifold, where it is possible to consider their similarity along various dimensions.
Recurrent neural networks are a way of embedding algorithms in a vector space where you can continuously move along one dimension and the performance measure varies continuously. In principle they are Turing-complete (up to resource bounds). In practice they can only represent well a certain class of algorithms (the ones that compute functions which are “Markovian” w.r.t. their input). Outside that class of algorithms, RNNs tend to become chaotic and can’t be effectively trained or even practically computed.
“Long short term memory” neural networks (by the aforementioned Schmidhuber et al.) try to overcome this issue using an architecture more similar to that of conventional digital circuits. Their basic element is a differentiable variant of the flip-flop unit. Just like you can use flip-flops and feed-forward boolean circuits to represent any finite-state machine in a discrete way, you can use LSTM cells and feed-forward MLP units to represent any finite-state machine in a differentiable way, which usually avoids chaotic dynamics.
In practice they do well on certain problems but don’t scale past a certain difficulty, possibly for the same reason that custom digital electronics doesn’t really scale as a solution for increasingly complex problems: past a certain point, you want to “program” things.
This “Neural Turing model” is an attempt to do that. If it turns out that it generalizes past toy problems, then it means that it is a “natural” way to embed algorithms of practical interests in a differentiable manifold. It will be a way to access the hidden topology of algorithms, and thus arbitrary ideas, that we intuitively “see” but we weren’t able to mathematically model so far.
( and presumably scaring the sit out of MIRI once Google plugs it to a reinforcement learning framework :) )
Another advantage is that neuronal nets in most cases allow for fairly strainforward incorporation of additional information (side channels, lower layers). If this is combinaed with the algorithm learning is does allow fairly general integrative capabilities.
The interesting thing is that they have essentially implemented a (sort of) Turing machine with differentiable functions, which enables training using gradient-based techniques.
In principle you could always train resource-bounded Turing machines using combinatorial optimization techniques such as classical planners, SAT solvers or ILP solvers, theoretical complexity will be PSPACE-complete or NP-complete depending on the constraints that you impose. In practice, this is very much computationally expensive and doesn’t scale past small examples.
If this differentiable approach generalizes past toy examples, it will allow efficient training. In addition to the obvious practical applications (*) I think it may also have a deeper philosophical value:
Consider word embeddings and the Vector space model: words are appear like discrete objects at syntactic level, while they represent concepts which intuitively feel like they lie in some sort of differentiable manifold: “David Cameron”, “Angela Merkel” and “Jürgen Schmidhuber” are all quite similar to each other on the “species” dimension, Cameron and Merkel are more similar to each other than to Schmidhuber on the “job” dimension, Cameron and Schmidhuber are more similar on the “gender” dimension, Merkel and Schmidhuber are more similar on the “nationality” dimension. The concepts of “species”, “job”, “gender” and “nationality” also have their own topology, and so on.
It is plausible that this ability to perceive an essentially continuous topology over discrete things is a fundamental property of how our mind works.
Modern techniques in natural language processing and information retrieval are able to learn these kind of representations from data, but these are generally representations of individual words or short phrases. It is hypothesized that you can represent the meaning of arbitrarily complex sentences, and thus ideas, by combining the representations of their constituents, but nobody currently has a clue on how to do it beside simple averaging (which yields the general topic) and a few other tricks.
What about algorithms? It seems plausible that algorithms are the most general way of describing ideas. Algorithms also look extremely discrete and structured things, at least when you look at them in the context of the usual mathematical formalisms and of the programming languages that derive from these formalisms.
But we still have an intuitive understanding that algorithms should fit in a differentiable manifold, where it is possible to consider their similarity along various dimensions.
Recurrent neural networks are a way of embedding algorithms in a vector space where you can continuously move along one dimension and the performance measure varies continuously. In principle they are Turing-complete (up to resource bounds). In practice they can only represent well a certain class of algorithms (the ones that compute functions which are “Markovian” w.r.t. their input). Outside that class of algorithms, RNNs tend to become chaotic and can’t be effectively trained or even practically computed.
“Long short term memory” neural networks (by the aforementioned Schmidhuber et al.) try to overcome this issue using an architecture more similar to that of conventional digital circuits. Their basic element is a differentiable variant of the flip-flop unit. Just like you can use flip-flops and feed-forward boolean circuits to represent any finite-state machine in a discrete way, you can use LSTM cells and feed-forward MLP units to represent any finite-state machine in a differentiable way, which usually avoids chaotic dynamics. In practice they do well on certain problems but don’t scale past a certain difficulty, possibly for the same reason that custom digital electronics doesn’t really scale as a solution for increasingly complex problems: past a certain point, you want to “program” things.
This “Neural Turing model” is an attempt to do that. If it turns out that it generalizes past toy problems, then it means that it is a “natural” way to embed algorithms of practical interests in a differentiable manifold. It will be a way to access the hidden topology of algorithms, and thus arbitrary ideas, that we intuitively “see” but we weren’t able to mathematically model so far.
( and presumably scaring the sit out of MIRI once Google plugs it to a reinforcement learning framework :) )
Another advantage is that neuronal nets in most cases allow for fairly strainforward incorporation of additional information (side channels, lower layers). If this is combinaed with the algorithm learning is does allow fairly general integrative capabilities.
That’s a very nice intuitive overview.