I would naively assume similar to the way we interface our own, with some kind of mapping of a subset at a time of the memory to a sensory input to the network, which then just needs to remember itself how to look up the relevant subsets.
It’s not really RAM, but rather a tape. (like a doubly linked list) The LSTM controller can’t specify any location in logarithmic space / time. They add multiple tape readers at one point though.
A neural network is just a function of some sort (inputs to outputs). So is a finite state machine (which is what a regular Turing machine uses). Put that way, nothing conceptually interesting is going on. As usual, the devil is in the details here.
The interesting thing is that they have essentially implemented a (sort of) Turing machine with differentiable functions, which enables training using gradient-based techniques.
In principle you could always train resource-bounded Turing machines using combinatorial optimization techniques such as classical planners, SAT solvers or ILP solvers, theoretical complexity will be PSPACE-complete or NP-complete depending on the constraints that you impose. In practice, this is very much computationally expensive and doesn’t scale past small examples.
If this differentiable approach generalizes past toy examples, it will allow efficient training. In addition to the obvious practical applications (*) I think it may also have a deeper philosophical value:
Consider word embeddings and the Vector space model: words are appear like discrete objects at syntactic level, while they represent concepts which intuitively feel like they lie in some sort of differentiable manifold: “David Cameron”, “Angela Merkel” and “Jürgen Schmidhuber” are all quite similar to each other on the “species” dimension, Cameron and Merkel are more similar to each other than to Schmidhuber on the “job” dimension, Cameron and Schmidhuber are more similar on the “gender” dimension, Merkel and Schmidhuber are more similar on the “nationality” dimension. The concepts of “species”, “job”, “gender” and “nationality” also have their own topology, and so on. It is plausible that this ability to perceive an essentially continuous topology over discrete things is a fundamental property of how our mind works. Modern techniques in natural language processing and information retrieval are able to learn these kind of representations from data, but these are generally representations of individual words or short phrases. It is hypothesized that you can represent the meaning of arbitrarily complex sentences, and thus ideas, by combining the representations of their constituents, but nobody currently has a clue on how to do it beside simple averaging (which yields the general topic) and a few other tricks.
What about algorithms? It seems plausible that algorithms are the most general way of describing ideas. Algorithms also look extremely discrete and structured things, at least when you look at them in the context of the usual mathematical formalisms and of the programming languages that derive from these formalisms. But we still have an intuitive understanding that algorithms should fit in a differentiable manifold, where it is possible to consider their similarity along various dimensions.
Recurrent neural networks are a way of embedding algorithms in a vector space where you can continuously move along one dimension and the performance measure varies continuously. In principle they are Turing-complete (up to resource bounds). In practice they can only represent well a certain class of algorithms (the ones that compute functions which are “Markovian” w.r.t. their input). Outside that class of algorithms, RNNs tend to become chaotic and can’t be effectively trained or even practically computed.
“Long short term memory” neural networks (by the aforementioned Schmidhuber et al.) try to overcome this issue using an architecture more similar to that of conventional digital circuits. Their basic element is a differentiable variant of the flip-flop unit. Just like you can use flip-flops and feed-forward boolean circuits to represent any finite-state machine in a discrete way, you can use LSTM cells and feed-forward MLP units to represent any finite-state machine in a differentiable way, which usually avoids chaotic dynamics.
In practice they do well on certain problems but don’t scale past a certain difficulty, possibly for the same reason that custom digital electronics doesn’t really scale as a solution for increasingly complex problems: past a certain point, you want to “program” things.
This “Neural Turing model” is an attempt to do that. If it turns out that it generalizes past toy problems, then it means that it is a “natural” way to embed algorithms of practical interests in a differentiable manifold. It will be a way to access the hidden topology of algorithms, and thus arbitrary ideas, that we intuitively “see” but we weren’t able to mathematically model so far.
( and presumably scaring the sit out of MIRI once Google plugs it to a reinforcement learning framework :) )
Another advantage is that neuronal nets in most cases allow for fairly strainforward incorporation of additional information (side channels, lower layers). If this is combinaed with the algorithm learning is does allow fairly general integrative capabilities.
If you want to put it that way, nothing conceptually interesting is going on in any neural network paper—we already know they’re universal.
My problem is the details: I visualize neural networks as like pachinko machines or Galton’s quincunx—you drop a bunch of bits (many balls) into the top (bottom) layer of neurons (pins) and they cascade down to the bottom based on the activation functions (spacing of pins & how many balls hit a pin simultaneously), and at the bottom (top) is emitted a final smaller output (1 ball somewhere). I don’t get the details of what it means to add a memory to this many-to-one function.
Recurrent Neural Networks feed their outputs back into their inputs, and run continuously. Instead of visualizing a river flowing directly downstream, visualize a river with a little cove off to one side, where vortices swirl around.
Some good comments on related work in https://www.reddit.com/r/MachineLearning/comments/2kth6d/googles_secretive_deepmind_startup_unveils_a/
(After reading the paper, I still don’t really get how you wire some RAM into a neural network. Maybe it makes more sense to others.)
I would naively assume similar to the way we interface our own, with some kind of mapping of a subset at a time of the memory to a sensory input to the network, which then just needs to remember itself how to look up the relevant subsets.
It’s not really RAM, but rather a tape. (like a doubly linked list) The LSTM controller can’t specify any location in logarithmic space / time. They add multiple tape readers at one point though.
This tape could be compared to the phonological loop and the stream of consciousness.
A neural network is just a function of some sort (inputs to outputs). So is a finite state machine (which is what a regular Turing machine uses). Put that way, nothing conceptually interesting is going on. As usual, the devil is in the details here.
The interesting thing is that they have essentially implemented a (sort of) Turing machine with differentiable functions, which enables training using gradient-based techniques.
In principle you could always train resource-bounded Turing machines using combinatorial optimization techniques such as classical planners, SAT solvers or ILP solvers, theoretical complexity will be PSPACE-complete or NP-complete depending on the constraints that you impose. In practice, this is very much computationally expensive and doesn’t scale past small examples.
If this differentiable approach generalizes past toy examples, it will allow efficient training. In addition to the obvious practical applications (*) I think it may also have a deeper philosophical value:
Consider word embeddings and the Vector space model: words are appear like discrete objects at syntactic level, while they represent concepts which intuitively feel like they lie in some sort of differentiable manifold: “David Cameron”, “Angela Merkel” and “Jürgen Schmidhuber” are all quite similar to each other on the “species” dimension, Cameron and Merkel are more similar to each other than to Schmidhuber on the “job” dimension, Cameron and Schmidhuber are more similar on the “gender” dimension, Merkel and Schmidhuber are more similar on the “nationality” dimension. The concepts of “species”, “job”, “gender” and “nationality” also have their own topology, and so on.
It is plausible that this ability to perceive an essentially continuous topology over discrete things is a fundamental property of how our mind works.
Modern techniques in natural language processing and information retrieval are able to learn these kind of representations from data, but these are generally representations of individual words or short phrases. It is hypothesized that you can represent the meaning of arbitrarily complex sentences, and thus ideas, by combining the representations of their constituents, but nobody currently has a clue on how to do it beside simple averaging (which yields the general topic) and a few other tricks.
What about algorithms? It seems plausible that algorithms are the most general way of describing ideas. Algorithms also look extremely discrete and structured things, at least when you look at them in the context of the usual mathematical formalisms and of the programming languages that derive from these formalisms.
But we still have an intuitive understanding that algorithms should fit in a differentiable manifold, where it is possible to consider their similarity along various dimensions.
Recurrent neural networks are a way of embedding algorithms in a vector space where you can continuously move along one dimension and the performance measure varies continuously. In principle they are Turing-complete (up to resource bounds). In practice they can only represent well a certain class of algorithms (the ones that compute functions which are “Markovian” w.r.t. their input). Outside that class of algorithms, RNNs tend to become chaotic and can’t be effectively trained or even practically computed.
“Long short term memory” neural networks (by the aforementioned Schmidhuber et al.) try to overcome this issue using an architecture more similar to that of conventional digital circuits. Their basic element is a differentiable variant of the flip-flop unit. Just like you can use flip-flops and feed-forward boolean circuits to represent any finite-state machine in a discrete way, you can use LSTM cells and feed-forward MLP units to represent any finite-state machine in a differentiable way, which usually avoids chaotic dynamics. In practice they do well on certain problems but don’t scale past a certain difficulty, possibly for the same reason that custom digital electronics doesn’t really scale as a solution for increasingly complex problems: past a certain point, you want to “program” things.
This “Neural Turing model” is an attempt to do that. If it turns out that it generalizes past toy problems, then it means that it is a “natural” way to embed algorithms of practical interests in a differentiable manifold. It will be a way to access the hidden topology of algorithms, and thus arbitrary ideas, that we intuitively “see” but we weren’t able to mathematically model so far.
( and presumably scaring the sit out of MIRI once Google plugs it to a reinforcement learning framework :) )
Another advantage is that neuronal nets in most cases allow for fairly strainforward incorporation of additional information (side channels, lower layers). If this is combinaed with the algorithm learning is does allow fairly general integrative capabilities.
That’s a very nice intuitive overview.
If you want to put it that way, nothing conceptually interesting is going on in any neural network paper—we already know they’re universal.
My problem is the details: I visualize neural networks as like pachinko machines or Galton’s quincunx—you drop a bunch of bits (many balls) into the top (bottom) layer of neurons (pins) and they cascade down to the bottom based on the activation functions (spacing of pins & how many balls hit a pin simultaneously), and at the bottom (top) is emitted a final smaller output (1 ball somewhere). I don’t get the details of what it means to add a memory to this many-to-one function.
Recurrent Neural Networks feed their outputs back into their inputs, and run continuously. Instead of visualizing a river flowing directly downstream, visualize a river with a little cove off to one side, where vortices swirl around.
I think another way to look at neural networks is they are nested non-linear regression models.
I am probably in the minority here, but I don’t think the stuff in the OP is that interesting.
The same way you add memory to many-to-one boolean functions: feedback loops.
As is everything else which can actually be built.