How big a deal is this? What, if anything, does it signal about when we get smarter than human AI?
It shows that Monte-Carlo tree search meshes remarkably well with neural-network-driven evaluation (“value networks”) and decision pruning/policy selection (“policy networks”). This means that if you have a planning task to which MCTS can be usefully applied, and sufficient data to train networks for state-evaluation and policy selection, and substantial computation power (a distributed cluster, in AlphaGo’s case), you can significantly improve performance on your task (from “strong amateur” to “human champion” level). It’s not an AGI-complete result however, any more than Deep-Blue or TD-gammon were AGI-complete.
The “training data” factor is a biggie; we lack this kind of data entirely for things like automated theorem proving, which would otherwise be quite amenable to this ‘planning search + complex learned heuristics’ approach. In particular, writing provably-correct computer code is a minor variation on automated theorem proving. (Neural networks can already write incorrect code, but this is not good enough if you want a provably Friendly AGI.)
The interesting thing about that RNN that you linked that writes code, is that it shouldn’t work at all. It was just given text files of code and told to predict the next character. It wasn’t taught how to program, it never got to see an interpreter, it doesn’t know any English yet has to work with English variable names, and it only has a few hundred neurons to represent its entire knowledge state.
The fact that it is even able to produce legible code is amazing, and suggests that we might not be that far of from NNs that can write actually usable code. Still some ways away, but not multiple decades.
The fact that it is even able to produce legible code is amazing
Somewhat. Look at what happens when you generate code from a simple character-level Markov language model (that’s just a look up table that gives the probability of the next character conditioned on the last n characters, estimated by frequency counts on the training corpus).
An order-20 language model generates fairly legible code, with sensible use of keywords, identifier names and even comments. The main difference with the RNN language model is that the RNN learns to do proper identation and bracket matching, while the Markov model can’t do it except at shot range.
While, as remarked by Yoav Goldberg, it is impressive that the RNN could learn to do this, learning to match brackets and ident blocks seems very far from learning to write correct and purposeful code.
Anyway, this code generation example is pretty much of a stunt, not a very interesting task. If you gave the Linux kernel source code to a human who has never programmed and doesn’t speak English and asked them to write something that looks like it, I doubt that they would be able to do much better.
Better examples of code generation using NNs (actually, log-bilinear models) or Bayesian models exist (ref, ref). In these works syntactic correctness is already guaranteed and the ML model only focuses on semantics.
The difference with Markov models is they tend to overfit at that level. At 20 characters deep, you are just copy and pasting large sections of existing code and language. Not generating entirely unseen samples. You can do a similar thing with RNNs, by training them only on one document. They will be able to reproduce that document exactly, but nothing else.
To properly compare with a markov model, you’d need to first tune it so it doesn’t overfit. That is, when it’s looking at an entirely unseen document, it’s guess of what the next character should be is most likely to be correct. The best setting for that is probably only 3-5 characters, not 20. And when you generate from that, the output will be much less legible. (And even that’s kind of cheating, since markov models can’t give any prediction for sequences it’s never seen before.)
Generating samples is just a way to see what patterns the RNN has learned. And while it’s far from perfect, it’s still pretty impressive. It’s learned a lot about syntax, a lot about variable names, a lot about common programming idioms, and it’s even learned some English from just code comments.
The best setting for that is probably only 3-5 characters, not 20.
In NLP applications where Markov language models are used, such as speech recognition and machine translation, the typical setting is 3 to 5 words. 20 characters correspond to about 4 English words, which is in this range.
Anyway, I agree that in this case the order-20 Markov model seems to overfit (Googling some lines from the snippets in the post often locates them in an original source file, which doesn’t happen as often with the RNN snippets). This may be due to the lack of regularization (“smoothing”) in the probability estimation and the relatively small size of the training corpus: 474 MB versus the >10 GB corpora which are typically used in NLP applications. Neural networks need lots of data, but still less than plain look-up tables.
It shows that Monte-Carlo tree search meshes remarkably well with neural-network-driven evaluation (“value networks”) and decision pruning/policy selection (“policy networks”). This means that if you have a planning task to which MCTS can be usefully applied, and sufficient data to train networks for state-evaluation and policy selection, and substantial computation power (a distributed cluster, in AlphaGo’s case), you can significantly improve performance on your task (from “strong amateur” to “human champion” level). It’s not an AGI-complete result however, any more than Deep-Blue or TD-gammon were AGI-complete.
The “training data” factor is a biggie; we lack this kind of data entirely for things like automated theorem proving, which would otherwise be quite amenable to this ‘planning search + complex learned heuristics’ approach. In particular, writing provably-correct computer code is a minor variation on automated theorem proving. (Neural networks can already write incorrect code, but this is not good enough if you want a provably Friendly AGI.)
Humans need extensive training to become competent, as will AGI, and this should have been obvious for anyone with a good understanding of ML.
The interesting thing about that RNN that you linked that writes code, is that it shouldn’t work at all. It was just given text files of code and told to predict the next character. It wasn’t taught how to program, it never got to see an interpreter, it doesn’t know any English yet has to work with English variable names, and it only has a few hundred neurons to represent its entire knowledge state.
The fact that it is even able to produce legible code is amazing, and suggests that we might not be that far of from NNs that can write actually usable code. Still some ways away, but not multiple decades.
Somewhat. Look at what happens when you generate code from a simple character-level Markov language model (that’s just a look up table that gives the probability of the next character conditioned on the last n characters, estimated by frequency counts on the training corpus).
An order-20 language model generates fairly legible code, with sensible use of keywords, identifier names and even comments. The main difference with the RNN language model is that the RNN learns to do proper identation and bracket matching, while the Markov model can’t do it except at shot range.
While, as remarked by Yoav Goldberg, it is impressive that the RNN could learn to do this, learning to match brackets and ident blocks seems very far from learning to write correct and purposeful code.
Anyway, this code generation example is pretty much of a stunt, not a very interesting task. If you gave the Linux kernel source code to a human who has never programmed and doesn’t speak English and asked them to write something that looks like it, I doubt that they would be able to do much better.
Better examples of code generation using NNs (actually, log-bilinear models) or Bayesian models exist (ref, ref). In these works syntactic correctness is already guaranteed and the ML model only focuses on semantics.
The difference with Markov models is they tend to overfit at that level. At 20 characters deep, you are just copy and pasting large sections of existing code and language. Not generating entirely unseen samples. You can do a similar thing with RNNs, by training them only on one document. They will be able to reproduce that document exactly, but nothing else.
To properly compare with a markov model, you’d need to first tune it so it doesn’t overfit. That is, when it’s looking at an entirely unseen document, it’s guess of what the next character should be is most likely to be correct. The best setting for that is probably only 3-5 characters, not 20. And when you generate from that, the output will be much less legible. (And even that’s kind of cheating, since markov models can’t give any prediction for sequences it’s never seen before.)
Generating samples is just a way to see what patterns the RNN has learned. And while it’s far from perfect, it’s still pretty impressive. It’s learned a lot about syntax, a lot about variable names, a lot about common programming idioms, and it’s even learned some English from just code comments.
In NLP applications where Markov language models are used, such as speech recognition and machine translation, the typical setting is 3 to 5 words. 20 characters correspond to about 4 English words, which is in this range.
Anyway, I agree that in this case the order-20 Markov model seems to overfit (Googling some lines from the snippets in the post often locates them in an original source file, which doesn’t happen as often with the RNN snippets). This may be due to the lack of regularization (“smoothing”) in the probability estimation and the relatively small size of the training corpus: 474 MB versus the >10 GB corpora which are typically used in NLP applications. Neural networks need lots of data, but still less than plain look-up tables.