The fact that it is even able to produce legible code is amazing
Somewhat. Look at what happens when you generate code from a simple character-level Markov language model (that’s just a look up table that gives the probability of the next character conditioned on the last n characters, estimated by frequency counts on the training corpus).
An order-20 language model generates fairly legible code, with sensible use of keywords, identifier names and even comments. The main difference with the RNN language model is that the RNN learns to do proper identation and bracket matching, while the Markov model can’t do it except at shot range.
While, as remarked by Yoav Goldberg, it is impressive that the RNN could learn to do this, learning to match brackets and ident blocks seems very far from learning to write correct and purposeful code.
Anyway, this code generation example is pretty much of a stunt, not a very interesting task. If you gave the Linux kernel source code to a human who has never programmed and doesn’t speak English and asked them to write something that looks like it, I doubt that they would be able to do much better.
Better examples of code generation using NNs (actually, log-bilinear models) or Bayesian models exist (ref, ref). In these works syntactic correctness is already guaranteed and the ML model only focuses on semantics.
The difference with Markov models is they tend to overfit at that level. At 20 characters deep, you are just copy and pasting large sections of existing code and language. Not generating entirely unseen samples. You can do a similar thing with RNNs, by training them only on one document. They will be able to reproduce that document exactly, but nothing else.
To properly compare with a markov model, you’d need to first tune it so it doesn’t overfit. That is, when it’s looking at an entirely unseen document, it’s guess of what the next character should be is most likely to be correct. The best setting for that is probably only 3-5 characters, not 20. And when you generate from that, the output will be much less legible. (And even that’s kind of cheating, since markov models can’t give any prediction for sequences it’s never seen before.)
Generating samples is just a way to see what patterns the RNN has learned. And while it’s far from perfect, it’s still pretty impressive. It’s learned a lot about syntax, a lot about variable names, a lot about common programming idioms, and it’s even learned some English from just code comments.
The best setting for that is probably only 3-5 characters, not 20.
In NLP applications where Markov language models are used, such as speech recognition and machine translation, the typical setting is 3 to 5 words. 20 characters correspond to about 4 English words, which is in this range.
Anyway, I agree that in this case the order-20 Markov model seems to overfit (Googling some lines from the snippets in the post often locates them in an original source file, which doesn’t happen as often with the RNN snippets). This may be due to the lack of regularization (“smoothing”) in the probability estimation and the relatively small size of the training corpus: 474 MB versus the >10 GB corpora which are typically used in NLP applications. Neural networks need lots of data, but still less than plain look-up tables.
Somewhat. Look at what happens when you generate code from a simple character-level Markov language model (that’s just a look up table that gives the probability of the next character conditioned on the last n characters, estimated by frequency counts on the training corpus).
An order-20 language model generates fairly legible code, with sensible use of keywords, identifier names and even comments. The main difference with the RNN language model is that the RNN learns to do proper identation and bracket matching, while the Markov model can’t do it except at shot range.
While, as remarked by Yoav Goldberg, it is impressive that the RNN could learn to do this, learning to match brackets and ident blocks seems very far from learning to write correct and purposeful code.
Anyway, this code generation example is pretty much of a stunt, not a very interesting task. If you gave the Linux kernel source code to a human who has never programmed and doesn’t speak English and asked them to write something that looks like it, I doubt that they would be able to do much better.
Better examples of code generation using NNs (actually, log-bilinear models) or Bayesian models exist (ref, ref). In these works syntactic correctness is already guaranteed and the ML model only focuses on semantics.
The difference with Markov models is they tend to overfit at that level. At 20 characters deep, you are just copy and pasting large sections of existing code and language. Not generating entirely unseen samples. You can do a similar thing with RNNs, by training them only on one document. They will be able to reproduce that document exactly, but nothing else.
To properly compare with a markov model, you’d need to first tune it so it doesn’t overfit. That is, when it’s looking at an entirely unseen document, it’s guess of what the next character should be is most likely to be correct. The best setting for that is probably only 3-5 characters, not 20. And when you generate from that, the output will be much less legible. (And even that’s kind of cheating, since markov models can’t give any prediction for sequences it’s never seen before.)
Generating samples is just a way to see what patterns the RNN has learned. And while it’s far from perfect, it’s still pretty impressive. It’s learned a lot about syntax, a lot about variable names, a lot about common programming idioms, and it’s even learned some English from just code comments.
In NLP applications where Markov language models are used, such as speech recognition and machine translation, the typical setting is 3 to 5 words. 20 characters correspond to about 4 English words, which is in this range.
Anyway, I agree that in this case the order-20 Markov model seems to overfit (Googling some lines from the snippets in the post often locates them in an original source file, which doesn’t happen as often with the RNN snippets). This may be due to the lack of regularization (“smoothing”) in the probability estimation and the relatively small size of the training corpus: 474 MB versus the >10 GB corpora which are typically used in NLP applications. Neural networks need lots of data, but still less than plain look-up tables.