Houshalter comments on [Link] AlphaGo: Mastering the ancient game of Go with Machine Learning

Houshalter 31 Jan 2016 14:29 UTC
2 points
The difference with Markov models is they tend to overfit at that level. At 20 characters deep, you are just copy and pasting large sections of existing code and language. Not generating entirely unseen samples. You can do a similar thing with RNNs, by training them only on one document. They will be able to reproduce that document exactly, but nothing else.

To properly compare with a markov model, you’d need to first tune it so it doesn’t overfit. That is, when it’s looking at an entirely unseen document, it’s guess of what the next character should be is most likely to be correct. The best setting for that is probably only 3-5 characters, not 20. And when you generate from that, the output will be much less legible. (And even that’s kind of cheating, since markov models can’t give any prediction for sequences it’s never seen before.)

Generating samples is just a way to see what patterns the RNN has learned. And while it’s far from perfect, it’s still pretty impressive. It’s learned a lot about syntax, a lot about variable names, a lot about common programming idioms, and it’s even learned some English from just code comments.
- V_V 31 Jan 2016 20:23 UTC
  0 points
  Parent
  
  The best setting for that is probably only 3-5 characters, not 20.
  
  In NLP applications where Markov language models are used, such as speech recognition and machine translation, the typical setting is 3 to 5 words. 20 characters correspond to about 4 English words, which is in this range.
  
  Anyway, I agree that in this case the order-20 Markov model seems to overfit (Googling some lines from the snippets in the post often locates them in an original source file, which doesn’t happen as often with the RNN snippets). This may be due to the lack of regularization (“smoothing”) in the probability estimation and the relatively small size of the training corpus: 474 MB versus the >10 GB corpora which are typically used in NLP applications. Neural networks need lots of data, but still less than plain look-up tables.