I’m currently a post-doc doing language technology/NLP type stuff. I’m considering quitting soon to work full time on a start-up. I’m working on three things at the moment.
The start-up is a language learning web app: http://www.cloze.it . What sets it apart from other language-learning software is my knowledge of linguistics, proficiency with text processing, and willingness to code detailed language-specific features. Most tools want to be as language neutral as possible, which limits their scope a lot. So they tend to all have the same set of features, centred around learning basic vocab.
Something that’s always bugged me about being an academic is, we’re terrible at communicating to people outside our field. This means that whenever I see a post using an NLP tool, they’re using a crap tool. So I wrote a blog post explaining a simple POS tagger that was better than the stuff in e.g. nltk (nltk is crap): http://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/ The POS tagger post has gotten over 15k views (mostly from reddit), so I’m writing a follow up about a concise parser implementation. The parser is 500 lines, including the tagger, and faster and more accurate than the Stanford parser (the Stanford parser is also crap).
I’m doing minor revisions for a journal article on parsing conversational speech transcripts, and detecting disfluent words. The system gets good results when run on text transcripts. The goal is to allow speech recognition systems to produce better transcripts, with punctuation added, and stutters etc removed. I’m also working on a follow up paper to that one, with further experiments.
Overall the research is going well, and I find it very engaging. But I’m at the point where I have to start writing grant applications, and selling software seems like a much better expected-value bet.
Something that’s always bugged me about being an academic is, we’re terrible at communicating to people outside our field. This means that whenever I see a post using an NLP tool, they’re using a crap tool.
Why is that, do you think? This doesn’t seem to be the case in the ML community as far as I can judge (though I’m not an expert). What’s special about NLP? What prevents the nltk people from doing what you did?
In ML, everyone is engaging with the academics, and the academics are doing a great job of making that accessible, e.g. through MOOCs. ML is one of the most popular targets of “ongoing education”, because it’s popped up and it’s a useful feather to have in your cap. It extends the range of programs you can write greatly. Many people realise that, and are doing what it takes to learn. So even if there are some rough spots in the curriculum, the learners are motivated, and the job gets done.
The cousin of language processing is computer vision. The problem we have as academics is that there is a need to communicate current best-of-breed solutions to software engineers, while we also communicate underlying principles to our students and to each other.
If you look at nltk, it’s really a tool for teaching our grad students. And yet it’s become a software engineering tool-of-choice, when it should never have been billed as industrial strength at all. Check out the results in my blog post:
NLTK POS tagger: 94% accuracy, 236s
My tagger: 96.8% accuracy, 12s
Both are pure Python implementations. I do no special tricks; I just keep things tight and simple, and don’t pay costs from integrating into a large framework.
The problem is that the NLTK tagger is part of a complicated class hierarchy that includes a dictionary-lookup tagger, etc. These are useful systems to explain the problem to a grad student, but shouldn’t be given to a software engineer who wants to get something done.
There’s no reason why we can’t have a software package that just gets it done. Which is why I’m writing one :). The key difference is that I’ll be shipping one POS tagger, one parser, etc. The best one! If another algorithm comes out on top, I’ll rip out the old one and put the current best one in.
That’s the real difference between ML and NLP or computer vision. In NLP, we really really should be telling people, “just use this one”. In ML, we need to describe a toolbox.
I’m currently a post-doc doing language technology/NLP type stuff. I’m considering quitting soon to work full time on a start-up. I’m working on three things at the moment.
The start-up is a language learning web app: http://www.cloze.it . What sets it apart from other language-learning software is my knowledge of linguistics, proficiency with text processing, and willingness to code detailed language-specific features. Most tools want to be as language neutral as possible, which limits their scope a lot. So they tend to all have the same set of features, centred around learning basic vocab.
Something that’s always bugged me about being an academic is, we’re terrible at communicating to people outside our field. This means that whenever I see a post using an NLP tool, they’re using a crap tool. So I wrote a blog post explaining a simple POS tagger that was better than the stuff in e.g. nltk (nltk is crap): http://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/ The POS tagger post has gotten over 15k views (mostly from reddit), so I’m writing a follow up about a concise parser implementation. The parser is 500 lines, including the tagger, and faster and more accurate than the Stanford parser (the Stanford parser is also crap).
I’m doing minor revisions for a journal article on parsing conversational speech transcripts, and detecting disfluent words. The system gets good results when run on text transcripts. The goal is to allow speech recognition systems to produce better transcripts, with punctuation added, and stutters etc removed. I’m also working on a follow up paper to that one, with further experiments.
Overall the research is going well, and I find it very engaging. But I’m at the point where I have to start writing grant applications, and selling software seems like a much better expected-value bet.
Why is that, do you think? This doesn’t seem to be the case in the ML community as far as I can judge (though I’m not an expert). What’s special about NLP? What prevents the nltk people from doing what you did?
In ML, everyone is engaging with the academics, and the academics are doing a great job of making that accessible, e.g. through MOOCs. ML is one of the most popular targets of “ongoing education”, because it’s popped up and it’s a useful feather to have in your cap. It extends the range of programs you can write greatly. Many people realise that, and are doing what it takes to learn. So even if there are some rough spots in the curriculum, the learners are motivated, and the job gets done.
The cousin of language processing is computer vision. The problem we have as academics is that there is a need to communicate current best-of-breed solutions to software engineers, while we also communicate underlying principles to our students and to each other.
If you look at nltk, it’s really a tool for teaching our grad students. And yet it’s become a software engineering tool-of-choice, when it should never have been billed as industrial strength at all. Check out the results in my blog post:
NLTK POS tagger: 94% accuracy, 236s
My tagger: 96.8% accuracy, 12s
Both are pure Python implementations. I do no special tricks; I just keep things tight and simple, and don’t pay costs from integrating into a large framework.
The problem is that the NLTK tagger is part of a complicated class hierarchy that includes a dictionary-lookup tagger, etc. These are useful systems to explain the problem to a grad student, but shouldn’t be given to a software engineer who wants to get something done.
There’s no reason why we can’t have a software package that just gets it done. Which is why I’m writing one :). The key difference is that I’ll be shipping one POS tagger, one parser, etc. The best one! If another algorithm comes out on top, I’ll rip out the old one and put the current best one in.
That’s the real difference between ML and NLP or computer vision. In NLP, we really really should be telling people, “just use this one”. In ML, we need to describe a toolbox.
TIL: NLP can mean Natural Language Processing, as well as Neuro Linguistic Programming. I was confused for a while there.