IBM’s “Watson” program to compete against “Jeopardy” champions tonight
It was mentioned before on LessWrong, but I feel people might appreciate a reminder:
http://www-03.ibm.com/innovation/us/watson/what-is-watson/countdown-to-jeopardy.html
It’s a bit of a cheesy PR thing—I’d be a lot more interested if they connected the program on the Internet and allowed anyone to try and ask them general questions, rather than mixing the program with voice recognition and (heh) buzzer-pushing. Trivia tests are also probably one of the easier challenges to deal with, since keyword filtering alone is very efficient in narrowing down the candidate space.
Still, I’m going to watch it if I can: if anybody knows of a streaming link that is accessible to non-US viewers, that would be appreciated.
(Silly aside: is anyone else annoyed by how “Jeopardy” pretends to invert the traditional question-answer format, while what it does is simply moving the “what is” from the former to the latter, even if the result makes no sense? I suppose to US people this is a rather old complaint, but I learnt about the show today and I’m rather bugged by this feature.)
Actually, Watson receives the prompts as text (source).
Because the category titles are usually fairly complex puns, Watson is built to infer the theme of a category from the questions as it goes along. The humans would benefit from starting at the highest dollar and working towards the lowest dollar, so as to maximize their superior pun skills. I wonder if they’ll do that.
IBM also sets a threshold on certainty for Watson answering a question. The machine is perfectly capable of ‘howlers’, answers that are completely unrelated to the question. These would embarass it if that threshold is too low, and possibly frighten clients of the technology this evolves into. Put the threshold too high, and it might be too cautious to win. I’m sure the tech geeks are mostly interested in winning, but the PR department might be interested in playing with style.
From what I saw, it seems they figured out that that was their best bet (somehow) fairly quickly. Once Watson lost control, the other two lost very little time in going for the big points.
Watching the first episode, there was an interlude where they had snippets of interviews with the creators. They happened to mention that the machine learnt within categories, and when play resumed the humans immediately switched to picking the most valuable categories first. I was impressed by that. It was good to see they were thinking strategically, and were listening closely.
Yeah, originally Jeopardy! (should be always rendered with the exclamation point, btw) was Merv Griffin’s Idea for a “guess the question” TV show, which eventually degenerated to a regular trivia quiz show with weird phrasing.
At this point though, Jeopardy! so dominates American quiz show consciousness that contestants on other quiz shows often accidentally phrase their answers as a question.
I don’t know about streaming, but you’ll probably be able to find a torrent of the episode by tomorrow at the latest.
Followup: I was able to attend a panel discussion tonight with several members of the team working on Watson. (My university is hosting panels for all three nights, as many of the team members were once students here. See watson.rpi.edu for recordings of the panel discussions.)
I spoke with one person from IBM after the episode aired, and confirmed that Watson is programmed with statistics from every Jeopardy episode. That allows it to search for the daily double efficiently, and afterward it does specifically turn to the lowest-point questions in each category in order to learn in which it might do best. It also employs game theory to determine how much to bet on daily doubles and final jeopardy, and which questions to pick when it has control.
They explained to us why Watson missed the final jeopardy in tonight’s game. The category was “U.S. Cities” and the answer was “Its largest airport was named for a World War II hero; its second for a World War II battle.” Watson learned that the category names don’t always strictly imply the answer type, so it didn’t consider that to be a strong indicator. It recognized that the clue was in two parts, but the second part was missing the noun and verb from the first, so Watson couldn’t really get anything from it. Toronto’s largest airport is named after a WWII vet, and there are cities named Toronto in the U.S. We were told that its confidence on Toronto was ~13%, and its second choice was Chicago (the correct answer) with a confidence of ~11%.
We were also told that its confidences are very well calibrated, so that, e.g., it will be right on average 9 out of 10 of the times that it displays 90% confidence.
The confidences are supposed to be probabilities? But they often summed to > 100%
Or is it “the procedure for generating the confidences is such that it’ll be well calibrated for the highest ranking answer”?
No, sorry, that should say confidences everywhere, not probabilities. I had written it out incorrectly and then edited it, but I missed that one. Fixed now.
What I meant was “for the top three answers, the confidences would sometimes sum to > 100, so how does that work?”
Is the procedure defined as well calibrated only for the top answer, or is there something I’m missing?
The confidence level compares the answer to other answers Watson’s given in the past, based on how much the answer is supported by the evidence Watson has and uses. All the answers are generated and scored in parallel. It’s not a comparison among the answers generated for a specific question, so it shouldn’t necessarily add up to 100.
Quote from Chris Welty at last night’s panel: “When [Watson] says ‘this is my answer, 50% sure,’ half the time he’s right about that, and half the time he’s wrong. When he says 80%, 20% of the time he’s wrong.”
Ah, thanks.
My observations on tonight’s competition:
The way Watson “sniped” the daily double immediately upon taking control of the board was very interesting to me. I suspect that it was programmed with a statistical distribution of past daily doubles (this site strongly suggests that their placement is not random). Otherwise, its behavior seems inconsistent, given that afterward it went after the 200-point question in each category, presumably in order to get a better read on the qualities of each category.
Its lack of audio input led it to repeat a wrong answer, which was one (probably predictable) flaw. It had trouble with the decades and alternate word meanings, which isn’t surprising given the explanation that its algorithms are based on word association. Also, it needs to read more Harry Potter.
this about maps with the issues I noticed. Looking forward to the next 2 days of this.
There is going to be a media frenzy after this, I expect some spillover to SIAI. Hope it’s well used.
It seems to me that the media, and even LessWrong, are being rather overly impressed by, in effect, heuristic database lookups. A trivia quiz is surely a textbook example of playing to the strengths of the computer; and even so the humans had at least got a fighting chance. Can someone convince me that this is more impressive than it looks? Isn’t it just a case of building a big-enough database with associative keyword nets, which was already well underway in the eighties and turned out to be a dead end?
Recall why these seemed to be a dead end. Two major reasons were (as I understand it) that the databases had to be massive to be useful and that they didn’t have easy ways of adding facts and relations without human intervention. Watson helps show directly that the first is no longer as big a deal, and will be less of a problem as computers improve. The second is also less of a problem now since Watson can take large datadumps and then develop associations more or less itself (at least that’s how it was explained to me.).
The episode is on Youtube (before they take it down) here: Part 1, Part 2
A somewhat vague paper describing the system: http://www.stanford.edu/class/cs124/AIMagzine-DeepQA.pdf
No voice recognition, and the mechanical buzzer pushing thing was kinda silly, but still, not bad. Certainly at least a decent step in natural language processing, right?
I know this is not at all anything like general AI (though I gather from the descriptions that there’s at least some form of reinforcement learning going on when it gets stuff wrong. I may be wrong on that though), but still, I feel at least a bit impressed. (I wish there’d been auditory speech processing too, instead of receiving text, but...)
EDIT: either way, it’s still just plain cool! :)
It’s substantially better than other question answering systems, so in that sense yes. On the other hand, it’s probably still movement towards a local maximum, rather than a robust, general strategy.
Robust, general natural language processing requires a proper grammar and parser, to get back an actionable semantic representation. That’s the kind of language processing module you’ll be able to reuse in lots of different applications.
Watson probably isn’t doing that. The system probably cherry-picks keywords, and uses a statistical classifier to predict the category of answer expected. Maybe a syntactic parse is used to help find clues, but I doubt it’s the main method.
I did my PhD on statistical parsing and am continuing to work on it as a post-doc. We’re getting better, but it’s still usually less practical than an ad hoc strategy for any given task. That doesn’t mean Watson isn’t impressive, of course. Watson shows us what can be done right now, and apparently what can be done is pretty damn sweet.
Huh, thanks.
Though it’s doing more than just individual keyword stuff. I think one major point is that it’s looking at context (ie, I think it’s supposed to have at least a basic ability to deal with puns and such.)
Also, I think it is set up to learn the theme of a category if it’s not initially sure (via associated questions and answers), and using that info to get an idea of what types of answers are being sought in a particular category.
If it’s not parsing, if it’s just keyword analysis rather than any analysis of grammar, it’s going way beyond just judging the keywords individually. (Not to mention, it’s parsing enough to at least figure out which words are the ones to use for its keyword search, I think.)
Do you think that Watson is anywhere near the local maximum associated with the strategies you think is being used by that system, incidentally?
Having looked through their overview paper, I’m no longer sure. They do have modules that do parsing and semantic role labelling and such. But their model is a mixture of dozens of individual models. So it’s tough to say much about how things are fitting together. They use more sophisticated techniques than I thought, although I don’t know how much contribution those techniques actually make in the final decision.
Thanks for looking at the paper and passing on the info about what’s actually going in inside it, btw.
I’ll confess that I still have a hard time shaking the intuition that this is how general AI will be arrived at, if and when it is: a bunch of things like this, more impressive with each generation, until it gradually occurs to us over the course a few decades or centuries that our computers can do everything we can do.
That describes the notion of non-sentient actors that adapt perfectly to the human living environment. A somewhat terrifying idea.
There is a “Jeopardy!” practice match video (with “Watson”).
My only exposure to the “Jeopardy” show was indirectly via, I believe, “The Nanny”. Maxwell’s resentful muttering of “Andrew Lloyd Webber” got a negative response. Maxwell was looking hopeful until Fran told him the correct answer was “WHO IS Andrew Lloyd Webber?” From this I deduced that a) Jeopardy is lame and makes you answer with questions and b) Maxwell is funny. :P