Followup: I was able to attend a panel discussion tonight with several members of the team working on Watson. (My university is hosting panels for all three nights, as many of the team members were once students here. See watson.rpi.edu for recordings of the panel discussions.)
I spoke with one person from IBM after the episode aired, and confirmed that Watson is programmed with statistics from every Jeopardy episode. That allows it to search for the daily double efficiently, and afterward it does specifically turn to the lowest-point questions in each category in order to learn in which it might do best. It also employs game theory to determine how much to bet on daily doubles and final jeopardy, and which questions to pick when it has control.
They explained to us why Watson missed the final jeopardy in tonight’s game. The category was “U.S. Cities” and the answer was “Its largest airport was named for a World War II hero; its second for a World War II battle.” Watson learned that the category names don’t always strictly imply the answer type, so it didn’t consider that to be a strong indicator. It recognized that the clue was in two parts, but the second part was missing the noun and verb from the first, so Watson couldn’t really get anything from it. Toronto’s largest airport is named after a WWII vet, and there are cities named Toronto in the U.S. We were told that its confidence on Toronto was ~13%, and its second choice was Chicago (the correct answer) with a confidence of ~11%.
We were also told that its confidences are very well calibrated, so that, e.g., it will be right on average 9 out of 10 of the times that it displays 90% confidence.
No, sorry, that should say confidences everywhere, not probabilities. I had written it out incorrectly and then edited it, but I missed that one. Fixed now.
The confidence level compares the answer to other answers Watson’s given in the past, based on how much the answer is supported by the evidence Watson has and uses. All the answers are generated and scored in parallel. It’s not a comparison among the answers generated for a specific question, so it shouldn’t necessarily add up to 100.
Quote from Chris Welty at last night’s panel: “When [Watson] says ‘this is my answer, 50% sure,’ half the time he’s right about that, and half the time he’s wrong. When he says 80%, 20% of the time he’s wrong.”
Followup: I was able to attend a panel discussion tonight with several members of the team working on Watson. (My university is hosting panels for all three nights, as many of the team members were once students here. See watson.rpi.edu for recordings of the panel discussions.)
I spoke with one person from IBM after the episode aired, and confirmed that Watson is programmed with statistics from every Jeopardy episode. That allows it to search for the daily double efficiently, and afterward it does specifically turn to the lowest-point questions in each category in order to learn in which it might do best. It also employs game theory to determine how much to bet on daily doubles and final jeopardy, and which questions to pick when it has control.
They explained to us why Watson missed the final jeopardy in tonight’s game. The category was “U.S. Cities” and the answer was “Its largest airport was named for a World War II hero; its second for a World War II battle.” Watson learned that the category names don’t always strictly imply the answer type, so it didn’t consider that to be a strong indicator. It recognized that the clue was in two parts, but the second part was missing the noun and verb from the first, so Watson couldn’t really get anything from it. Toronto’s largest airport is named after a WWII vet, and there are cities named Toronto in the U.S. We were told that its confidence on Toronto was ~13%, and its second choice was Chicago (the correct answer) with a confidence of ~11%.
We were also told that its confidences are very well calibrated, so that, e.g., it will be right on average 9 out of 10 of the times that it displays 90% confidence.
The confidences are supposed to be probabilities? But they often summed to > 100%
Or is it “the procedure for generating the confidences is such that it’ll be well calibrated for the highest ranking answer”?
No, sorry, that should say confidences everywhere, not probabilities. I had written it out incorrectly and then edited it, but I missed that one. Fixed now.
What I meant was “for the top three answers, the confidences would sometimes sum to > 100, so how does that work?”
Is the procedure defined as well calibrated only for the top answer, or is there something I’m missing?
The confidence level compares the answer to other answers Watson’s given in the past, based on how much the answer is supported by the evidence Watson has and uses. All the answers are generated and scored in parallel. It’s not a comparison among the answers generated for a specific question, so it shouldn’t necessarily add up to 100.
Quote from Chris Welty at last night’s panel: “When [Watson] says ‘this is my answer, 50% sure,’ half the time he’s right about that, and half the time he’s wrong. When he says 80%, 20% of the time he’s wrong.”
Ah, thanks.