How fast can watson buzz in




















Within three months, he had been joined by new Watson staffers, mostly technologists in the fields of natural language processing and machine learning. Healthcare had already been suggested as the first industry Watson should target for commercial offerings, but there were no plans to confine it just to medicine.

Any information-intensive industry was fair game, anywhere were there were huge volumes of unstructured and semi-structured data that Watson could ingest, understand and process quicker than its human counterparts.

Healthcare might be a starting point, but banking, insurance, and telecoms were all in the firing line. But how do you turn a quiz show winner into something more business-like? First job for the Watson team was to get to grips with the machine they'd inherited from IBM Research, understand the 41 separate subsystems that went into Watson, and work out what needed to be fixed up before Watson could put on its suit and tie.

In the Watson unit's first year, the system got sped up and slimmed down. The system that was the size of a master bedroom will now run in a system the size of the vegetable drawer in your double-drawer refrigerator. Another way of looking at it: a single Power server, measuring nine inches high, 18 inches wide and 36 inches deep, and weighing in at around pounds. Having got the system to a more manageable size for businesses, it set about finding customers to take it on. IBM had healthcare pegged as its first vertical for Watson from the time of the Jeopardy win.

However, while Jeopardy Watson and healthcare Watson share a common heritage, they're distinct entities: IBM forked the Watson code for its commercial incarnation. Jeopardy Watson had one task - get an answer, understand it, and find the question that went with it. It was a single user system - had three quizmasters put three answers to it, it would have thrown the machine into a spin. Watson had to be retooled for a scenario where tens, hundreds, however many, clinicians would be asking questions at once, and not single questions either - complex conversation with several related queries one after the other, all asked in non-standard formats.

And, of course, there was the English language itself with all its messy complexity. What we inherited was the core engine, and we said 'Okay, let's build a new thing that does all sort of things the original Jeopardy system wasn't required to do'. To get Watson from Jeopardy to oncology, there were three processes that the Watson team went through: content adaptation, training adaptation, and functional adaptation - or, to put it another way, feeding it medical information and having it weighted appropriately; testing it out with some practice questions; then making any technical adjustments needed - tweaking taxonomies, for example.

The content adaptation for healthcare followed the same path as getting Watson up to speed for the quiz show: feed it information, show it what right looks like, then let it guess what right looks like and correct it if it's wrong.

In Jeopardy, that meant feeding it with thousands of question and answer pairs from the show, and then demonstrating what a right response looked like. Then it was given just the answers, and asked to come up with the questions. When it went wrong, it was corrected. Through machine learning, it would begin to get a handle on this answer-question thing, and modify its algorithms accordingly.

Some data came from what IBM describes as a Jeopardy-like game called Doctor's Dilemma, whose questions include 'the syndrome characterized by joint pain, abdominal pain, palpable purpura, and a nephritic sediment? The training, says Kohn, "is an ongoing process, and Watson is rapidly improving its ability to make reasonable recommendations the oncologists think are helpful. By , there were two healthcare organisations that had started piloting Watson. Wellpoint, one of the US biggest insurers, was one of the pair of companies that helped define the application of Watson in health.

And it was this relationship that helped spur Watson's first commercial move into working in the field of cancer therapies. While using Watson as a diagnosis tool might be its most obvious application in healthcare, using it to assist in choosing the right therapy for a cancer patient made even more sense. MSKCC was a tertiary referral centre - by the time patients arrived, they already had their diagnosis. So Watson was destined first to be an oncologist's assistant, digesting reams of data - MSKCC's own, medical journals, articles, patients notes and more - along with patients' preferences to come up with suggestions for treatment options.

Each would be weighted accordingly, depending on how relevant Watson calculated they were. Unlike its Jeopardy counterpart, healthcare Watson also has the ability to go online - not all its data has to be stored. And while Watson had two million pages of medical data from , sources to swallow, it could still make use of the general knowledge garnered for Jeopardy - details from Wikipedia, for example. What it doesn't use, however, is the Urban Dictionary. In health care doctors want to understand what sources Watson consulted in the literature and what connections did it make between, say, a medical test and a therapy—Where did that information come from, and why did Watson conclude that?

In other areas such as drug discovery, same thing—the biologist or chemist wants to understand why Watson concluded that such-and-such a molecule would be the next big drug. They want to explore the data before making a decision. How does the latest Watson compare to its game-show predecessor? Some of that is through hardware optimization but a lot of it is also tuning the underlying machine-learning algorithms to make them much more efficient.

That lets us run Watson on a much smaller, yet more powerful, system than we did two years ago. What will the personal versions of cognitive systems look like? There are a couple of ways cognitive systems will interact on a personal level.

These cognitive systems will also be important in networks of people. In social media one could envision that a cognitive system is just another node in your personal network that enters into discussions between you and other people, and offers its opinion on certain issues based on available information. Is it possible to shrink such cognitive systems down a size that you could wear or carry around? Watson has shrunk considerably in two years. Fast-forward another few years—the power of Watson that played on Jeopardy!

You just need to be able to reach it electronically to get all of the power of Watson. Cognitive systems at the beginning of this era of computing were about the size of a room.

The Jeopardy! The human brain uses about 20 watts of power. It took thousands of times more energy for that computer to win just that game versus a human brain. Every night, all three contestants passed a very hard test to be there. Ergo, nearly all the contestants know nearly all the answers nearly all the time. So it just comes down to buzzer mojo. Which is why Watson won so handily Wait a minute, what?

Did Jennings just say that the computer won based on a "physical" advantage, specifically sounding a buzzer faster than a human is capable of? Watson, created by IBM, is in Jennings' own words supposed to be a "giant leap forward in the field of natural-language processing," like Ask Jeeves about ten years ago — does its light speed buzzer abilities give it an unfair advantage?

Jennings seemed to reference it again, in a question about losing to another contestant, Brad Rutter:. Is the Jeopardy! According to Jennings' website, it is. As can be seen in the graph, if such a system were to answer the 50 percent of questions it had highest confidence for, it would get 80 percent of those correct. We refer to this level of performance as 80 percent precision at 50 percent answered.

The lower line represents a system without meaningful confidence estimation. Since it cannot distinguish between which questions it is more or less likely to get correct, its precision is constant for all percent attempted.

Developing more accurate confidence estimation means a system can deliver far higher precision even with the same overall accuracy. Figure 2. Precision Versus Percentage Attempted. Perfect confidence estimation upper line and no confidence estimation lower line. A compelling and scientifically appealing aspect of the Jeopardy Challenge is the human reference point. Figure 3 contains a graph that illustrates expert human performance on Jeopardy It is based on our analysis of nearly historical Jeopardy games.

Each point on the graph represents the performance of the winner in one Jeopardy game. In contrast to the system evaluation shown in figure 2, which can display a curve over a range of confidence thresholds, the human performance shows only a single point per game based on the observed precision and percent answered the winner demonstrated in the game.

A further distinction is that in these historical games the human contestants did not have the liberty to answer all questions they wished.

Rather the percent answered consists of those questions for which the winner was confident and fast enough to beat the competition to the buzz.

The system performance graphs shown in this paper are focused on evaluating QA performance, and so do not take into account competition for the buzz. Ken Jennings had an unequaled winning streak in , in which he won 74 games in a row. Based on our analysis of those games, he acquired on average 62 percent of the questions and answered with 92 percent precision.

Human performance at this task sets a very high bar for precision, confidence, speed, and breadth. Our metrics and baselines are intended to give us confidence that new methods and algorithms are improving the system or to inform us when they are not so that we can adjust research priorities. Developed in part under the U. A requirement of the Jeopardy Challenge is that the system be self-contained and does not link to live web search.

Most notably, TREC participants were given a relatively small corpus 1M documents from which answers to questions must be justified; TREC questions were in a much simpler form compared to Jeopardy questions, and the confidences associated with answers were not a primary metric.

Furthermore, the systems are allowed to access the web and had a week to produce results for questions. The reader can find details in the TREC proceedings 4 and numerous follow-on publications. The experiment focused on precision and confidence. It ignored issues of answering speed and aspects of the game like betting and clue values. The questions used were randomly sampled Jeopardy clues from episodes in the past 15 years.

The corpus that was used contained, but did not necessarily justify, answers to more than 90 percent of the questions.

Clearly the precision and confidence estimation are far below the requirements of the Jeopardy Challenge. In our experiments on TREC data, OpenEphyra answered 45 percent of the questions correctly using a live web search. OpenEphyra did not produce reliable confidence estimates and thus could not effectively choose to answer questions with higher confidence. Clearly a larger investment in tuning and adapting these baseline systems to Jeopardy would improve their performance; however, we limited this investment since we did not want the baseline systems to become significant efforts.

In figure 5 we show two other baselines that demonstrate the performance of two complementary approaches on this task. The light gray line shows the performance of a system based purely on text search, using terms in the question as queries and search engine scores as confidences for candidate answers generated from retrieved document titles.

The black line shows the performance of a system based on structured data, which attempts to look the answer up in a database by simply finding the named entities in the database related to the named entities in the clue.

These two approaches were adapted to the Jeopardy task, including identifying and integrating relevant content. The results form an interesting comparison. The search-based system has better performance at percent answered, suggesting that the natural language content and the shallow text search techniques delivered better coverage. However, the flatness of the curve indicates the lack of accurate confidence estimation.

To be a high-performing question-answering system, DeepQA must demonstrate both these properties to achieve high precision, high recall, and an accurate confidence estimation. We devoted many months of effort to encoding algorithms from the literature. Our investigations ran the gamut from deep logical form analysis to shallow machine-translation-based approaches.

We integrated them into the standard QA pipeline that went from question analysis and answer type determination to search and then answer selection. It was difficult, however, to find examples of how published research results could be taken out of their original context and effectively replicated and integrated into different end-to-end systems to produce comparable results. Our efforts failed to have significant impact on Jeopardy or even on prior baseline studies using TREC data.

We ended up overhauling nearly everything we did, including our basic technical approach, the underlying architecture, metrics, evaluation protocols, engineering practices, and even how we worked together as a team. OAQA is intended to directly engage researchers in the community to help replicate and reuse research results and to identify how to more rapidly advance the state of the art in QA Ferrucci et al As our results dramatically improved, we observed that system-level advances allowing rapid integration and evaluation of new ideas and new components against end-to-end metrics were essential to our progress.

Jeopardy was described as one addressing dimensions including high precision, accurate confidence determination, complex language, breadth of domain, and speed. The system we have built and are continuing to develop, called DeepQA, is a massively parallel probabilistic evidence-based architecture.

For the Jeopardy Challenge, we use more than different techniques for analyzing natural language, identifying sources, finding and generating hypotheses, finding and scoring evidence, and merging and ranking hypotheses. What is far more important than any particular technique we use is how we combine them in DeepQA such that overlapping approaches can bring their strengths to bear and contribute to improvements in accuracy, confidence, or speed.

DeepQA is an architecture with an accompanying methodology, but it is not specific to the Jeopardy Challenge. We have begun adapting it to different business applications and additional exploratory challenge problems including medicine, enterprise search, and gaming. The overarching principles in DeepQA are massive parallelism, many experts, pervasive confidence estimation, and integration of shallow and deep knowledge. Massive parallelism: Exploit massive parallelism in the consideration of multiple interpretations and hypotheses.

Many experts: Facilitate the integration, application, and contextual evaluation of a wide range of loosely coupled probabilistic question and content analytics. Pervasive confidence estimation: No component commits to an answer; all components produce features and associated confidences, scoring different question and content interpretations.

An underlying confidence-processing substrate learns how to stack and combine the scores. Integrate shallow and deep knowledge: Balance the use of strict semantics and shallow semantics, leveraging many loosely formed ontologies. Figure 6 illustrates the DeepQA architecture at a very high level.

The remaining parts of this section provide a bit more detail about the various architectural roles. The first step in any application of DeepQA to solve a QA problem is content acquisition, or identifying and gathering the content to use for the answer and evidence sources shown in figure 6.

Content acquisition is a combination of manual and automatic steps. The first step is to analyze example questions from the problem space to produce a description of the kinds of questions that must be answered and a characterization of the application domain. Analyzing example questions is primarily a manual task, while domain analysis may be informed by automatic or statistical analyses, such as the LAT analysis shown in figure 1. Given the kinds of questions and broad domain of the Jeopardy Challenge, the sources for Watson include a wide range of encyclopedias, dictionaries, thesauri, newswire articles, literary works, and so on.

Given a reasonable baseline corpus, DeepQA then applies an automatic corpus expansion process. The process involves four high-level steps: 1 identify seed documents and retrieve related documents from the web; 2 extract self-contained text nuggets from the related web documents; 3 score the nuggets based on whether they are informative with respect to the original seed document; and 4 merge the most informative nuggets into the expanded corpus.

The live system itself uses this expanded corpus and does not have access to the web during play. In addition to the content for the answer and evidence sources, DeepQA leverages other kinds of semistructured and structured content.

Another step in the content-acquisition process is to identify and collect these resources, which include databases, taxonomies, and ontologies, such as dbPedia, 7 WordNet Miller , and the Yago 8 ontology. The first step in the run-time question-answering process is question analysis. During question analysis the system attempts to understand what the question is asking and performs the initial analyses that determine how the question will be processed by the rest of the system. The DeepQA approach encourages a mixture of experts at this stage, and in the Watson system we produce shallow parses, deep parses McCord , logical forms, semantic role labels, coreference, relations, named entities, and so on, as well as specific kinds of analysis for question answering.

Most of these technologies are well understood and are not discussed here, but a few require some elaboration. Question classification is the task of identifying question types or parts of questions that require special processing. This can include anything from single words with potentially double meanings to entire clauses that have certain syntactic, semantic, or rhetorical functionality that may inform downstream components with their analysis.

Question classification may identify a question as a puzzle question, a math question, a definition question, and so on. It will identify puns, constraints, definition components, or entire subclues within questions. As discussed earlier, a lexical answer type is a word or noun phrase in the question that specifies the type of the answer without any attempt to understand its semantics. Determining whether or not a candidate answer can be considered an instance of the LAT is an important kind of scoring and a common source of critical errors.

An advantage to the DeepQA approach is to exploit many independently developed answer-typing algorithms. However, many of these algorithms are dependent on their own type systems. We found the best way to integrate preexisting components is not to force them into a single, common type system, but to have them map from the LAT to their own internal types.

The focus of the question is the part of the question that, if replaced by the answer, makes the question a stand-alone statement.

Most questions contain relations, whether they are syntactic subject-verb-object predicates or semantic relationships between entities. Watson uses relation detection throughout the QA process, from focus and LAT determination, to passage and answer scoring. Watson can also use detected relations to query a triple store and directly generate candidate answers. In Jeopardy the broad domain makes it difficult to identify the most lucrative relations to detect.

In 20, Jeopardy questions, for example, we found the distribution of Freebase 9 relations to be extremely flat figure 7. Roughly speaking, even achieving high recall on detecting the most frequent relations in the domain can at best help in about 25 percent of the questions, and the benefit of relation detection drops off fast with the less frequent relations.

Broad-domain relation detection remains a major open area of research. Figure 7. As discussed above, an important requirement driven by analysis of Jeopardy clues was the ability to handle questions that are better answered through decomposition.

DeepQA uses rule-based deep parsing and statistical classification methods both to recognize whether questions should be decomposed and to determine how best to break them up into subquestions. The operating hypothesis is that the correct question interpretation and derived answer s will score higher after all the collected evidence and all the relevant algorithms have been considered.



0コメント

  • 1000 / 1000