David Ferrucci: The Story of IBM Watson Winning in Jeopardy | AI Podcast Clips
4Hx15WVxvII • 2019-10-12
Transcript preview
Open
Kind: captions
Language: en
so one of the greatest accomplishments
in the history of AI is Watson competing
against in a game of Jeopardy against
humans and you were a lead in that
accrue at a critical part of that so
let's start the very basics what is the
game of Jeopardy the game for us humans
human versus human right so it's to take
a question and answer it actually no but
it's not right it's really not it's
really it's really to get a question and
answer but it's what we call a factoid
questions so this notion of like it's it
really relates to some fact that
everything few people would argue
whether the facts are true or not in
fact most people what an jeopardy kind
of counts on the idea that these these
statements have factual answers and and
the idea is to first of all determine
whether or not you know the answer which
is sort of an interesting twist so first
of all understand the question you have
to understand the question what is it
asking and that's a good point because
the questions are not asked directly
right they're all like the way the
questions are asked is nonlinear it's
like it's a little bit witty it's a
little bit playful sometimes it's a it's
a little bit tricky yeah they're aston
and exactly in numerous witty tricky
ways exactly what they're asking is not
obvious it takes it takes an experienced
humans a while to go what is it even
asking right and it's sort of an
interesting realization that you have
was a missus Oh what's the Jeopardy is a
question answering Shou and there's a go
like I know a lot and then you read it
and you're you know you're still trying
to process the question and the
champions have the answer to moved on
there's like there's three questions
ahead the time you figured out what the
question even met so there's there's
definitely an ability there to just
parse out what the question even is so
that was certainly challenging it's
interesting historically though if you
look back at the jeopardy games much
earlier you know like 63 yeah and I
think the questions were much more
direct they weren't quite like that they
got sort of more and more interesting
the way they asked them that sort of
more interesting and subtle and nuanced
and humorous and witty over time which
really required the human to kind of
make the right connections and figuring
out what the question was even asking so
yeah you have to figure out the
questions even asking then you have to
determine whether or not you think you
know the answer and because you have to
buzz in really quickly you sort of have
to make that determination as quickly as
you possibly can otherwise you lose the
opportunity buzz in you've been going
before you really know if you know the
answer I think well I think a lot of
humans will will assume they'll they'll
look at the look at their process of
very superficially in other words what's
the topic what are some key words and
just say do I know this area or not
before they actually know the answer
then they'll buzz in and then they'll
buzz in and think about it it's
interesting what humans do now some
people who know all things like Ken
Jennings or something or the more recent
big jeopardy player that knows well
suppose that those just assume they know
although jeopardy you know just posed it
you know Watson interestingly didn't
even come close to knowing all of
Jeopardy right Watson leaving at the
peak even that that's better yeah so for
example I mean we had this thing called
recall which is like how many of all the
Jeopardy questions you know how many did
could we even find like find the right
answer for like anywhere like could we
come up with if we look you know we had
up a big body of knowledge some of the
order of several terabytes I mean from
from a web scale was actually very small
but from like a book scale talking about
millions in bucks right so they're cool
in millions a book and cyclopædia is
dictionaries books a ton of information
and you know for I think was 80 only 85%
was the answer anywhere to be found
hmm so you're ready down well you're
ready down at that level just to get
just to get started right so and and so
was important to get a very quick sense
of do you think you know the right
answer to this question so we have to
compute that confidence as quickly as we
possibly could so it's in effect we have
to answer it and at least you know spend
some time essentially answering it and
then judging the confidence that we you
know that that our answer was right and
in deciding whether or not we work off
enough to buzz in and that would depend
on what else was going on in the game it
could because it was a risk so like if
you're really in a situation where I
have to take a gas I have very little to
lose
then you'll buzz in with less confidence
so that was a counter for the the
financial standings of the different
competitors correct
how much is a game was laughs how much
time was left and where were you were in
the standings things like that what how
many hundreds of milliseconds that we're
talking about here do you have a sense
of what is we targets yeah was the
targeted so I mean we targeted answering
and under three seconds and buzzing it
so the decision to buzz in and then the
actual answering are those two yes there
were two there were two different things
in fact we had multiple stages whereas
like we would say let's estimate our
confidence which which is sort of a
shallow answering process and then
ultimate and then ultimately decide to
buzz in and then we may take another
second or something it's kind of go in
there and measure and do that but by and
large we're saying like we can't play
the game we can't even compete if we
can't on average answer these questions
and around three seconds or less
so you stepped in so there's this
there's these three humans playing a
game and you stepped in with the idea
that IBM Watson would be one of replaced
one of the humans and compete against
two can you tell the story of Watson
taking on this game sure seems
exceptionally difficult yeah so the
story was that um it was or it was
coming up I think the 10-year
anniversary of a big blue eye not big
deep blues IBM wanted to do sort of
another kind of really you know fun
challenge public challenge that can
bring attention to IBM research and the
kind of the cool stuff that we were
doing
I had been working in an AI at IBM for
some time I had a team doing what's
called open domain factoids
question-answering which is you know
we're not gonna tell you what the
questions are we're not even gonna tell
you what they're about
can you go off and get accurate answers
to these questions and it was an area of
AI research that I was involved in and
so it was a big Pat it was a very
specific passion of mine language
understanding and always always been a
passion of mine one sort of narrow slice
on whether or not you could do anything
was language was this notion of open
domain and meaning I could ask anything
about anything
factoids meaning it essentially had an
answer and and you know being able to do
that accurately and quickly so that was
a research area that might even already
been in and so completely independently
several you know IBM exactly there's
like what are we gonna do what's the
next cool thing to do and Ken Jennings
was on his winning streak this was like
whatever was 2004 I think was on his win
winning streak and someone thought hey
that'd be really cool if um if the
computer can play jeopardy and so this
was like in 2004 they were shopping this
thing around and everyone who's telling
the the research execs no way Mike this
is crazy and we have some pretty you
know senior people if you like I say
only other's crazy and who come across
my desk and I was like but that's kind
of what what I'm really interested in
doing and but there was such this
prevailing sense of this is nuts we're
not gonna risk IBM's reputation on this
we're just not doing it and this
happened in 2004 it happened in 2005 at
the end of 2006 it was coming around
again and I was coming off of a I was
doing that the open domain
question-answering stuff but I was
coming off a couple other projects I had
a lot more time to put into this and I
argued that it could be done and I argue
it would be crazy not to do this can I
you could be honest at this point so
even though you argued for it what's the
confidence that you had yourself
privately that this could be done what
was we just told the store how you tell
stories to convince others how confident
were you what was your estimation of the
problem at that time
so I thought it was possible and a lot
of people thought it was impossible I
thought it was possible a reason why I
thought it was possible is because I did
some brief experimentation I knew a lot
about how we were approaching home open
domain factoids question answering we've
been doing it for some years I looked at
the Japanese stuff I said this is going
to be hard for a lot of the points that
you mentioned earlier hard to interpret
the question hard to do it quickly
enough hard to compute an accurate
confidence none of this stuff had been
done well enough before but a lot of the
technologies were building with the
kinds of technologies that should work
but more to the point what was driving
me was I was an IBM research I was a
senior leader in IBM Research and this
is the kind of stuff we were supposed to
do we were basically supposed to the
moonshot this is I mean we were supposed
to take things and say this is an active
research area it's our obligation to
kind of if we have the opportunity to
push it to the limits and if it doesn't
work to understand more deeply why we
can't do it and so I was very committed
to that notion saying folks this is what
we do it's crazy not not to do that this
is an active research area we've been in
this for years why wouldn't we take this
Grand Challenge and and push it as hard
as we can at the very least we'd be able
to come out and say here's why this
problem is is way hard here's what we've
tried and here's how we failed so I was
very driven as a scientist from that
perspective and then I also argued based
on what we did a feasibility study oh
why I thought it was hard but possible
and I showed examples of you know where
it succeeded where it failed why it
failed and sort of a high level
architecture approach for why we should
do it but for the most part that at that
point the execs really were just looking
for someone crazy enough to say yes
because for several years at that point
everyone has said no I'm not willing to
risk my reputation and my career you
know on this thing clearly you did not
have such fears okay I did not say you
died right in and yet for what I
understand
it was performing very poorly in the
begin
so what were the initial approaches and
why did they fail well there were lots
of hard aspects to it I mean one of the
reasons why prior approaches that we had
worked on in the past
um failed was because of because the
questions were difficult difficult to
interpret like what are you even asking
for right very often like if if the
question was very direct like what city
you know or what you know even then it
could be tricky but but you know what
city or what person often when it would
name it very clearly you would know that
and and if there was just a small set of
them in other words we're gonna ask
about these five types like it's gonna
be an answer and the answer will be a
city in this state or a city in this
country the answer will be a person of
this type right like an actor or
whatever it is but turns out that in
jeopardy there were like tens of
thousands of these things and it was a
very very long tale meaning you know
that it just went on and on and and so
even if you focused on trying to encode
the types at the very top like there's
five that were the most let's say five
of the most frequent you still cover a
very small percentage of the data so you
couldn't take that approach of saying
I'm just going to try to collect facts
about these five or ten types or twenty
types or fifty types or whatever so that
was like one of the first things like
what do you do about that and so we came
up with a an approach toward that and
the approach looked promising and we we
continue to improve our abilities to
handle that problem throughout the
project the other issue was that right
from the outside I said we're not going
to I committed to doing this in three to
five years so we did in four so I got
lucky
um but one of the things that that
putting that like stake in the ground
was I and I knew how hard the language
of the standard problem was I said we're
not going to actually understand
language to solve this problem we are
not going to interpret the question and
the domain of knowledge the question of
- in reason over that to answer these
questions were obviously we're not going
to be doing that at the same time simple
search wasn't good enough to confidently
answer with this you know a single
correct answer first others like
brilliant that's such a great mix of
innovation in practical engineering
three three four eight so you're not
you're not trying to solve the general
NLU problem you're saying let's solve
this in any way possible oh yeah no I
was committed to saying look we're good
solving the open domain question
answering problem we're using jeopardy
as a driver for the havoc management
hard enough big benchmark exactly and
now we're gonna do it we're just like
whatever like just figure out what works
because I want to be able to go back to
the academic the scientific community
and say here's what we tried here's what
work here's what didn't work I don't
want to go in and say oh I only have one
technology man only gonna use this I'm
gonna do whatever it takes I'm like I'm
gonna think out of the box man do
whatever it takes one um and I also lose
another thing I believe I believe that
the fundamental NLP technologies and
machine learning technologies would be
would be adequate and this was an issue
of how do we enhance them how do we
integrate them how do we advance them so
I had one researcher and came to me who
had been working on question answering
with me for a very long time
who had said we're gonna need Maxwell's
equations for question-answering and
said if we if we need some fundamental
formula that breaks new ground and how
we understand language
we're screwed yeah we're not gonna get
there from here like we I am not
counting I am that my assumption is I'm
not counting on some brand new invention
what I'm counting on is the ability to
take everything that has done before to
figure out a an architecture on how to
integrate it well and then see where it
breaks and make the necessary advances
we need to make and so this thing works
yeah push it hard to see where it breaks
and then patch it up I mean that's how
people change the world and that's the
you know mosque approaches the Rockets
SpaceX that's the Henry Ford and so on a
lot and and I happen to be and in this
case I happen to be right but but like
we didn't know but you kind of have to
put a second or so how you gonna run the
project so yeah and backtracking to
search so if you were to do what's the
brute force solution what what would you
search over so you have a question how
would you search the possible space of
answers look web search has come a long
way even since then but at the time like
you know you first of all I mean there
are a couple of other constraints around
the problems interesting so you couldn't
go out to the web you couldn't search
the Internet in other words the AI
experiment was we want a self-contained
device device if devices as big as a
room fine it's as big as a room but we
want a self-contained advice contained
device you're not going out the internet
you don't have a lifetime lifeline to
anything so it has to kind of fit in a
shoebox if you will or at least the size
of a few refrigerators whatever it might
be
see but also you couldn't just get out
there you couldn't go off network right
to kind of go so there was that
limitation but then we did it but the
basic thing was go go do what go do a
web search problem was even when we went
and did a web search I don't remember
exactly the numbers but someone the
order of 65% at a time the answer would
be somewhere you know in the top 10 or
20 documents so first of all that's not
even good enough to play
pretty you know the words even if you
could pull the avian if you could
perfectly pull the answer out of the top
20 documents top 10 documents whatever
was which we didn't know how to do but
even if you could that do that your
you'd be at and you knew it was right
lens we've had enough confidence in it
right so you have to pull out the right
answer you have you depth of confidence
it was the right answer and and then
you'd have to do that fast enough now go
buzz in and you'd still only get 65% of
them right wind doesn't even put you in
the winner's circle
man winner's circle you have to be up
over 70 and you have to do it really
quick and you do really quickly but now
the problem is well even if I had
somewhere in the top 10 documents how do
I figure out where in the top 10
documents that answer is and how do i
compute a confidence of all the possible
candidates so it's not like I go in
knowing the right answer and I have to
pick it I don't know the right answer I
have a bunch of documents somewhere in
there's the right answer how do i as a
machine go out and figure out which
one's right and then how do I score it
so and now how do I deal with the fact
that I can't actually go out to the web
first of all if you pause and then just
think about it if you could go to the
web do you think that problem is
solvable if you just pause on it just
thinking even beyond jeopardy do you
think the problem of reading text
defined where the answer is but we saw
we solved that in some definition of
solves given the Jeopardy challenge how
did you do it forever so how did you
take a body of work and a particular
topic and extract the key pieces of
information so what so now forgetting
about the the huge volumes that are on
the web right so now we have to figure
out we did a lot of source research in
other words what body of knowledge is
gonna be small enough but broad enough
to answer Jeffrey and we ultimately did
find the body of knowledge that did that
I mean it included Wikipedia and a bunch
of other stuff so like encyclopedia
tennis stuff I don't know theories
different types of semantic resources
unlike word net and other types of
matters like that as well as like some
Web crawls in other words where we went
out and took that content and then
expanded it based on producing
statistical see you know statistically
producing seas
using those seas for other searchers
searches and then
expanding that so using these like
expansion techniques we went out and had
found enough content and we're like okay
this is good and we even up and totally
and you know we had a threat of research
as always trying to figure out what
content could we efficiently include I
mean there's a lot of popular cut like
what is the church lady well I think was
one of the end hey know what I guess the
that's probably an encyclopedia so it's
a you know is that but then we would but
then we would take that stuff when we
would go out and we would expand in
other words we go find other content
that wasn't in the core resources and
expanded you know the amount of content
will grew it by an order of magnitude
but still so again from a web scale
perspective this is very small amount of
content
it's very select we then we then took
all that content so we we pre analyzed
the crap out of it meaning we we we
parsed it you know broke it down into
all those individual words and then we
did semantic static and semantic parses
on it you know had computer algorithms
that annotated it and we in that we
index that in a very rich and very fast
index so we have a relatively huge
amount of you know let's say the
equivalent of for the sake of argument
two to five million bucks
we've now analyzed all that blowing up
at size even more because now with all
this metadata and we then we richly
index all of that and in by way in a
giant in-memory cache so Watson did not
go to disk so the infrastructure
component there if you just speak to it
how tough it I mean I know mm maybe this
is two thousand eighty nine you know
that that's kind of a long time ago
right how hard is it to use multiple
machines Olivia how hard is the
infrastructure the hardware component we
used IBM we so we used IBM hardware we
had something like I figured exactly but
2000 or close to three thousand cores
completely connected so they had a
switch were you know every CPU was
connected to every other scene they were
sharing memory in some kind of way
Lauren of clever shared memory right and
all this data was pre analyzed and put
into a very fast indexing structure that
was all all all in all in memory and
then we took that question we would
analyze the question so all the content
was now pre analyzed so if I so if I
went and tried to find a piece of
content it would come back with all the
metadata that we had pre computed how do
you shove that question how do you
connect the the big stuff of the meta
the the big knowledgebase of the
metadata and that's index to the simple
little witty confusing question right so
therein lies you know the Watson
architects right so he would take the
question we would analyze the question
so which means that we would parse it
and interpret it a bunch of different
ways we try to figure out what is it
asking about so we would come we had
multiple strategies to kind of determine
what was it asking for that might be
represented as a simple string and
character string or something we would
connect back to different semantic types
that were from existing resources so
anyway the bottom line is we would do a
bunch of analysis in the question and
question analysis had to finish and had
to finish fast so we do the question
analysis because then from the question
analysis we would now produce searches
so we would and we had built using open
source search engines we modified them
we had a number of different search
engines we would use that had different
characteristics we went in there and
engineered and modified those search
engines ultimately to now take our
question Alice's produce multiple
queries based on different
interpretations of the question and fire
out a whole bunch of searches in
parallel and they would produce a would
come back with passages so this is these
are passive search algorithms dudes come
back with passages and so now you let's
say you had a thousand passages now for
each passage you you parallel eyes again
so you went out and you paralyzed
paralyzed the search each search would
now come back with a whole bunch of
passages maybe you had a total of a
thousand or five thousand whatever
passages for each passage now you'd go
and figure out whether or not there was
a candidate it would call it candidate
answer in there so you had a whole bunch
of other a whole bunch of other
algorithms
that would find candidate answers
possible answers to the question and so
you had Canada answer jet cold candidate
answers generators to a whole bunch of
those so for every one of these
components the team was constantly doing
research coming up better ways to
generate search queries from the
questions better ways to analyze the
question better ways to generate
candidates and speeds so better is
accuracy and speed cracked so right and
speed and accuracy for the most part
we're separated we handle that sort of
in separate ways like I focused purely
on accuracy and to an accuracy are we
ultimately getting more questions and
producing more accurate confidences and
they had a whole nother team that was
constantly analyzing the workflow to
find the bottlenecks and then in
figuring out of both paralyze and drive
the algorithm speed but anyway so so now
think of it like you have this big fan
out now right because you have you had
multiple queries now you have now you
have thousands of candidate answers for
each candidate answer you're gonna score
it so you're gonna use all the data that
built up you're gonna use the question
analysis you can use how the query was
generated you're gonna use the passage
itself and you're going to use the cans
at an answer that was generated and
you're going to score that so now we
have a group of researchers coming up
with scores there are hundreds of
different scores so now you're getting a
fan at it again from however many
candidate answers you have to all the
different scores so if you have a 200
different scores and you never thousand
candidates now you have two thousand
thousand scores and and so now you got
to figure out you know how do I now rank
these ranked these answers based on the
scores that came back and I want to rank
them based on the likelihood that there
are correct answers to the question so
every score was its own research project
winning me my score so is that the
annotation process of basically human
being saying that this this answer think
of think of it if you want to think of
it what you're doing you know if you
want to think about what a human would
be doing human would be looking at a
possible answer they'd be reading the
you know Emily Dixon Dickinson they've
been reading the passage in which they
occurred they'd be looking at the
question and they'd be making a decision
of how likely it is that Emily Dixon
Dickinson given this evidence in this
passage is the right answer to that
question got it so that that's the
annotation task that stands I'm scoring
tasks so but scoring implies zero to one
kind of that's right zero to one school
is not a binary no give the score give
it a zero yeah exactly is it what humans
did give different scores so that you
have to somehow normalize and all that
kind of stuff that deal with all that
depends on what your strategy is we both
we could be relative to it could be we
we actually looked at the raw scores as
well standardized scores because humans
are not involved in this humans are not
involved sorry so I mean I'm
misunderstanding the the the procedure
this is passages where is the ground
truth coming from grab truth is only
their answers to the questions so its
end to end
it's end to end so we also I was always
driving and and performance a very
interesting a very interesting you know
engineering approach and ultimately
scientific and research approaches were
always driving intent now that's not to
say we wouldn't make hypotheses that
individual component performance was
related in some way to n10 performance
of course we would because people would
have to build individual components but
ultimately to get your component
integrate into the system you had to
show impact on end-to-end performance
question answering performance as
there's many very smart people work on
this and they're basically trying to
sell their ideas as a component that
should be part of the system that's
right and and they would do research on
their component and they would say
things like you know I'm going to
improve this as a candidate generate or
I'm going to improve this as a question
score or I was a passive scorer
I'm going to improve this or as a parser
and I can improve it by two percent on
its component metric like a better parse
or better candidate or a better type
estimation whatever it is and then I
would say I need to understand how
the improvement on that component metric
is going to affect the end-to-end
performance if you can't estimate that
and can't do experiments to demonstrate
that it doesn't get in that's like the
best run AI project I've ever heard
that's awesome okay what breakthrough
would you say like I'm sure there's a
lot of day to day break this but it was
there like a breakthrough that really
helped improve performance like what
what people began to believe or is it
just a gradual process well I think it
was a gradual process but one of the
things that I think gave people
confidence that we can get there was
that as we fouled as as we follow this
procedure of different ideas build
different components plug them into the
architectural run the system see how we
do do the error analysis start off new
research projects to improve things and
the end and and the very important idea
that the individual component work did
not have to deeply understand everything
that was going on with every other
component and this is where we we
leverage machine learning in a very
important way so while individual
components could be statistically driven
machine learning components some of them
were heuristic some of them were machine
learning components the system has a
whole combined all the scores using
machine learning this was critical
because that way you can divide and
conquer so you can say ok you work on
your candidate generator or you work on
this approach to answer scoring you work
on this approach to type scoring you
work on this approach to passage search
or the passive selection and so forth
but when we just plug it in and we had
enough training data to say now we can
weaken train and figure out how do we
weigh all the scores relative to each
other based on the predicting the
outcome which is right right or wrong on
jeopardy and we had enough training data
to do that so this enabled people to
work independently and to let the
machine learning do the integration
beautiful so that yeah the machine
learning is doing the fusion
and then it's a human orchestrated
ensemble that's our friend approaches so
it's great still impressive that you're
able to get it done a few years that
were that not obvious to me that it's
doable if I just put myself in that
mindset but when you look back at the
Jeopardy challenge again when you're
looking up with the Stars
what are you most proud of looking back
at those days I'm most proud of my um my
commitment and my team's commitment to
be true to the science to not be afraid
to fail that's beautiful because there's
so much pressure because it is a public
event it is a public show that you were
dedicated to the idea that's right do
you think it was a success in the eyes
of the world it was a success by your
I'm sure exceptionally high standards is
there something you regret you would do
differently
it was a success I it was a success for
our goal our goal was to build the most
advanced open domain question answering
system we went back to the old problems
that we used to try to solve and we did
dramatically better on all of them as
well as we beat jeopardy so we won at
jeopardy so it was it was a success it
was I worried that the world would not
understand it as fast because it came
down to only one game and I knew
statistically speaking this can be a
huge technical success and we could
still lose that one game and that's a
whole nother theme of this of the
journey but it was a success it was not
a success in natural language
understanding but that was not the goal
yeah that was but I would argue I
understand what you're saying in terms
of the science but I would argue that
the inspiration of it right the they not
a success in terms of solving natural
language understanding there was a
success of being an inspiration to
future challenges
absolutely that drive future efforts
you
Resume
Read
file updated 2026-02-13 13:22:18 UTC
Categories
Manage