Transcript
oGk1v1jQITw • Deep Learning for Natural Language Processing (Richard Socher, Salesforce)
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0010_oGk1v1jQITw.txt
Kind: captions
Language: en
thank you everybody thanks for coming
back very soon after lunch I'll try to
make it entertaining to avoid some post
food coma so I actually have a lot - OH
- being here - Andrew and Chris and my
PhD at Stanford here it's it's really
it's always fun to be back I figured
there's a going to be a broad range of
capabilities in the room so I'm sorry I
will probably bore some of you for the
first two-thirds of the talk because
I'll go over the basics of what's NLP
when natural language processing what's
deep learning and what's really at the
intersection of the two and then the
last third I will talk a little bit
about some exciting new research that's
happening right now so let's get started
with what is natural language processing
it's really a feel at the intersection
of computer science AI and linguistics
and you could define a lot of goals and
a lot of these statements here we could
really talk and philosophize a lot about
but I'll move through them pretty
quickly for me the goal of natural
language processing is for computers to
process or scare quotes understand
natural language in order to perform
tasks that are actually useful for
people such as question answering the
caveat here is that really fully
understanding and representing the
meaning of language or even defining it
is quite an elusive goal so whenever I
say the model understands I'm sorry I
shouldn't say that
really these models don't understand in
the sense that we understand language
anything so whenever somebody says they
can read or represent the full meaning
and its entire glory it's it's usually
not quite true really perfect language
understanding is in some sense AI
complete in the sense that you need to
understand all of visual inputs and
thought and and a lot of other complex
things so a little more concretely as we
try to tackle this overall problem of
understanding language what are sort of
the different levels that we often look
at it often and for many people starts
at speech and then once you have speech
you might say alright now I know what
phonemes are smaller parts of words I
understand
words form Nets morphology or
morphological analysis once I know what
the meaning of words are I might try to
understand how they're put together in
grammatical ways such that the sentences
are understandable or at least
grammatically correct too a lot of
speakers of the language once we go and
we understand the structure we actually
want to get to the meaning and that's
really where I think most of the
interesting most of my interests lies
and semantic interpretation actually
trying to get to the meaning in some
useful capacity and then after that we
might say well if we understand now the
meaning of the whole sentence what's how
do we actually interact
what's the discourse how do we have you
know spoken dialogue system and things
like that where deep learning has really
improved the state of the art
significantly is really in speech
recognition and syntax and semantics and
the interesting thing is that we're kind
of actually skipping some of these
levels deep learning doesn't require
often morphological analysis to create
very useful systems and in some cases
actually skips syntactic analysis
entirely as well it doesn't have to know
about the grammar it doesn't have to be
taught about what mound phrases are
prepositional phrases it can actually
get straight to some semantically useful
tasks right away and that's going to be
one of the sort of advantages that we
don't have to actually be as inspired by
linguistics as traditional natural
language processing had to be so why is
NLP hard well there's a lot of
complexity in representing and learning
and especially using linguistics
situational world and visual knowledge
really all of these are connected when
it gets to the meaning of language to
really understand what red means can you
do that without visual understanding for
instance if you have for instance this
sentence here Jane hit June and then she
fell or and then she ran depending on
which verb comes after she the
definition the meaning of she actually
changes and this is one subtask
you might look at so called an F or a
resolution or cor efference resolution
in general where you try to understand
who does she actually refer to and it
really depends on the meaning again
somewhat scare quotes here
of the verb that follows this pronoun
similarly there's a lot of ambiguity so
here we have a very simple sentence for
words
I made her duck now that simple sentence
can actually have at least four
different meanings if you can think
about it for a little bit right you made
her a duck that she loves for Christmas
as for dinner
you made her dock like me just now and
so on there are actually four different
meanings and to know which one requires
in some sense situational awareness or
knowledge to really disambiguate what
what is meant here so that's sort of the
high level of NLP now where does it
actually become useful in terms of
applications well they actually range
from very simple things that we kind of
assume or you're given now we use them
all the time every day to more and more
complex and then also more in the realm
of research the simple ones are things
like spell checking or key word search
and finding synonyms and ophisaurus then
the meaty medium sort of difficulty ones
are the extract information from
websites trying to extract sort of
product prices or dates and locations
people or company names are called named
entity recognition you can go a little
bit above that and try to classify sort
of reading levels for school text for
instance or do sentiment analysis that
can be helpful if you have a lot of
customer emails that come in and you
want to prioritize highly the ones of
customers for really really review right
now and then the really hard ones and I
think in some sense the most interesting
ones are machine translation trying to
actually be able to translate between
all the different languages in the world
question answering clearly something
that is a very exciting and useful piece
of technology especially over very large
complex domains can be used to automated
for automated email replies I know
pretty much everybody here would love to
have some simple automated email reply
system and then spoken dialogue systems
bots are very hip right now these are
all sort of complex things that are
still in the realm of research to do
them really well we're making huge
progress
with deep learning on these three but
there's still nowhere near human
accuracy so let's look at the
representations I mention you know we
have morphology and words and syntax and
semantics and so on we can look at one
example a namely machine translation and
look at how did people try to solve this
problem of machine translation well it
turns out they actually tried all these
different levels with varying degrees of
success you can try to have a direct
translation of words to other words the
problem is that is often a very tricky
mapping one the meaning of one word in
English might have three different words
in German and vice versa
you can have three of the same words in
English meaning all this single same
word in German for instance so then
people said well let's try to maybe do
some tactic transfer where we have whole
phrases like to kick the bucket just
means stab them in German okay not a fun
example and then semantic transfer might
be well let's try to find a logical
representation of the whole sentence the
actual meaning in some human
understandable form and and try to just
find another surface representation of
that now of course that will also get
rid of a lot of the subtleties of
language and so they're tricky problems
in all these kinds of representations
now the question is what does deep
learning do you've already saw at least
two methods standard neural networks
before and convolutional neural networks
for vision and in some sense there's
going to be a huge similarity here to
these methods because just like images
that are essentially a long list of
numbers the vector and standard neural
networks where the hidden state is also
just a vector or a list of numbers that
is also going to be the main
representation that we will use
throughout for characters for words for
short phrases for sentences and in some
cases for entire documents they will all
be vectors and with that we are sort of
finishing up the whirlwind of what's NLP
of course you could give an entire
lecture on all like almost every single
slide I just gave
we're very a very high level but we'll
continue at that speed to try to squeeze
this complex deep learning for NLP
subject area into an hour and a half I
think there are two most two of the most
important basic Lego blocks that you
nowadays want to know in order to be
able to sort of creatively play around
with more complex models and those are
going to be word vectors and sequence
models namely recurrent neural networks
and I kind of split this into words
sentences and multiple sentences but
really you could use recurrent neural
networks for shorter phrases as well as
multiple sentences but in many cases
we'll see that they have some
limitations as you move to longer and
longer sequences and just use the
default neural network sequence models
alright so let's start with words and
maybe one last blast from the past here
to represent the meaning of words we
actually used to use a taxonomy like
word net that kind of defines each word
in relationship to lots of other ones so
you can for instance define hyper names
and is a relationships you might say the
word Panda for instance in its first
meaning as a noun basically goes through
this complex tags directed acyclic graph
most of it is roughly just a tree and in
the end like everything it is an entity
but it's actually a physical entity a
type of object it's a whole object it's
a living thing it's an organism animal
and so on so you basically can define a
word like this and another way at each
node of this tree you actually have so
called sunset so synonym sets here's an
example for the synonym set of the word
good good can have a lot of different
meanings can actually be both an
adjective and as well as an adverb as
well as a noun now what are the problems
with this kind of discrete
representation well they can be great as
a resource of your human you want to
find synonyms but they're ever they're
never going to be quite sufficient to
capture all the nuances that we have in
language so for instance the synonyms
here for
good were adapt Axford practice
proficient and skillful but of course
you would use these words in slightly
different contexts you would not use the
word expert in exactly this all the same
context as you would use the meaning of
good or the word good likewise it will
be missing a lot of new words language
is this interesting living organism we
change it all the time you might have
some kids they say Yolo and all of a
sudden you know you need to update your
dictionary likewise maybe in Silicon
Valley you might see ninja a lot and now
you need to update your dictionary again
and that is basically going to be a
Sisyphus job right nobody will ever be
able to really capture all the meanings
and and this living breathing organism
that languages so it's also very
subjective some people might think ninja
should just be deleted from the
dictionary and I don't want to include
it I'll just think nifty or badass is
kind of a silly word and should not be
included in a proper dictionary but it's
being used in real language and so on it
requires human labor as soon as you
change your domain you have to ask
people to update it and it's also hard
to compute accurate word similarities
some of these words are subtly different
and it's really a continuum in which we
can measure their similarities so
instead what we're going to use and what
is also the first step for deep learning
will actually realize it's not quite
deep learning in many cases but it is
sort of the first step to use deep
learning and NLP is we will use
distributional similarities so what does
that mean basically the idea is that
we'll use the neighbors of a word to
represent that word itself it's a pretty
old concept and here's an example for
instance for the word banking we might
actually represent banking in terms of
all these other words that are around it
so let's do a very simple example where
we look at a window around each word and
so here the window length that's just
for simplicity say it's one we represent
each word only with the words one to
left and one to the right of it we'll
just use the symmetric context around
each word and here's a simple example
corpus
so if the three sentences in my corpus
of course we would always want to use
corpora with billions of words instead
of just a couple but just to give you an
idea of what's being captured in these
word vectors is I like people earning I
like NLP and I enjoy flying and now this
is it's very simple so-called corcoran
statistic you'll just simply see here I
for instance appears twice in its window
size of one here the word like isn't its
window and its context and the word
enjoy is once in its context and for
like you have twice to its left I and
once deep and once NLP it turns out if
you just take those vectors now this
could be a vector of presentation just
each row could be a vector
representation for words unfortunately
as soon as your vocabulary increases
that vector dimensionality would change
and hence you have to retrain your whole
model it's also very sparse and really
it's going to be somewhat noisy if you
use that vector now another better thing
to do might be to run SVD or something
simple like say dimensionality reduction
on such a co-occurrence matrix and that
actually gives you a reasonable first
approximation to word vectors very old
method works reasonably well now what
works even better than simple PCA is
actually a model introduced by Thomas
McAuliffe in 2013 called word Tyvek so
instead of capturing Corcoran's counts
directly out of a matrix like that
you'll actually go through each window
in a large corpus and try to predict a
word that's in the center of each window
and use that to predict the words around
it that way you can very quickly train
you can train almost on line though few
people do this and and add words to
vocabulary very quickly in this zooming
fashion so now let's look a little bit
at this model where Tyvek because it's
first very simple NLP model and to sort
of is very instructive we won't go into
too many details but at least look at a
couple of equations so again
main goal is to breeding words in a
window
of some length that we define em type or
parameter of every word now the
objective function will essentially try
to maximize here the log probability of
any of these contacts words given the
Center word so we go through our entire
corpus T very long sequence and at each
time step J we will basically look at
all the words in the context of the
current word T and basically try to
maximize here this probability of trying
to be able to predict that word that is
around the current word T and theta are
all the parameters namely all the word
vectors that we'd want to optimize so
now how do we actually define this
probability P here the simplest way to
do this and this is not the actual way
but it's the simplest and first to
understand and derive this model is with
this very simple inner product here and
that's why we can't quite call a deep
there's not going to be many layers of
nonlinearities like we see in deep
neural networks to be just a simple
inner product and the higher debt in a
product is the more likely these two
will be predicting one another so here
see the context is the dissenter word
sorry oh is the outside word and
basically this inner product the larger
it is the more likely we were going to
predict this and these are both just
standard and dimensional vectors and now
in order to get a real probability we'll
essentially apply softmax to all the
potential inner products that you might
have in your vocabulary and one thing
you will notice here is well this
denominator is actually going to be a
very large sum I will want to sum here
overall potential inner products for
every single window that would be true
slow so now the real methods that we
would use we're going to are going to
approximate the sum in a variety of
clever ways now I could literally talk
to next hour and a half just about how
to optimize the details of this equation
but then we'll all deplete our mental
energy for the rest of the day and so
I'm just going to point you to the class
I taught earlier this year so yes 24d we
we have lots of different slides that go
into all the details of this equation
how to approximate it and then how to
optimize it it's going to be very
similar to the way we optimize any other
neural network we're going to use
stochastic gradient descent we're going
to look at mini batches of a couple of
hundred windows at a time and an update
those word vectors and we're just going
to take simple gradients of each of
these vectors as we go through windows
in a large corpus all right now we
briefly mentioned PCA like methods and
based on senior Lu decomposition often
or standard a simple PCA now we also had
this word Tyvek model there's actually
one model that combines the best of both
worlds namely glove or global vectors
introduced by Geoffrey Pennington in
2014 and it has a very similar idea and
you'll notice here there's some
similarity you have this inner product
again for different pairs but this model
will actually go over the Corcoran's
matrix once you have this Corcoran's
matrix it's much more efficient to try
to predict once how often two words
appear next to each other rather than do
it 50 times each time that that pair
appears in an actual corpus so in some
sense you can be more efficiently going
through all the current statistics and
you're going to basically try to
minimize the this this subtraction here
and what that basically means is that
each inner product will try to
approximate the log probability of these
two words actually co-occurring now you
have this function here which
essentially will allow us to not overly
weight certain pairs that occur very
very frequently the for instance
co-occurs with lots of different words
and you want to basically lower the
importance of all the words that Corker
with that so you can train this very
fast it scales to gigantic corpora in
fact we train this on common crawl which
is a really great data set of most of
the internet it's many billions of
tokens and it gets also very good
performance
on small corpora because it makes use
very efficiently of these Corcoran
statistics and that's essentially what
words well word vectors are always
capturing so if in one sentence you just
want to remember every time you hear
word vectors in deep learning one
they're not quite deep even though we
call them sort of step one of deep
learning and to it they're really just
capturing Corcoran's counts how often
does a word appear in the context of
other words so let's look at the some
interesting results of these glove
vectors here the first thing we do is
look at nearest neighbors so now that we
have these n dimensional vectors usually
you say n between 50 to at most 500 good
general numbers 100 or 200 dimensional
each of these each word is now
represented as a single vector and so we
can look in this vector space for words
that appear close by we started and
looked for the nearest neighbors of frog
and well turned out
these are the nearest neighbors which
was a little confusing since we're not
biologists but fortunately when you
actually look up in Google what what
those mean you'll see that they are
actually all indeed different kinds of
frogs some appear very rarely in the
corpus and others like toad or much more
frequent now one of the most exciting
results that came out of word vectors
actually these word analogies so the
idea here is can linearly can there be
relationships between different word
vectors that simply fall out of very
linear and simple addition and
subtraction so the idea here is what is
meant a woman equal to king to something
else as in what is the right analogy
when I try to basically fill in here the
last missing word now the way we're
going to do this is very very simple
cosine similarity or basically just take
let's take an example here the vector of
woman we subtract the word vector we
learned of man and we add
the word vector of king and the
resulting vector I the art max for this
turns out to going to be Queen for a lot
of these different models and that was
very surprising again we're capturing
core current statistics so man might in
its context often have things like
running and fighting other silly things
that men do and then you subtract those
kinds of words from the context and you
add them again and in some sense it's
intuitive though surprising that it
works out that well for so many
different examples so here are some some
other examples similar to the king and
queen example where we basically took
these two hundred dimensional vectors
and we projected them down to two
dimensions again with a very simple
method like PCA and what we find is
actually quite interestingly even in
just the two first principal components
of this space we have some very
interesting sort of female male
relationships so men to women is similar
to uncle and aunt brother and sister sir
and madam and so on so this is an
interesting semantic relationship that
falls out of essentially Corcoran's
counts in specific windows around each
word and a large corpus here's another
one that's more of a syntactic
relationship we actually have here
superlatives like slow slower slowest is
in a similar vector relationship to
short shorter and shortest or strong
stronger and strongest so this was very
exciting and of course when you see an
interesting qualitative result you want
to try to quantify who can do better in
trying to understand these analogies and
what are the different modes and hyper
parameters that modify the performance
now this is something that you will
notice in pretty much every deep
learning project ever which is more data
will give you better performance it's
probably the single most useful thing
you can do to machine learning or deep
learning system is to train it with more
data and we found that too now they're
different vector sizes too which is a
common hyper parameter like I said
usually between 52 and
so I wondered here we have 300
dimensional that essentially gave us the
best performance for these different
kinds of semantics and tactic
relationships now in many ways having a
single vector for words can be
oversimplifying right some words have
multiple meanings maybe they should have
multiple vectors sometimes the word
meaning changes overtime and so on so
there's a lot of simplifying assumptions
here but again our final goal for deep
NLP is going to be to create useful
systems and it turns out this is a
useful first step to create such systems
that mimic some human language behavior
in order to create useful applications
for us all right but words word vectors
are very useful but words of course
never appear in isolation and what we
really want to do is understand words in
their context and so this leads us to
the second section here on recurrent
neural networks so we already went over
the basic definition of standard neural
networks really the main difference
between a standard neural network and a
recurrent neural network which I'll
abbreviate as RN and now is that we will
tie the weights at each time step and
that will allow us to essentially
condition the neural network on all the
previous words in theory and practice
how we can optimize it it won't be
really all the previous words we've more
like at most the last 30 words but in
theory this is what a powerful model can
do so let's look at the definition of a
recurrent neural network and this is
going to be a very important definition
so we'll go into a little bit of details
here so let's assume for now we have our
word vectors as given and we'll
represent each sequence in the beginning
it's just a list of these word vectors
now what we're going to do is we're
computing a hidden state HT at each time
step and the way we're going to do this
is with a simple neural network
architecture in fact you can think of
this summation here is really just a
single layer neural network if you were
to concatenate the two matrices in these
two that
but intuitively we basically will map
our current word vector at that time
step T sometimes I use these square
brackets to denote that we're taking the
word vector from that time step in there
we map that with a linear layer a simple
matrix vector product and we sum up some
that matrix vector product to another
matrix vector product of the previous
hidden state at the previous time step
we sum those two and reapply in one case
a simple sigmoid function to define this
standard neural network layer that will
be HT and now at each time step we want
to predict some kind of class
probability over a set of potential
events classes words and so on and we
use the standard softmax classifier some
other communities called logistic
regression classifier so here we have a
simple matrix WS for the softmax weights
we have basically a number of rows are
going to be a number of classes that we
have and the number of columns is the
same as the hidden dimension sometimes
we want to predict the next word in a
sequence in order to be able to identify
the most likely sequence so for instance
if I asked for a speech recognition
system what is the price of wood now in
isolation if you hear wood you would
probably assume it's the wo uld
auxiliary verb wood but in this
particular context the price of it
wouldn't make sense to have a verb
following that and so it's more like the
wo D to find the price of wood so
language modeling is very useful task
and it's also very instructive to use as
an example for where recurrent neural
networks refine so in our case here this
softmax is going to be quite a large
matrix that goes over the entire
vocabulary of all the possible words
that we have so each word is going to be
our class the classes for language
models are the words in our vocabulary
and so we can define here
this y hat T the jf1 is basically
denoting here the probability that the J
word at the J index will come next after
all the previous words very useful model
again for speech recognition for machine
translation for just finding a prior for
language in general alright
again main difference the standard
neural networks we just have the same
set of W weights at all the different
time steps everything else is pretty
much a standard neural network we often
initialize the first h0 here just either
randomly or all zeroes and again in
language modeling in particular the next
word is our class of the softmax now we
can measure basically the performance of
language models with terms are called
perplexity which really is here the
average log likelihood of the basically
the probabilities of being able to
predict the next word so you want to
really give the highest probability to
the word that actually will appear next
in a long sequence and then the higher
that probability is the lower your
perplexity in hence the models less
perplexed to see the next word in some
sense you can think of language modeling
as almost NLP complete and some silly
sense that you just if you can actually
predict every single word that follows
after any arbitrary sequence of words in
a perfect way you would have
disambiguated a lot of things you can
you can say for instance what is the
answer to the following question ask the
question and then the next couple of
words would be the predicted answer so
there's no way we can actually ever do
perfect job in language modeling but
there's certain contexts where we can
give a very high probability to the
right next couple of words now this is
the standard recurrent neural network
and one problem with this is that we
will modify the hidden state here at
every time set so even if I have words
like the and a and sentence period and
things like that it will stick
frequently modify in my hidden state now
that can be problematic let's say for
instance I want to train a sentiment
analysis algorithm and I talk about
movies and I talk about the plot for a
very long time then I say oh man this
movie was really wonderful it's great to
watch and then especially the ending and
you talk again for like fifty timesteps
or 50 words or hundred words about the
plot
now all these plot words will
essentially modify my hidden states if
at the end of that whole sequence I want
to classify the sentiment the word
wonderful and great that I mentioned
somewhere in the middle might be
completely gone because I keep updating
my hidden state with all these content
words to talk about the plot now the way
to improve this is by use better kinds
of recurrent units and I'll introduce
here a particular kind so called gated
recurrent units introduced by Cho in
some sense and we'll learn more about
the LS TM tomorrow when Kwok gives his
lecture but G R user in some sense a
special case of LS DMS and the main idea
is that we want to have the ability to
keep certain memories around without
having the current input modify modify
them at all so again this example of
sentiment analysis I say something's
great that should somehow be captured in
my hidden state and I don't want all the
content words to talk about the plot in
a movie review to modify that is
actually overall I was a great movie and
then we also want to allow error
messages to flow at different strengths
depending on the input so if I say great
I want that to modify a lot of things in
the past so let's define a giryu
fortunately since you already know the
basic Lego block of a standard neural
network there's only really one or two
subtleties here that are different there
are a couple of different steps that
we'll need to compute at every time step
so in the standard RNN
what we did was just have this one
single neural network that we hope would
capture all this complexity of the
sequence instead now we'll first compute
a couple of gates at that time step so
the first thing will
compute is the so called update gate
it's just yet another neural network
layer based on the current input word
vector and again the past hidden state
so these look quite familiar but this
will just be an intermediate value and
we'll call it the update gate then we'll
also compute a reset gate is yet another
standard neural network layer again just
matrix vector product summation matrix
vector product some kind of
non-linearity here namely Sigma it's
actually important in this case that it
is a sigmoid just just basically both of
these will be vectors with numbers that
are between 0 and 1 now we'll compute a
new memory content an intermediate age
tilt here with yet another neural
network but then we have this little
funky symbol in here basically this will
be an element-wise multiplication so
basically what this will allow us to do
is if that reset gate is 0 we can
essentially ignore all the previous
memory elements and only store the new
word information so for instance if I
talked for a long time about the plot
now I say this was an awesome movie now
you want to basically be able to ignore
if your whole goal of this sequence
classification model is to capture
sentiment I'm going to be able to ignore
past content this is of course if this
was a 0 entirely a 0 vector now this
will be more subtle this is a long
vector if you know maybe a hundred or
200 dimensions so maybe some dimensions
should be reset but others maybe not
and then here we'll have our finally
final memory and that essentially
combines these two states the previous
hidden state and this intermediate one
at our current time step and what this
will allow us to do is essentially also
say well maybe we want to ignore
everything that's currently happening
and only update the last time step we
basically copy over the previous time
step in the hidden state of that and
ignore the current thing again simple
example in sentiment maybe there's a lot
of talk about the plot when a movie was
released if you want to basically have
the ability to ignore that and just copy
that in the beginning
may have said it was an awesome movie so
here's an attempt at a clean
illustration I have to say personally I
in the end find the equations a little
more intuitive than the visualizations
that we try to do but some people are
are more visual here so this is in some
ways basically here we have our word
vector and it goes through different
layers and then some of these layers
will essentially modify other outputs of
previous time steps so this is a pretty
nifty model and it's read the second
most important basic Lego block that
we're going to learn about today and so
just want to make sure we take a little
bit of time I'll repeat this here again
if the reset gate this R value is close
to zero those kinds of hidden dimensions
are basically allowed to be dropped and
if the update gates Z basically is one
then we can copy information in of that
unit through many many different time
steps and if you think about
optimization a lot what this will also
mean is that the gradient can flow
through the recurrent wheel network
through multiple time steps until it
actually matters and you want to update
a specific word for instance and go all
the way through many different time
steps
so then what this also allows us is to
actually have some units that have
different update frequencies some you
might want to reset every other word
other ones you might really cap like
they have some long-term context and
they stay around for much longer
all right this is the geo you it's the
second most important building block for
today there are like I said a lot of
other variants of recurrent neural
networks lots of amazing work in that
space right now and tomorrow quoc will
we'll talk a lot about some more
advanced methods so now that you've
understand word vectors and neural
network sequence models you really have
the two most important concepts for deep
NLP
and that's pretty awesome so congrats we
can now in some ways really play around
with those two Lego blocks plus some
slight modifications of them very
creatively and build a lot of really
cool models a lot of the models that
I'll show you and that you can read and
see and read the latest papers that are
now coming out almost every week on
archive will have some kind of component
of these will use really these two
components in a major way now this is
one of the few slides now with something
really new because I want to keep it
exciting for the people who already knew
all this stuff and took the class and
everything
this is tackling a important problem
which is and all these models that
you'll see in pretty much most of these
papers we have in the end one final
softmax here right and that softmax is
basically our default way of classifying
what we can see next what kinds of
classes we can predict the problem with
that is of course that that will only
ever predict accurately frequently seen
classes that we had at training time but
in the case of language modeling for
instance where our classes are the words
we may see a test time some completely
new words maybe I'm just going to
introduce to you a new name srini for
instance and nobody may have like seen
that word at training time but now that
I mentioned him and I will introduce him
to you you should be able to predict the
word trini and that person in a new
context and so the solution that we're
literally going to release only next
week
and in a new paper is to essentially
combine the standard softmax that we can
train with a pointer component and that
pointer component will allow us to point
to previous contexts and then predict
based on that to see that word so let's
for instance take the example you have
language modeling again we may read a
long article about the Fed chair Janet
Yellen and maybe the word Yellen had not
appeared in training time before so we
couldn't ever predict it even though we
just learned about it and now a couple
of sentences later interest rates were
based and then missus and now we want to
predict that next word now if that
hadn't appeared in our softmax standard
training procedure at training time we
would never be able to predict it what
this model will do and we're kind of
calling it a pointer sentinel mixture
model is it will essentially first try
to see what any of these previous words
maybe be the right candidate so we can
really take into consideration the
previous context of say the last hundred
words and if we see that word and that
word makes sense after you know we train
it of course then we might give a lot of
probability mass to just that word at
this current position in our previous
immediate context at test time and then
we have also the sentinel which is
basically going to be the rest of the
probability if we cannot refer to the
some of the words that we just saw and
that one will go directly to our
standard softmax and then what we'll
essentially have is a mixture model that
allows us to say either we have or we
have a combination of both of
essentially words that just appeared in
this context and words that we saw in
our standard softmax language modeling
system so I think this is a pretty
important next step because it will
allow us to predict things we've never
seen a training time and that's
something that's clearly a human
capability that most or pretty much none
of these language models had before and
so to look at how much it actually helps
it'll be interesting to look at some of
the performance before so again what
we're measuring here is perplexity and
the lower the better because it's
essentially inverse here of the actual
probability that we assigned to the
correct next word and in just 2010 so
six years ago there this was some great
work early work by Thomas McAuliffe
where he compared to a lot of standard
natural language processing methods
syntactic neural net syntactic models
that essentially tried to predict the
next word and had a perplexity of 107
and he was able to use the standard
recurrent neural networks and actually
an ensemble of eight of them
to really significantly push down the
perplexity especially when you combine
it with standard count based methods for
language modeling so in 2010 he made
great progress by pushing it down to 87
and now this is one of the great
examples of how much progress is being
made in the field thanks to deep
learning we're two years ago white
chicks are memba and and his
collaborators were able to push that
down even further to 78 with a very
large lsdm similar to a GRU like model
but even more advanced quark will will
teach you the basics of LS CMS tomorrow
then last year we pushed the the
performance was pushed down even further
by yarn gull and then this one actually
came out just a couple of weeks ago
variational recurrent highway networks
pushed it down even further but this
pointer sentiment model is able to get
it down to 70 so in just a short amount
of time we pushed it down by more than
10 perplexity points and in two years
and that is really an increased speed in
performance that we're seeing now that
deep learning so if changing a lot of
areas of natural language processing
alright now we have sort of our basic
Lego blocks the word vectors and the GRU
sequence models and now we can talk a
little bit about some of the ongoing
research that we're working on and I'll
start that with maybe a controversial
question which is could we possibly
reduce all NLP tasks to essentially
question answering tasks over some kind
of input and in some ways that's a
trivial observation that you could do
that but it actually might help us to
think of models that could take any kind
of input a question about that input and
try to produce an output sequence so let
me give you a couple of examples of what
I mean by this so here we have the first
one is a task that we would standardly
associate with
answering I'll give you a couple of
facts Mary walk to the bathroom
send her went to the garden Daniel went
back to the garden Sandra took the milk
there where's the milk and now you might
have to logically reason so I try to
find the sentence about milk
maybe Sandra took the milk there and I
would have to maybe do an F for a
resolution find out what does there
refer to and then you try to find you
know the previous sentence that
mentioned Sandra see that it's garden
and then give an answer garden so this
is a simple logical reasoning question
answering task and that's what most
people in the QA field sort of
associated with some kinds of question
answers but we can also say everybody's
happy and the question is what's the
sentiment and the answer is positive all
right so this is a different subfield of
NLP that tackles sentiment analysis we
can go further and ask what are the
named entities of a sentence like Jane
has a baby in Dresden and you want to
find out that Jane is a person in
Dresden as a location this is an example
of sequence tagging you can even go as
far and say you know I think the smile
is incredible and the question is what's
the translation into French and you get
you know Japan's kusuma del a on clay
habla and dad in some ways would be
phenomenal if we're able to actually
tackle all these different kinds of
tasks with the same kind of model so
maybe it would be an interesting new
goal for NLP to try to develop a single
joint model for general question
answering I think it would push us to
think about new kinds of sequence models
and new kinds of reasoning capabilities
in an interesting way now there are two
major obstacles to actually achieving
the single joint model for arbitrary QA
tests the first one is that we don't
even have a single model architecture
that gets consistent state-of-the-art
results across a variety of different
tasks so for instance for question
answering and this is a data set called
Bobby did face book published
last year strongly supervised memory
networks get the state of the art for
sentiment analysis you had tree lsdm
models developed by cashing ty here at
Stanford last year and for part of
speech tagging you might have
bi-directional lsdm conditional random
fields one thing you do notice is all
the current state-of-the-art methods are
deep learning sometimes they still
connect to other traditional methods
like conditional random fields and
undirected graphical models but there's
always some some kind of deep learning
component in them so that is the first
obstacle the second one is that really
fully joint multitask learning is very
very hard usually when we do do it we
restrict it to lower layers so for
instance in natural language processing
all we're currently able to share in
some principled way our word vectors we
take the same word vectors we trained
for instance with glove or work avec and
we initialize our deep neural network
sequence models with those word vectors
in computer vision and we're actually a
little further ahead and you're able to
use multiple of the different layers and
you initialize a lot of your CNN models
with first pre trained
CNN that was pre trained on imagenet for
instance now usually people evaluate
multitask learning with only two tasks
they trained on for a first task and
then they evaluate the model that they
initialize from the first on the second
task but they often ignore how much the
performance degrades on the original
task so when somebody takes an image net
CNN and applies it to a new problem they
rarely ever go back and say how much did
my accuracy actually decrease on the
original data set and furthermore we
usually only look at tasks that are
actually related and then we find out
look there's some amazing transfer
learning capability going on what we
don't look a look at often in the
literature and in most people's work is
that when the tasks aren't related to
one another they actually hurt each
other and this is a so called
catastrophic forgetting it's not there's
not too much work
that right now now I also would like to
say that right now almost nobody uses
the exact same decoder or classifier for
a variety of different kinds of outputs
right we at least replace the softmax to
try to predict different kinds of
problems all right so this is the second
obstacle now for now we'll only tackle
the first obstacle and this is basically
what motivated us to come up with
dynamic memory networks they are
essentially an architecture to try to
tackle arbitrary question-answering
paths when I'll talk about dynamic
memory networks is important to note
here that for each of the different
tasks I'll talk about it'll be a
different dynamic memory network it
won't have the exact same weights will
just be the same general architecture so
the high-level idea for DM ends is as
follows imagine you had to read a bunch
of facts like these here they're all
very simple in and of themselves but if
I now ask you a question I showed you
these and I asked where Sandra you know
it'd be very hard even if you read them
all of them and be kind of hard to
remember and so the idea here is that
for complex questions we might actually
want to allow you to have multiple
glances at just at the input and just
like I promised our one of our most
important basic Lego blocks will be this
GRU we just introduced in the previous
section now here's this whole model in
all its gory details and we'll dive into
all of that in the next couple of slides
so don't worry it's it's a big model a
couple of observations so the first one
is I think we're moving in deep learning
now to try to use more proper software
engineering principles basically to
modularize encapsulate certain
capabilities and then take those as
basic Lego blocks and build more complex
models on top of them a lot of times
nowadays you just have a CNN that's like
one little block in a complex paper and
then other things happen on top here
we'll have the
gru or word vectors basically has you
know one module a sub module in these
different ones here and I'm not even
mentioning word vectors anymore but word
vectors still play a crucial role and
each of these words is essentially
represented as this word vector but we
just kind of assume that it's there
okay so let's walk on a very high level
through this model they're essentially
four different modules there's the input
module which will be a neural network
sequence model and giryu and there's a
question module an episodic memory
module and an answering module and
sometimes we also have these semantic
memory modules here but for now these
are Ray just our word vectors and we'll
ignore that for now so let's go through
this here is our corpus and our question
is where is the football and this is our
input that should allow us to answer
this question now if I ask this question
I will essentially use the final
representation of this question to learn
to pay attention to the right kinds of
inputs that seem relevant for given what
I know to answer this question so
whereas the football well it would make
sense to basically pay attention to all
the sentences that mention football and
maybe especially the last ones if the
football moves around a lot
so what we'll observe here is that this
last sentence will get a lot of
attention so John put down the football
and now what we'll basically do is that
this hidden state of this recurrent
neural network model will be given as
input to another recurrent neural
network because it seemed relevant to
answer this current question at hand now
we'll basically agglomerate all these
different facts that seem relevant at
the time and is now the gru in this
final vector m and now this vector M
together with the question will be used
to go over the inputs again if the model
deems that doesn't have enough
information yet to answer the question
so if I ask you where's the football and
it's so far only found that John put
down the football you don't know enough
you still don't know where it is but you
now have a new fact namely John seems
relevant to answer the question and that
fact is now represented in this vector M
which is also just the last in the state
of another
Network now we'll go over the inputs
again now that we know that John and the
football irrelevant will be learned to
pay attention to John move to the
bedroom and John went to the hallway
again those are going to get
agglomerated here in this recurrent
neural network and now the model seems
thinks that it actually knows enough
because it basically intrinsically
captured things about the football John
found a location and so on of course we
didn't have to tell it anybody anything
about their people their locations if X
moves to Y and y is in the set of
locations then this happens
none of that you just give it a lot of
stories like that and in its hidden
states it will capture these kinds of
patterns so then we have the final
vector M and we'll give that to an
answer module which produces in our
standard softmax way the answer all
right now let's zoom into the different
modules of this overall dynamic memory
network architecture the input
fortunately is just a standard GRU the
way we defined it before so simple word
vectors hidden states reset gates update
gates and so on the question module is
also just the GRU a separate one with
its own weights and the final vector q
here is just going to be the last hidden
state of that recurrent neural networks
you can't model now the interesting
stuff happens in the episodic memory
module which is essentially a sort of
meta gated GRU where this gate will
basically define is defined computed by
the attention mechanism and will
basically say this current state
sentence si here seems to matter and the
superscript T is the episode that we
have so each episode basically means
we're going over the input entirely one
time so it starts at g1 here and what
this basically will allow us to do is to
say well if G is
zero then what we'll do is basically
just copy over the past states from the
input nothing will happen and unlike
before in all these GRU equations this G
is just a single scalar number it will
basically say if G is zero then this
sentence is completely irrelevant to my
current question at hand I can
completely skip it
all right and there are lots of examples
like mary mary traveled to the hallway
that are just completely irrelevant to
answering the current question in those
cases this g will be zero and we're just
copying the previous hidden state of
this recurrent neural network over
otherwise we'll have a standard giryu
model so now of course the big question
is how do we compute this G and this
might look a little ugly but it's quite
simple basically we're going to compute
two vector similarities multiplicative
an edit of one with absolute values of
all the single values of the sentence
vector that we currently have and the
question vector and the first the memory
state of the previous pass of the input
and the first pass over the input the
memory state is initialized to be just a
question and then afterwards at
agglomerated relevant facts so
intuitively here if the sentence
mentions John for instance and the
question is or mentions football and the
question is where is the football
then you'd hope that the question vector
Q mentions has some units that are more
active because football was mentioned
and the sentence vector mentions
football so there's some units that are
more active because football is
mentioned and hence some of these inner
products or absolute values of
subtractions are going to be large and
then what we're going to do is just plug
that into a standard through standard
single layer neural network and in a
standard linear layer here and then we
apply a soft max to essentially weight
all of these different potential
sentences that we might have to compute
the final gate so this will basically a
soft attention mechanism that sums to
one and we'll pay most attention to the
facts that seem most relevant given what
I
no so far and the question then when the
end of the input has reached all these
relevant facts here are summarized in
another GRU that basically moves up here
and you can train a classifier also if
you have the right kind of supervision
to basically train that the model knows
enough to actually answer the question
and stop iterating over the inputs if
you don't have that kind of supervision
you can also just say I will go over the
inputs a fixed number of times and that
that works reasonably well to all right
there's a lot to sink in so I'll give
you a couple seconds basically we pay
attention to different facts given a
certain question we iterate over the
input multiple times and we agglomerate
the facts that seem relevant given the
current knowledge and the question now I
don't usually talk about neuroscience
I'm not a neuroscientist but there is a
very interesting relationship here that
a friend of mine Sam Gershman pointed
out which is that the episodic memory in
general for humans is actually the
memory of autobiographical events so
it's the time when we remember the first
time I went to school or something like
that and essentially a collection of our
past personal experiences that occurred
at a particular time in a particular
place and just like our episodic memory
that can be triggered with a variety of
different inputs this is also this
episodic memory is also triggered with a
specific question at hand and what's
also interesting is the hippocampus
which is a seat of the episodic memory
in humans is actually active during
transitive inference so transitive
inference is you know going from A to B
to C to have some connection from A to C
or in this case here with this football
for instance you first had to find facts
about John into football and then
finding where John was and then find the
location of John so those are examples
of transitive inference and it turns out
that you also need in the dmn these
multiple passes to enable the capability
to do transitive inference now the final
module again is very simple G or UN
softmax to produce the final
answers the main difference here is that
instead of just having the current the
previous hidden state 18 minus 1 as
input will also include the question at
every time and we will include the
answer that was generated at the
previous time step but rather than that
it's our standard softmax from your
standard cross-entropy errors to
minimize it and now beautiful thing of
this whole model is that it's end-to-end
trainable these four different modules
will actually all train based on the
cross entropy of that final softmax all
these different modules communicate with
vectors and we'll just have Delta
messages and back propagation to train
them now there's been a lot of work in
the last two years on models like this
in fact quoc will cover a lot of these
really interesting models tomorrow
different types of memory structures and
so on and the dynamic memory network is
in some sense one of those models one
one particular model is a proper
comparison because it's there a lot of
similarities namely memory networks from
jason weston those basically also have
inputs and scoring and attention
response mechanisms the main difference
is that they use different kinds of
basic Lego blocks for these different
kinds of mechanisms for input they use
bag of words representation z' or
non-linear on linear embeddings for the
attention and responses they have
different kinds of iteratively to run
functions the main interesting sort of
difference to the dmn is that the dmn
really use this recurrent neural network
type sequence models for all of these
different modules and capabilities and
in some sense that helps us to have a
broader range of applications that
include things like sequence tagging and
so let me go over a couple of results
and experiments of this model so the
first one is on this Bobbie dataset did
Facebook publish it basically has a lot
of these kinds of simple logical
reasoning type questions in fact all
these like where's the
Paul those were examples from the
Facebook Bobby data set and it also
includes things like yes/no questions
simple counting negation some indefinite
knowledge where the answer might be may
be basic coreference where you have to
realize what does she
who does she refer to or he reasoning
over time if this happened before that
and so on and basically this dynamic
memory network I think is currently the
state of the art on this data set of the
simple simple logical reasoning now the
problem with this data set is that it's
a synthetic data set and so it had only
a certain set of generating like human
general human defined generative
functions that created certain patterns
and in that sense it's only necessary
and not a sufficient condition of
solving it with sometimes a hundred
percent accuracy to real question
answering so there's still a lot of
complexity the main interesting bit to
point out here is that there are
different numbers of training examples
for each of these different subtasks and
so you have basically a thousand
examples of simple negation for instance
and it's always a similar kind of
pattern and hence you're able to
classify it very well now real language
you will never have that many examples
for each type of pattern you want to
learn and so it's still general question
answering is still an open problem and
non-trivial now what's cool is this same
architecture of allowing the model to go
over inputs multiple times also got
state of the art and sentiment analysis
very different kind of task and we
actually analyzed whether it's really
helpful to have multiple passes over the
input and it turns out it is so there's
certain things like reasoning over three
facts or Counting where you really have
to have this dynamic this episodic
memory module and it goes over the input
maybe five times for sentiment it
actually turns out it hurts after going
over the input more than two times and
that's actually one of the things we're
now working on is can we find
model that does the same thing for every
single input with the same weights to
try to learn this different tasks we can
actually look at a couple of fun
examples of this model and what happens
with tough sentiment sentences generally
to be honest sentiment you can probably
get to like seventy five percent
accuracy with some very simple models
that just basically find like great
words like great and wonderful and
awesome and you'll get to something
that's roughly right here some of the
examples that those are the kinds of
examples that you now need to get right
to retry to push the state-of-the-art
further in sentiment analysis so here
the sentences in its ragged cheap and
unassuming way the movie works so this
sentence is incorrect even if you allow
the dmn but I have this whole
architecture but only allow one pass
over the input once you have two passes
over the input it actually learns to pay
attention not just to these very strong
adjectives but in the end actually to
the movie working so here these fields
are essentially the gating function G
that we defined that pays attention to
specific words and the darker it is the
larger that gate is and the more open it
is amor that word effects the hidden
state in the episodic memory module so
it goes over the input the first time
pays attention to cheap and unassuming
and way and a little bit of works too
but the second time it basically figured
out it agglomerate it's sort of the
facts of that sentence and then learn to
pay attention more to specific words
that seem more important just one more
example here my response to the film is
best described as lukewarm so in general
sentiment analysis when you look at
unique an scores like the word best is
basically some of the most one of the
most positive words you could possibly
use in a sentence and the first time the
model passes over the sentence that also
pays most attention
took this incredibly positive word maybe
best but then this site once it
agglomerate at the context actually
realizes well best actually here is not
used in its adjective way but it's
actually an adverb that best describes
something and what it describes is
actually lukewarm and hence it's
actually a negative sentence so those
are the kinds of examples that you need
to get to now to appreciate improvements
in sentiment analysis where we basically
also went from on this particular data
set these are all neural network type
models that started 82 until then that
same data set existed for around 8 years
and none of the standard NLP models had
reached above 80% accuracy and now we're
basically in the high high 80s and and
those are the kinds of improvements that
that you see across a variety of
different NLP tasks now that deep
learning has come and deep learning
techniques are being used in NLP and now
the last task in NLP that this model
turn out are also working for Ivy Wallen
as part of speech tagging now part of
speech tagging is less exciting of a
task it's more of an intermediate task
but it's still fascinating to see that
after this data set has been around for
over 20 years
you can still improve the state of the
art was the same kind of architecture
that also did well and fuzzy reasoning
of sentiment and discrete logical
reasoning for for question answering now
we had a new person joined a group
Zhiming and he he thought well that's
cool
but he was more of a computer vision
researcher and so he thought well could
I use this create question-answering
module now to do visual
question-answering so combine sort of
some stat was going on in the group and
NLP and apply it to a computer vision
and he did not have to know all of the
different aspects of the code all he had
to do was change the input module from
one that gives you hidden states at each
word over a long sequence of you know
words and sentences to an input module
that would give him vector
years four sequences of regions in an
image and he literally did not touch
some of the other parts of the code I
did have to look carefully at this input
module aware again here our basic Lego
block that Andre introduced really well
of our convolutional neural network and
then each the convolutional networks
will essentially give us 14 by 14 many
vectors one for each and it's one of its
top states one representing each region
of an image and then what we'll do is
basically take those vectors and now
replace the word vectors we used to have
with CNN vectors and then plug them into
GRU now again the GRU we know as our
basic Lego block we already defined it
one addition here is that it'll actually
be a bi-directional GRU will go once
from left to right in this snake-like
fashion and another one goes from right
to left backwards now both of these will
basically have hidden state and you can
just concatenate the hidden states of
both of these to compute the final
hidden state at each for each block of
the image and that model to actually
achieve state-of-the-art results this
data set has been only released last
year so everybody now works on deep
learning techniques to try to solve it
and I was at first a little skeptical it
was just too good to be true that this
model we developed for NLP would work so
well so we really dug in to looking at
the attention so what I showed you here
these G values again that we computed
with this equation now instead of paying
attention to words it paid attention to
different regions in the image and we
started basically analyzing going
through a bunch of those on the Deaf set
and analyzing what is it actually paying
attention to again it's being trained
only with the image the question and the
final answer that's what you get a
training time you do not get this sort
of latent representation of where you
should actually pay it
attention to in the image in order to
answer that question correctly so when
the question was what is the main color
on the bus and learned to actually pay
attention here to that bus mic well okay
maybe that's not that impressive it's
just the main object in the center of
the image and you know what it types the
type of trees are in the background well
maybe it just you know connects tree
with anything that's green and pays
attention to that so I was neat but you
know not not super impressive yet so is
this in the wild kind of more
interesting and actually pays attention
to a man-made structure in the
background and correctly answer's no
then this one is kind of interesting who
is on both photos the answers girl now
to be honest I don't think the model
actually knows that there are two people
tries to match them and so on it just
finds the main person or main object in
in this in the scene the main object is
a little baby girl so it says girl this
one's also relatively trivial what time
of day was this picture taken the
answers night because it's very dark
picture at least in the sky now this one
is getting a little more interesting
what is the boy holding the answer a
surfboard and it actually does pay
attention to both of the arms and then
what's just below that arm so that's a
little more interesting kind of
attention visualization and then for a
while we're also worried well what if in
the data set it just learns really well
from language alone yes it pays
attention to things but maybe it'll just
say things that it often sees in the
text so if I asked you what or what
color are the bananas you don't really
have to look at an image in 95% of the
cases you're right just saying yellow
without seeing an image so it was really
this one I was kind of excited about
because it actually paid attention to
the bananas in the middle and then did
say green and kind of overruled the
prior that it would get from from
language alone what's the pattern on the
cat's fur on its tail pays attention
mostly to the tail and says stripes now
this one here was interesting and fit
the player hit the ball the answer
yes though I have to say that we later
had a journalist want to do his own
question he he asked John marker from
New York Times and we just put together
this demo and the night before and he's
like well I want to ask my own question
and I am like okay and he asked is the
girl wearing a hat and you know it
wasn't made for production so it's kind
of slow and the system was cranking it
like well you know like trying to come
up with excuses it's kind of black
background and the plaque hat and it
might be kind of hard to see and
unfortunately I got it right and said
yes and then after the interview I said
well maybe let's look and see if like
what I imma just asked it myself less
stressful situation a bunch of questions
on my own and these are all the
questions like the first eight questions
that I could come up with and somewhat
to my surprise it actually got them all
right so what is the girl holding a
tennis racket what's she playing playing
tennis or what's she doing I was to go
wearing shorts what is the color of the
ground brown then I was like well okay
let's try to break it by asking just
like what's the color of like the sound
of this the smallest object the ball
actually got that right to because her
skirt white also kind of interesting
like when you asked him all what she's
wearing shorts but in you asked about
the skirt and it still sort of is you
know sort of capturing that you might
call this different things
what and then this one was interesting
what did the girl just hit tennis ball
and then as like well what if I asked is
the girl about to hit the tennis ball
and said yes and then did the girl just
hit the tennis ball and it said yes
again so then I finally found a way to
break it so it doesn't have enough the
Corcoran statistics to understand and
again spare quote understand sort of
which angles does the arm have to be in
order to assume that the ball was just
adores about it but what it basically
does show us is that once it saw a lot
of examples on a specific domain it
really can capture quite a lot of
different things now see if we can get
the demo up I have to be a VPN to make
it work but so here's here's one
one example the best way to hope for any
chance of enjoying this film is by
lowering your expectations again one of
those kinds of sentences that you have
to now get correct in order to get
improved performance on sentiment and
actually correctly says that this is
this is negative now we can also
actually ask that question in Chinese
this is one of the beautiful things off
of the dmn and in general really of most
deep learning techniques we don't have
to be experts in a domain or even in a
language to create a very very accurate
model for for that language or that
domain there's no more future
engineering I'm not going to make a fool
of myself trying to read that one out
loud but that's an interesting example
you can also this is the what parts of
speech are there you can have other
things like you know named entities and
other sequence problems I can also ask
what are the men wearing on the head
answers helmets and then maybe a
slightly more interesting question why
are the men wearing helmets and the
answer is safety so especially we're
close to the circle of death here at
Stanford where a lot of bikes crash and
it's a good answer all right with that
I'll leave a couple of minutes for for
questions so basically the summary is
word vectors and recurrent neural
networks are super useful building
blocks once you really appreciate and
understand those two building blocks
you're kind of ready to have some fun
and build more complex models really in
the end this dmn is a way to combine
that in just a variety of new ways to a
larger more complex model and that's
also where the state I think of deep
learning is for natural language
processing we've tackled a lot of these
smaller sub-problems intermediate tasks
and now we can work on more interesting
complex problems like dialogue and
question answering machine translation
and things like that all right
thank you
I mean all right cool yeah a quick
question in the dynamic memory Network
you have the the RN and you also
mentioned that if you have better
assumption of the input right so you
used to work on the tray LST M right so
if you change they are in into a tree
structure would that help it's a good
question I I actually loved researchers
at in my whole PhD about tree structures
and somewhat surprising in the last
couple of weeks to actually some new
results on SNL I understand for natural
language inference data said where tree
structures are again the state of the
art and I have to say that I think the
the dynamic memory Network by having
this ability in the episodic memory to
keep track of different sub phrases and
pay attention to those and then combine
them over multiple passes I think you
can kind of get away with not having a
tree structures so yes you might have a
slight improvement representing
sentences as trees in your input module
but I think they're only going to be
slight and I think the episodic memory
module that has this capability to go
over the input multiple times pay
attention to certain sub phrases will
capture a lot of the kinds of
complexities that you might want to
capture in tree structures so I don't my
short answer is I don't think you
necessarily need it have you tried it we
have not no thanks
hi a question is about question
answering say if we want to apply
questions into some specific domains
that health healthcare but we don't
really have the data we don't have
questions appears and what sure we'll do
are there any general principles here
it's a great question what do you do if
you want to question answering on a
complex domain you don't have the data I
think and this feels maybe like a
cop-out but I think it's very true both
in practice and in theory create the
data like if you cannot possibly create
more than a thousand examples of
anything then maybe automating that
process
is not that important so clearly you
should be able to create some data and
in many cases that is the best use of
your time is just to sit down or ask the
domain expert to create a lot of
questions and then have people find the
answers and then measure how they
actually get to those answers try to
have them in a constrained environment
and so on I think most companies for
instance when you try to do automated
email replies which is in some ways a
little bit similar to question answering
well there's a nice nice nice domain
because everybody had already emailed
there were already answered before so
you can use sort of past behavior now if
you had a search engine where people
asked a lot of questions then you can
also use that too in bootstrap and see
where did they actually fail and then
take all those really tough queries
where they failed have some humans sit
there and collect the data so that's
that's the simplest answer now the other
answer is let's work together for the
Mexican like many years on research for
smaller training data set sizes and
complex reasoning the the fact of the
matter for that line of research will
still be if you if a system has never
seen a certain type of reasoning I'll be
hard for the systems to pick up that
type of reasoning I think we're going to
get with these kinds of architectures to
the space where at least if it has seen
this type of reasoning a specific type
of transitive reasoning or temporal
reasoning or sort of cause and effect
type reasoning at least like a couple
hundred times then you should be able to
train a system with these kinds of
models to do it are these QA systems
currently robust to false input our
questions for the woman playing tennis
if you asked what's the man holding
would it replied there is no man it
would not and largely because at
training time you never try to mess with
it like that I'm pretty sure if you
added a lot of training examples where
you had those it would probably
eventually pick it up those would be
important for like real-world
implementations and so real-world
implementations of this in security are
actually kind of tricky I think
whenever you train a system we know we
can for instance both steal certain
classifiers by using them a lot
we know we can fool them into
classifying certain images for instance
as others we have folks in the audience
who worked on that exact line of work so
I would be careful using it in security
environments right now yeah I have a
question oh wow
up there yeah I have a question actually
uh there was a slide where you had the
input module and and there were a bunch
of sentences so what those sentences
themselves are n ends because you know
sequence is basically made up of those
individual words in sake love you know
representation so what those you know
also when are n ends that word you know
stitch together or so the answer there
is a little complex because we have two
two papers with the dmn and the answer
is different for each the simplest in
the simplest form of that there it is
actually a single
gru that goes from the first word
through all the sentences as if there
are one gigantic sequence and but it has
access to each sentence period at the
end to pay a special attention to the
end of sentences and so yes in the
simplest form it is just a giryu that
goes over all the words this is a normal
process to basically just concatenate
all the sentences into one gigantic you
know so the answer there and this is
kind of why I split the the talk into
three different ones from like words
single sentences and in multiple
sentences I think if you just had a
single gru that goes over everything and
now you try to reason over that entire
sequence it would not work very well
your read to have an additional
structure such as an intention mechanism
or a pointer mechanism that has the
ability to pay attention to specific
parts of your input to do that very
accurately but yeah in general that's
fine as long as you have this additional
mechanism thank you thank you great
question so in the recurrent neural Nets
you're using sigmoids
in visual recognition I guess are
rectified linear units for the more
popular non-linearity that's right so
rail users are great now when you look
at the GRU equations here and you have
these reset gates and so these reset
gates here
you want them to essentially be be
between zero and one so that it can
either ignore this input entirely or you
have it normally be part of the
computation of H tilt so in some cases
you really do want to have Sigma lights
there but other ones for instance some
like simpler things where you actually
don't have that much recurrence such as
going from one member state to another
in the second iteration of this model
actually rail used were we're good mom
good like activation functions to did
you guys try to after training this
network try to take these weights for
the images and do object detection again
so these weights would be augmented with
the text victors did you try to use that
is a very cool idea that we did not
explore no there you go you got to do it
fast
yeah feel this feel is moving fast you
just let the cat out of the box so so
those attention models are pretty
powerful when you have an opportunity
data and then you can learn you know to
make make yourself with data but even
though those are some of the tasks are
pretty gets a trivial to human but it's
hard for model tuner so what do you
think of a casinos right now even right
now we have not a non G base on the web
right no inequity pedia we not we know a
lot about you know common sense but how
what do you think about you cover those
knowledge base into those models I
actually love that line of research too
and that was kind of what we start out
with this semantic memory module in the
simplest form is just word vectors I
think in one next iteration would
activity to have knowledge bases also
influence the reasoning there's very
little work on combining text and
knowledge bases to do overall complex
question answering that requires
reasoning thing is a phenomenally
interesting area of research so where
any night hints or any starting point
about it so there are some photos there
are some papers that
reasoning over knowledge bases alone so
we had a paper on recursive no tensor
networks that basically takes a triplet
a word vector for an entity might be in
freebase might be in word net a relation
a vector for a relationship and a vector
for another entity and then basically
pipe them into a neural network and say
yes no are these two entities actually
in that relationship and you can have a
variety of different architectures I
think semi work done on that as well
wait that's a different brother
different Benjy oh I think over there
all right and it's true that's true yeah
if antoine board right that's right
that's right so so i think you can also
reason over knowledge graphs and you
could then try to combine that with
reasoning over fuzzy text it has been a
boat it all has been done i think nobody
has yet really combined it in a
principled way
great question yeah one last question
a whole question so so what the model
answer my questions correctly so how do
i check the model actually understand
understood my question and the woods
which are logic was a models logic
behind that it's a good question in some
ways it's a common question for for
neural network interpretability so
income division at the sometimes we can
at least the visualizes the features
right so how about the right and so i
think the best thing that we could do
right now is to show these attention
scores where you know for sentiment
we're like oh how did it come up the
sentiment oh it paid attention to the
movie working and likewise for question
answering we can see like which facts at
which sentences that actually pay
attention to in order to answer that
overall question so that is I think the
best answer that we could come up with
right now but how yeah there's certain
other complexities that there's still an
area of open resources thank you all
right thank you everybody
so thank you Richard we'll take another
coffee break for 30 minutes so please
come back at 2:45 but for a presentation
by sherry more