Transcript
oGk1v1jQITw • Deep Learning for Natural Language Processing (Richard Socher, Salesforce)
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0010_oGk1v1jQITw.txt
Kind: captions Language: en thank you everybody thanks for coming back very soon after lunch I'll try to make it entertaining to avoid some post food coma so I actually have a lot - OH - being here - Andrew and Chris and my PhD at Stanford here it's it's really it's always fun to be back I figured there's a going to be a broad range of capabilities in the room so I'm sorry I will probably bore some of you for the first two-thirds of the talk because I'll go over the basics of what's NLP when natural language processing what's deep learning and what's really at the intersection of the two and then the last third I will talk a little bit about some exciting new research that's happening right now so let's get started with what is natural language processing it's really a feel at the intersection of computer science AI and linguistics and you could define a lot of goals and a lot of these statements here we could really talk and philosophize a lot about but I'll move through them pretty quickly for me the goal of natural language processing is for computers to process or scare quotes understand natural language in order to perform tasks that are actually useful for people such as question answering the caveat here is that really fully understanding and representing the meaning of language or even defining it is quite an elusive goal so whenever I say the model understands I'm sorry I shouldn't say that really these models don't understand in the sense that we understand language anything so whenever somebody says they can read or represent the full meaning and its entire glory it's it's usually not quite true really perfect language understanding is in some sense AI complete in the sense that you need to understand all of visual inputs and thought and and a lot of other complex things so a little more concretely as we try to tackle this overall problem of understanding language what are sort of the different levels that we often look at it often and for many people starts at speech and then once you have speech you might say alright now I know what phonemes are smaller parts of words I understand words form Nets morphology or morphological analysis once I know what the meaning of words are I might try to understand how they're put together in grammatical ways such that the sentences are understandable or at least grammatically correct too a lot of speakers of the language once we go and we understand the structure we actually want to get to the meaning and that's really where I think most of the interesting most of my interests lies and semantic interpretation actually trying to get to the meaning in some useful capacity and then after that we might say well if we understand now the meaning of the whole sentence what's how do we actually interact what's the discourse how do we have you know spoken dialogue system and things like that where deep learning has really improved the state of the art significantly is really in speech recognition and syntax and semantics and the interesting thing is that we're kind of actually skipping some of these levels deep learning doesn't require often morphological analysis to create very useful systems and in some cases actually skips syntactic analysis entirely as well it doesn't have to know about the grammar it doesn't have to be taught about what mound phrases are prepositional phrases it can actually get straight to some semantically useful tasks right away and that's going to be one of the sort of advantages that we don't have to actually be as inspired by linguistics as traditional natural language processing had to be so why is NLP hard well there's a lot of complexity in representing and learning and especially using linguistics situational world and visual knowledge really all of these are connected when it gets to the meaning of language to really understand what red means can you do that without visual understanding for instance if you have for instance this sentence here Jane hit June and then she fell or and then she ran depending on which verb comes after she the definition the meaning of she actually changes and this is one subtask you might look at so called an F or a resolution or cor efference resolution in general where you try to understand who does she actually refer to and it really depends on the meaning again somewhat scare quotes here of the verb that follows this pronoun similarly there's a lot of ambiguity so here we have a very simple sentence for words I made her duck now that simple sentence can actually have at least four different meanings if you can think about it for a little bit right you made her a duck that she loves for Christmas as for dinner you made her dock like me just now and so on there are actually four different meanings and to know which one requires in some sense situational awareness or knowledge to really disambiguate what what is meant here so that's sort of the high level of NLP now where does it actually become useful in terms of applications well they actually range from very simple things that we kind of assume or you're given now we use them all the time every day to more and more complex and then also more in the realm of research the simple ones are things like spell checking or key word search and finding synonyms and ophisaurus then the meaty medium sort of difficulty ones are the extract information from websites trying to extract sort of product prices or dates and locations people or company names are called named entity recognition you can go a little bit above that and try to classify sort of reading levels for school text for instance or do sentiment analysis that can be helpful if you have a lot of customer emails that come in and you want to prioritize highly the ones of customers for really really review right now and then the really hard ones and I think in some sense the most interesting ones are machine translation trying to actually be able to translate between all the different languages in the world question answering clearly something that is a very exciting and useful piece of technology especially over very large complex domains can be used to automated for automated email replies I know pretty much everybody here would love to have some simple automated email reply system and then spoken dialogue systems bots are very hip right now these are all sort of complex things that are still in the realm of research to do them really well we're making huge progress with deep learning on these three but there's still nowhere near human accuracy so let's look at the representations I mention you know we have morphology and words and syntax and semantics and so on we can look at one example a namely machine translation and look at how did people try to solve this problem of machine translation well it turns out they actually tried all these different levels with varying degrees of success you can try to have a direct translation of words to other words the problem is that is often a very tricky mapping one the meaning of one word in English might have three different words in German and vice versa you can have three of the same words in English meaning all this single same word in German for instance so then people said well let's try to maybe do some tactic transfer where we have whole phrases like to kick the bucket just means stab them in German okay not a fun example and then semantic transfer might be well let's try to find a logical representation of the whole sentence the actual meaning in some human understandable form and and try to just find another surface representation of that now of course that will also get rid of a lot of the subtleties of language and so they're tricky problems in all these kinds of representations now the question is what does deep learning do you've already saw at least two methods standard neural networks before and convolutional neural networks for vision and in some sense there's going to be a huge similarity here to these methods because just like images that are essentially a long list of numbers the vector and standard neural networks where the hidden state is also just a vector or a list of numbers that is also going to be the main representation that we will use throughout for characters for words for short phrases for sentences and in some cases for entire documents they will all be vectors and with that we are sort of finishing up the whirlwind of what's NLP of course you could give an entire lecture on all like almost every single slide I just gave we're very a very high level but we'll continue at that speed to try to squeeze this complex deep learning for NLP subject area into an hour and a half I think there are two most two of the most important basic Lego blocks that you nowadays want to know in order to be able to sort of creatively play around with more complex models and those are going to be word vectors and sequence models namely recurrent neural networks and I kind of split this into words sentences and multiple sentences but really you could use recurrent neural networks for shorter phrases as well as multiple sentences but in many cases we'll see that they have some limitations as you move to longer and longer sequences and just use the default neural network sequence models alright so let's start with words and maybe one last blast from the past here to represent the meaning of words we actually used to use a taxonomy like word net that kind of defines each word in relationship to lots of other ones so you can for instance define hyper names and is a relationships you might say the word Panda for instance in its first meaning as a noun basically goes through this complex tags directed acyclic graph most of it is roughly just a tree and in the end like everything it is an entity but it's actually a physical entity a type of object it's a whole object it's a living thing it's an organism animal and so on so you basically can define a word like this and another way at each node of this tree you actually have so called sunset so synonym sets here's an example for the synonym set of the word good good can have a lot of different meanings can actually be both an adjective and as well as an adverb as well as a noun now what are the problems with this kind of discrete representation well they can be great as a resource of your human you want to find synonyms but they're ever they're never going to be quite sufficient to capture all the nuances that we have in language so for instance the synonyms here for good were adapt Axford practice proficient and skillful but of course you would use these words in slightly different contexts you would not use the word expert in exactly this all the same context as you would use the meaning of good or the word good likewise it will be missing a lot of new words language is this interesting living organism we change it all the time you might have some kids they say Yolo and all of a sudden you know you need to update your dictionary likewise maybe in Silicon Valley you might see ninja a lot and now you need to update your dictionary again and that is basically going to be a Sisyphus job right nobody will ever be able to really capture all the meanings and and this living breathing organism that languages so it's also very subjective some people might think ninja should just be deleted from the dictionary and I don't want to include it I'll just think nifty or badass is kind of a silly word and should not be included in a proper dictionary but it's being used in real language and so on it requires human labor as soon as you change your domain you have to ask people to update it and it's also hard to compute accurate word similarities some of these words are subtly different and it's really a continuum in which we can measure their similarities so instead what we're going to use and what is also the first step for deep learning will actually realize it's not quite deep learning in many cases but it is sort of the first step to use deep learning and NLP is we will use distributional similarities so what does that mean basically the idea is that we'll use the neighbors of a word to represent that word itself it's a pretty old concept and here's an example for instance for the word banking we might actually represent banking in terms of all these other words that are around it so let's do a very simple example where we look at a window around each word and so here the window length that's just for simplicity say it's one we represent each word only with the words one to left and one to the right of it we'll just use the symmetric context around each word and here's a simple example corpus so if the three sentences in my corpus of course we would always want to use corpora with billions of words instead of just a couple but just to give you an idea of what's being captured in these word vectors is I like people earning I like NLP and I enjoy flying and now this is it's very simple so-called corcoran statistic you'll just simply see here I for instance appears twice in its window size of one here the word like isn't its window and its context and the word enjoy is once in its context and for like you have twice to its left I and once deep and once NLP it turns out if you just take those vectors now this could be a vector of presentation just each row could be a vector representation for words unfortunately as soon as your vocabulary increases that vector dimensionality would change and hence you have to retrain your whole model it's also very sparse and really it's going to be somewhat noisy if you use that vector now another better thing to do might be to run SVD or something simple like say dimensionality reduction on such a co-occurrence matrix and that actually gives you a reasonable first approximation to word vectors very old method works reasonably well now what works even better than simple PCA is actually a model introduced by Thomas McAuliffe in 2013 called word Tyvek so instead of capturing Corcoran's counts directly out of a matrix like that you'll actually go through each window in a large corpus and try to predict a word that's in the center of each window and use that to predict the words around it that way you can very quickly train you can train almost on line though few people do this and and add words to vocabulary very quickly in this zooming fashion so now let's look a little bit at this model where Tyvek because it's first very simple NLP model and to sort of is very instructive we won't go into too many details but at least look at a couple of equations so again main goal is to breeding words in a window of some length that we define em type or parameter of every word now the objective function will essentially try to maximize here the log probability of any of these contacts words given the Center word so we go through our entire corpus T very long sequence and at each time step J we will basically look at all the words in the context of the current word T and basically try to maximize here this probability of trying to be able to predict that word that is around the current word T and theta are all the parameters namely all the word vectors that we'd want to optimize so now how do we actually define this probability P here the simplest way to do this and this is not the actual way but it's the simplest and first to understand and derive this model is with this very simple inner product here and that's why we can't quite call a deep there's not going to be many layers of nonlinearities like we see in deep neural networks to be just a simple inner product and the higher debt in a product is the more likely these two will be predicting one another so here see the context is the dissenter word sorry oh is the outside word and basically this inner product the larger it is the more likely we were going to predict this and these are both just standard and dimensional vectors and now in order to get a real probability we'll essentially apply softmax to all the potential inner products that you might have in your vocabulary and one thing you will notice here is well this denominator is actually going to be a very large sum I will want to sum here overall potential inner products for every single window that would be true slow so now the real methods that we would use we're going to are going to approximate the sum in a variety of clever ways now I could literally talk to next hour and a half just about how to optimize the details of this equation but then we'll all deplete our mental energy for the rest of the day and so I'm just going to point you to the class I taught earlier this year so yes 24d we we have lots of different slides that go into all the details of this equation how to approximate it and then how to optimize it it's going to be very similar to the way we optimize any other neural network we're going to use stochastic gradient descent we're going to look at mini batches of a couple of hundred windows at a time and an update those word vectors and we're just going to take simple gradients of each of these vectors as we go through windows in a large corpus all right now we briefly mentioned PCA like methods and based on senior Lu decomposition often or standard a simple PCA now we also had this word Tyvek model there's actually one model that combines the best of both worlds namely glove or global vectors introduced by Geoffrey Pennington in 2014 and it has a very similar idea and you'll notice here there's some similarity you have this inner product again for different pairs but this model will actually go over the Corcoran's matrix once you have this Corcoran's matrix it's much more efficient to try to predict once how often two words appear next to each other rather than do it 50 times each time that that pair appears in an actual corpus so in some sense you can be more efficiently going through all the current statistics and you're going to basically try to minimize the this this subtraction here and what that basically means is that each inner product will try to approximate the log probability of these two words actually co-occurring now you have this function here which essentially will allow us to not overly weight certain pairs that occur very very frequently the for instance co-occurs with lots of different words and you want to basically lower the importance of all the words that Corker with that so you can train this very fast it scales to gigantic corpora in fact we train this on common crawl which is a really great data set of most of the internet it's many billions of tokens and it gets also very good performance on small corpora because it makes use very efficiently of these Corcoran statistics and that's essentially what words well word vectors are always capturing so if in one sentence you just want to remember every time you hear word vectors in deep learning one they're not quite deep even though we call them sort of step one of deep learning and to it they're really just capturing Corcoran's counts how often does a word appear in the context of other words so let's look at the some interesting results of these glove vectors here the first thing we do is look at nearest neighbors so now that we have these n dimensional vectors usually you say n between 50 to at most 500 good general numbers 100 or 200 dimensional each of these each word is now represented as a single vector and so we can look in this vector space for words that appear close by we started and looked for the nearest neighbors of frog and well turned out these are the nearest neighbors which was a little confusing since we're not biologists but fortunately when you actually look up in Google what what those mean you'll see that they are actually all indeed different kinds of frogs some appear very rarely in the corpus and others like toad or much more frequent now one of the most exciting results that came out of word vectors actually these word analogies so the idea here is can linearly can there be relationships between different word vectors that simply fall out of very linear and simple addition and subtraction so the idea here is what is meant a woman equal to king to something else as in what is the right analogy when I try to basically fill in here the last missing word now the way we're going to do this is very very simple cosine similarity or basically just take let's take an example here the vector of woman we subtract the word vector we learned of man and we add the word vector of king and the resulting vector I the art max for this turns out to going to be Queen for a lot of these different models and that was very surprising again we're capturing core current statistics so man might in its context often have things like running and fighting other silly things that men do and then you subtract those kinds of words from the context and you add them again and in some sense it's intuitive though surprising that it works out that well for so many different examples so here are some some other examples similar to the king and queen example where we basically took these two hundred dimensional vectors and we projected them down to two dimensions again with a very simple method like PCA and what we find is actually quite interestingly even in just the two first principal components of this space we have some very interesting sort of female male relationships so men to women is similar to uncle and aunt brother and sister sir and madam and so on so this is an interesting semantic relationship that falls out of essentially Corcoran's counts in specific windows around each word and a large corpus here's another one that's more of a syntactic relationship we actually have here superlatives like slow slower slowest is in a similar vector relationship to short shorter and shortest or strong stronger and strongest so this was very exciting and of course when you see an interesting qualitative result you want to try to quantify who can do better in trying to understand these analogies and what are the different modes and hyper parameters that modify the performance now this is something that you will notice in pretty much every deep learning project ever which is more data will give you better performance it's probably the single most useful thing you can do to machine learning or deep learning system is to train it with more data and we found that too now they're different vector sizes too which is a common hyper parameter like I said usually between 52 and so I wondered here we have 300 dimensional that essentially gave us the best performance for these different kinds of semantics and tactic relationships now in many ways having a single vector for words can be oversimplifying right some words have multiple meanings maybe they should have multiple vectors sometimes the word meaning changes overtime and so on so there's a lot of simplifying assumptions here but again our final goal for deep NLP is going to be to create useful systems and it turns out this is a useful first step to create such systems that mimic some human language behavior in order to create useful applications for us all right but words word vectors are very useful but words of course never appear in isolation and what we really want to do is understand words in their context and so this leads us to the second section here on recurrent neural networks so we already went over the basic definition of standard neural networks really the main difference between a standard neural network and a recurrent neural network which I'll abbreviate as RN and now is that we will tie the weights at each time step and that will allow us to essentially condition the neural network on all the previous words in theory and practice how we can optimize it it won't be really all the previous words we've more like at most the last 30 words but in theory this is what a powerful model can do so let's look at the definition of a recurrent neural network and this is going to be a very important definition so we'll go into a little bit of details here so let's assume for now we have our word vectors as given and we'll represent each sequence in the beginning it's just a list of these word vectors now what we're going to do is we're computing a hidden state HT at each time step and the way we're going to do this is with a simple neural network architecture in fact you can think of this summation here is really just a single layer neural network if you were to concatenate the two matrices in these two that but intuitively we basically will map our current word vector at that time step T sometimes I use these square brackets to denote that we're taking the word vector from that time step in there we map that with a linear layer a simple matrix vector product and we sum up some that matrix vector product to another matrix vector product of the previous hidden state at the previous time step we sum those two and reapply in one case a simple sigmoid function to define this standard neural network layer that will be HT and now at each time step we want to predict some kind of class probability over a set of potential events classes words and so on and we use the standard softmax classifier some other communities called logistic regression classifier so here we have a simple matrix WS for the softmax weights we have basically a number of rows are going to be a number of classes that we have and the number of columns is the same as the hidden dimension sometimes we want to predict the next word in a sequence in order to be able to identify the most likely sequence so for instance if I asked for a speech recognition system what is the price of wood now in isolation if you hear wood you would probably assume it's the wo uld auxiliary verb wood but in this particular context the price of it wouldn't make sense to have a verb following that and so it's more like the wo D to find the price of wood so language modeling is very useful task and it's also very instructive to use as an example for where recurrent neural networks refine so in our case here this softmax is going to be quite a large matrix that goes over the entire vocabulary of all the possible words that we have so each word is going to be our class the classes for language models are the words in our vocabulary and so we can define here this y hat T the jf1 is basically denoting here the probability that the J word at the J index will come next after all the previous words very useful model again for speech recognition for machine translation for just finding a prior for language in general alright again main difference the standard neural networks we just have the same set of W weights at all the different time steps everything else is pretty much a standard neural network we often initialize the first h0 here just either randomly or all zeroes and again in language modeling in particular the next word is our class of the softmax now we can measure basically the performance of language models with terms are called perplexity which really is here the average log likelihood of the basically the probabilities of being able to predict the next word so you want to really give the highest probability to the word that actually will appear next in a long sequence and then the higher that probability is the lower your perplexity in hence the models less perplexed to see the next word in some sense you can think of language modeling as almost NLP complete and some silly sense that you just if you can actually predict every single word that follows after any arbitrary sequence of words in a perfect way you would have disambiguated a lot of things you can you can say for instance what is the answer to the following question ask the question and then the next couple of words would be the predicted answer so there's no way we can actually ever do perfect job in language modeling but there's certain contexts where we can give a very high probability to the right next couple of words now this is the standard recurrent neural network and one problem with this is that we will modify the hidden state here at every time set so even if I have words like the and a and sentence period and things like that it will stick frequently modify in my hidden state now that can be problematic let's say for instance I want to train a sentiment analysis algorithm and I talk about movies and I talk about the plot for a very long time then I say oh man this movie was really wonderful it's great to watch and then especially the ending and you talk again for like fifty timesteps or 50 words or hundred words about the plot now all these plot words will essentially modify my hidden states if at the end of that whole sequence I want to classify the sentiment the word wonderful and great that I mentioned somewhere in the middle might be completely gone because I keep updating my hidden state with all these content words to talk about the plot now the way to improve this is by use better kinds of recurrent units and I'll introduce here a particular kind so called gated recurrent units introduced by Cho in some sense and we'll learn more about the LS TM tomorrow when Kwok gives his lecture but G R user in some sense a special case of LS DMS and the main idea is that we want to have the ability to keep certain memories around without having the current input modify modify them at all so again this example of sentiment analysis I say something's great that should somehow be captured in my hidden state and I don't want all the content words to talk about the plot in a movie review to modify that is actually overall I was a great movie and then we also want to allow error messages to flow at different strengths depending on the input so if I say great I want that to modify a lot of things in the past so let's define a giryu fortunately since you already know the basic Lego block of a standard neural network there's only really one or two subtleties here that are different there are a couple of different steps that we'll need to compute at every time step so in the standard RNN what we did was just have this one single neural network that we hope would capture all this complexity of the sequence instead now we'll first compute a couple of gates at that time step so the first thing will compute is the so called update gate it's just yet another neural network layer based on the current input word vector and again the past hidden state so these look quite familiar but this will just be an intermediate value and we'll call it the update gate then we'll also compute a reset gate is yet another standard neural network layer again just matrix vector product summation matrix vector product some kind of non-linearity here namely Sigma it's actually important in this case that it is a sigmoid just just basically both of these will be vectors with numbers that are between 0 and 1 now we'll compute a new memory content an intermediate age tilt here with yet another neural network but then we have this little funky symbol in here basically this will be an element-wise multiplication so basically what this will allow us to do is if that reset gate is 0 we can essentially ignore all the previous memory elements and only store the new word information so for instance if I talked for a long time about the plot now I say this was an awesome movie now you want to basically be able to ignore if your whole goal of this sequence classification model is to capture sentiment I'm going to be able to ignore past content this is of course if this was a 0 entirely a 0 vector now this will be more subtle this is a long vector if you know maybe a hundred or 200 dimensions so maybe some dimensions should be reset but others maybe not and then here we'll have our finally final memory and that essentially combines these two states the previous hidden state and this intermediate one at our current time step and what this will allow us to do is essentially also say well maybe we want to ignore everything that's currently happening and only update the last time step we basically copy over the previous time step in the hidden state of that and ignore the current thing again simple example in sentiment maybe there's a lot of talk about the plot when a movie was released if you want to basically have the ability to ignore that and just copy that in the beginning may have said it was an awesome movie so here's an attempt at a clean illustration I have to say personally I in the end find the equations a little more intuitive than the visualizations that we try to do but some people are are more visual here so this is in some ways basically here we have our word vector and it goes through different layers and then some of these layers will essentially modify other outputs of previous time steps so this is a pretty nifty model and it's read the second most important basic Lego block that we're going to learn about today and so just want to make sure we take a little bit of time I'll repeat this here again if the reset gate this R value is close to zero those kinds of hidden dimensions are basically allowed to be dropped and if the update gates Z basically is one then we can copy information in of that unit through many many different time steps and if you think about optimization a lot what this will also mean is that the gradient can flow through the recurrent wheel network through multiple time steps until it actually matters and you want to update a specific word for instance and go all the way through many different time steps so then what this also allows us is to actually have some units that have different update frequencies some you might want to reset every other word other ones you might really cap like they have some long-term context and they stay around for much longer all right this is the geo you it's the second most important building block for today there are like I said a lot of other variants of recurrent neural networks lots of amazing work in that space right now and tomorrow quoc will we'll talk a lot about some more advanced methods so now that you've understand word vectors and neural network sequence models you really have the two most important concepts for deep NLP and that's pretty awesome so congrats we can now in some ways really play around with those two Lego blocks plus some slight modifications of them very creatively and build a lot of really cool models a lot of the models that I'll show you and that you can read and see and read the latest papers that are now coming out almost every week on archive will have some kind of component of these will use really these two components in a major way now this is one of the few slides now with something really new because I want to keep it exciting for the people who already knew all this stuff and took the class and everything this is tackling a important problem which is and all these models that you'll see in pretty much most of these papers we have in the end one final softmax here right and that softmax is basically our default way of classifying what we can see next what kinds of classes we can predict the problem with that is of course that that will only ever predict accurately frequently seen classes that we had at training time but in the case of language modeling for instance where our classes are the words we may see a test time some completely new words maybe I'm just going to introduce to you a new name srini for instance and nobody may have like seen that word at training time but now that I mentioned him and I will introduce him to you you should be able to predict the word trini and that person in a new context and so the solution that we're literally going to release only next week and in a new paper is to essentially combine the standard softmax that we can train with a pointer component and that pointer component will allow us to point to previous contexts and then predict based on that to see that word so let's for instance take the example you have language modeling again we may read a long article about the Fed chair Janet Yellen and maybe the word Yellen had not appeared in training time before so we couldn't ever predict it even though we just learned about it and now a couple of sentences later interest rates were based and then missus and now we want to predict that next word now if that hadn't appeared in our softmax standard training procedure at training time we would never be able to predict it what this model will do and we're kind of calling it a pointer sentinel mixture model is it will essentially first try to see what any of these previous words maybe be the right candidate so we can really take into consideration the previous context of say the last hundred words and if we see that word and that word makes sense after you know we train it of course then we might give a lot of probability mass to just that word at this current position in our previous immediate context at test time and then we have also the sentinel which is basically going to be the rest of the probability if we cannot refer to the some of the words that we just saw and that one will go directly to our standard softmax and then what we'll essentially have is a mixture model that allows us to say either we have or we have a combination of both of essentially words that just appeared in this context and words that we saw in our standard softmax language modeling system so I think this is a pretty important next step because it will allow us to predict things we've never seen a training time and that's something that's clearly a human capability that most or pretty much none of these language models had before and so to look at how much it actually helps it'll be interesting to look at some of the performance before so again what we're measuring here is perplexity and the lower the better because it's essentially inverse here of the actual probability that we assigned to the correct next word and in just 2010 so six years ago there this was some great work early work by Thomas McAuliffe where he compared to a lot of standard natural language processing methods syntactic neural net syntactic models that essentially tried to predict the next word and had a perplexity of 107 and he was able to use the standard recurrent neural networks and actually an ensemble of eight of them to really significantly push down the perplexity especially when you combine it with standard count based methods for language modeling so in 2010 he made great progress by pushing it down to 87 and now this is one of the great examples of how much progress is being made in the field thanks to deep learning we're two years ago white chicks are memba and and his collaborators were able to push that down even further to 78 with a very large lsdm similar to a GRU like model but even more advanced quark will will teach you the basics of LS CMS tomorrow then last year we pushed the the performance was pushed down even further by yarn gull and then this one actually came out just a couple of weeks ago variational recurrent highway networks pushed it down even further but this pointer sentiment model is able to get it down to 70 so in just a short amount of time we pushed it down by more than 10 perplexity points and in two years and that is really an increased speed in performance that we're seeing now that deep learning so if changing a lot of areas of natural language processing alright now we have sort of our basic Lego blocks the word vectors and the GRU sequence models and now we can talk a little bit about some of the ongoing research that we're working on and I'll start that with maybe a controversial question which is could we possibly reduce all NLP tasks to essentially question answering tasks over some kind of input and in some ways that's a trivial observation that you could do that but it actually might help us to think of models that could take any kind of input a question about that input and try to produce an output sequence so let me give you a couple of examples of what I mean by this so here we have the first one is a task that we would standardly associate with answering I'll give you a couple of facts Mary walk to the bathroom send her went to the garden Daniel went back to the garden Sandra took the milk there where's the milk and now you might have to logically reason so I try to find the sentence about milk maybe Sandra took the milk there and I would have to maybe do an F for a resolution find out what does there refer to and then you try to find you know the previous sentence that mentioned Sandra see that it's garden and then give an answer garden so this is a simple logical reasoning question answering task and that's what most people in the QA field sort of associated with some kinds of question answers but we can also say everybody's happy and the question is what's the sentiment and the answer is positive all right so this is a different subfield of NLP that tackles sentiment analysis we can go further and ask what are the named entities of a sentence like Jane has a baby in Dresden and you want to find out that Jane is a person in Dresden as a location this is an example of sequence tagging you can even go as far and say you know I think the smile is incredible and the question is what's the translation into French and you get you know Japan's kusuma del a on clay habla and dad in some ways would be phenomenal if we're able to actually tackle all these different kinds of tasks with the same kind of model so maybe it would be an interesting new goal for NLP to try to develop a single joint model for general question answering I think it would push us to think about new kinds of sequence models and new kinds of reasoning capabilities in an interesting way now there are two major obstacles to actually achieving the single joint model for arbitrary QA tests the first one is that we don't even have a single model architecture that gets consistent state-of-the-art results across a variety of different tasks so for instance for question answering and this is a data set called Bobby did face book published last year strongly supervised memory networks get the state of the art for sentiment analysis you had tree lsdm models developed by cashing ty here at Stanford last year and for part of speech tagging you might have bi-directional lsdm conditional random fields one thing you do notice is all the current state-of-the-art methods are deep learning sometimes they still connect to other traditional methods like conditional random fields and undirected graphical models but there's always some some kind of deep learning component in them so that is the first obstacle the second one is that really fully joint multitask learning is very very hard usually when we do do it we restrict it to lower layers so for instance in natural language processing all we're currently able to share in some principled way our word vectors we take the same word vectors we trained for instance with glove or work avec and we initialize our deep neural network sequence models with those word vectors in computer vision and we're actually a little further ahead and you're able to use multiple of the different layers and you initialize a lot of your CNN models with first pre trained CNN that was pre trained on imagenet for instance now usually people evaluate multitask learning with only two tasks they trained on for a first task and then they evaluate the model that they initialize from the first on the second task but they often ignore how much the performance degrades on the original task so when somebody takes an image net CNN and applies it to a new problem they rarely ever go back and say how much did my accuracy actually decrease on the original data set and furthermore we usually only look at tasks that are actually related and then we find out look there's some amazing transfer learning capability going on what we don't look a look at often in the literature and in most people's work is that when the tasks aren't related to one another they actually hurt each other and this is a so called catastrophic forgetting it's not there's not too much work that right now now I also would like to say that right now almost nobody uses the exact same decoder or classifier for a variety of different kinds of outputs right we at least replace the softmax to try to predict different kinds of problems all right so this is the second obstacle now for now we'll only tackle the first obstacle and this is basically what motivated us to come up with dynamic memory networks they are essentially an architecture to try to tackle arbitrary question-answering paths when I'll talk about dynamic memory networks is important to note here that for each of the different tasks I'll talk about it'll be a different dynamic memory network it won't have the exact same weights will just be the same general architecture so the high-level idea for DM ends is as follows imagine you had to read a bunch of facts like these here they're all very simple in and of themselves but if I now ask you a question I showed you these and I asked where Sandra you know it'd be very hard even if you read them all of them and be kind of hard to remember and so the idea here is that for complex questions we might actually want to allow you to have multiple glances at just at the input and just like I promised our one of our most important basic Lego blocks will be this GRU we just introduced in the previous section now here's this whole model in all its gory details and we'll dive into all of that in the next couple of slides so don't worry it's it's a big model a couple of observations so the first one is I think we're moving in deep learning now to try to use more proper software engineering principles basically to modularize encapsulate certain capabilities and then take those as basic Lego blocks and build more complex models on top of them a lot of times nowadays you just have a CNN that's like one little block in a complex paper and then other things happen on top here we'll have the gru or word vectors basically has you know one module a sub module in these different ones here and I'm not even mentioning word vectors anymore but word vectors still play a crucial role and each of these words is essentially represented as this word vector but we just kind of assume that it's there okay so let's walk on a very high level through this model they're essentially four different modules there's the input module which will be a neural network sequence model and giryu and there's a question module an episodic memory module and an answering module and sometimes we also have these semantic memory modules here but for now these are Ray just our word vectors and we'll ignore that for now so let's go through this here is our corpus and our question is where is the football and this is our input that should allow us to answer this question now if I ask this question I will essentially use the final representation of this question to learn to pay attention to the right kinds of inputs that seem relevant for given what I know to answer this question so whereas the football well it would make sense to basically pay attention to all the sentences that mention football and maybe especially the last ones if the football moves around a lot so what we'll observe here is that this last sentence will get a lot of attention so John put down the football and now what we'll basically do is that this hidden state of this recurrent neural network model will be given as input to another recurrent neural network because it seemed relevant to answer this current question at hand now we'll basically agglomerate all these different facts that seem relevant at the time and is now the gru in this final vector m and now this vector M together with the question will be used to go over the inputs again if the model deems that doesn't have enough information yet to answer the question so if I ask you where's the football and it's so far only found that John put down the football you don't know enough you still don't know where it is but you now have a new fact namely John seems relevant to answer the question and that fact is now represented in this vector M which is also just the last in the state of another Network now we'll go over the inputs again now that we know that John and the football irrelevant will be learned to pay attention to John move to the bedroom and John went to the hallway again those are going to get agglomerated here in this recurrent neural network and now the model seems thinks that it actually knows enough because it basically intrinsically captured things about the football John found a location and so on of course we didn't have to tell it anybody anything about their people their locations if X moves to Y and y is in the set of locations then this happens none of that you just give it a lot of stories like that and in its hidden states it will capture these kinds of patterns so then we have the final vector M and we'll give that to an answer module which produces in our standard softmax way the answer all right now let's zoom into the different modules of this overall dynamic memory network architecture the input fortunately is just a standard GRU the way we defined it before so simple word vectors hidden states reset gates update gates and so on the question module is also just the GRU a separate one with its own weights and the final vector q here is just going to be the last hidden state of that recurrent neural networks you can't model now the interesting stuff happens in the episodic memory module which is essentially a sort of meta gated GRU where this gate will basically define is defined computed by the attention mechanism and will basically say this current state sentence si here seems to matter and the superscript T is the episode that we have so each episode basically means we're going over the input entirely one time so it starts at g1 here and what this basically will allow us to do is to say well if G is zero then what we'll do is basically just copy over the past states from the input nothing will happen and unlike before in all these GRU equations this G is just a single scalar number it will basically say if G is zero then this sentence is completely irrelevant to my current question at hand I can completely skip it all right and there are lots of examples like mary mary traveled to the hallway that are just completely irrelevant to answering the current question in those cases this g will be zero and we're just copying the previous hidden state of this recurrent neural network over otherwise we'll have a standard giryu model so now of course the big question is how do we compute this G and this might look a little ugly but it's quite simple basically we're going to compute two vector similarities multiplicative an edit of one with absolute values of all the single values of the sentence vector that we currently have and the question vector and the first the memory state of the previous pass of the input and the first pass over the input the memory state is initialized to be just a question and then afterwards at agglomerated relevant facts so intuitively here if the sentence mentions John for instance and the question is or mentions football and the question is where is the football then you'd hope that the question vector Q mentions has some units that are more active because football was mentioned and the sentence vector mentions football so there's some units that are more active because football is mentioned and hence some of these inner products or absolute values of subtractions are going to be large and then what we're going to do is just plug that into a standard through standard single layer neural network and in a standard linear layer here and then we apply a soft max to essentially weight all of these different potential sentences that we might have to compute the final gate so this will basically a soft attention mechanism that sums to one and we'll pay most attention to the facts that seem most relevant given what I no so far and the question then when the end of the input has reached all these relevant facts here are summarized in another GRU that basically moves up here and you can train a classifier also if you have the right kind of supervision to basically train that the model knows enough to actually answer the question and stop iterating over the inputs if you don't have that kind of supervision you can also just say I will go over the inputs a fixed number of times and that that works reasonably well to all right there's a lot to sink in so I'll give you a couple seconds basically we pay attention to different facts given a certain question we iterate over the input multiple times and we agglomerate the facts that seem relevant given the current knowledge and the question now I don't usually talk about neuroscience I'm not a neuroscientist but there is a very interesting relationship here that a friend of mine Sam Gershman pointed out which is that the episodic memory in general for humans is actually the memory of autobiographical events so it's the time when we remember the first time I went to school or something like that and essentially a collection of our past personal experiences that occurred at a particular time in a particular place and just like our episodic memory that can be triggered with a variety of different inputs this is also this episodic memory is also triggered with a specific question at hand and what's also interesting is the hippocampus which is a seat of the episodic memory in humans is actually active during transitive inference so transitive inference is you know going from A to B to C to have some connection from A to C or in this case here with this football for instance you first had to find facts about John into football and then finding where John was and then find the location of John so those are examples of transitive inference and it turns out that you also need in the dmn these multiple passes to enable the capability to do transitive inference now the final module again is very simple G or UN softmax to produce the final answers the main difference here is that instead of just having the current the previous hidden state 18 minus 1 as input will also include the question at every time and we will include the answer that was generated at the previous time step but rather than that it's our standard softmax from your standard cross-entropy errors to minimize it and now beautiful thing of this whole model is that it's end-to-end trainable these four different modules will actually all train based on the cross entropy of that final softmax all these different modules communicate with vectors and we'll just have Delta messages and back propagation to train them now there's been a lot of work in the last two years on models like this in fact quoc will cover a lot of these really interesting models tomorrow different types of memory structures and so on and the dynamic memory network is in some sense one of those models one one particular model is a proper comparison because it's there a lot of similarities namely memory networks from jason weston those basically also have inputs and scoring and attention response mechanisms the main difference is that they use different kinds of basic Lego blocks for these different kinds of mechanisms for input they use bag of words representation z' or non-linear on linear embeddings for the attention and responses they have different kinds of iteratively to run functions the main interesting sort of difference to the dmn is that the dmn really use this recurrent neural network type sequence models for all of these different modules and capabilities and in some sense that helps us to have a broader range of applications that include things like sequence tagging and so let me go over a couple of results and experiments of this model so the first one is on this Bobbie dataset did Facebook publish it basically has a lot of these kinds of simple logical reasoning type questions in fact all these like where's the Paul those were examples from the Facebook Bobby data set and it also includes things like yes/no questions simple counting negation some indefinite knowledge where the answer might be may be basic coreference where you have to realize what does she who does she refer to or he reasoning over time if this happened before that and so on and basically this dynamic memory network I think is currently the state of the art on this data set of the simple simple logical reasoning now the problem with this data set is that it's a synthetic data set and so it had only a certain set of generating like human general human defined generative functions that created certain patterns and in that sense it's only necessary and not a sufficient condition of solving it with sometimes a hundred percent accuracy to real question answering so there's still a lot of complexity the main interesting bit to point out here is that there are different numbers of training examples for each of these different subtasks and so you have basically a thousand examples of simple negation for instance and it's always a similar kind of pattern and hence you're able to classify it very well now real language you will never have that many examples for each type of pattern you want to learn and so it's still general question answering is still an open problem and non-trivial now what's cool is this same architecture of allowing the model to go over inputs multiple times also got state of the art and sentiment analysis very different kind of task and we actually analyzed whether it's really helpful to have multiple passes over the input and it turns out it is so there's certain things like reasoning over three facts or Counting where you really have to have this dynamic this episodic memory module and it goes over the input maybe five times for sentiment it actually turns out it hurts after going over the input more than two times and that's actually one of the things we're now working on is can we find model that does the same thing for every single input with the same weights to try to learn this different tasks we can actually look at a couple of fun examples of this model and what happens with tough sentiment sentences generally to be honest sentiment you can probably get to like seventy five percent accuracy with some very simple models that just basically find like great words like great and wonderful and awesome and you'll get to something that's roughly right here some of the examples that those are the kinds of examples that you now need to get right to retry to push the state-of-the-art further in sentiment analysis so here the sentences in its ragged cheap and unassuming way the movie works so this sentence is incorrect even if you allow the dmn but I have this whole architecture but only allow one pass over the input once you have two passes over the input it actually learns to pay attention not just to these very strong adjectives but in the end actually to the movie working so here these fields are essentially the gating function G that we defined that pays attention to specific words and the darker it is the larger that gate is and the more open it is amor that word effects the hidden state in the episodic memory module so it goes over the input the first time pays attention to cheap and unassuming and way and a little bit of works too but the second time it basically figured out it agglomerate it's sort of the facts of that sentence and then learn to pay attention more to specific words that seem more important just one more example here my response to the film is best described as lukewarm so in general sentiment analysis when you look at unique an scores like the word best is basically some of the most one of the most positive words you could possibly use in a sentence and the first time the model passes over the sentence that also pays most attention took this incredibly positive word maybe best but then this site once it agglomerate at the context actually realizes well best actually here is not used in its adjective way but it's actually an adverb that best describes something and what it describes is actually lukewarm and hence it's actually a negative sentence so those are the kinds of examples that you need to get to now to appreciate improvements in sentiment analysis where we basically also went from on this particular data set these are all neural network type models that started 82 until then that same data set existed for around 8 years and none of the standard NLP models had reached above 80% accuracy and now we're basically in the high high 80s and and those are the kinds of improvements that that you see across a variety of different NLP tasks now that deep learning has come and deep learning techniques are being used in NLP and now the last task in NLP that this model turn out are also working for Ivy Wallen as part of speech tagging now part of speech tagging is less exciting of a task it's more of an intermediate task but it's still fascinating to see that after this data set has been around for over 20 years you can still improve the state of the art was the same kind of architecture that also did well and fuzzy reasoning of sentiment and discrete logical reasoning for for question answering now we had a new person joined a group Zhiming and he he thought well that's cool but he was more of a computer vision researcher and so he thought well could I use this create question-answering module now to do visual question-answering so combine sort of some stat was going on in the group and NLP and apply it to a computer vision and he did not have to know all of the different aspects of the code all he had to do was change the input module from one that gives you hidden states at each word over a long sequence of you know words and sentences to an input module that would give him vector years four sequences of regions in an image and he literally did not touch some of the other parts of the code I did have to look carefully at this input module aware again here our basic Lego block that Andre introduced really well of our convolutional neural network and then each the convolutional networks will essentially give us 14 by 14 many vectors one for each and it's one of its top states one representing each region of an image and then what we'll do is basically take those vectors and now replace the word vectors we used to have with CNN vectors and then plug them into GRU now again the GRU we know as our basic Lego block we already defined it one addition here is that it'll actually be a bi-directional GRU will go once from left to right in this snake-like fashion and another one goes from right to left backwards now both of these will basically have hidden state and you can just concatenate the hidden states of both of these to compute the final hidden state at each for each block of the image and that model to actually achieve state-of-the-art results this data set has been only released last year so everybody now works on deep learning techniques to try to solve it and I was at first a little skeptical it was just too good to be true that this model we developed for NLP would work so well so we really dug in to looking at the attention so what I showed you here these G values again that we computed with this equation now instead of paying attention to words it paid attention to different regions in the image and we started basically analyzing going through a bunch of those on the Deaf set and analyzing what is it actually paying attention to again it's being trained only with the image the question and the final answer that's what you get a training time you do not get this sort of latent representation of where you should actually pay it attention to in the image in order to answer that question correctly so when the question was what is the main color on the bus and learned to actually pay attention here to that bus mic well okay maybe that's not that impressive it's just the main object in the center of the image and you know what it types the type of trees are in the background well maybe it just you know connects tree with anything that's green and pays attention to that so I was neat but you know not not super impressive yet so is this in the wild kind of more interesting and actually pays attention to a man-made structure in the background and correctly answer's no then this one is kind of interesting who is on both photos the answers girl now to be honest I don't think the model actually knows that there are two people tries to match them and so on it just finds the main person or main object in in this in the scene the main object is a little baby girl so it says girl this one's also relatively trivial what time of day was this picture taken the answers night because it's very dark picture at least in the sky now this one is getting a little more interesting what is the boy holding the answer a surfboard and it actually does pay attention to both of the arms and then what's just below that arm so that's a little more interesting kind of attention visualization and then for a while we're also worried well what if in the data set it just learns really well from language alone yes it pays attention to things but maybe it'll just say things that it often sees in the text so if I asked you what or what color are the bananas you don't really have to look at an image in 95% of the cases you're right just saying yellow without seeing an image so it was really this one I was kind of excited about because it actually paid attention to the bananas in the middle and then did say green and kind of overruled the prior that it would get from from language alone what's the pattern on the cat's fur on its tail pays attention mostly to the tail and says stripes now this one here was interesting and fit the player hit the ball the answer yes though I have to say that we later had a journalist want to do his own question he he asked John marker from New York Times and we just put together this demo and the night before and he's like well I want to ask my own question and I am like okay and he asked is the girl wearing a hat and you know it wasn't made for production so it's kind of slow and the system was cranking it like well you know like trying to come up with excuses it's kind of black background and the plaque hat and it might be kind of hard to see and unfortunately I got it right and said yes and then after the interview I said well maybe let's look and see if like what I imma just asked it myself less stressful situation a bunch of questions on my own and these are all the questions like the first eight questions that I could come up with and somewhat to my surprise it actually got them all right so what is the girl holding a tennis racket what's she playing playing tennis or what's she doing I was to go wearing shorts what is the color of the ground brown then I was like well okay let's try to break it by asking just like what's the color of like the sound of this the smallest object the ball actually got that right to because her skirt white also kind of interesting like when you asked him all what she's wearing shorts but in you asked about the skirt and it still sort of is you know sort of capturing that you might call this different things what and then this one was interesting what did the girl just hit tennis ball and then as like well what if I asked is the girl about to hit the tennis ball and said yes and then did the girl just hit the tennis ball and it said yes again so then I finally found a way to break it so it doesn't have enough the Corcoran statistics to understand and again spare quote understand sort of which angles does the arm have to be in order to assume that the ball was just adores about it but what it basically does show us is that once it saw a lot of examples on a specific domain it really can capture quite a lot of different things now see if we can get the demo up I have to be a VPN to make it work but so here's here's one one example the best way to hope for any chance of enjoying this film is by lowering your expectations again one of those kinds of sentences that you have to now get correct in order to get improved performance on sentiment and actually correctly says that this is this is negative now we can also actually ask that question in Chinese this is one of the beautiful things off of the dmn and in general really of most deep learning techniques we don't have to be experts in a domain or even in a language to create a very very accurate model for for that language or that domain there's no more future engineering I'm not going to make a fool of myself trying to read that one out loud but that's an interesting example you can also this is the what parts of speech are there you can have other things like you know named entities and other sequence problems I can also ask what are the men wearing on the head answers helmets and then maybe a slightly more interesting question why are the men wearing helmets and the answer is safety so especially we're close to the circle of death here at Stanford where a lot of bikes crash and it's a good answer all right with that I'll leave a couple of minutes for for questions so basically the summary is word vectors and recurrent neural networks are super useful building blocks once you really appreciate and understand those two building blocks you're kind of ready to have some fun and build more complex models really in the end this dmn is a way to combine that in just a variety of new ways to a larger more complex model and that's also where the state I think of deep learning is for natural language processing we've tackled a lot of these smaller sub-problems intermediate tasks and now we can work on more interesting complex problems like dialogue and question answering machine translation and things like that all right thank you I mean all right cool yeah a quick question in the dynamic memory Network you have the the RN and you also mentioned that if you have better assumption of the input right so you used to work on the tray LST M right so if you change they are in into a tree structure would that help it's a good question I I actually loved researchers at in my whole PhD about tree structures and somewhat surprising in the last couple of weeks to actually some new results on SNL I understand for natural language inference data said where tree structures are again the state of the art and I have to say that I think the the dynamic memory Network by having this ability in the episodic memory to keep track of different sub phrases and pay attention to those and then combine them over multiple passes I think you can kind of get away with not having a tree structures so yes you might have a slight improvement representing sentences as trees in your input module but I think they're only going to be slight and I think the episodic memory module that has this capability to go over the input multiple times pay attention to certain sub phrases will capture a lot of the kinds of complexities that you might want to capture in tree structures so I don't my short answer is I don't think you necessarily need it have you tried it we have not no thanks hi a question is about question answering say if we want to apply questions into some specific domains that health healthcare but we don't really have the data we don't have questions appears and what sure we'll do are there any general principles here it's a great question what do you do if you want to question answering on a complex domain you don't have the data I think and this feels maybe like a cop-out but I think it's very true both in practice and in theory create the data like if you cannot possibly create more than a thousand examples of anything then maybe automating that process is not that important so clearly you should be able to create some data and in many cases that is the best use of your time is just to sit down or ask the domain expert to create a lot of questions and then have people find the answers and then measure how they actually get to those answers try to have them in a constrained environment and so on I think most companies for instance when you try to do automated email replies which is in some ways a little bit similar to question answering well there's a nice nice nice domain because everybody had already emailed there were already answered before so you can use sort of past behavior now if you had a search engine where people asked a lot of questions then you can also use that too in bootstrap and see where did they actually fail and then take all those really tough queries where they failed have some humans sit there and collect the data so that's that's the simplest answer now the other answer is let's work together for the Mexican like many years on research for smaller training data set sizes and complex reasoning the the fact of the matter for that line of research will still be if you if a system has never seen a certain type of reasoning I'll be hard for the systems to pick up that type of reasoning I think we're going to get with these kinds of architectures to the space where at least if it has seen this type of reasoning a specific type of transitive reasoning or temporal reasoning or sort of cause and effect type reasoning at least like a couple hundred times then you should be able to train a system with these kinds of models to do it are these QA systems currently robust to false input our questions for the woman playing tennis if you asked what's the man holding would it replied there is no man it would not and largely because at training time you never try to mess with it like that I'm pretty sure if you added a lot of training examples where you had those it would probably eventually pick it up those would be important for like real-world implementations and so real-world implementations of this in security are actually kind of tricky I think whenever you train a system we know we can for instance both steal certain classifiers by using them a lot we know we can fool them into classifying certain images for instance as others we have folks in the audience who worked on that exact line of work so I would be careful using it in security environments right now yeah I have a question oh wow up there yeah I have a question actually uh there was a slide where you had the input module and and there were a bunch of sentences so what those sentences themselves are n ends because you know sequence is basically made up of those individual words in sake love you know representation so what those you know also when are n ends that word you know stitch together or so the answer there is a little complex because we have two two papers with the dmn and the answer is different for each the simplest in the simplest form of that there it is actually a single gru that goes from the first word through all the sentences as if there are one gigantic sequence and but it has access to each sentence period at the end to pay a special attention to the end of sentences and so yes in the simplest form it is just a giryu that goes over all the words this is a normal process to basically just concatenate all the sentences into one gigantic you know so the answer there and this is kind of why I split the the talk into three different ones from like words single sentences and in multiple sentences I think if you just had a single gru that goes over everything and now you try to reason over that entire sequence it would not work very well your read to have an additional structure such as an intention mechanism or a pointer mechanism that has the ability to pay attention to specific parts of your input to do that very accurately but yeah in general that's fine as long as you have this additional mechanism thank you thank you great question so in the recurrent neural Nets you're using sigmoids in visual recognition I guess are rectified linear units for the more popular non-linearity that's right so rail users are great now when you look at the GRU equations here and you have these reset gates and so these reset gates here you want them to essentially be be between zero and one so that it can either ignore this input entirely or you have it normally be part of the computation of H tilt so in some cases you really do want to have Sigma lights there but other ones for instance some like simpler things where you actually don't have that much recurrence such as going from one member state to another in the second iteration of this model actually rail used were we're good mom good like activation functions to did you guys try to after training this network try to take these weights for the images and do object detection again so these weights would be augmented with the text victors did you try to use that is a very cool idea that we did not explore no there you go you got to do it fast yeah feel this feel is moving fast you just let the cat out of the box so so those attention models are pretty powerful when you have an opportunity data and then you can learn you know to make make yourself with data but even though those are some of the tasks are pretty gets a trivial to human but it's hard for model tuner so what do you think of a casinos right now even right now we have not a non G base on the web right no inequity pedia we not we know a lot about you know common sense but how what do you think about you cover those knowledge base into those models I actually love that line of research too and that was kind of what we start out with this semantic memory module in the simplest form is just word vectors I think in one next iteration would activity to have knowledge bases also influence the reasoning there's very little work on combining text and knowledge bases to do overall complex question answering that requires reasoning thing is a phenomenally interesting area of research so where any night hints or any starting point about it so there are some photos there are some papers that reasoning over knowledge bases alone so we had a paper on recursive no tensor networks that basically takes a triplet a word vector for an entity might be in freebase might be in word net a relation a vector for a relationship and a vector for another entity and then basically pipe them into a neural network and say yes no are these two entities actually in that relationship and you can have a variety of different architectures I think semi work done on that as well wait that's a different brother different Benjy oh I think over there all right and it's true that's true yeah if antoine board right that's right that's right so so i think you can also reason over knowledge graphs and you could then try to combine that with reasoning over fuzzy text it has been a boat it all has been done i think nobody has yet really combined it in a principled way great question yeah one last question a whole question so so what the model answer my questions correctly so how do i check the model actually understand understood my question and the woods which are logic was a models logic behind that it's a good question in some ways it's a common question for for neural network interpretability so income division at the sometimes we can at least the visualizes the features right so how about the right and so i think the best thing that we could do right now is to show these attention scores where you know for sentiment we're like oh how did it come up the sentiment oh it paid attention to the movie working and likewise for question answering we can see like which facts at which sentences that actually pay attention to in order to answer that overall question so that is I think the best answer that we could come up with right now but how yeah there's certain other complexities that there's still an area of open resources thank you all right thank you everybody so thank you Richard we'll take another coffee break for 30 minutes so please come back at 2:45 but for a presentation by sherry more