Transcript
G5RY_SUJih4 • Sequence to Sequence Deep Learning (Quoc Le, Google)
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0015_G5RY_SUJih4.txt
Kind: captions Language: en eating that were divided in two parts so number one and we work with you and develop the sequence to sequence learning and then that's the second part I would I will place sequin to sequence in a broader context or a lot of exciting work in this area now so let's multiply this by a an example so a week ago I came back from vacation and my in my inbox I have five hundred and eight emails and reply emails and a lot of emails I basically just require just yes and no answer so let's try to see whether we can do a system that can automatically reply these emails to say yes and no and for example so some of the email would be you know from my my friend on she said hi in the subject and she said are you visiting Vietnam for the New Year walk that would be her content and then my probable reply would be yes so you can gather another set like this and then you know you have some inputs content so less for now let's ignore the the the on the author of the email and the subject but let's focus on the content so let's suppose that you gather some email and some input would be something like are you visited in Vietnam for the New Year Kwok and the answer will be yes and then the another email would be are you hanging out with us tonight the answer is no because I'm quite busy so the third email would be did you read the coolness paper on breast net the answer is yes because I liked it now let's let's do a little bit of processing we're basically in the in the previous slide we have gear and comma and then kwok and then question mark and so on so let's let's do a little bit of processing and then put the the comma a space between gear and comma and then Kwok and question mark and so on so this step a lot of people call tokenization and normalization so let's do that with our emails now so and then the second step I would do would be to do feature representation so in this step what I'm going to do is the following I'm going to construct a 2,000 dimensional vector 2,000 represent the size of English vocabulary and then I'm going to go through email I'm going to count how many times a particular word occur in my email for example for example the world are occur one in my email so I increase the counter and then you occur one so I increased another counter and s etc and then I will reserve at the end a token to reserve to just count all the words that just our vocabulary okay and then now you now use successful you if you do this project a process you're going to convert all of you or your email from input to output pairs where the input would be fixed line representation of 20,000 dimensional vector and output would be either year or one okay any questions so far okay good okay so I will get so as you said somebody in the audience that the order of the words don't matter matter and the answer is yes so I'm going to get back to that issue later now so that's x and y and now your job my job now is to try to find some W search that W time X can approximate Y Y is the output right and Y here is yes and no so because of this problem is has two categories you can think of it as a logistic regression problem now if anybody follow the gray cs2 10:29 class by andrew probably can formulate this very quickly but in a very short you the album comes as follow you kind of try to come up with a vector for every email your w is a two column matrix okay the first column will find the probability for the eat whether the email have to be answer as yes second column will be answered as no and then you basically take the dot product between w1 at the first column now Adam is called the stochastic gwendy set so you run for iteration one to like a million you run for a long long time you sample a random email X and then some reply and then if the reply is yes then you want to update your w1 and w2 such that you increase the probability that the answer is yes so you increase the first probability now if your reply is if the correct reply is no then you're gonna update w1 and w2 so that you can increase the probability of the is email to be answered as you know so the second probability okay so let's call those a p1 and p2 now so because to update I said to update the increase what does that mean what that means is that you find the gradient of the partial gradient of the objective function with respect to some parameter so now you have to pick some alpha which is the learning rate and then you say W 1 is equal to W 1 plus some alpha the partial derivative of block of P 1 with respect to D of W 1 ok now I cheated a little bit here because I used the log function it turns out because the log function is a mono is a monotonic increasing function so increasing P 1 is equivalent to increase in the log of P 1 ok and it usually with this formulation stochastic gradient descent works better any question so far and then you can also update you know W 2 if the email is to be reply is yes and you can you can have different way to update and to if the reply is no so what's a and then if you have a new email coming in then you take X and then then you control into the vector then you compute the first probability ok W 1 time X divided by W exponential W 1 time X plus exponential or W 2 time X and if that probability is larger than 0.5 then you say yes and if that probability is less than 0.5 then you say no ok so that's how you do prediction with this now now this there's a problem with this representation is that there's some information loss so somebody in the audience just said that the order of the words don't matter and that's that's true now let's let's fix this problem by using something called the recurrent Network and I think a rigid soldier already talked about recurrent networks and some part of it yesterday and Andrei as well now there the idea of a recurrent Network is basically you have also have fixed representation for your input but it actually preserves some sort of info ordering information and the way that you compute the hidden units the following so the function hash of Euro is basically hyperbolic hyperbolic tangent of some some matrix you time the work vector for the world are okay so Richard also talk about what vectors yesterday so you you can take what vectors coming out of what to back or you can just actually randomly initialize them if you want to okay so let's suppose that that's H of zero now H of one would be a function of H zero and the vector for you which is a times H of zero plus u times V of vector u and then you can keep going with that to see one of my three three most complicated slides so you are you should ask questions no questions so everybody familiar with recording that sir well okay so to make predictions with this but you you tack on the label at the last step and then you say try to predict why for me how do you do that now here I I basically you you went the way you did before and basically you make update on the W matrix which is the the classifier at the top like what I said earlier now but you also have to update all the relevant matrices which is the matrix you the matrix a and some work vectors right so this is basically you have to compute the partial derivative of the last function with respect to those parameters now that's going to be very complicated and usually I when I do that I do that myself I get that wrong but there's a lot of tools out there that you can use which is you can use auto auto differentiation in tensor flow or you can call torch or you can call piano to actually compute the derivatives and once you have the derivatives you can just make the update right yeah yes so you the matrix you are share so I'm going to go back to one side so this matrix you I share all for all vertical matrices right and the size you have to determine ahead of time for example the number of column would be the size of the work vectors but the number of rows must be like like a thousand if you want or maybe 255 you want so this is model selection and it depends on whether you under fit in over fitting to choose a bigger model or a smaller model and your compute power so that you can train a larger model a smaller model the matrix you yeah so the the work vectors the world vectors the number of work vectors that you use are the size of vocabulary right which is so you gonna tend to end up with 20,000 work vectors right but the the size of so that means you have 20,000 rows in matrix U but the number of column you can sorry the number of column is 20,000 but the number of row would be you have to determine up just yourself okay any other questions now okay so what's a big picture so the big picture is I started with bag-of-words representations and then I talked about a and n as a new way to represent variable size input that can capture some sort of ordering information then I'll talk about Auto differentiation so that you can compute the partial derivatives and these you can find auto intensive flow or piano or torch now then I talked about stochastic when descent as a way to train the neural networks and the question so far okay you have a question oh that's also depends on how big your your training set and how big is your computer and so on right but usually if you use an N and if you used like a hidden state of a hundred you should take like a couple hours yeah but it depends largely largely depends on you know size of training data because you want to iterate for all a lot of you sample a lot of emails right you and you want your algorithm to see as many emails as possible right so okay so if you use such algorithm to just say yes no and just know then you might end up losing a lot of friends because because because we don't just say yes no because we went to say when for example my friend asked me are you visiting Vietnam for the new year walk then maybe the better answer would be yes see you soon right that's not better nicer way to approach this and then if if my friends ask me are you hanging out with us tonight so instances say no I would say no I'm too busy or did you read the coop ok right so let's let's see how we're going to fix this so so before I'm gonna tell you the solution I would say this is the this problem is drew it basically requires you to map between variable size input and some variable to some variable size output right and if you can do something like this then there's a lot of applications because you can do auto reply which is what we've been working on so far but we can also work on user to do translation just like between English French you can do image captioning so input would be an a fixed like vector or representation coming from conflict and then output would be the cat sat on the mat right or you can do summarization the input will be a document and output would be some summer summary of it or you can do two speech transcription where you can have input would be speech frames and output would be words or you can do conversation so basically the input would be the conversation so far and the output could be might reply or you can do cue night etc etc so we can keep going on now so how do we solve this problem so so this is this is hard so let's check out what Android capacity has to say about recurrent networks okay so so Android say that there's more than one way that you can configure your network to do things so we can do you could use your network to map recurrent networks to map one two right so the at the bottom that's an input the the green would be the hidden state and the output would be the what you want to predict now 1 1 2 1 is not what we want right because we have many too many so it's probably more like the last two to the right right but we arrived as the solution that I said in the red box and the reason why it does that's a better solution is because the the the size of the input and the size of output can vary a lot sometimes you have smaller input but larger output but sometimes you have larger input and smaller output so if you do the one in the red circle you can be very flexible right if you do the one to the extreme right then maybe the output has to be smaller or at least the same with the with the input right which what we are that's what we don't want so let's construct a solution that look like that so okay so here's the solution so the input would be something like hi how are you right and then let's put a special token unless let's say the token is end and then you're going to predict the first token which is M and then you predict the second token fine and then you predict the throat Oken thanks and then you keep going on until you predict the world end and then you stopped now I want to mention that B in the previous set of slides I was just talking about yes and no and ingest no you have only two choices okay now you have more than two choices you have actually 20,000 choices and you can actually use the algorithm that are the the logistic regression and you can expand it to cover that more than one more than two choices you can have a lot of choices okay and then the algorithm uses just follow the same way now so dizzy my first solution when I say walk - sick - sick but it turns out it didn't work very well and the reason why I didn't work very well is the model never know what it actually predicted in the in the last step so it keep a keep going and you keep synthesizing output but it didn't know what it said it didn't know what decision it committed in the previous step so a better simpler solution would look like this a better solution is you back basically you feed what the model predicts in the previous step as input to the next step alright so for example in this case I'm going to take am I'm going to feed it in to the next step so that I'm conduct completing the dance in the second world which is fine and etc so a lot of people call this concept auto regressive so you you take your you eat your own output and make it as your input any questions so far or whenever it produced end then just stop there's a special token end yeah now okay so the so relevant architecture here would be the end code people also call the encoder as the what the recurrent network in the input and the decoder would be the recurrent network in the output okay okay so how do you train this so again so you basically you run for a million steps you see all your emails and then you say you sample and for each iteration you sample an email X and a reply why why would be you know I'm fine thanks right and then the sample random work YT in Y and then you update the iron and encoder and decoder parameters so that you can increase the probability that Y of T is correct given all what you seen before which is your YT minus 1 YT minus 2 etc and also all the axes right and then you have to compute the partial derivatives to make it work so the computing part partial this is very difficult so again I recommend you to use something like Auto differentiation intensive flow or torch or Tiano okay you have a question yeah but the recurrent Network the number of parameters didn't change because you have U and V a UV and I are fixed right okay so the question in the in the audience is that there's um if the iron and are different in four different example and the answer is yes so the number of steps and are different I have a question there okay yeah I'm gonna get to that in the next slide yeah okay all right so the question is a in practice how long would I go to for the RN I would say if you usually stop at like 400 steps or something like that because outside of that it's going to be too long to make the update and compute it's very expensive to compute but you can go more if you want to yeah I have a question yeah yeah yeah yeah so that's a problem so if I'm going to talk about the prediction next so let me go to the prediction and then you can ask questions so okay so how do you do prediction so this the first algorithm that can we can you can do is go greedy decoding okay in greedy decoding is for any incoming email X I'm going to find I'm going to predict the first word okay and then you find the most likely word and then you feed back in and then you find the next most likely word and then then you feed back in and etc so if you keep going you keep going until you see the world end and then stop all it is exceed a certain length you stop okay now that's just do greedy okay so let's let's do a little bit less greedy so it turns out that so given X you can predict more than one candidate so let's say you can predict a candidate's let's say three okay so you take three candidates and then for each candidate you're going to feed in the next step and then you arrive at three so the next step you're going to be have nine candidates right and then you're going to end up going that way so here's a picture so given input X I'm going to predict the first token there would be hi yes and please and given every first token like this I'm going to feed back into the network and the network will produce another three and etc so you're going to end up with a lot of a lot of candidates so how did you select the best candidate well you can traverse each beam and then you compute the John probability at each step and then you find the sequence I have the highest probability to be the sequence of choice what is your reply any question to see the most complicated slide in my talk oh yeah yes so the question is what do you do with our vocabulary works now it turns out in this algorithm what you do is that for any word that is our vocabulary you create a token call unknown and you map everything to unknown or anything that our vocal every vocabulary to be unknown so it doesn't seem very nicely but usually it works well there's a bunch of algorithms to address these issues for example they break it into like characters and things like that and then it you could fix this problem yeah yeah the cost function is that so I go back one slide so the cost function one more slide so the cost function is that you sample a random were YT here let's suppose that here I this is my input sofa or an input and I'm sample YT let's say T is equal to 2 so which means the work fine okay I'm at the work fine I want to increase the probability of the model to predict whoa fine so the every time the model will make a lot of predictions some a lot of them will be incorrect right so you have a lot of probabilities you have probability for the water and the probably a and etc and then probably for zzzzz right and you have a lot of probabilities you want the probability probabilities for the worst for the work fine to be as high as possible you increase the probability does that make sense or you condition on IIM so you condition so when I'm at fine my input would be hi how are you and and um okay that's that's all I see and then I need to make a prediction and I have to make that prediction right right and you know if I'm at the world thanks my input would be hi how are you and I'm fine and I gotta get my thanks for probability right okay yeah I have a question here oh I haven't thought about it yet so the question is how do you personalize so well one way to do it is basically embed a user as a vector so let's suppose that you have a lot of users and you embed a user as a vector that's one way to do it yeah I have a question here yeah yeah so the question is that let's suppose that my beam search is 10 then you go to from 10 like a hundred and then a thousand and suddenly it grows very quickly right it go to rule a if you if your sequence is long then you end up with K to the N or something like that well one way to do it is basically you do truncate that beam search where any any sequence with very low probability you just pick it up you don't use it anymore so you go so you can do this you can do 3 9 and then you ten to seven and then you go back up to 9 right and then you keep going so that way you don't end up with a huge beam and usually in practice using like a beam size of three or ten would work just fine and whoops wait yeah yeah I have a question okay so for because it's a 9n we don't have to Pat the input now to be fast sometimes we have to Pat the input because we want to make use make sure that batch processing what's very well so you'd be bad but we paired with only like zero tokens okay yeah so let's suppose that you have a sequence of ten then you have a graph of ten when you have a sequence a batch of all twenty you haven't made a graph for twenty and etc yeah that will make the GPU very happy I have a question that oh so so you are you asking sort of so my interpretation of your question is how do you insert the world embedding into the model is that correct our user embed an old if you want to personalize the thing then at the beginning you have a vector and that's a vector for quoc with a ID one two three four five and then if is Peter then the vector would be five six seven eight yeah yeah that's one way to do it yeah well there's more than one way you can do it at the end or you can do it at the beginning or you can insert a tab at every prediction steps but my proposal is just predict put it at the beginning the simpler okay I have a question there yeah you yeah that's a very good question the question is what if the model details right if we make a prediction and then that's a bad prediction and your model never see and then it keeps detailing and it will pretty produce garbage yeah that's a that's a good question so I'm going to get to that so well so this is sly so there's an algorithm for scheduled sampling so in scheduled sampling what you do is you you instead of feeding the truth during training you can fee feet what sample from the sub max so what generated by the model and then feed in as input so that the model understands that if it produce something bad it would suck actually can recover from it right so that's that's one way to address this issue is that make sense yeah any question there's a question here okay yeah yeah yeah so in this algorithm yeah the question is how large is the the size of the Dakota well my answer is that try to be as large as possible but it's going to be very slow and in this algorithm what happens is that you you use the same you use like fixed length embedding for like to represent the very very much the long term dependency like a huge input right and that's going to be a problem so I'm going to come back to that issue with the attention model in a second okay any question okay here's a question ah so does the model learn synonyms is that a question or what's the question oh I see well yeah it turns out that if you learn it turns out that it mapped good and if you visualize embedding the good and fine and so on I'm not very closely to the to the embedding space but in the output there's we don't know what else to do the other approach is basically to train the world embeddings using water vac and then try to ask the model to regress to the world imbalance right so that's one way to address this issue we tried something like that did not work very well so whatever we have in here was pretty good okay I have to keep going but like any way the algorithm that you've seen so far turns out actually answer some emails so if you use the smart reply feature in inbox it's already used this system in production now for example in the indc me email my colleague Ricardo got an email from his friend saying that hey we wanted to invite you to join us from the early Thanksgiving on November 22nd beginning around 2:00 p.m. please bring your favorite dish and reserve by next week and then it would propose three answers for example the first answer would be telecine second answer would be will be there and the third answer is sorry we won't be able to make it now this where do these three answer come from those those are the beams now there's an algorithm to actually figure out the diversity as well of the beams so that you don't end up with very similar answers so there's an algorithm that like a heuristic that make these beams a little bit more diverse and then they pick the best three to present to you okay any question yeah I have a question here yeah there's no guarantees so the question is how do I guarantee that the the beam would terminate an end now there's no guarantee it can go on forever the indeed there are certain cases like that if you don't train the model very well now but if you train the model well with with very good accuracy then the model usually terminates highly see any cases that it don't terminate it doesn't terminate yeah but there are some corner cases that it will do funny things but you you can stop the model after like a thousand or hundred or something like that so that you make sure that the model doesn't do that doesn't go on crazy right I have a question here that's very interesting yeah it just comes out because there's a lot of emails and if you invite someone there's more than one person and it might be it learns about Thanksgiving it just mean inviting the whole family things like that yeah it just learned from statistics yeah or maybe that something like that yeah okay okay oh in industry algorithm so the question is do I do any post processing to correct the grammar of the beams in this algorithm we did not have to do it yeah okay I have another question so okay so the question how contextual so I would say we don't have any user embedding in this so it's pretty general the input would be the previous emails and the output would be the prediction the reply that's all we have so it sees a context which is the threat sofa okay did I answer your question okay yeah we you can catch me up after the talk yeah oh yeah it ran down too so yeah slow question oh oh I see so the question is there's some some emails are not relevant for a smart apply maybe they've too long or you should not reply or something like that so in fact we have two algorithms so one hour with them this is to say yes or no to reply right and then after it passes the threshold there's an algorithm to run to produce the threshold so it's a combine of two our rhythms that are actually I presented earlier yeah I have to get going but you can get back to the question so there's a lot of a more interesting stuff coming along okay so so what's a big picture so far so the big picture is that we have an i NN encoder that it's all the input and then we have an iron and decoder the trying to predict one token at a time in the output now everything else force is the same way so you can use stochastic when you sent to train the algorithm and then you you do beam search decoding usually you do app in search of up 3 and then you should be able to find good food good beam with the highest probability now someone in the audience brought up the issue that we use fixed length representation so just before you you make a prediction the Japan the hm and the white thing right before you go to the Dakota okay that is the fixed-line representation and you can think of it as like it's a vector that capture all everything in the in the input right it could be a thousand words or could be five words and you use a fixed length representation for a variable length input which is kind of not so nice so we want to to fix that issue so there's an algorithm coming along and it's actually invented at a at University of Montreal you're sure he's here so the idea is to use an attention so how does an attention work so in principle what you want is something like this every time before you make a prediction let's say you predict the world am you kind of won a loop again at all the hidden state so far you want to look at all what you see in the input software okay now say when you do fine you also want to see all the all the hidden state of the input sofa and and on now how do you do that in as a program so well you can do this so you H of M you predict a vector C let's say that vector is the same dimension with all the H okay so if the your H of one each dimension of 100 then C also have a dimension of 100 okay and then you take C and then you do dot product dot product with all the H okay and then you have coefficients a 0 a 1 blah blah blah to a to the N okay and those are scalars okay and then after you have those scalars you compute something called the beta which is basically I stop max of all the Alpha right so 2q compute that you take the exponent bi is an exponential our AI divided by the sum of Exponential's okay okay and then you take those bi and then multiply by H by and then you take the weighted average and then you take the sum and then you send it to add additional signal to predict the war and and then you keep going with that right so in the next step you also predict another C and then you take that C to compute the dot product you compute the B the a and then you can compute the B you can take the B you do the weighted average and then you send it to the next time to send it to the prediction and then you use stochastic when you send to Train everything okay and this autumn is implemented in tensorflow okay so how how into table what is going on here so let's suppose that you want to use this for translation so in translation you wanna for example the input would be hi how are you and the output is Ola combos paths or something like that okay and then when you put it the first word you want Ola to correspond to the world hi okay because there's an one-to-one mapping between the word high and Ola so if you use the attention model the beta's that you learn will put a strong wait for the words Ola for the world high and then it has a smaller wait for all the stuff and then if you keep going then when you say Como's then it will focus on how and etc okay so it moves that coefficient it put a strong emphasis on the relevant world and especially for translation it's extremely useful because you know the one-to-one mapping between the input and output any question so far this is definitely very complicated yeah I have a question all right now the beta other day and be alone so I don't I don't and so the question is how do I deal with languages where the order them like reverse for example English to Chinese Japanese right so some of the verbs get moved and things like that well I didn't I did not have cold air be they are learned so by virtue of learning they will figure out what beta to put right to wait the input and those are computer basically computed migrated set right so they just keep on learning okay I have a questionnaire okay yeah so the question is are they any work on putting attention in the output yeah I think I think you can do that I'm not too familiar with any work in here but I think it's possible to do it I think some people explore something like that yeah any question oh I have a question another question yeah yeah yeah yeah yeah so so the question is less about because right now the world hi is capitalized at the first character it doesn't mean I'm using two n or n vocabulary size so in practice you we should do some normalization if you have a small data set what you should do is you normalize the tax so high will be like lowercase and etc now if you have a huge data set doesn't matter we just learn okay yeah I have a question there right yeah so it so the question is in a sense it's capture the the positional information in the import yeah I agree I have a question there a pattern punctuation ah so the question is what do I do with punctuation well they are in right now I just present the algorithm as if it's a very simple implementation like the very basic but one thing that you can do is you you before you train the algorithm you put a space between the world and the punctuation so that you do some that is that step is called tokenization or normalization in language processing so you can use any like a stanford NLP package or something like that to normalize your text so that is easy to train now if you have infinite data then if you just learn itself okay so I should get going because there's a lot of other interesting stuff okay so it turns out that the the basic implementation but if you want to get good results and if you have big data sets so one thing that you can do is to make the network deep and one way to make deep is is in the following way so you stack your your recurrent network on top of each other right so you know like in the first sequence of sequence paper we use a network of four but people are gradually increasing to like six I and so on right now and they getting better and better result like in image net if you make a network people you also get better results okay so i if you wanna train sequin to sequins with attention then do a couple years ago when we like many laps working on this problem were behind the state-of-the-art but right now in translation many translation tasks basically this model our audio already achieved state-of-the-art without in a lot of these the pomt datasets so to train this model so number one is that as i said you might end up with a lot of vocabulary our vocal vocabulary issues so what Barack Obama will be this an unknown right Hillary Clinton and season unknown now you you might use something like what segments right so you segment the words out for example Barack Obama would be bar and drag and etc or you can use all the smart algorithms for example word character split you can split words that have unknown to be in two characters and then you treat the meta character there's some work at Stanford and they prove that it works very well so that's one way to do it you know tip number two is that you you when you train this algorithm because you when you do back propagation or forward propagation you multiply you essentially multiply a matrix many many times so you have explosion of function value or or the gradient or implosion as well now one thing that you can do is you click the grade in a certain value right so you say that if the gradient magnitude of the gradient is larger than 10 set it to ten okay then tip number three is to use giu or in our work we use a long short term memory okay so I want to revisit this long short-term memory business a little bit okay so what's the long short-term memory so in use an iron cell basically you can catenate your input and your the the hidden state and then you multiply by some theta and then you apply with some activation function let's say that's a hyperbolic tangent okay now that's the simple function for n n now in lsdm you basically you multiply the input and hash by a huge big matrix let's call that theta that theta is four times bigger than the theta I said in the iron and cell and then you're going to take that Z okay that coming out you split it into four blocks its block you can compute the gates and then you you use the the value of a something called like the cell and then you keep adding the newly computed computed values to the cell so there's this apart here that I say that the integral of C is that what it does is basically it keep a hidden state where it keep adding information to it so it doesn't multiply information but it's keep adding information you don't need to know a lot of this if you want to just apply a SDM because it's already implemented intensive law any questions so far okay so in terms of applications you can use this thing to do summarization so I've seen I started seeing work in some radiation pretty exciting you can do image captioning so and the input in that case would just be a representation of an image coming out from vgg or coming out for google net and etc and then you send it to the I end and and we do the decoding for you or you can use it for speech recognition or transcription or you can use it for QA so to the next part of the project the top and we'll talk a little bit about speech recognition okay so well in speech recognition the input could be maybe waveforms right and then an output could be some words you know hi how's it well one thing that you can do is you drop your input into Windows that's the green box is there and then you crop a lot of them and then you send a lot of them to an iron and then you convert it into MFC see before you send to Ana MFC see or spectrogram or something like that okay and then you use the algorithm that I said earlier and then with attention and then you do the transcription you predict one word at a time in the output now the problem with this algorithm is that in turn when it comes to speech you end up with a lot of input right you can end up with thousands and thousand steps so back propagating in time even with attention can be difficult now one thing that you can do is basically you do some kind of a pyramid to map the input so you if you do enough layers you can divide your input into a factor of eight or sixteen if you do enough layers right and then you produce the output so we we work in on an implementation where the output is actually characters like like the in the by - squawk where they have the ctc now I have to say that the strength of this algorithm is that you actually have an implicit language model in the output so when I say I when I have the word how is actually conditioned on hi and stop before right and including the input so there's an implicit language model already but the problem with this is that actually you have to wait until the end of the input to do the coding so the decoding has to be done offline okay so if you use this for voice search it might not be too nice because people want to see the some some output right away okay so in that case there's an algorithm that can use it do it in an online fashion block-by-block now also I have to mention that in translation this hour the sequence sequence a wit attention works great it's a among the stay of the art but when it comes to speech it doesn't work as well as the CDC at least in published results we're not as good as CDC which is whatever what Adam talked earlier or some of the hmm DNN hybrid which is which is the most Wylie speech system currently so I want to pause there and then I can take questions any questions I have a question at the back yeah yeah yeah Oh so how does the book in translation well in translation what we do is basically we have pairs of sentences so for example hi how are you and then hola como estas right and then we have pairs of sentences like this and then we just feed it into the turns out into the sequence two sequences attention at every step we again we're going to predict one word at a time but before we make a prediction the model has the attention so it actually see the the input once more before it makes a prediction that's how it works now what is can you repeat okay what is the issue with with a model again please yeah I see well I I can't quite follow the question but let's take it offline is that okay yeah yeah and then we can do some paper okay together I have a question yeah yeah yeah okay so the model I did the inbox thing that I presented it was on in English but there's no limitation in the model in terms of language so let's suppose that you in your inbox that you sometimes you write in English and sometimes you you write in in Vietnamese or sometimes you write it in Spanish whatever and you personalize by user embedding that I would say that it will just learn your behavior and then we will basically predict the world that you want you make you but make sure that your your output bank vocabulary is large enough so that it covers not only the English words but also the Spanish word and etc like Vietnamese and so on so your vocabulary gonna be not going to be 20,000 it's going to be like a hundred thousand because you have more choices and then you have to change your model on on those examples yeah it's a matter of the training data that's all okay I have a questionnaire yeah yeah yeah I saw the question is that in the case of voice search right now you have to wait at the end to make a prediction is there any otherwise yeah yeah the answer yes you can make a prediction block by block so you can actually figure out like an algorithm a simple algorithm to actually segment the speech and then make a prediction and then take the prediction and feed it it as input at the next block so you can keep going like that so you in theory you can actually do online decoding but but I'm saying that the work on you can do online decoding but that work is currently work in progress how about that okay I have a question there yeah over here so we have some input email and then some output email where export written emails reply and then you can just strain it that way yeah yeah okay I have a couple questions yeah yeah the question is that in speech recognition the CDC seems to be a very nice framework because it match it laser like a monotonic increase Minh in the output and the input but let CTC make this independent assumption it doesn't have a language model in it maybe that's the the sequence of sequence I can address this oh yeah I think that's a great idea maybe we should write a paper together okay I think I think I haven't seen it but I think that's a very good idea question I say okay great so so the question is that is there because right now we predict one step at a time is there any way to actually look globally at the output and maybe use some kind of reinforcement learning to adjust the output and the answer is yes so there's a recently a recent paper at Facebook who I think sequence level training or something like that where they don't optimize for one step at time but they predict they look at the globally and then they try to improve world at a rate or they try to improve blue score or things like that for translation and it seems to be making some improvement in the metrics that they care about now if you show it to humans though people still prefer the output from this model so some of the metrics that we use in translation and so on might not be what the metrics that we optimize and the next step prediction seem to be what people like a lot in translation yeah so so the question is can we add the GaN loss like it again lost yeah I think that's a great idea yeah I have a question here yeah yeah change yeah yeah so let's suppose that you type the first ha hola then you can actually start the beam from there so the question is is there any way to incorporate user input so I say yeah it let's suppose that you wanna you say hola sorry hi how are you right and then as soon as the person type hola that actually restrict your beam so you can actually condition your beam on the first world Ola and your beam will be better yeah I think that's a good idea I have a question oh so how much data did we use so in translation for example we use the several several WMT coppices Cobra and the W empty copper I usually have tens of millions of seven pairs of tendencies something like that and every every sentence have like 20 words on average twenty thirty words on average I can't remember but that's something like that order of magnitude yeah yeah I have a question there I can't really hear also how's it compared to Google search auto-completion I honestly I don't know what to use underneath a Google search auto-completion but if I were if they you if I think they should use something like this because it's okay I have still lots of interesting stuff coming along so okay okay so what's a big picture so the big picture is so far I talked about sequin to sequence learning and yesterday Andrew was talking about most of the big trends in deep learning and it talking about the second trend was basically doing end-to-end deep learning so you can characterize sequence of sequence learning as an 2n deep learning as well now so the framework is very general so it should work from a lot of NLP related tasks because a lot of them you would have input sequence and output sequence in our NLP it could be input would be some text and output would be some you know passing trees that's also possible but it works great when you have a lot of data now when you don't have enough data then maybe you want to consider dividing your problems into smaller components and then creating your sequin to sequence in the sub components and then merge them okay now if you don't have a lot of data but you have a lot of related tasks then it's also possible to actually merge all these tasks by combining the data and then have an indicator bit to say this is translation this is summarization this is email reply and then change only and that should improve your your output to now this basically conclude the parts about sequence sequence and then the next part I'm going to apply sequence to sequence in a big picture of the active on ongoing work in neural nets for NLP so if you have any questions you you can ask now I take maybe two questions because I think I running out of time so I have a question yeah also the question is does the modem handle emoji I don't know but it's emoji is like a piece of text to write so you can just like feed it into as another extra token if you make them if you make your vocabulary 200,000 then you should be able to cover emoji as well yeah I have a question also if you have new data coming in so should I return the model where you I think towards the end we lower the learning rate so if you add new data it just it will not make a lot of good updates so usually we make you you can add new data increase the learning rate and then continue to Train yeah that should work okay so I already took two questions let's keep going so so this is an active area that actually is a very exciting which is in the area of automatic unite so you can think that maybe the set up would be can you read a Wikipedia page and then answer a question or can you read a book and answer your question now you in theory you can use sequin to sequence with attention and then to do this task so it's going to look like this you're going to read the book right one token a time and with the book then treat a question and then you're going to use the attention to look at all the pages and then you make a prediction of the tokens right so so that cut up that's kind of sometimes you do we do answer this question that way sometimes we don't have knowledge about the fact so we actually read the book again to answer the fact but a lot of the time if you ask me is Barack Obama the president of the United States I would say yes because it's already in my memory so maybe it's better to actually akhmet the iron with some kind of memory okay so that it will not to do this look back again right it's kind of annoying look back again so there's an active area of this research I'm not a definite expert but I'm very aware so I can place you in the right context here so work in this area would be memory networks by Western and folks at Facebook there will be new rotating machines that deepmind dynamic memory networks would be a richer soldier presented yesterday and then stuck augmented iron ends by Facebook again and etc now well let's list so I want to show you a like high-level what is this augmented memory means okay so let's think about the attention so the attention looked like this so you and in the end coder you're going to look at at some input okay and then you have a controller which is your H variable and then you keep updating very high variable but along the side you're gonna write down into memory your h1 h2 h3 and etc right you store it into a memory clear-rite and in the decoder what you're going to do is you gonna continue continue producing some output right are you going to update your controller G but you're going to read from memory your H okay right so that so so again so in the import you write to memory in and then in the output you read from memory now now let's let's try to be a little bit more general and the general would be at any point in time you can read and write right you have a controller and you can read and write read and write all the time now to do that you you have to follow in architectures you have some memory bank big memory back ok and then you you can use the right you can decide to write some information into it from by a combination of the memory bank in the previous step and the hidden variable in the previous step and then you also read into the hidden state to and then you could make an amount update and then you can keep going forever like that so this concept is called an N with augmented memory okay is that is that somewhat clear any question you have a question the question is when you read do you read the entire memory bank a lot of these algorithms are actually soft attention so yes it will look the entire memory you can actually predict where to look right and then read that only that block now with the problem with that is you end up with very it's not differentiable anymore right because this the thing that you don't read don't contribute to the gradient so it's going to be hard to train but you can use to reinforce and so on to train it so there's a reason our paper reinforcement learning new row Turing machines but actually so there's something like this right not exactly but it will deal with discrete actions okay any question no question Wow okay so the another extension that a lot of people talk about is using an N with augmented operations so you want to augment the neural network with some kind of operations like addition subtraction multiplication the sine function etc lot of love functions so to motivate you you can think about Q and I can fall into this for example histor context the building was constructed in the year 2000 and then it was in later all people say oh it was then destroyed in the year 2010 and then the question would be how long it the building survived and the answer would be ten years now how would you answer this question where you say 2010 subtract two thousand ten years now neural nets if you can train with a lot example it can do that too you can learn too subtract numbers and things like that it requires a lot of data to do so all right so maybe is better to augment them with functions like addition and subtraction right so the way you can do it is that the neural network will read all the token so far and we'll push the numbers into a stack and then you get the more the neural net is augmented by a subtraction and a addition function and these two phone and then you assign these a probability for these two functions so green the more duck does mean the higher probability okay so you aside to probability and these two you compute the weighted average of the values coming out of these two function and then you take that and then you pop it and you push it into the stack in the next step and then in the next step you will call the addition and subtraction again and etc that's the principle of something called neural programmers or new neural programmer interpreters so there are two papers last year from Google brain and nygma was talking about this so so that's that's some of the related work in the area of augmenting recurrent networks with with operations with memory etc now what's a big picture ok so the big picture I want to revisit and I say so what I've talked to today is sequin to sequence learning and it's an end-to-end deep learning task so it's one of the big trends happening in natural language it's very general so you can use if you have a lot and a lot of supervised data it's a very supervised learning algorithm so if you have a lot of data it should work well but if you don't have enough supervised data then you consider dividing your problem and then training different in different components or you can train jointly in an multitask settings and people also train it jointly with auto encoder namely to read the input sentence and then predict the output sentence again and that's also and then you train jointly with all the tasks and works as well if you if you go home and then you want to make impact at your work tomorrow then so far that that's so far so good that that can make some impact now if you want to do some research and I think like things with memory operation operation augmentation are some of the exciting areas but but it seems like still work in progress but I would expect a lot of advances in this area in the near future so so you if you want to know more you can take a look at pre-solar block you talk about attention and of my augmented recurrent networks I also wrote some tutorials pretty simple this the sequin to sequence with attention for translation is implemented intensive flow so you can download and you can use you can actually download tensor flow and train it what I said today now this there's a lot of work going on in this area not on many of these are not mine so I so as you can see you can even read the world just means how many papers come along in this this area so I can pause there and I have five minutes to answer questions I have a question there yeah yeah I see okay can you speak to the microphone because I can't hear very well add a microphone and then I think people can hear that as well when you're treating a Q&A network so you're taking the example of training from a book to answer questions yeah so if let's say Harry Potter who was Harry Potter's father now there could be many books that have a character Harry so he has a context resolution issue which is which Harry should I answer the question for ya how do you solve the context context problem in your training this kind of Q&A type Network I think that's a great question so I think one thing is that you can always personalize for example you know that the guy when I talk about you can have a representation for the user and then you know that when he say Harry his because he actually been reading a lot of books about Harry Potter so it's more likely to be Harry Potter but I think with the hour time I said I just want to make sure that it's as simple as possible so the father if you do the juicer has to ask the question Harry Potter rather than Harry but I'm saying if you represent user vectors and then you inject more additional knowledge about the users about the context into as additional token in the input of the net the net can figure it out by itself yes so that's one way to do it yeah okay I have a question yeah you did some work on Doc to Vic yeah do you have an idea what the state of the art in generalizing were two veggies to more than one word oh I see I think skip thoughts are interested in directions here so dr. that is one way but skip thought so that the idea of skip thoughts was Ruslan salakhutdinov with author on this a his idea is basically using sequence to sequence to predict the next sentence so the input would be the current sentence the output we would be the the the the previous sentence all connect sentence and then you can train a model like that if the model is called skip four and I have heard a lot of good things about skip thoughts where you can take the embedding at the end and then you can do document classification and things like that and it works very well so that's that's probably one place that you can you know can go my colleague at Google is also working on something called auto encoder so he instead of predicting the next sentence he predict the current sentence so trying to repeat the current sentence and and that's kind of work well too yeah yeah see what was your thoughts on how to solve the common sense reasoning problem Oh common sense I'm deeply interested in common sense but I gotta say I have no idea I think maybe you can do something like I think common sense is about a lot of first of all there's a lot of knowledge about the world that is not captured in text right for example gravity and things like that so maybe you really need to actually combine a lot of morality that's that's one way to think about it all the way all the thing is do you make sure that unsupervised learning work that's another approach but I think this digital research area I think I'm just making guesses right now is there a good way to have sent all these rules and you know using some soft yes yes so the question is how do you represent Dru's so so if you think about this network the neural programmer network that it actually augmented by addition and and subtraction then these are rules right you can augment it with a table of proofs and then ask the network to actually attend into the truth table people have looked to this direction so that's one way to do it okay saying basically argument is to do some logical reasoning yeah yeah yeah hey okay great talk yeah thank you um are is there like a practical rule of thumb for how many sequence pairs you need to train such a model successfully yes a is there are there any tips to reduce how many pairs you need if you don't I said okay so usually the bigger data set the better but like the corpus that people train this on translation for example English to German it's only about about 3 5 million pairs of sentences or something like that so that's kind of small 3 million right and still people are able to make it to the state of the art so that's that's pretty encouraging now if you don't even don't have a lot of data that I would say things like pre-trained your work vectors with language models or a word to vac right that's that's one area that you have a lot of parameters you can pre train your model with some kind of language model and then you reduce the sub max that's another area that you have a lot of parameters or use drop out in the input embed in or drop out some random word in the input sentence so those things can improve the regular radiation when you don't have a lot of data okay yeah thank you okay yeah thank you all so we'll reconvene at 6 o'clock for yoshua bengio closing keynote