Nuts and Bolts of Applying Deep Learning (Andrew Ng)
F1ka6a13S9I • 2016-09-27
Transcript preview
Open
Kind: captions Language: en so you know when we're uh organizing this Workshop My My co-organizers initially asked me hey Andrew end of the first day go give a Visionary talk so until uh several hours ago my talk was advertised as Visionary talk um but until but but I Was preparing for this presentation over the last several days um I tried to think what what would be the most useful information to you um and what are the things that you know you could take back to work on Monday and and do something different at your job next Monday and I thought that um Mr context right now as pet mentioned I lead BYU's AI team so team about a thousand people working on Vision speech NLP you know lots of applications of machine learning and so what I thought I'd do instead is um instead of taking the shiniest pieces of deep learning that I know I want to take the lessons that I saw at BYU that are common to so many different um academic areas as well as applications and you know autonomous cause augmented reality uh advertising uh web search um medical diagnosis with take what of the common lessons the simple powerful ideas that I've seen help drive a lot of machine learning progress at BYU and I thought I will um share those ideas of you because the patterns I see across a lot of projects I thought might be the patterns that would be most useful to you as well whatever you are working on in the next several weeks or months um so one common theme that will appear in in this presentation today is that the workflow of organizing machine learning projects feels like parts of it are changing in in the era of deep learning so for example one of the ideas I talk about is bias variance this is a super old idea right and then you know many of you maybe all of you have heard of buyers and variance but in the era of deep learning I feel like there have been some changes to the way we think about buyers and variance so we want to talk about some of these ideas which maybe aren't even deep learning per se but um have been slowly shifting as as we apply deep learning to more and more of our applications okay oh and um instead of holding all your questions until the end you know if you have a question in the middle feel free to raise your hand and well I'm very happy to take questions in the middle since this is a more maybe informal whiteboard talk right and also we want to say hi to all home viewers hi right so um you know one question that I still get asked sometimes is um and kind Andre alluded to this earlier a lot of the basic ideas of deep learning have been around for decades so why are they taking off just now right why is it that deep learning these neon networks have all know for maybe decades why why are they working so well now so I think that um the one biggest Trend in deep learning the the is is is scale that scale drives deep learning progress um and uh I think Andrea mentioned scale of data and scale of computation um and I'm just draw a picture that illustrates that concept maybe a little bit more right so if I plot a figure where on the horizontal axis I plot um the amount of data we have for a problem and on the vertic iCal axis we plot you know performance right so x-axis is the amount of spam data you've collected y AIS is how accurately can you classify spam um then if you apply you know traditional learning algorithms right what we found was that the performance often looks like it starts to Plateau after a while what was as if the older generations of learning algorithms including you know support V what logistic regression as the was as if they didn't know what to do with all the data that we finally had and what happened kind of over the last 20 years last last 10 years was with the rise of the internet rise of Mobile Rise of iot where as a society sort of marched to the right of this curve right for for for for many problems not all problems and so um with all the buzz and all the hype about deep learning in my opinion the number one reason um that deep learning algs work so well is that if you train going to call a small neuronet maybe you get slightly better performance um if you train a mediumsized neuronet right maybe get even better performance and is only if you train a large neuronet that you could train a model with the capacity to absorb all this data that we have access to that allows you to get the best as possible performance and so I feel like this is a trend that we' seen in many verticals in many application areas um couple comments one is that um this uh you know actually when I draw this picture some people ask me well does this mean a small neuronet always dominates a traditional learning algorithm and the answer is not really uh technically if you look at the small data regime if you look at the left end of this plot right um the relative ordering of these algorithms is not that well defined it depends on who's more motivated to engineer the features better right so if if you know the svm guy is more motivated to spend more time Eng doing features they might beat out the uh uh the the the neuron Network application but um uh because when you don't have much data a lot of the knowledge of the algorithm comes from hand engineering right but so but this trend is much more evident in the regime of Big Data where you just can't hand engineer enough features uh and and and the large interet combined with a lot of data tends to outperform so couple of the comments um the ification of this figure is that in order to get the best performance in order to hit that Target uh you need two things right one is you need to train a very large NE Network or reasonably large NE Network and you need um a large amount of data and so this in turn has caused pressure to train large near net Nets right build large Nets as well as get huge amounts of data so one of the other interesting Trends I've seen is that um increasingly um it I'm I'm finding that it makes sense to build an AI team as well as build a computer systems team and have the two teams kind of sit next to each other and the reason I say that is um I guess uh so let's see what so when when we started you know bu research we said our team that way other teams are also organized this way I think Peter mentioned to me that open AI also has a systems team and a and a and a machine learning team and the reason we're starting to organize our teams that way I think is that um some of the computer systems work we do right so we have an HPC team high performance team super Computing team at BYU some of the extremely specialized knowledge in HPC is just incredibly difficult for for an AI researcher to learn right some people are super smart maybe maybe Jeff Dean is smart enough to learn everything but but it's just difficult for any one human to be sufficiently expert in HPC and sufficiently expert in um uh uh in in machine learning and so we've been finding and and shubo actually one of the co-organizers is on our HPC team we've been finding that bring Talent from that Knowledge from these multiple sources multiple communities allows us to get our best performance um I want to you know you've heard a lot of present heard a lot of fantastic presentations today I want to draw one other picture which is um in my mind this is how I mentally bucket you know work in in in deep learning so this might be a useful calization right when you look at the talk you can mentally put each talk into one of these buckers I'm about to draw um but I feel like there's a lot of work on I'm G to call you know General DL General models and this would basically what the type of model that Hugo lell talked about this morning where you have you know really densely connected layers right um I guess FC right was was the so there's a huge bucket of models there um and then I think a second bucket is um sequence models so 1D sequences um and this is where I would Buck could lot the work on rnns uh you know lstms right grus um some of the attention models which I guess probably yosha r talk about tomorrow or maybe maybe maybe others maybe quas I'm not sure right um but so the 1D sequence models is another huge bucket um the third bucket is the image models um this is really 2D and maybe sometimes 3D but this is where I would tend to bucket all the work of CNN convolutional Nets and then in my mental bucket then then there's a fourth one which is the other right and this includes uh unsupervised learning you know uh uh uh the reinforcement learning right as well as lots of other creative ideas um being explored L and you like what I still find slow fature analysis B coding um U uh you know a various models kind of in the other category super exciting so it turns out that if you look across industry today um almost all the value today is driven by these three bucket right so what I mean is uh those three buckers of algorithms are you know driving causing us to have much better products right or or monetizing very well it's just incredibly useful for lots of things um in some ways I think this bucket might be the future of AI right so I find UNS supervised learning especially super exciting uh so so I'm I'm actually super excited about this as well um although I think that if you know on Monday you have a job and you're trying to like build a product or whatever the chance of you using something from one of these three buckets will be will be highest um but I definitely encourage you to contribute to research here as well right so um I said the trend one the the major Trends one of deep learning is scale um this is what I would say is maybe major Trend two of of two of two Trends this is not going to go on forever right um is I feel major Trend too is um the rise of endtoend deep learning uh for Rich especially for Rich outputs and so um end to end deep learning I'll say a little bit more in a second exactly what I mean by that but the examples I'm going to talk about are all from one of these three buckets right General DL sequence models image 2D 3D models um but let's see best Illustrated a few examples um until recently a lot of machine learning used to Output just real numbers you know so I guess in Richard's uh uh example you have Ave movie review right and then actually but I prepared totally different examples I was editing the my examples earlier to to to be more coherent with the speakers before me um but we have a movie review and then output the sentiment you know is this a positive or A negative movie review um or you might have an image right and then you want to do uh image nit object recognition you know so this would be a01 output this might be a integer from 1 to 1,00 but so until recently a lot of machine learning was about out putting a single number maybe a real number maybe an integer um and I think the the the number two major Trend that I'm really excited about is um enter and de learning algorithms that can output much more complex things than numbers and so one example that you've seen is a image captioning where instead of taking an image and saying this is a cat you can now take an image and output you know an entire string of texts using RNN to generate that sequence so I guess uh what um Andre who spoke just now I think Oro vendal uh uh shoe at BYU right a whole bunch of people have have have worked on this problem um one of the things that I guess uh my my my collaborator Adam coats will talk about tomorrow uh maybe Quark as well not sure is um speech recognition where you take as input audio and you directly output you know the text transcript right and so um when we first propose using this kind of ENT in architecture to do speech recognition this was very controversial we're building my work of Alex Graves uh but the idea of actually putting this in the production speech system was very very controversial when we first you know said we wanted to do this but I think the whole Community is coming around to this point of view more recently um or you know machine translation say go from English to French right soas qu others uh working on there a lot of teams now um or you know given the parameters um synthesize a brand new image right and and and you saw some examples of image synthesis so I feel like the the the second major trend of of of deep learning that that I find very exciting and and I mean this allowing us to build you know transformative things that we just couldn't build three or four years ago has this trend toward not just learning algorithms an output not just a number that can output very complicated things like a sentence or caption or French sentence or image or or or or or let the recent wavenet paper output audio right so I think this is a maybe the second um major Trend so um despite all the excitement um about endtoend deep learning um I think that end to end deep learning you know sadly is not the solution to everything um I want to give you some rules of thumb for deciding when to use what is exactly an learning and when to use it and when not to use it so was moving the second bullet and we'll go through these so the trend toward end to end deep learning has been um this idea that instead of engineering a lot of intermediate representations maybe you can go directly from your Ro input to whatever you want to predict right so for example actually a take I'm going to use speech as a recurring example uh so for speech recognition um previously one used to go from the audio to you know hand engineered features like mfccs or something and then maybe extract phon names right um and then eventually you try to generate the transcript um oh for those of you that aren't sure what a phone name is so uh if you look at the word listen to the word cat and the word kick the sound right is the same sound and so pH names are this uh um basic units of sound such as c as a pH name and is um hypothesized by linguist to be the basic unit of sound so C would be the maybe the three pH names that make up the word cat right so traditional speech systems used to used to do this uh and I think 2011 leang and Jeff Hinton um made a lot of progress in speech recognition by saying we can use deep learning to do this first step um but the end to end approach to this would be to say let's forget about phes let's just have a neuronet right input the audio and output the transcript um so it turns out that in some problems this's endtoend approach so one end is the input the other end is the output so the phrase end to end deep learning refers to uh just having a neuronet or you know like a learning algorithm directly go from input output that's that's what n to end means um this ENT end formula uh is I think it makes for what great PR uh and and it's actually very simple but it only works sometimes um and actually maybe maybe say this interesting story you know this end to-end story we really upset a lot of people um when we were doing this work I guess I guess I used to go around and say I think PHS are a fantasy of linguists um and we should do away with them and I still remember there was a meeting at Stanford and some of you know who it was there was a linguist kind of yelling at me in public for saying that so maybe maybe I should not H we turned out to be right yeah so all right um so let's see um but the the the Ares heel of a lot of deep learning is that you need tons of label data right so if this is your X and that's your y then for endtoend deep learning to work you need a ton of label you know input output data X comma y so to take an example where um um where you know one may or may not consider ENT deep learning um this is a problem I learned about just last week from ctis langas and and Doin who's in the audience I think of uh imagine you want to use um X-ray pictures of your hand in order to predict the child's age right so this is a real thing you know doctors actually care to look at an x-ray of your of a child's hand in order to predict the the age of the child so um boy let me draw an x-ray image right so this is you know the child's hand so these are the bones right I guess this is why I'm not a doctor okay so that's a hand and and and you see the bones um and so more traditional algorithm my input an image and then first you know extract the bones so first figure out oh there's a bone here there's a bone here there's a bone here and then maybe measure the length of these bones right um so really I'm going to say bone lengths and then maybe has some formula like some regression average some simple thing to go from the bone length to estimate the age of the child right so this is a non-end to-end approach to trying to solve this problem an interent approach would be to take an image and then you know run a convet or whatever and just try to Output the age of a child and I think this is one example of a problem where um it's very challenging to get end to end deep learning to work because you just don't have enough data you just don't have enough X-rays of children's hands annotated with dat ages and instead where we see deep learning coming in is in this step right to use go from image to to figure out where the bones are use deep learning for that but the advantage of this non-end architecture is it allows you to hand engineer in more information about the system such as how bone lengths map age right which which you can kind of get tables about um there are a lot of examples like this and I think one of the unfortunate things about deep learning is that um let's see uh you know you can for for for suitably sexy values of X and Y you could almost always train a model and publish a paper but that doesn't always mean that you know it's actually a good idea Peter I see yeah I see yeah I see yes that's true yes Pet's poting out that in practice you could um uh if this is a fixed function f right you could back propop all the way from the age all the way back to the image yeah that's a good idea actually who was it just said you better do it quickly yeah um Let me give a couple other examples uh uh that where where it might be harder to backdrop all the way through right so here here's an example um take self-driving cars you know most teams are using an architecture where you input an image what's in front of the car let's say and then you you know detect other cars right uh and then and and maybe use the image detect pedestrians right self-driving cars are obviously more complex than this right uh but then now that you know where the other cars and where the posss are relative to your car you then have a planning algorithm uh uh to then you know come up with a trajectory right and then now that you know um what's the trajectory that you want your car to drive through um you could then you know compute the steering direction right let's say and so um this is actually the architecture that most self-driving car teams are using um and you know that have been interesting approaches to to say well I'm going to input an image and I'll put a steering direction right and I think this is an example of where um at least with today's data technology I'd be very is about the second approach and I think if you have enough data the second approach will work and you could even prove a theorem you know showing that it will work I think but um I don't know that anyone today has enough data to make the second approach really really work well right and and I think kind of Peter made a great comment just now and I think you know some of these components will be incredibly complicated you know like this could be a pop Planet ex explicit search and you could actually design a really complicated powerp plan and generic the trajectory and your ability to hand code that still has a lot of value right so this is one thing to watch out for um I have seen project teams say I can get X I can get y I'm G to train deep learning um but unless you actually have the data you know some of these things make for great demos if if you cherry pick the examples but but it can be challenging to um get to work at scale I I should say for self-driving CS this debate is still open I'm I'm cautious about this I don't think is this I don't think this will necessarily fail I just think the data needed to do this will be will be really immense so I I'd be very cautious about and and right now but it might work if you have enough data um so you know one of the themes that comes up in machine learning really if you're work on a machine learning project one thing that'll often come up is um you will you know develop a learning system uh train it maybe doesn't work as well as you're hoping yet and the question is is what do you do next right this is a very common part of a machine learning you know research or a machine learning engineer's life which is you know you you train a model doesn't do what you want it to yet so what do you do next right this happens us all the time um and you face a lot of choices you could collect more data maybe you want to train it longer maybe you want a different neuron Network architecture maybe you want to try regularization maybe you know bigger model for some more gpus so you have a lot of decisions and I think that um a lot of the skill of a machine learning researcher machine learning engineer is knowing how to make these decisions right and and and the difference in performance and whether you you know do you train a bigger model or do you try regularization your skill at picking these decisions will have a huge impact on how rapidly um uh you can make progress on actual machine learning problem so um I want to talk a bit about bias and variance since that's one of the most basic you know Concepts in machine learning and I feel like it's evolving slightly in the era of of of deep learning so to use a as a motiving example um let's say the goal is to build a human level right uh Speech system right speech recognition system okay so um what we would typically do especially in Academia is we'll get a data set you know here's my data set a lot of examples and then we Shuffle it and we randomly split it into 7030 train tests or maybe or maybe 70% train you know 15% Dev and uh 15% test right we oh and uh some people use the term validation set but I'm I'm just use the dep set or stand for development set means the same thing as validation set okay so it's pretty common um and so what we would what what what I encourage you to do if you aren't already is to measure the following things um human level error so let actually let me illustrate an example let's say that on your deel uh uh let's say that um on your death set you know human level error is uh 1% error um let's say that your training set error is um use 5% and let's say that your def set error really very de set is appr proxy for test set except you tune to the dev set right is um you know 6% d okay so this is one of the most basic this this is really a a step in developing a learning Al that I encourage you to do if you aren't already to figure out what are these three numbers because these three numbers um really helps in terms of telling you what to do next so in this example um you see that you're doing much worse than human level performance um and so you see that there's a huge gap here from 1% to 5% and I'm going to call this you know right the bias of your learning algorithm um and for the statisticians in the room I'm using the terms buys and variance informally and doesn't correspond exactly to the way they're defined in textbooks but I find these useful concepts for for for deciding how to make progress on your problem um and so I would say that you know in this example you have a high bias class by try training a bigger model maybe try training longer we come back to this in a second um for a different example you know so this is one example uh for a different example if human level error is 1% and uh training set error with 2% right and death set error was 6% then you know you really have a high what variance problem right like an overfitting problem and this tells you this really tells you what to do what to try right try adding regularization or try um uh uh or try early stopping or um or or or even better we get more data um and then there's also really a third case which is if you have a 1% human level error um I'm going to say 6% death set error oh actually let me say 5% death set error and 10% um excuse me 5% training error and 10% death set error and in this case you have high bias and high variance right um so so I guess yeah High buys and high VAR you know like sucks for you right um so I feel like that when I talk to applied machine learning teams there's one really simple workflow um that is enough to help you make a lot of decisions about what you should be doing on your machine learning application um and by if if you're wondering why I'm talking about this and what this has to do with deepy I'll come back to this in a second right does this change in Era deep learning but uh uh I feel like this is you know almost a workflow like almost a a flow chart right right which is first ask yourself um is your training error high oh and I hope I'm writing big enough that people can see if if you have a trouble reading let me know and I'll and I'll read it back out right but first I ask you know are you even doing well in your training set um and and and if your training error is high then you know you have high bias and so your standard tactics like train a bigger model just train a bigger NE Network um or maybe try training longer you know make sure that your optimization algorithm is is doing a good enough job um and then there's also this magical one which is a new model architecture which is a hard one right um come back to that in a second okay and then you kind of keep doing that until you're doing well at least on your training set once you're at least doing well on your training set so your training error is no longer high so no training error is not unacceptably High um we then ask you know is your death error high right and if the answer is yes then um well if your dep set error is high then you have a high variance problem you have an overfitting problem and so you know the solutions are try to get more data right or add regularization or try a new model architecture right and then until and and you kind of keep doing this until your uh dep set error is is is is no is I guess until both you're doing well on your training set and on your death set and then you know hopefully right you're done so I think one of the um one of the nice things about this era of deep learning is that no matter it's kind of no the way you're stuck with modern deep learning tools we have a clear path for making progress in a way that was not true or at least was much less true in the era before deep learning which is in particular no matter what your problem is overfitting or underfitting uh really high buys or high Varian or maybe both right you always have at least one action you can take which is bigger model or more data so you could so so so in the Deep learning era relative to say the logistic regession era the svm era it feels like we more often have a way out of whatever problem we're stuck in um and so I feel like these days people talk less about buyers variance trade-off you might have heard that term buyers variance trade-off underfitting versus overfitting and the reason we talked a lot about that in the past was because a lot of the moves available to us like tuning regularization that really traded off buyers and variance so it was like a you know zero something right you you could improve one but that makes the other one worse but in the era of deep learning really one of the reasons I think deep learning has been so powerful is that the coupling between buyers and variance can be weaker and we now have tools we now have better tools to you know reduce buyers without increasing variance or reduce variance without increasing buyers and really the bigger the the the big one is really you can always train a bigger model bigger neuron Network in a way that was harder to do when you're training logistic regression is to come up with more and more features right that was just harder to do um so let's see one of the um and I'm I'm going to add more to this diagram at the bottom in a second okay um one of the effects of uh this maybe this and and and by the way I've been surprised I mean honestly um this new model architecture that's really hard right it takes a lot of experience but but even if you aren't super experienced with you know a variety of deep learning models the things in the blue boxes you can often do those and that would drive a lot of progress right but if you have experience with you know how to tune a confet versus a resonet versus Whatever by all means try those things as well definitely encourage you to keep mastering those as well but this dumb formula of more data bigger bigger model more data is enough to do very well on a lot of problems so um let's see uh so bigger model puts pressure on you know systems which is why we we have high performance Computing team um more data has has led to another interesting um set of Investments so uh with you know I guess a lot of us have always what needed that had this insatiable hunger for data we use you know trout sourcing for labeling um uh we try to come with all sorts of clever ways to come to to to get data um one one area that that I'm seeing more and more activity in right it feels a little bit nent but I'm seeing a lot of activity in is um automatic data synthesis right um let's see and so here's what I mean you know once upon a time people used the hand engineer features and there was a lot of skill in hand engineering the features of you know like the CIF or the hog or whatever to feed into svm um automatic data synthesis is this little area that is small but feels like it's growing where there is some hand engineering needed but I'm seeing quite a lot of progress in multiple problems is enabl by hand engineering uh synthetic data in order to feed into the giant mole of your neuron Network right so let me best Illustrated a couple examples um one of the easy ones is OCR so so let's say you want to train a um optical character recognition system and actually I've been surprised that by do this has tons of users actually this is one of my most useful apis that I you right um if you imagine firing up Microsoft Word um and uh downloading a random picture off the Internet then choose a random Microsoft Word font choose a random word in English dictionary and just type the English word into Microsoft Word in a random font and paste that on top you know like a transparent background on top of a random image off the internet then you just synthesize a training example for OCR right um and so this gives you access to essentially unlimited amounts of data it turns out that the simple idea I just described won't work in its natural form you actually need to do a lot of tuning to blur the synthesize text with the background to make sure the color contrast matches your training distribution so found it in practice can be a lot of work to find you and how you synthesize data but I've seen in many verticals um I'll give a few examples if you do that engineering work and sadly it's painful engineering you could actually get a lot of progress actually actually ta Wang uh who was a a student here at Stanford um uh the effect I saw was he engineered this for months with very little progress and then suddenly he got the parameters right and he had huge amounts of data and was able to build one of the best OCR systems in the world at that time right um other examples speech recognition right uh one of the most powerful ideas uh for building a effective speech system is if you take clean audio you know a clean relatively noises audio and take random background sounds and just synthesize what that person's voice would sound like in the presence of that background noise right and this turns out to work remarkably well so if you recall a lot of car noise what the inside of your car sounds like and record a lot of clean audio of someone speaking in a quiet environment um the mathematical operation is actually addition it's superposition of sound but you basically add the two waveforms together and then you get an audio clip that sounds like that person talking in the car and you feed this your learning algorithm and so this has a dramatic effect in in terms of amplifying the training set for speech recognition and has a huge effect can have a we found a huge effect on um performance um and then also NLP you know here here's here's one example actually done by some Stanford students which is um using entend deep learning to do grammar correction so input a ungrammatical English sentence you know maybe written by non-native speaker right and can you automatically have a have a I guess attention RNN input an ungrammatical sentence and correct the grammar just edit the sentence for me um and it turns out that you can synthesize huge amounts of this type of data automatically and so that'll be another example where data synthesis um works very well um and oh and I think uh uh video games in RL right really one of the um well let me just games broadly right one of the most powerful um uh applications of RL deep RL these days is video games and I think if you think supervised learning has an insatable hunger for data wait till you work on AO algorithms right I think the the hunger for data is even greater but when you play video games the advantage of that is you can synthesize almost infinite amounts of data to to feed this even greater more right even greater need that our ARS have um so just one note of caution data synthesis has a lot of limits um I'll tell you one other story um you know let's say you want to recognize cars right uh there are a lot of video games um I need to play more video games what's a video game with cause in it oh GTA Grand Theft Auto right so there a bunch of cars in Grand Theft Auto why we just take pictures of cars from Grand Theft Auto and you can synthesize lots of cars lots of orientations there and paste that give that as training data um it turns out that's difficult to do because from the human perceptual system there might be 20 cars in a game but it looks great to you because you can't tell if there 20 cars in the game or a thousand cars in a game right and so there are situations where the synthetic data set looks great to you because 20 cars in a video game is plenty it turns out uh you don't need a 100 different calls for the human to think it looks realistic but from perspective of learning algorithm this is very impoverished very very poor data set so so I think so so so a lot to be to be sorted out for data synthesis um for those you that work in companies one one practice I would strongly recommend is to have a unified data warehouse right um so what I mean is that if your teams if your you know engineer teams your research teams are going around trying to accumulate the data from lots of different organizations in your company that's just going to be a pain it's going to be slow so um at buo you know our our policy is um it's not your data is a company's data and if it's user data it goes into my user data warehouse uh we we we should have a discussion about user access rights privacy and who can access what data but at BYU I felt very strongly so we we mandate this data needs to come into one loog it's a logical warehous right so it's physically distributed across loss of data censes but they should be in one system and what we should discuss is access rights but what we should not discuss is whether or not to bring together data into as unified a data warehouse as possible and so this is another practice that I found um makes uh access the data just much smoother and allows you know teams to to to to drive performance so really if if if your boss ask you tell them that I said like build a unified data warehouse right so um I want to take the uh train test you know bias variance picture and refine it it turns out that this idea of a 7030 split right train test or whatever this was common in um machine learning kind of in the past when you know frankly most of us an Academia were working on relatively small data sets right and so I know there used to be this thing called the UC Irvine repository for machine learning data sets you know by today's this amazing results at the time but by today standards is quite small and so you download the data set shuffle the data set and you have you know train Dev test and whatever um in today in production machine learning today is much more common for your train and your test distributions to come from different distributions right and and and this creates new problems and new ways of thinking about bu and VAR so let me sure talk about that um so actually here's a concrete example and this is a real example from Buu right what builds a very effective speech recognition system and then recently actually actually quite some time back now we wanted to launch a new product that uses speech recognition um we wanted a speech enabled rear viiew mirror right so you know if you have a car that doesn't have a built-in GPS unit right uh we wanted this is a real product in China we want to let you take out your rearview mirror and put a new you know AI power speech part rearview mirror because it's an easier uh uh like a off the market installation so you can speak to rearview mirror and say dear Rew mirror you know navigate me to whatever right so this is a real product um so so so how do you build a speech recognition system for this incar speech enable rear view mirror um so this is our status right we have you know let's call it 50,000 hours of data from of speech recognition data from all sorts of places right a lot of data we bought some user data that that that we have permission to use but a lot of data collected from all sorts of places but not your incar rear viiew mirror scenario right and then our product managers can go around and you know through quite a lot of work for this example I'm going to say let's say they collect 10 hours more of data from exactly the rear view mirror scenario right so you know install this thing in the car get drive around talk to it you collect 10 hours of data from exactly the distribution that you want to test on so the question is what do you do now right do you throw this 50,000 hours of data away because it's not from the distribution one or or or can you use it in some way um in the older pre deep learning days people used to build very separate models it was more common to build one speech model for rearview mirrror one model model for the maps voice query one model for search one model and in the era of deep learning it's becoming more and more common to just power all the data into one model and let the model sorted out and so long as your model is big enough you could usually do this um and if you do little Tech if you get the features right you could usually pile all the data into one model uh and often see gains butly usually not see any losses but the question is given this data set you know how do you split this into train dep test right so here's one thing you could do which is call this your training set call this your dep set and call this your test set right um turns out this is a bad idea I would not do this and so one of the best practices with with' derived is um make sure your development set and test sets are from the same distribution right I've been finding that this is one of the tips that really boost the effectiveness of a machine learning team um so in particular I would make this the training set and then of my 10 hours let me expand this a little bit right much smaller data set maybe five hours def five hours of tests and the reason for this is um uh your team will be working to tune things on the death set right and the last thing you want is if they spend three months working on the death set and then realize when they finally tested that the test is totally different a lot of work is wasted so I think to make an analogy you know having different dep and test set distributions is a bit like if I tell you hey everyone let's go north right and then a few hours later when when all of you are in Oakland I say where are you wait I want you to be in San Francisco and you go what why' you tell me to go north tell me to go to San Francisco right and so I think having depth and test set be from the same distribution is one of the ideas that I found really optimizes the team's efficiency because it you know the development set which is what your team is going to be tuning as algorithms to that is really the problem specification right and you problem specification tells them to go here but you actually want them to go there you're going to waste a lot of effort um and so when possible having Deen tests from the same distribution which it isn't always there there there's some cavas but when is reason well to do so um this really improves the the the the um the team's efficiency um and another thing is once you specify the de set that's like your problem specification right uh once you start the test set that's your problem specification your team might go and collect more training data or change the training set or synthesize more training set but but you know you shouldn't change the test set if the test set is is is your problem specification right so um so in practice what I actually recommend is splitting a training set as follows um your training set cover a small part of this let me just say 20 hours of data to form I'm going to call this the um training def set train dasde set but that's basically a development set that's from the same distribution as your training Set uh and then you have your depth set and your test set right so these are what you actually from the distribution you actually care about and these you have your training set $50,000 of all sorts of data and maybe we aren't even entirely sure what data this is uh but split off just a small part of this so I guess this is now what 49980 hours and 20 hours um and then here's the generalization of the bias variance concept um actually let me use this board and and but has say the the um the fact that training and test sets don't match is one of the problems that um Academia doesn't study much there's some work on domain adaptation there is some literature on it but it turns out that when you train and test on different distributions you know it it sometimes it's just random is a little bit luck whether you generalize well to a totally different test set so that's made it hard to study systematically which is why I think um Academia has not studied this particular problem as much as I feel it is important for to to those of us building production systems um but there is some work but but not no no no very widely deployed Solutions yet would be would my sense um but so I think our best practice is if if you now generalize what I was describing just now to the following which is um measure human level performance measure your training set performance measure your training death performance measure your death set performance and measure your test set performance right so now you have kind of five numbers so to take an example let's say human level is 1% error um and I'm going to use very obvious examples for illustration if your training set performance is 10% you know and this is 10.1% right uh 10.1% you know 10.2% right in this example then it's quite clear that you have a huge gap between human level performance and training set performance and so you have a huge bias right uh and and and so kind of use the the the bias fixing types of um uh uh Solutions um and then um there just one example I want to well and so I find that the machine learning one of the most useful things is to look at the aggregate error of your system which in this case you know is your depth set your tested era and then to break down the components to to figure out how much of what eror comes from where so you know where to focus your attention so this accumulation of errors this difference here this is maybe 9% bias which is a lot so I would work on the bias reduction techniques uh this Gap here right this is kind of um really the variance this Gap here is due to your train test distribution mismatch um and this is overfitting of Dev okay right um so just to be really concrete um here's an example where you have high train test error mismatch right which is if human level performance is 1% % your training error is you know 2% uh your training death is 2.1% and then on your death Set uh the error suddenly jumps to 10% right so this would sorry my my my my x-axis doesn't perfectly line up but if there's a huge gap here then I would say you have a huge train test set mismatch problem okay um and so at this basic level of analysis what you know this formula for machine learning instead of Dev I would replace this with train Dev right and then in the rest of this uh really recipe for machine learning um I would then ask um is your death error high if yes then you have a train test mismatch problem and there the solution would be to try to get more data uh that's similar to test set right or maybe a data synthesis or data augmentation you know try to tweak your training set to make it look more like your test set um and then there's always this kind of a uh Hail Mary I guess which is you know new architecture right um and then finally just to finish this up you know that that that's not that much more uh finally uh uh there's this yeah well and then hopefully if you're done uh uh hopefully your test set error will be will be good and if if you're doing well your death set but not your test set it means you've overit your death set so just get some more death set data right actually I'll just write this I guess test set error y right and if yes then just get more depth data okay and then done sorry if this is not too legible what I wrote here is uh if your dep set error is not high but your test set error is high it means you've overfit your dep set so just get more test set get more depth set data okay um so one of the um effects I've seen is bias and variance is it sounds so simple but it's actually much diffic much more difficult to apply in practice than it sounds when I talk about it or or on text right so some tips for a lot of problems just calculate these numbers and this can help drive your analysis in terms of deciding what to do um yeah and and I find that it takes surprisingly long to really Gro to really understand bu and variance deeply but I find that people that understand buys and variance deeply are often able to drive very rapid progress in in in machine learning applications right and and I know it's much sexier to show you some cool new network architecture and I don't know and and and and just this really helps our teams make rapid progress on things um so you know there's one thing I I I kind of snuck in here without making it explicit which is that in this whole analysis we were benchmarking against human level performance right so there another Trend another thing that that that has been differenced uh again you know I'm looking across a lot of projects I've seen in many areas and trying to pull out the common Trends but I find that comparing to human level performance is a much more common theme now than several years ago right with with I guess Andre being the uh the the human level Benchmark for image net uh um and and and really by do we compare our speech system to human level performance and try to exceed it and so on so why is that um it turns out that so why why why is human level performance right such a such a common theme in in applied deep learning um it turns out that if um this the x-axis is time as in you know how long you've been working on a project and the y- axis is accuracy right if this is human level performance you know like human level accuracy or human level performance on some task you find that for a lot of projects your teams will make rapid progress you know up until they get to human level performance and then often it will maybe surpass human level performance a bit and then progress often gets much harder after that right but this is a common pattern I see in a lot of problems um so there multiple reasons why this is the case I'm I'm curious like why why why why do you think this is the case any any guesses yeah cool labels are coming from humans the labs are oh cool yep labels coming from humans anything else all right cool anything else oh interesting Ox small out the human brain yeah I don't know maybe I I think the the the the distance from neuronet to human brains is very far so that one I would uh I see human capacity this from similar yeah kind of yeah all close yeah just okay board I see see right cool that's one more and then I'll just oh be satisfied okay cool you're satisfied and bought I guess on two sides of the coin I guess all right so oh defens human man yeah yeah cool so so let let me let me let me uh I think there there all all all you know lots of great answers um I think that there there there are several good reasons for this type of effect um one of them is that um there is for a lot of problems there is some theoretical limit of performance right if if you know some fraction of data is just noisy in speech recognition a lot of audio CPS are just noisy uh someone picked up a phone and you know they're in a rock console or something and it's just impossible to figure out what on Earth they were saying right or some images you know are jus
Resume
Categories