Transcript
u6aEYuemt0M • Deep Learning for Computer Vision (Andrej Karpathy, OpenAI)
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0008_u6aEYuemt0M.txt
Kind: captions Language: en Yeah. So thank you very much for the introduction. Uh so today I'll speak about uh deep learning especially in the context of computer vision. So what you saw in the previous talk is neural networks. Uh so you saw that neural networks are organized into these layers fully connected layers where neurons in one layer are not connected but they're connected fully to all the neurons in the previous layer. And we saw that basically we have this um layer-wise structure from input until output um and there are neurons and nonlinearities etc. Now, so far we have not made too many assumptions about the inputs. So, in particular here, we just assume that an input is some kind of a vector of numbers that we plug into this neural network. So, um that's both a bug and a feature to some extent. Uh because in most um in most real world applications, we actually can make some assumptions about the input that make learning much more efficient. Uh um that makes learning much more efficient. So in particular um usually we don't just want to plug in uh into neural networks vectors of numbers but they actually have some kind of a structure. So we don't have vectors of numbers but these numbers are arranged in some kind of a uh layout like an n- dimensional array of numbers. So for example spectrograms are two-dimensional arrays of numbers. Images are threedimensional arrays of numbers. Videos would be four-dimensional arrays of numbers. Text you could treat as one dimensional array of numbers. And so whenever you have this kind of local connectivity uh structure in your data then you'd like to take advantage of it and convolutional neural networks allow you to do that. So before I dive into convolutional neural networks and all the details of the architectures I'd like to uh briefly talk about a bit of the history of how this field evolved over time. So I like to start off usually with uh talking about hub and weasel and the experiments that they performed in 1960s. So what they were doing is trying to study the computations that happened in the early visual cortex areas of a cat. And so they had cat and they plugged in electrodes uh to that could record from the different uh neurons. And then they showed the cat different patterns of light and they were trying to debug neurons effectively and try to show them different patterns and see what they responded to. And a lot of these experiments uh inspired some of the modeling that came in afterwards. So in particular, one of the early models that tried to take advantage of some of the results of these experiments where the um was the model called neurokcognitron from Fukushima in the 1980s. And so what you saw here was these uh this architecture that again is layer-wise similar to what you see in the cortex where you have these simple and complex cells where the simple cells detect small things in the visual field and then you have this local connectivity pattern and the simple and complex cells alternate in this layered architecture throughout. And so this was this looks a bit like a comnet because you have some of its features like say the local connectivity but at the time this was not trained with back propagation. These were specific heristically chosen uh u updates that and this was unsupervised learning back then. So the first time that we've actually used back propagation to train some of these networks was an experiment of Yan Lakun in the 1990s. And so um this is an example of one of the networks that was developed back then in 1990s by Yan Lakun as Linet 5. And this is what you would recognize today as a convolutional neural network. So it has a lot of the very sim uh convolutional layers and it's alternating and it's a similar kind of design to what you would see in the Fukushima's neurocognitron but this was actually trained with back propagation end to end using supervised learning. Um now so this happened in roughly 1990s and we're here in 2016 basically about 20 years later. Um now computer vision has u has for a long time kind of um worked on larger images and a lot of these models back then were applied to very small uh kind of settings like say recognizing uh digits um and zip codes and things like that and they were very successful in those domains. But back at least when I entered computer vision in roughly 2011 it was thought that a lot of people were aware of these models but it was thought that they would not scale up naively into large complex images that they would be constrained to these toy tasks for a long time or I shouldn't say toy because these were very important tasks but certainly like smaller visual recognition problems and so in computer vision in roughly 2011 it was much more common to use a kind of um these feature-based approaches at the time and they didn't work actually that well so when I entered my PhD in 200 1 working on computer vision, you would run a state-of-the-art uh object detector on this image and you might get something like this uh where cars were detected in trees and you would kind of just shrug your shoulders and say, "Well, that just happens sometimes." You kind of just accept it as a as a something that would just happen. Um and of course this is a caricature. Things actually were like relatively decent. I I should say, but uh definitely there were many mistakes that you would not see today about four years uh in 2016, five years later. And so a lot of uh computer vision kind of looked much more like this. When you look into a paper of trying that tried to do image classification, you would find this section in the paper on the features that they used. So this is one page of features. And so they would use um yeah a gist etc. And then a second page of features and all their hyperparameters. So all kinds of different histograms and you would extract this kitchen sink of features and a third page here. And so you end up with uh this very large complex codebase because some of these feature types are implemented in MATLAB, some of them in Python, some of them in C++. And you end up with this large codebase of extracting all these features, caching them and then eventually plugging them into linear classifiers to do some kind of visual recognition task. So it was uh quite unwieldy uh but uh it worked to some extent but there were definitely room for improvement and so a lot of this changed uh in computer vision in 2012 with this paper from Alex Kepsky, Eliask and Jeff Hinton. So this is the first time that um someone took a convolutional neural network that is very similar to the one that you saw from 1998 from Yanakun and I'll go into details of how they differ exactly uh but they took that kind of network they scaled it up they made it much bigger and they trained on a much bigger data set on GPUs and things basically ended up working extremely well and this is the first time that computer vision community has really noticed these models and adopted them to work on larger images. So uh we saw that the performance uh of these models has improved drastically. Here we are looking at the imageet ILSVRC um visual recognition challenge over the years and we're looking at the top five errors. So low is good and you can see that from 2010 uh in the beginning uh these were feature-based methods and then in 2012 we had this huge jump in performance and that was due to um the first uh kind of convolutional neural network in 2012 and then we've managed to push that over time and now we're down to about 3.57%. Uh I think the results for imageet 2000 imageet challenge 2016 are actually due to come out today but I don't think that actually they've come out yet. I have this second tab here opened. I was waiting for the result, but I I don't think this is up yet. Okay. No, nothing. All right. Well, we'll get to find out very soon what happens right here. Uh, so I'm very excited to see that. Uh, just to put this in context, by the way, because you're just looking at numbers like 3.57. How good is that? That's actually really really good. So, what something that I did about two years ago now is that I tried to measure human accuracy on this data set. And so what I did that uh for that is I developed this web interface where I would show myself imageet images from the test set. And then I had this interface here um where I would have all the different classes of imageet. There's 10,00 and some example images. And then basically you go down this list and you scroll for a long time and you find what class you think that image might be. And then I competed against the comnet uh at the time and this was Google net in 200 uh in 2014. And uh so hot dog is a very simple class. You can do that quite easily. Uh but why is the accuracy not 0%. It well some of the things like hot dog seems very easy. Why isn't it trivial for humans to see? Well, it turns out that some of the images in a test set of imageet are actually mislabeled. But also some of the images are just very difficult to guess. So in particular, if you have this terrier, there's 50 different types of terriers and it turns out to be very difficult task to find exactly which type of terrier that is. You can spend minutes trying to find it. Turns out that convolutional neural networks are actually extremely good at this and so this is where I would lose points compared to comnet. Um so I estimate that human accuracy based on this is roughly 2 to 5% range depending on how much time uh you have and how much expertise you have and how many people you involve and how much they really want to do this which is not too much and uh so really we're doing extremely well and so we're down to 3% and uh I think the error rate if I remember correctly was about 1.5%. So if we get below 1.5% I would be extremely suspicious on imageet. Uh that seems wrong. So to summarize basically what we've done is um before 2012 computer vision looked somewhat like this where we had these feature extractors and then we trained a small portion at the end of the feature extractor extraction step. And so we only trained this last piece on top of these features that were fixed. And we've basically replaced the feature extraction step with a single convolutional neural network. And now we train everything completely end to end. And this turns out to work uh quite nicely. So I'm going to go into details of how this works in a bit. Uh also in terms of code complexity uh we kind of went from a setup that looks whoops I'm way ahead. Okay. We went from a setup that looks something like that in papers to something like uh you know instead of extracting all these things we just say apply 20 layers with 3x3 combo or something like that and things work quite well. Uh this is of course an overexaggeration but I think it's a correct first order statement to make is that we've definitely seen um that we've reduced code complexity quite a lot because these architectures are so homogeneous compared to what we've done before. So it's also remarkable that so we had this reduction in complexity. We had this amazing performance on imageet. One other thing that was quite amazing about the results in 2012 that is also a separate thing that did not have to be the case is that the features that you learn by training on imageet turn out to be quite generic and you can apply them in different settings. So in other words, this transfer learning um works extremely well. And of course, I didn't go into details of convolutional networks yet, but uh we start with an image and we have a sequence of layers just like in a normal neural network. And at the end, we have a classifier. And when you pre-train this network on imageet, then it turns out that the features that you learn in the middle are actually transferable and you can use them on different data sets and that this works extremely well. And so that didn't have to be the case. You might imagine that you could have a convolutional network that works extremely well on imageet but when you try to run it on some something else like birds data set or something that it might just not work well but that is not the case and that's a very interesting finding in my opinion. So um people noticed this back in roughly 2013 after the first convolutional networks. They noticed that you can actually take many computer vision data sets and it used to be that you would compete on all of these kind of separately and design features maybe for some of these separately and you can just uh shortcut all those steps that we had designed and you can just take these pre-trained features that you get from ImageNet and you can just train a linear classifier on every single data set on top of those features and you obtain many state-of-the-art results across many different data sets. And so this was quite a remarkable finding back then I believe. So things worked very well on imageet. Things transferred very well and the code complexity of course got much uh much more manageable. So now all this power is actually available to you with very few lines of code. If you want to just use a convolutional network uh on images it turns out to be only a few lines of code. If you use for example caris is one of the deep learning libraries that I'm going to go into and I'll mention again later in the talk. Uh but basically you just load a state-of-the-art convolutional neural network. You take an image, you load it and you compute your predictions and it tells you that this is an African elephant inside that image. And this took a couple hund couple hundred or a couple 10 milliseconds if you have a GPU. And so everything got much faster, much simpler, works really well, transfers really well. So this was really a huge advance in computer vision. And so as a result of all these nice properties, uh, comnets today are everywhere. So here's a collection of some of the some of the things I I try to uh find across across different applications. So for example, you can search Google photos for different types of um categories like in this case Rubik's cubes. Um you can find house numbers very efficiently. You can of course this is very relevant in self-driving cars and we're doing perception in the cars. Convolutional networks are very relevant there. Medical image diagnosis recognizing Chinese characters uh doing all kinds of medical segmentation tasks. Uh quite random tasks like whale recognition and more generally many Kaggle challenges. uh satellite image analysis recognizing different types of galaxies. You may have seen recently that um a waveet from deep mind also a very interesting paper that they generate music and they generate speech. Uh and so this is a generative model and that's also just a comet is doing most of the heavy lifting here. So it's a convolutional network on top of sound and uh other tasks like image captioning in the context of reinforcement learning and agent in environment interactions. We've also seen a lot of advances of using comnets as the core computational building block. So when you want to play Atari games or you want to play Alph Go or Doom or Starcraft or if you want to get robots to perform interesting manipulation tasks, all of this uses comes as a core computational um block uh to do very impressive things. Uh not only are we using it for a lot of different application, we're also finding uses in art. So um so here are some examples from DeepDream. So you can basically uh simulate what it looks like, what it feels like maybe to be on some drugs. So you can take images and you can just hallucinate features using comnets or you might be familiar with neural style which allows you to take arbitrary images and transfer arbitrary styles of different paintings like Bango on top of them. And this is all using convolutional networks. The last thing I'd like to note that I find also interesting is that in the process of trying to develop better computer vision architectures and trying to basically optimize for performance on the imageet challenge, we've actually ended up converging to something that potentially might function something like your visual cortex in some ways. And so these are some of the experiments that I find interesting where they've studied macak monkeys uh and they record from a subpopul of the um of the IT cortex. This is the part that does a lot of object recognition and so they record. So basically they take a monkey and they take a comnet and they show them images and then you look at what those images are represented at the end of this network. So inside the monkeykey's brain or on top of your convolutional network. And so you look at representations of different images and then it turns out that there's a mapping between those two spaces that actually seems to indicate to some extent that some of the things we're doing somehow ended up converging to something that the brain could be doing as well in the visual cortex. Um so that's just some intro. I'm now going to dive into convolutional networks and try to explain um the briefly how these networks work. Of course there's an entire class on this that I taught which is a convolutional networks class. And so I'm going to distill some of you know those 13 lectures into one lecture. So we'll see how that goes. I won't cover everything of course. Okay. So convolutional neural network is really just a single function. It goes from it's a function from the raw pixels of some kind of an image. So we take 224 x24x3 image. So three here is for the color channels RGB. You take the raw pixels, you put it through this function, and you get 1,000 numbers at the end. In the case of image classification, if you're trying to categorize images into 1,000 different classes and really functionally all that's happening in a convolutional network is just dotproducts and max operations. That's everything. But they're wired up together in interesting ways so that you are basically doing visual recognition. And in particular the this function f has a lot of knobs in it. So these ws here that participate in these dotproducts and in these convolutions and fully connected layers and so on these ws are all parameters of this network. So normally you might have about on the order of 10 million parameters and uh those are basically knobs that change this function. And so we'd like to change those knobs of course so that when you put images through that function you get probabilities that are consistent with your training data. And so that gives us a lot to tune and turns out that we can do that tuning automatically with back propagation uh through that search process. Now more concretely a convolutional neural network is made up of a sequence of layers just as in the case of normal neural networks. But we have different types of layers that we play with. Uh so we have convolutional layers here I'm using rectified linear unit relu for short as a nonlinearity. Uh so I'm making that an explicit its own layer. Um pooling layers and fully connected layers. The core computational building block of a convolutional network though is this convolutional layer and we have nonlinearities interspersed. We are probably getting rid of things like pooling layers. So you might see them slightly going away over time and fully connected layers can actually be represented. They're basically equivalent to convolutional layers as well. And so really uh it's just a sequence of com layers in the simplest case. So let me explain convolutional layer because that's the core computational building block here that does all the heavy lifting. So the entire comnet is this collection of layers and these layers don't function over vectors. So they don't transform vectors as a normal neural network but they function over volumes. So a layer will take a volume a threedimensional volume of numbers an array. In this case for example we have a 32x 32x3 image. So those three dimensions are the width, height and I'll refer to the third dimension as depth. We have three channels. Uh that's not to be confused with the depth of a network which is the number of layers in that network. So this is just the depth of a volume. So this convolutional layer accepts a threedimensional volume and it produces a threedimensional volume using some weights. So the way it actually produces this output volume is as follows. We're going to have these filters in a convolutional layer. So these filters are always small spatially like say for example 5x5 filter but their depth extends always through the input depth of the uh input volume. So since the input volume has three channels, the depth is three, then our filters will always match that number. So we have depth of three in our filters as well. And then we can take those filters and we can basically convolve them with the input volume. So uh what that amounts to is we take this filter. Um oh yeah, so that's just a point that the channels here must match. We take that filter and we slide it through all spatial positions of the input volume. And along the way as we're sliding this filter, we're computing dotproducts. So wrppose x plus b where w are the filters and x is a small piece of the input volume and b is the offset. And so this is basically the convolutional operation. You're taking this filter and you're sliding it through at all spatial positions and you're computing that products. So when you do this you end up with this activation map. So in this case uh we get a 28x 28 activation map. 28 comes from the fact that there are 28 unique positions to place this 5x5 filter into this 3 32x32 uh space. So there are 28 by 28 unique positions you can place that filter in. In every one of those you're going to get a single number of how well that filter likes that part of the input. Um so that carves out a single activation map. And now in a convolutional layer we don't just have a single filter but we're going to have an entire set of filters. So here's another filter a green filter. We're going to slide it through the input volume. It has its own parameters. So these there are 75 numbers here that basically make up a filter. there are different 75 numbers. We convolve them through get a new activation map and we continue doing this for all the filters in that convolutional layer. So for example, if we had six filters uh in this convolutional layer, then we might end up with 28x 28 activation maps six times and we stack them along the depth dimension to arrive at the output volume of 28x 28x 6. And so really what we've done is we've re-represented the original image which is 32x 32x3 into a kind of a new image that is 28x 28x 6 uh where this image basically has these six channels that tell you how well every filter matches or likes every part of the input image. So let's compare this operation to say using a fully connected layer as you would in a normal neural network. So in particular we saw that we processed a 32x 32x3 volume into 28x 28x6 volume. And uh one question you might want to ask is how many parameters would this require if we wanted a fully connected layer of the same number of output neurons here? So we wanted 28 x 28x 6 or time 28* 2 * 28 * 6 number of neurons fully connected. How many parameters would that be? Turns out that that would be quite a few parameters, right? because every single neuron in the opted volume would be fully connected to all of the 32x 32x3 numbers here. So basically every one of those 28x 28x 6 neurons is connected to 32x 32x3 turns out to be about 15 million parameters and also on that order of number of multiplies. So you're doing a lot of compute and you're introducing a huge amount of parameters into your network. Now since we're doing convolution instead uh you'll notice that think about the number of parameters that we've introduced with this example convolutional layer. So we've used uh we had six filters and every one of them was a 5x5x3 filter. So basically we just have 5x5x3 filters. We have six of them. If you just multiply that out we have 450 parameters. And in this I'm not counting the biases. I'm just counting the raw weights. So compared to 15 million we've only introduced very few parameters. Also, how many multiplies have we done? So, computationally, how many flops are we doing? Uh, well, we have 28 by 28 by six outputs to produce. And every one of these numbers is a function of a 5x5x3 region in the original image. So, basically, we have 28 x 28 by 6 and then there's every one of them is computed by doing 5* 5* 3 multiplies. So, you end up with only on the order of 350,000 um multiplies. So, we've reduced from 15 million to quite a few. So we're doing less flops and we're using fewer parameters. And really what we've done here is we've made assumptions, right? So we've made the assumption that because um the fully connected layer, if this was a fully connected layer, could compute the exact same thing. Uh but it would um so a specific setting of those 15 million parameters would actually produce the exact output of this convolutional layer. But we've done it much more efficiently. We've done that by introducing um these biases. So in particular, we've made assumptions. We've assumed, for example, that since we have these fixed filters that we're sliding across space, we've assumed that if there's some interesting feature that you'd like to detect in one part of the image, like say top left, then that feature will also be useful somewhere else like on the bottom right because we fix these filters and apply them at all the spatial positions equally. You might notice that this is not always something that you might want. For example, if you're getting inputs that are centered face images and you're doing some kind of a face recognition or something like that, then you might expect that you might want different filters at different spatial positions. Like say for eye regions you might want to have some eye like filters and for mouth region you might want to have mouth specific features and so on. And so in that case you might not want to use convolutional layer because those features have to be shared across all spatial positions. And the second um assumptions that we made is that these filters are small locally and so we don't have global connectivity. We have this local connectivity but that's okay because we end up stacking up these convolutional layers in sequence. And so this the neurons at the end of the comnet will grow their receptive field as you stack these convolutional layers on top of each other. So at the end of the comnet, those neurons end up being a function of the entire image eventually. So just to give you an idea about what these activation maps look like concretely, here's an example of an image on the top left. This is a part of a car I believe. And we have these different filters at we have 32 different small filters here. And so if we were to convolve these filters with this image, we end up with these activation maps. So this filter if you convolve it you get this activation map and so on. So this one for example has some orange stuff in it. So when we convolve with this image you see that this white here is denoting the fact that that filter matches that part of the image quite well. And so we get these activation maps. You stack them up and then that goes into the next convolutional layer. So the way this looks then uh looks like then is that we've processed this with some kind of a convolutional layer. We get some output. We apply a rectified linear unit, some kind of a nonlinearity as normal and then we just repeat that operation. So we keep plugging these con volumes into the next convolutional layer and so they plug into each other in sequence. Okay? And so we end up processing the image over time. So that's the convolutional layer. Now you'll notice that there are a few more layers. So in particular the pooling layer I'll explain very briefly. Um pooling layer is quite simple. Uh if you've used Photoshop or something like that, you've taken a large image and you've resized it, you've downsampled the image. Well, pooling layers do basically something exactly like that, but they're doing it on every single channel independently. So for every one of these channels independently in a input volume, we'll pluck out that activation map. We'll down sample it and that becomes a channel in the output volume. So it's really just a downsampling operation on these volumes. Uh so for example one of the common ways of doing this in the context of neural networks especially is to use max pooling operation. So in this case it would be common to say for example use 2x2 filters stride two uh so um and do a max operation. So if this is an input channel in a volume then we're basically what that amounts to is we're truncating it into these 2x two regions and we're taking a max over four numbers to produce uh one piece of the output. Okay. So this is a very cheap operation that downsamples your volumes. It's really a way to control the capacity of the network. So you don't want too many numbers. You don't want things to be too computationally expensive. It turns out that a pooling layer allows you to down sample your volumes. You're going to end up doing less computation and it turns out to not hurt the performance too much. So we use them basically as a as a way of controlling the capacity of these networks. And the last layer that I want to briefly mention of course is the fully connected layer which is exactly as what you're familiar with. So we have these volumes throughout as we've processed the image. At the end you're left with this volume and now you'd like to predict some classes. So what we do is we just take that volume we stretch it out into a single column and then we apply a fully connected layer which is really amounts to just a matrix multiplication and then that gives us uh probabilities after applying like a soft max or something like that. So let me now show you briefly uh a demo of what the convolutional network looks like. Uh so this is comjs. uh this is um a deep learning library for training convolutional neural networks that I've that is implemented in JavaScript. I wrote this maybe uh two years ago at this point. So here what we're doing is we're training a convolutional network on the CR10 data set. CR10 is a data set of 50,000 images. Each image is 32x 32x3 and there are different 10 different classes. So here we are training this network in the browser and you can see that the loss is decreasing which means that we're better classifying these inputs. And uh so here's the network specification which you can play with because this is all done in the browser. So you can just change this and play with this. Uh so this is an input image and this convolutional network I'm showing here all the intermediate activations and all the intermediate um basically activation maps that we're producing. So here we have a set of filters. We're convoling them with the image and getting all these activation maps. Uh I'm also showing the gradients but I don't want to dwell on that too much. Venue threshold. So ReLU thresholding anything below zero gets clamped at zero and then you pull. So this is just a downsampling operation and then another convolution relu pull com pool etc until at the end we have a fully connected layer and then we have our softmax so that we get probabilities out and then we apply a loss to those probabilities and back propagate. And so here we see that I've been training in this tab for the last maybe uh 30 seconds or 1 minute and we're already getting about 30% accuracy on CR10. So this these are test images from CR10 and these are the outputs of this convolutional network and you can see that it learned that this is already a car or something like that. So this trains pretty quickly in JavaScript. Uh so you can play with this and you can change the architecture and so on. Another thing I'd like to show you is uh this video because it gives you again this like very intuitive visceral feeling of exactly what this is computing is there's a very good video by Jason Yosinski uh from recent advance. I'm going to play this in a bit. This is from the deep visualization toolbox. So you can download this code and you can play with this. It's this interactive convolutional network demo and neural networks have enabled computers to better see and understand the world. They can recognize school buses and Z top left corner we show the in this popular. So what we're seeing here is these are activation maps in some particular uh shown in real time as this demo is running. Uh so these are for the com one layer of an Alex net which we're going to go into in much more detail. But these are the different activation maps that are being produced at this point. Um neural network called Alexet running in cafe. By interacting with the network, we can see what some of the neurons are doing. For example, on this first layer, a unit in the center responds strongly to light to dark edges. Its neighbor one neuron over responds to edges in the opposite direction, dark to light. Using optimization, we can synthetically produce images that light up each neuron on this layer to see what each neuron is looking for. We can scroll through every layer in the network to see what it does, including convolution, pooling, and normalization layers. We can switch back and forth between showing the actual activations and showing images synthesized to produce high activation. By the time we get to the fifth convolutional layer, the features being computed represent abstract concepts. For example, this neuron seems to respond to faces. We can further investigate this neuron by showing a few different types of information. First, we can artificially create optimized images using new regularization techniques that are described in our paper. These synthetic images show that this neuron fires in response to a face and shoulders. We can also plot the images from the training set that activate this neuron the most as well as pixels from those images most responsible for the high activations computed via the deconvolution technique. This feature responds to multiple faces in different locations. And by looking at the decons, we can see that it would respond more strongly if we had even darker eyes and rosier lips. We can also confirm that it cares about the head and shoulders but ignores the arms and torso. We can even see that it fires to some extent for cat faces using backrop or decon. We can see that this unit depends most strongly on a couple units in the previous layer con 4 and on about a dozen or so in con 3. Now let's look at another neuron on this layer. So what's this unit doing? From the top nine images, we might conclude that it fires for different types of clothing. But examining the synthetic images shows that it may be detecting not clothing per se, but wrinkles. In the live plot, we can see that it's activated by my shirt. And smoothing out half of my shirt causes that half of the activations to decrease. Finally, here's another interesting neuron. This one has learned to look for printed text in a variety of sizes, colors, and fonts. This is pretty cool because we never ask the network to look for wrinkles or text or faces. The only labels we provided were at the very last layer. So the only reason the network learned features like text and faces in the middle was to support final decisions at that last layer. For example, the text detector may provide good evidence that a rectangle is in fact a book seen on edge. And detecting many books next to each other might be a good way of detecting a bookcase, which was one of the categories we trained the net to recognize. In this video, we've shown some of the features of the deep viz toolbox. Okay, so I encourage you to play with that. It's it's really fun. So, I hope that gives you an idea about exactly what's going on. There's these convolutional layers. We downsample them from from time to time. There's usually some fully connected layers at the end, but mostly it's just these convolutional operations stacked on top of each other. So, what I'd like to do now is I'll dive into some details of how these architectures are actually put together. The way I'll do this is I'll go over all the winners of the imageet challenges and I'll tell you about the architectures, how they came about, how they differ, and so you'll get a concrete idea about what these architectures look like in practice. So we'll start off with the Alex net in 2012. Um so the Alex net just to give you an idea about the uh the sizes of these networks and the images that they process it took 227 x27 by3 images. And the first layer of an Alex net for example was a convolutional layer that had 11 by11 filters applied with a stride of four and there are 96 of them. stride of four I didn't fully explain because I wanted to save some time but intuitively it just means that as you're sliding this filter across the input you don't have to slide it one pixel at a time but you can actually jump a few pixels at a time so we have 11 by11 filters with a stride a skip of four and we have 96 of them you can try to compute for example what is the output volume if you apply this uh this um this sort of convolutional layer on top of this volume and I didn't go into details of how you compute that but basically there are formulas for this and you can look into details uh in the class but um you arrive at 55 x 55 by 96 volume as output. The total number of parameters in this layer we have 96 filters every one of them is 11 by 11 by3 because that's the input uh depth of these images. So basically just amounts to 11 * 11 * 3 and then you have 96 filters. So about 35,000 parameters in this very first layer. Uh then the second layer of an Alex net is a pooling layer. So we apply 3x3 filters at stride of two and they do max pooling. So you can again compute the output volume size of that after applying this to that volume and you arrive if you do some uh very simple arithmetic there you arrive at 27 by 27 by 96. So this is the down sampling operation. You can think about what is the number of parameters in this pooling layer. Um and of course it's zero. So pooling layers compute a fixed function a fixed down sampling operation. There are no parameters involved in the pooling layer. All the parameters are in convolutional layers and the fully connected layers which are in some extent equivalent to convolutional layers. So you can go ahead and just basically based on the description in the paper although it's non-trivial I think based on the description of this particular paper but you can go ahead and decipher what uh the volumes are throughout you can look at the uh kind of patterns that emerge in terms of how you actually um increase number of filters in higher convolutional layers. So we started off with 96 then we go to 256 filters then to 384 and eventually 4,96 units fully connected layers. You'll see also normalization layers here which have since become slightly deprecated. It's not very common to use the normalization layers that were used uh at the time for the Alexent architecture. What's interesting to note is how this differs from the 1998 yan lakun network. So in particular I usually like to think about four things that hold back progress. So uh at least in deep learning so the data as a constraint compute uh and then I like to differenti differentiate between algorithms and infrastructure algorithms being something that feels like research and infrastructure being something that feels like a lot of engineering has to happen and so in particular we've had progress in all those four fronts. So we see that in 1998 uh the data you could get a hold of maybe would be on the order of a few thousand whereas now we have a few million. So we had three orders of magnitude of increase in number of data. Compute uh GPUs have become available and we use them to train these networks. They are about say roughly 20 times faster than CPUs. And then of course CPUs we have today are much much faster than CPUs that they had back in 1998. So I don't know exactly to what that works out to but I wouldn't be surprised if it's again on the order of three orders of magnitude of improvement. Again uh I'd like to actually skip over algorithm and talk about infrastructure. So in this case we're talking about uh Nvidia releasing the CUDA library that allows you to efficiently create all these matrix vector operations and apply them on arrays of numbers. So um that's a piece of software that you we rely on and that we take advantage of that wasn't available before. And finally algorithms is kind of an interesting one because there's been uh in those 20 years there's been much less improvement in uh in algorithms than all these other three pieces. So in particular what we've done with the 1998 network is we've made it bigger. So you have more channels, you have more layers by a bit. Uh and the two really new things algorithmically are uh dropout and rectified linear units. So uh dropout is a regularization technique uh developed by Jeff Hinton and colleagues. And rectified linear units are these nonlinearities that train much faster than sigmoids and 10H's. And this paper actually had a plot u that showed that the rectified linear units trained a bit faster than sigmoids. And that's intuitively because of the vanishing gradient problems. And when you have very deep networks with sigmoids, um those gradients vanish as Hugo was talking about in last lecture. Uh so what's interesting also to note by the way is that both dropout and relu are basically like one line or two lines of code change. So it's about two line diff total in those 20 years. And both of them consist of setting things to zero. So with the ReLU, you set things to zero when they're lower than zero. And with Dropout, you set things to zero at random. So, it's a good idea to set things to zero. Apparently, that's what we've learned. So, if you try to find a new cool algorithm, look for oneline diffs that set something to zero. Probably will work better and we could add you here to this list. Uh, now some of the newest things that happened uh some of the comparing it again and giving you an idea about the hyperparameters that uh were in this architecture. Um, it was the first use of rectified linear units. We haven't seen that as much before. uh this network used the normalization layers which are not used anymore at least in the specific way that they use them in this paper. Uh they used heavy data augmentation. So you don't only put in you don't only pipe these images into the networks exactly as they come from the data set but you jitter them spatially around a bit and you warp them and you change the colors a bit and you just do this randomly because you're trying to build in some invarianes to these small perturbations and you're basically hallucinating additional data. Uh it was the um the first real um use of dropout. Um and roughly you see standard hyperparameters like say batch sizes of roughly 128 u using stocastic gradient descent with momentum usually 0.9 um in the momentum learning rates of 1 -2 you reduce them in normal ways. So you reduce roughly by factor of 10 whenever validation stops improving and weight decay of just a bit 5 negative4 and uh ensembling always helps. So you train seven independent convolutional networks separately and then you just average their predictions always gives you additional 2% improvement. So this is AlexNet the winner of 2012. In 2013 the winner was the ZFNET. This was developed by uh Matthew Zyler and Rob Fergus in 2013 and this was an improvement on top of Alexet architecture. In particular, one of the the bigger differences here were that the convolutional layer, the first convolutional layer, they went from 11 by11 stride 4 to 7 by7 stride 2. So you have slightly smaller filters and you apply them more densely. And then also they noticed that these convolutional layers in the middle if you make them larger if you scale them up then you actually gain performance. So they managed to improve a tiny bit. Matthew Zyler then went uh he um became the founder of clarify uh and uh he worked on this a bit more inside clarify and he managed to push the performance to 11% which was the winning entry at the time but we don't actually know what get gets you from 14% to 11% because Matthew never disclosed the full details of what happened there but uh he did say that it was more tweaking of these hyperparameters and optimizing that a bit so that was 2013 winner in 2014 we saw a slightly bigger diff to this um so one of the networks that was introduced then was a VGNet from Karen Simmonian and Andrew Zerman. What's beautiful about VGNet and they explored a few architectures here and the one that ended up working best was this D column which is why I'm highlighting it. What's beautiful about the VGNet is that it's so simple. So you might have noticed in these previous uh um in these previous networks you have these different filter sizes, different layers and you do different amount of strides and everything kind of looks a bit hairy and you're not sure where these hyperparameters are coming from. VGET is extremely uniform. All you do is 3x3 convolutions with stride one pad one and you do two x2 max poolings with stride two and you do this throughout completely homogeneous architecture and you just alternate a few comp and a few pool layers and you get a top top performance. So they managed to reduce the error down to 7.3% in the VGNet um just with a very simple and homogeneous architecture. So it's I've also here written out this D architecture just so you can see I'm not I'm not sure how instructive this is because it's kind of dense but you can definitely see and you can look at this offline perhaps but you can see how these volumes develop and you can see the kinds of sizes of these filters. Um so they're always 3x3 but the number of filters again grows. So we started off with 64 and then we go to 128 256 512. So we're just doubling it over time. Um I also have a few numbers here just to give you an idea of the scale at which these networks normally operate. So we have on the order of 140 million parameters. This is actually quite a lot. I'll show you in a bit that this can be about five or 10 million parameters and it works just as well. Um and it's about 100 megabytes for image in terms of memory in the forward pass and then the backward pass also needs roughly on that order. So that's roughly the numbers that we're uh we're working with here. Uh also you can note that most of the and this is true mostly in convolutional networks is that most of the memory is in the early convolutional layers. Most of the parameters at least in the case where you use these giant fully connected layers at the top would be here. Um so the winner actually in 2014 was not the VGET I only presented because it's such a simple architecture but the winner was actually Google net with a slightly um hairier architecture we should say. So it's still a sequence of things but in this case they've uh put inception modules in sequence and this is an example inception module. I don't have too much time to go into the details but you can see that it consists basically of convolutions and different kinds of strides and so on. Um so the Google net um is looks slightly uh hairier but it turns out to be more efficient in several respects. So for example it works a bit better than VGNET at least at the time. um it only has five million parameters compared to VGE that's 140 million parameters. So a huge reduction and you do that by the way by just throwing away fully connected layers. So you'll notice in this breakdown I did these fully connected layers here have 100 million parameters and 16 million parameters. Turns out you don't actually need that. So if you took take them away that actually doesn't hurt the performance too much. So uh you can get a huge reduction of parameters. Um and it was um it was slightly we can also compare to the original AlexNet. So compared to the original Alex net, we have fewer parameters, a bit more compute and a much better performance. So Google Net was really optimized to have a low footprint both memory wise uh both computation wise and both parameter- wise but it looks a bit uglier and VGNet is a very beautiful homogeneous architecture but there are some inefficiencies in it. Okay, so that's uh 2014. Now in 2015 we had a a slightly bigger delta on top of the architectures. So right now these architectures if Yan Lakun looked at them maybe in 1998 he would still recognize everything. So everything looks very like simple. You've just played with hyperparameters. So one of the first kind of bigger departures I would argue was in 2015 with the introduction of residual networks. Uh and so this is work from Kaming Hi and colleagues in Microsoft Research Asia. And so they did not only win the imageet challenge in 2015 but they won a whole bunch of challenges. And this was all just by applying these residual networks that were trained on imageet and then fine-tuned on all these different tasks and you basically can crush lots of different tasks whenever you get a new awesome comnet. Um so at this time the performance was basically 3.57% from these residual networks. So this is 2015. Also uh this paper tried to argue that if you look at the number of layers it goes up and then it uh they made the point that uh with residual networks as we'll see in a bit you can introduce many more layers and they uh and that that correlates strongly with performance. We've since found that in fact you can make these residual networks quite sh quite a lot shallower like say on the order of 20 or 30 layers and they work just as fine just as well. So it's not necessarily the depth here but I'll go into that in a bit but you get a much better performance. What's interesting about this paper is this this plot here where they compare these residual networks and I'll go into details of how they work in a bit and these what they call plane networks which is everything I've explained until now and the problem with plane networks is that when you try to scale them up and introduce additional layers they don't get monotonically better. So if you take a 20 layer model and uh on this is on C10 experiments if you take a 20 layer model and you run it and then you take a 56 layer model you'll see that the 56 layer model performs worse and this is not just on the test data. So it's not just an overfitting issue. This is on the training data. The 56 layer model performs worse on the training data than the 20 layer model even though the 56 layer model can imitate 20 layer model by setting 36 layers to compute identities. So basically it's an optimization problem that you can't find the solution once your problem size grows that much bigger in this plain net uh architecture. So in the residual networks that they've proposed they found that when you wire them up in a slightly different way you monotonically get a better performance as you add more layers. So more layers always strictly better and you don't run into these optimization issues. So comparing residual networks to plane networks in plain networks as I've explained already you have this sequence of convolutional layers uh where every convolutional layer operates over volume before and produces volume. In residual networks we have this first convolutional layer on top of the raw image. Then there's a pooling layer. Um so at this point we've reduced to 56 x 56x 64 the original image and then from here on they have these residual blocks with these funny skip connections and this turns out to be quite important. Um so let me show you what these look like. Um so the original climbing paper had this architecture here shown under original. So on the left you see original residual networks design. Since then they had an additional paper that uh played with the architecture and found that there's a better arrangement of u layers inside this block that works better empirically. And so the way this works, so concentrate on the proposed one in the middle since that works so well, is you have this pathway uh where you have this representation of the image X and then instead of transforming that representation X to get a new X to plug in later, we end up uh having this X, we go off and we do some compute on the side. So that's that residual block doing some computation and then you add your result on top of X. So you have this addition operation here going to the next residual block. So you have this X and you always compute deltas to it. And I think this it's not intuitive that this should work much better or why that works much better. I think it becomes a bit more intuitively clear if you actually understand the back propagation dynamics and how backrop works. And this is why I always urge people also to implement backdrop themselves to get an intuition for how it works, what it's computing and so on. Because if you understand backdrop, you'll see that addition operation is a gradient distributor. So um you you get a gradient from the top and this gradient will flow equally to all the children that participated in that addition. So you have gradient flowing here from the supervision. So you have supervision at the very bottom here in this diagram and it kind of flows upwards and it flows through these residual blocks and then gets added to this stream. And so you end up with but this addition distributes that gradient always ident identically through. So what you end up with is this kind of a gradient superighway as I like to call it where these gradients from your supervision go directly to the original convolutional layer and then on top of that you get these deltas from all the residual blocks. So these blocks can come on online and can help out that original stream of information. This is also related to I think why LSTMs long short-term memory uh networks uh work better than recurrent neural networks because they also have these kind of additional addition operations in the LSTM and it just makes the gradients flow significantly better. Then there were some results on top of residual networks that I thought were quite amusing. So uh recently for example we had this result on deep networks with stoastic depth. Uh the idea here was that uh the authors of this paper noticed that you have these residual blocks that compute deltas on top of your stream and you can basically randomly throw out layers. So you have these say 100 blocks 100 residual blocks and you can randomly drop them out and uh at test time similar to dropout you introduce all of them and they all work at the same time but you have to scale things a bit just like with dropout. Uh but basically it's kind of a unintuitive result because you can throw out layers at random and I think it breaks the original notion of what we had of comnets of as like these feature transformers that that they compute more and more complex features over time or something like that. And I think it seems much more intuitive to think about these residual networks, at least to me, as some kinds of dynamical systems where you have this original representation of the image X and then every single residual block is kind of like a vector field that because it computes an a delta on top of your signal. And so these vector fields nudge your original representation X towards a space where you can decode the answer Y of like the class of that X. And so if you drop off some of these residual blocks at random, then if you haven't applied one of these vector fields, then the other vector fields that come later can kind of make up for it and they nudge they basically nudge the um they pick up the slack and they nudge it along. Anyways, and so that's possibly why this the image I currently have in mind of how these things work. Um so much more like dynamical systems. In fact, another experiments that people are playing with that I also find interesting is you don't have you can share these residual blocks. So it starts to look more like a recurrent neural network. So these residual blocks would have shared connectivity and then you have this dynamical system really where you're just running a single RNN, a single vector field that you keep iterating over and over and then your fixed point gives you the answer. So it's kind of interesting what's happening. Uh it looks very funny. Okay, we've had many more interesting results that so people are playing a lot with these residual networks and uh improving on them in various ways. So, as I mentioned already, it turns out that you can make these residual networks much shallower and make them wider. So, you introduce more channels and that can work just as well, if not better. So, it's not necessarily the depth that is giving you a lot of the performance. It's um um you can scale down the depth and if you increase the width, that can actually work better. And um they're also more efficient if you do it that way. There's more uh funny regularization techniques here. Swap out is a funny regularization technique that actually interpolates between plain nets, resets and dropout. So that's also a funny paper. Uh we have fractal nets. We actually have many more different types of nets. And so people have really experimented with this a lot. I'm really eager to see what the winning architecture will be in 2016 as a result of a lot of this. One of the things that has really enabled this rapid experimentation in the community is that somehow we've developed luckily this culture of sharing a lot of code among ourselves. So for example um Facebook has released um just as an example Facebook has released residual networks code and torch that is really good that a lot of these papers I believe have adopted and worked on top of and that allowed them to actually really um scale up their experiments and and uh explore different architectures. So it's great that this has happened. Unfortunately a lot of these papers are coming kind of on archive and it's kind of a chaos as these are being uploaded. So at this point I think this is a natural point to plug very briefly my archivesity.com. So this is the best website ever and what it does is it crawls archive and uh it takes all the papers and it analyzes all the papers the full text of the papers and it creates TF bag of words features for all the papers and then you can do things like you can search a particular paper like residual networks paper here and you can look for similar papers on archive and so this is a sorted list of basically all the residual networks papers that are most related to that paper. uh or you can also create user accounts and you can create a library of papers that you like and then archive sanity will train a support vector machine for you and basically you can look at what are archive papers over the last month that I would enjoy the most and that's just computed by archive sanity and so it's like a curated feed specifically for you. So I use this quite a bit and I find it uh useful so I hope that other people do as well. Okay, so we saw convolutional neural networks. I explained how they work. I explained some of the background context. given you an idea of what they look like in practice and we went through case studies of the winning architectures over time, but so far we've only looked at image classification specifically. So we're categorizing images into some number of bins. So I'd like to briefly talk about addressing other tasks in computer vision and how you might go about doing that. So the way to think about uh doing other tasks in computer vision is that really what we have is you can think of this comput convolutional neural network as this block of compute that has a few million parameters in it and it can do basically arbitrary functions that are very nice over images and um so takes an image gives you some kind of features and now different tasks uh will basically look as follows. You want to predict some kind of a thing in different tasks that will be different things and you always have a desired thing and then you want to make the predicted thing much more closer to the desired thing and you back propagate. So this is the only part usually that changes from task to task. You'll see that these comnets don't change too much. what changes is your loss function at the very end and that's what actually helps you uh really transfer a lot of these winning architectures you usually use for these pre-trained networks and you don't worry too much about the details of that architecture because you're only worried about you know adding a small piece at the top or changing the loss function or substituting a new data set and so on. So just to make this slightly more concrete, in image classification, we apply this compute block. We get these features and then if I want to do classification, I would basically predict 10,00 numbers that give me the lock probabilities of different classes. And then I have a predicted thing, a desired thing, particular class, and I can back prop. If I'm doing image captioning, the it also looks very similar. Instead of predicting just a vector of 10,000 numbers, I now have, for example, a 10,000 num uh 10,000 words in some kind of vocabulary. and I'd be predicting 10,000 numbers and a sequence of them. And so I can use a recurrent neural network which you will hear much more about I think in Richard's uh lecture just after this. And so I produce a sequence of 10,000 dimensional vectors and that's just a description and they indicate the probabilities of different words to be emitted at different time steps. Or for example if you want to do localization again most of the block stays unchanged but now we also want some kind of a extent in the image. So suppose we want to classify we don't only just want to classify this as an airplane but we want to localize it with XY width height bounding box coordinates and if we make the specific assumption as well that there's always a single one thing in the image like a single airplane in every image then you can just afford to just predict that. So we predict these uh softmax scores just like before and apply the cross entropy loss and then we can predict XY with height on top of that and we use like an L2 loss or a Huber loss or something like that. So you just have a predicted thing, a desired thing and you just backdrop. If you want to do reinforcement learning because you want to play different games, then again the setup is you just predict some different thing and it has some different semantics. So in this case we would be for example predicting eight numbers that give us the probabilities of taking different actions. For example, there are eight discrete actions in Atari. Then we just predict eight numbers and then we train this with slightly different manner because in the case of reinforcement learning you don't actually have a you don't actually know what the correct action is to take at any point in time but you can still get a desired thing eventually because you just run these rollouts over time and you just see uh what what happens and then um that helps you that helps inform exactly what the correct answer should have been or what the desired thing should have been in any one of those rollouts in any point in time. I don't want to dwell on this too much in this lecture though it's outside of the scope. You'll hear much more about reinforcement learning in in a later lecture. Uh if you wanted to do segmentation for example uh then you don't want to predict a single vector of numbers for a single uh for a single image but every single pixel has its own category that you'd like to predict. So a data set will actually be colored like this and you have different classes different areas and then instead of predicting a single vector of classes you predict an entire array of 224 x24 since that's the extent of the original image for example times 20 if you have 20 different classes and then you basically have uh 224 x24 independent softaxis here that's one way you could pose this and then you back propagate this would here would be slightly more difficult because you see here I have decom layers mentioned here and I didn't explain deconvolution layers. They're related to convolutional layers. They do a very similar operation but kind of uh backwards in some way. So a convolutional layer kind of does these downsampling operations as it computes. A decom layer does these kind of upsampling operations as it computes these convolutions. But in fact you can implement a decom layer using a com layer. So what you do is you a decom forward pass is the com layer backward pass and the decom backward pass is the com layer forward pass basically. So they're basically an identical operation but it just are you upsampling or downsampling kind of. So uh you can use decon layers or you can use hyper columns and there are different things that people do in segmentation literature but that's just a rough idea as you're just changing the loss function at the end. If you wanted to do autoenccoders so you want to do some unsurprised learning or something like that. Well you're just trying to predict the original image. So you're trying to get the convolutional network to implement the identity transformation. And the trick of course that makes it non-trivial is that you're forcing the representation to go through this representational bottleneck of 7 by7 x 512. So the network must find an efficient representation of the original image so that it can decode it later. So that would be a autoenccoder you again have an L2 loss at the end and you back prop or if you want to do variational autoenccoders you have to introduce a reparameterization layer and you have to append an additional small loss that makes your posterior be your prior but it's just like an additional layer and then you have an entire generative model and you can actually like sample images as well. If you wanted to do detection things get a little more hairy perhaps compared to localization or something like that. So one of my favorite detectors perhaps to explain is the yellow detector because it's perhaps the simplest one. It doesn't work the best but it's the simplest one to explain and it has the core idea of how people do detection in uh computer vision. And so the way this works is we reduced the original image to a 7x7 x 512 feature. So really there are these 49 discrete locations that we have and um at every single one of these 49 locations we're going to predict in yellow we're going to predict a class. So that's shown here on the top right. So every single one of these 49 will be some kind of a softmax. And then additionally at every single position we're going to predict some number of bounding boxes. And so there's going to be a b number of bounding boxes. Say b is 10. So we're going to be predicting uh 50 numbers. And the the five comes from the fact that every bounding box will have five numbers associated with it. So you have to describe the x y the width and the height. And you have to also indicate some kind of a confidence of that bounding box. Um so that's the fifth number is some kind of a confidence measure. So you basically end up predicting these bounding boxes. They have positions, they have class, they have confidence and then you have some true bounding boxes in the image. So you know that there are certain true boxes and they have certain class and what you do then is you match up the desired thing with the predicted thing and whatever. So say for example you had one um bounding box of a cat then you would find the closest predicted bounding box and you would mark it as a positive and you would try to make that associated grit cell predict cat and you would nudge the prediction to be slightly more towards the cat uh box and so all of this can be done with simple losses and you just back propagate that and then you have a detector or if you want to get much more fancy you could do uh dense image captioning so in this case this is a combination of detection and image captioning this is a paper with my equal co-author Justin Johnson and FA Lee from last year. And so what we did here is image comes in and it becomes much more complex. I don't maybe want to go into it as much but the first order approximation is that instead it's basically detection but instead of predicting fixed classes we instead predict a sequence of words. So we use a recurrent neural network there. Uh but basically you can take an image then and you can predict you can both detect and predict and describe everything in a complex visual scene. So that's just some overview of different tasks that people care about. Most of them consist of just changing this top part. You put different loss function in a different data set. But you'll see that this computational block stays relatively unchanged from time to time. And that's why as I mentioned when you do transfer learning um you just want to kind of take these pre-trained networks and you mostly want to use whatever works well on imageet because a lot of that does not change too much. Okay. So in the last part of the talk I'd like to let me just make sure we're good on time. Okay, we're good. So in the last part of the talk I just wanted to give some um hints of some practical considerations when you want to apply convolutional networks in practice. So first consideration you might have if you want to run these networks is what hardware do I use? Um so some of the options that um I think are available to you well first of all you can just buy a machine. So for example Nvidia uh has these digits dev boxes that you can buy. They have Titan X GPUs which are strong GPUs. You can also, if you're much more ambitious, you can buy DGX1, which has the newest Pascal P100 GPUs. Unfortunately, the DGX1 is about $130,000. So, this is kind of an expensive supercomputer. Uh, but the dig box, I think, is a more accessible. And so, that's one option you can go with. Alternatively, you can look at the specs of a dev box and those specs are they're good specs, and then you can buy all the components yourself and assemble it like Lego. Unfortunately u you that's prone to mistakes of course but you can definitely reduce the price maybe by a factor of like two um if compared to the Nvidia machine but of course Nvidia machine would just come with all the software installed all the hardware is ready and you can just do work there are a few GPU offerings in the cloud but unfortunately it's actually not at a good place right now uh it's actually quite difficult to get GPUs in the cloud good GPUs at least. So, Amazon AWS has these grid K5 520s. They're not very good GPUs. They're not fast. They don't have too much memory. It's actually kind of a problem. Um, Microsoft Azure is coming up, Azure is coming up with its own offering soon. Uh, so I think uh they've announced it and it's in some kind of a beta stage if I remember correctly. And so those are powerful GPUs K80s that would be available to you. At OpenAI for example, you use Cirrus Scale. So Serale is much more a slightly different model. You can't spin up GPUs on demand, but they allow you to rent a box in the cloud. So what that amounts to is that we have these boxes somewhere in the cloud. I have just the the DNS. I just have the URL. I SSH to it. It's a it's a TitanX boxes in the machine. And so you can just do work that way. So these options are available to hardware wise. In terms of software, there are many different frameworks of course that you could use for deep learning. Uh so these are some of the more um common ones that you might see in practice. Um so different people have different um recommendations on this. I would my personal recommendation right now to most people if you just want to apply this in uh practical settings 90% of the use cases are probably addressable with things like KAS. So KAS would be my go-to number one uh thing to look at. Keras is a layer over TensorFlow or Theano. Uh and basically it's just a higher level API over either of those. So for example I usually use Keras on top of TensorFlow and uh it's a much more um higher level language than raw tensorflow. So you can also work in raw tensorflow but you'll have to do a lot of low-level stuff. If you need all that freedom, then that's great because that allows you to have much more freedom in terms of how you design everything. But um it can be slightly more wordy. For example, you have to assign every single weight. You have to assign a name, stuff like that. And so it's just much more wordy, but you can work at that level. Or for most applications, I think KAS would be sufficient. And I've used Torch for a long time. I still really like Torch. It's very lightweight, interpretable. It works just just fine. So those are the the options that I would uh currently consider at least. Um, another practical consideration you might be wondering what architecture what architecture do I use in my problem. So my answer here and I've already hinted at this is don't be a hero. Don't go crazy. Don't design your own neural networks and convolutional layers and don't probably don't you don't want want to do that probably. So the algorithm is actually very simple. Look at whatever is currently the latest released thing that works really well in ILSVRC. you download that pre-trained model and then you potentially add or delete some layers on top because you want to do some other task. So that usually requires some tinkering at the top or something like that and then you fine-tune it on your application. So actually a very straightforward process. Uh the first degree I think to most applications would be don't tinker with it too much you're going to break it. But of course you can also take 231N and then you might become much better at u at tinkering with with these architectures. Second uh is uh how do I choose the parameters? And my answer here again would be don't be a hero. Uh look into papers, look what hyperarameters they use. For the most part, you'll see that all papers use the same hyperparameters. They look very similar. So Adam, when you use Adam for optimization, it's always learning rate 1G3 or 1G4. Uh so for you can also use SGD momentum, it's always the similar kinds of learning rates. So don't go too crazy designing this. One of the things you probably want to play with the most is the regularization. So uh and in particular not the L2 regularization but the dropout rates is something I would advise instead and um so uh because you might have a smaller or a much larger data set. If you have a much smaller data set then overfitting is a concern. So you want to make sure that you uh regularize properly with dropout and then you might want to as a second degree consideration uh maybe learning rate you want to tune that a tiny bit but that that's usually doesn't have as much of an effect. Um so really there's like two hyperparameters and you take a pre-trained network and this is 90% of the use cases I would say. Um yeah so compared to when computer vision 2011 where you might have hundreds of hyperparameters so uh yeah okay and uh in terms of uh distributed training so if you want to work at scale because uh if you want to train imageet or some large scale data sets you might want to train across multiple GPUs. So, just to give you an idea, most of these state-of-the-art networks are trained on the order of a few weeks across multiple GPUs, usually four or eight GPUs. And these GPUs are roughly on the order of $1,000 each, but then you also have to house them. So, of course, that adds additional price. But you almost always want to train on multiple GPUs if possible. Um, usually you don't end up training across machines. That's much more rare, I think, to train across machines. What's much more common is you have a single machine and it has eight Titan X's or something like that and you do distributed training on those eight Titan X's. There are different ways to do distributed training. So if you're very if you're feeling fancy, you can try to do some uh model parallelism where you split your network across multiple GPUs. Um I would instead advise some kind of a data parallelism architecture. So usually what you see in practice is you have eight GPUs. So I take my batch of 256 images or something like that. I split it and I split it equally across the GPUs. I do forward pass in those GPUs and then I u I basically just add up all the gradients and I propagate that through. So you're just distributing this batch and you're doing um mathematically you're doing the exact same thing as if you had a giant GPU but you're just splitting up that batch across different GPUs. Uh but you're still doing synchronous training with SGD as normal. So that's what you'll see most in practice which I think is uh the best thing to do right now for most normal applications. And other kind of considerations that sometimes enter that uh you could maybe worry about is that there are these bottlenecks to be aware of. So in particular CPU to disk bottleneck. This means that you have a giant data set. It's somewhere on some disk. You want that disk to probably be an SSD because you want this loading to be quick because these GPUs process data very quickly and that might actually be a bottleneck. Like loading the data could be a bottleneck. So in many applications, you might want to pre-process your data. Make sure that it's read out contiguously in very raw form from something like an HDFI file or some kind of other binary format. And um another bottleneck to be aware of is the CPU GPU bottleneck. So the GPU is doing a lot of heavy lifting of the neural network and the CPU is loading the data and you might want to use things like pre-fetching threads where the CPU while the networks are doing forward backward on the GPU. Your CPU is busy loading the data from the disk and maybe doing some pre-processing and making sure that it can um ship it off to the GPU at the next time step. So those are some of the practical considerations I I could come up with for this lecture. Uh if you wanted to learn much more about convolutional neural networks and a lot of what I've been talking about, then I encourage you to check out CS231N. Uh we have lecture videos available. We have notes, slides, and assignments. Everything is uh up uh and available. So uh you're welcome to check it out. And that's it. Thank you. [Applause] So I guess I can take some questions. Yeah. Hello. Hello. Hi. I'm Kyle Far from Lumna. Um, I'm using a lot of convolutional nets for genomics. One of the problems that we see is that our genomic sequence tends to be arbitrary length. Uh so right now we're pattern with a lot of zeros, but we're curious as to what your thoughts are on using CNN's for uh things of arbitrary size or we can't just down sample to 277 by 277. Yep. So is this like a genomic sequence of like ATCG like that kind of sequence? Yeah, exactly. Yeah. So some of the options would be uh so recurren networks might be a good fit because they allow arbitrarily sized contacts. Uh another option I would say is if you look at the waveet paper uh from deep mind they have uh audio and they're using convolutional networks for processing it and I would basically adopt that kind of an architecture. they have this clever way of doing uh what's called atros or dilated convolutions and so that allows you to capture a lot of context with few layers and so that's called dilated convolutions and the waveet paper has some details and there's an efficient implementation of it that you should be aware of on GitHub and so you might be able to just drag and drop the fast WaveNet code into that application and so you have much larger context but it's of course not infinite context as you might have with a recurrent network yeah we're definitely checking those out uh we also tried RNN's they're quite slow for these things uh our main problem is that the genes can be very short or very long, but the whole sequence matters. Um, so I I think that's one of the challenges that we're looking at with this type of problem. Interesting. Um, yeah. So those would be the two options that I would play with basically. I think those are the two that I'm aware of. Yeah, thank you. Thanks for a great lecture. So my question is that is there a clear mathematical or conceptual understanding when people decide how many hidden layers have to be part of their architecture? Yeah. So um the answer with a lot of this is there a mathematical understanding will likely be no because we are in very early phases of just doing a lot of empirical anal like guess and check kind of work. Um and so theory is in some some ways like lagging behind a bit. Uh I would say that with residual networks uh you want to have more layers usually works better and so you can take these layers out or you can put them in and it's just mostly a computational consideration of how much can you fit in. So our considerations usually is you have a GPU it has maybe 16 gigs of RAM or 12 gigs of RAM or something. I want certain batch size and I have these considerations and that upper bounds the amount of like layers or how big they could be. And so I use the biggest thing that fits in my GPU. And that's mostly what uh the way you choose this. And then you regularize it very strongly. So if you have a very small data set uh then you might end up with a pretty big network for your data set. So you might want to make sure that you are tuning those dropout rates properly and so you're not overfitting. So I have question uh my understanding is that uh uh the recent uh convolution doesn't use pooling layers right. So the question is why uh you know why don't they use pulling layers? So you know is there still a place for pulling? Yeah. Uh yeah. So certainly so if you saw for example the residual network um at the end there was a single pooling layer at the very beginning but mostly they went away. You're right. So it took uh I wonder if I can find the slide. I wonder if this is a good idea to try to find the slide. That's probably okay. Let me just find this. Oh okay. So this was the residual network architecture. So you see that they do a first com and then there's a single pool right there. But certainly the trend has been to throw them away over time and there's a paper also uh it's called striving for simplicity the all convolutional neural network and uh the point in that paper is look you can actually do strided convolutions you can throw away pooling layers altogether works just as well. So pulling layers are kind of I would say this kind of a bit of a historical vestage of they needed things to be efficient and they need to control the capacity and down sample things uh quite a lot and so we're kind of throwing them away over time and uh yeah they're not doing anything like super useful. They're doing this fixed operation and uh you want to learn as much as possible so maybe you don't actually want to get rid of that information. Uh so it's always more appealing to um it's probably more appealing I would say to throw them away. uh you mentioned there is a sort of cognitive uh or brain analogy that the brain is doing pulling so uh yeah so I think that analogy is stretched by a lot so the brain I'm not sure if the brain is doing [Laughter] pooling yeah about image compression not for just classification but the usage of uh neural networks for image compression do we have any examples sorry I couldn't hear the question uh instead of classification for images uh can we use the neural networks for uh image compression. Image compression. Uh yeah, I think there's actually really exciting work in this area. So um one that I'm aware of for example is recent work from Google where they're using convolutional networks and recurrent networks to come up with variably sized codes for images. Um so certainly a lot of these generative models I mean they are very related to compression. Uh so definitely a lot of work in that area that uh that I'm excited about. Also for example super resolution networks. So you saw the recent acquisition uh of um magic pony by Twitter. So they were also doing something that basically allows you to compress, you can send low resolution streams because you can upsample it on the client. Uh and so a lot of work in that area. Yeah, I had one question. One more but maybe after you can you please comment on scalability regarding number of classes? So what does it take if we go up to 10,000 or 100,000 classes? Mhm. Yes. Yeah. So if you have a lot of classes then of course you can grow your softmax but that becomes inefficient at some point because you're doing a giant matrix multiply. So some of the ways that people are addressing this in practice I believe is use of like hierarchical softmax and things like that. Uh so you um you decompose your classes into groups and then uh you kind of predict one group at a time and you kind of converge uh that way. Um so I'm not I I see these papers but I don't uh I'm not an expert on exactly how this works but I do know that hierarchical softmax is something that people use in this setting especially for example in language models this is often used because you have huge amount of words and you still need to predict them somehow and so I believe Thomas Mikolof for example he has some papers on using hierarchical softmax in this context could you uh could you talk a little bit about the u the convolutional functions like what uh what considerations you should make in u uh selecting the functions they're used in any convolutional filters selecting the functions that are used in the convolutional filters. Uh so these filters are just parameters, right? So we train those filters. They're just numbers that we train with back propagation. Okay. Are you talking about the nonlinearities perhaps or Yeah, I'm just wondering about uh when you're selecting those uh the features or when you're getting the uh when you're trying to train to to uh understand different features within an image, what uh what are those uh filters actually doing? Oh, I see you're talking about understanding exactly what those filters are looking for in so a lot of interesting work especially for example so Jason Yosinski uh he has this deepest toolbox and I've shown you that you can kind of debug it that way a bit. Uh there's an entire lecture that I encourage you to watch in CS231N on visualizing understanding uh convolutional networks. So people use things like a decom or guided or guided back propagation or you back propagate to image and you try to find a stimulus that maximally activates any arbitrary neuron. So different ways of probing it uh and different ways have been developed and there's a lecture about it. So I would I would check that out. Great. Thanks. Uh I had a question regarding the size of fine-tuning data set. For example, is there a ballpark uh number if if you are trying to do classification? Uh how many do would you need for fine-tuning it to your sample set? Uh so how many uh how many data points do you need to to get good performance is the question. Okay. So, so, okay. So, this is like the most boring answer I think because the more the better always and uh it's really hard to say actually the um how many you need. Um so, usually one way one way to look at it is um one heristic that people sometimes follow is you look at number of parameters and you want the number of examples to be on the order of number of parameters. That's one way people sometimes break it down even for fine-tuning. Uh because we'll have a imageet model. So I was hoping that most of the things would be taken care over there and then you're just fine-tuning. So you you might need a lower order. I see. So when you're saying fine-tuning, are you fine the whole network or you're freezing some of it or just the top classifier? Just the top classifier. Yeah. So one another way to look at it is you have some number of parameters and you can estimate the number of bits that you you think every parameter has and then you count the number of bits in your data. So that's the kind of like comparisons you would do. Uh but really uh yeah I have no good answer. So the more the better and you have to try and you have to regularize and you have to cross validate that and you have to see what performance you get over time because it's too task dependent for me to say something stronger. Uh hi I would like to know how do you think the covenant will work in the 3D case? Uh like is it just a simple extension of the 2D case or do we need some extra tweak about it in 3D case? So you're talking specifically about say videos or some uh 3D uh actually I'm talking about the the the image has the depth information. Oh I see. So uh say you have like RGBD input and things like that. Yes. So I'm not too familiar what people do but um uh I do know for example that uh people try to have for example one thing you can do is just treat it as a fourth channel or maybe you want a separate comnet on top of the depth channel and do some fusion later. Uh so I don't know exactly what the state-of-the-art in treating that depth channel is right now. Um, so I don't know exactly how they do how they do it right now. Oh, so maybe just one more question just uh how do you think the 3D object recognition 3D object? Yeah. Recognition. So what is the output that you'd like? Uh the output is still the the class probability but we are not treating the a 2D image but the the 3D representation of the object. I see. So do you have a mesh or a point cloud? Yeah. I see. Yeah. Uh so also not not exactly my area unfortunately but so the problem with these uh meshes and so on is that there's this like rotational degree of freedom that I'm not sure what people do about honestly. So uh the yeah so I'm actually not an expert on this so I don't want to comment. There are some obvious things you might want to try like you might want to plug in all the possible ways you could orient this and then at test time average over them. So that would be some of the obvious things to play with but I don't I'm not actually sure what the state of the art is. Okay. Thank you. I have one more question. Okay. So coming back to distributed training, is it possible to do even the classification in a distributed way or my question in future can I imagine my um our cell phones do these things together for one inquiry? Uh our cell phones Oh, I see you. You're trying to get cell phones distributed training. Yes. Yes. A train and also a radical idea for one cell phone user. Very radical idea. So related thoughts I had recently was so I had comejs in the browser and I was thinking of um basically this trains networks and I was thinking about similar questions because you could imagine shipping this off as an ad equivalent like the people just include this in the JavaScript and then everyone's browsers are kind of like training a small network uh so I think that's a related question but do you think there's like too much communication overhead or it could be actually really distributed in a uh efficient way? Yes. So the problem with distributing it a lot is actually um the stale gradients problem. So when you look at some of the uh papers that Google has put out about distributed training as you look at the number of workers when you do asynchronous SGD number of workers and the the performance improvement you get it kind of like plateaus quite quickly after like eight workers or something quite small. So I'm not sure if there are ways of dealing with thousands of workers. The issue is that you have a distributed you every worker has this like specific snapshot of the weights that are currently um at from the pull you pull from the master and now you have a set of weights that you're using and you do forward backward and then you send an update but by the time you send an update and you've done your forward backward the parameter server has now done like lots of updates from like thousands of other things and so your gradient is stale you've evaluated it at a wrong an old location and so it's an incorrect direction now and everything breaks. So that's the challenge and I'm not sure what people are doing about this. Yeah. Uh I was wondering about uh applications of convolutional uh nets to uh two inputs at a time. So let's say you have two pictures of jigs of puzzles, puzzles, jigs of pieces and trying to figure out if they fit together or uh whether one object compares to the other in a specific way. Have you heard of any implementation of this kind? Uh yes. So you have two inputs instead of one. Yeah. So the common ways of dealing with that is you put a comet on each and then you do some kind of a fusion eventually to to merge the information. Right. I see. And uh same for um recurrent neural networks if you had like variable input. Uh so for example in the context of videos where you have frames coming in. Yeah. Then yeah so some of the approaches are you have a convolutional network on the frame and then at the top you tie it in with the recurren neural network. Mhm. So you have these you reduce the image to some kind of a lower dimensional representation and then that get that's an input to a recurrent neural network at the top. Uh there are other ways to play with this. For example, you can actually make the recurrent you can make every single neuron in the comnet recurrent. That's also one funny way of deal doing deal doing deal doing deal doing deal doing deal doing deal doing deal doing deal doing deal doing deal doing this. So right now when a neuron computes its output it's only a function of a local neighborhood uh in below it. But you can also make it in addition a function of that same local neighborhood or like its own activation perhaps at the previous time step if that makes sense. So so this so this neuron is not just computing a dot product with the with the current patch but it's also incorporating a dotproduct of its own and maybe its neighborhoods uh activations at the previous time step of the frame. So that's kind of like a small RNN update hidden inside every single neuron. So those are the things that I think people play with when I'm not familiar with what currently is working best in this area. Pretty awesome. Thank you. Yeah. Yeah. Hi uh thanks for the great talk. I have a question uh regarding uh the latency for the models that are trained using multiple layers. So especially at the prediction time you know as we add more more layers for the forward pass it will take some time you know it'll increase in the latency right for the prediction. So what are the numbers that we have seen uh you know you know presently that you know that you know if you can share that you know the the prediction time or that you know the latency uh at the at the forward pass. So you're worried for example uh you have some you want to run prediction very quickly would it be on an embedded device or is this in the cloud? Uh yes suppose you know it's a cell phone you know you have you're you're identifying the objects or you know you're you're doing some uh you know image analysis or something. Yeah. So there's definitely a lot of work on this. So one way you would approach this actually is you have this uh network that you've trained using floatingpoint arithmetic 32 bits say and so there's a lot of work on uh taking that network and uh discretizing all the weights into like ins and making it much smaller and pruning connections. So one of the works I'm um related to this for example is sonhan here at Stanford has a few papers on getting rid of spurious connections and reducing the network as much as possible and then making everything very efficient with integer arithmetic. Uh so basically you achieve this by um discretizing all the weights and all the activations and uh throwing away and pruning the network. So there are some tricks like that that people play. Um that's mostly what you would do on an embedded device. And then the challenge of course is you've changed the network and now you just kind of are crossing your fingers that it works well. And so I think what's uh interesting for uh re from research standpoint is you'd like to do you'd like your test time to exactly match your training time, right? So then you get the best performance and so the question is how do we train with low precision arithmetic and there's a lot of work on this as well. So say from Yoshua Benjio's lab as well and um uh so that's exciting directions of how you train in low precision regime. Do do you have any numbers I mean that you can share for the you know state-of-the-art how much time does it take? Um yes so I see the papers but I'm not sure if I remember the the exact reductions. It's on the order of okay I don't want to say because basically I don't know. Thanks a I don't want to try to guess this. All right. Thank you. All right. We're out of time. Let's thank Andre. Lunch is outside and we'll restart at 12:45.