Deep Learning for Computer Vision (Andrej Karpathy, OpenAI)
u6aEYuemt0M • 2016-09-27
Transcript preview
Open
Kind: captions Language: en Yeah. So thank you very much for the introduction. Uh so today I'll speak about uh deep learning especially in the context of computer vision. So what you saw in the previous talk is neural networks. Uh so you saw that neural networks are organized into these layers fully connected layers where neurons in one layer are not connected but they're connected fully to all the neurons in the previous layer. And we saw that basically we have this um layer-wise structure from input until output um and there are neurons and nonlinearities etc. Now, so far we have not made too many assumptions about the inputs. So, in particular here, we just assume that an input is some kind of a vector of numbers that we plug into this neural network. So, um that's both a bug and a feature to some extent. Uh because in most um in most real world applications, we actually can make some assumptions about the input that make learning much more efficient. Uh um that makes learning much more efficient. So in particular um usually we don't just want to plug in uh into neural networks vectors of numbers but they actually have some kind of a structure. So we don't have vectors of numbers but these numbers are arranged in some kind of a uh layout like an n- dimensional array of numbers. So for example spectrograms are two-dimensional arrays of numbers. Images are threedimensional arrays of numbers. Videos would be four-dimensional arrays of numbers. Text you could treat as one dimensional array of numbers. And so whenever you have this kind of local connectivity uh structure in your data then you'd like to take advantage of it and convolutional neural networks allow you to do that. So before I dive into convolutional neural networks and all the details of the architectures I'd like to uh briefly talk about a bit of the history of how this field evolved over time. So I like to start off usually with uh talking about hub and weasel and the experiments that they performed in 1960s. So what they were doing is trying to study the computations that happened in the early visual cortex areas of a cat. And so they had cat and they plugged in electrodes uh to that could record from the different uh neurons. And then they showed the cat different patterns of light and they were trying to debug neurons effectively and try to show them different patterns and see what they responded to. And a lot of these experiments uh inspired some of the modeling that came in afterwards. So in particular, one of the early models that tried to take advantage of some of the results of these experiments where the um was the model called neurokcognitron from Fukushima in the 1980s. And so what you saw here was these uh this architecture that again is layer-wise similar to what you see in the cortex where you have these simple and complex cells where the simple cells detect small things in the visual field and then you have this local connectivity pattern and the simple and complex cells alternate in this layered architecture throughout. And so this was this looks a bit like a comnet because you have some of its features like say the local connectivity but at the time this was not trained with back propagation. These were specific heristically chosen uh u updates that and this was unsupervised learning back then. So the first time that we've actually used back propagation to train some of these networks was an experiment of Yan Lakun in the 1990s. And so um this is an example of one of the networks that was developed back then in 1990s by Yan Lakun as Linet 5. And this is what you would recognize today as a convolutional neural network. So it has a lot of the very sim uh convolutional layers and it's alternating and it's a similar kind of design to what you would see in the Fukushima's neurocognitron but this was actually trained with back propagation end to end using supervised learning. Um now so this happened in roughly 1990s and we're here in 2016 basically about 20 years later. Um now computer vision has u has for a long time kind of um worked on larger images and a lot of these models back then were applied to very small uh kind of settings like say recognizing uh digits um and zip codes and things like that and they were very successful in those domains. But back at least when I entered computer vision in roughly 2011 it was thought that a lot of people were aware of these models but it was thought that they would not scale up naively into large complex images that they would be constrained to these toy tasks for a long time or I shouldn't say toy because these were very important tasks but certainly like smaller visual recognition problems and so in computer vision in roughly 2011 it was much more common to use a kind of um these feature-based approaches at the time and they didn't work actually that well so when I entered my PhD in 200 1 working on computer vision, you would run a state-of-the-art uh object detector on this image and you might get something like this uh where cars were detected in trees and you would kind of just shrug your shoulders and say, "Well, that just happens sometimes." You kind of just accept it as a as a something that would just happen. Um and of course this is a caricature. Things actually were like relatively decent. I I should say, but uh definitely there were many mistakes that you would not see today about four years uh in 2016, five years later. And so a lot of uh computer vision kind of looked much more like this. When you look into a paper of trying that tried to do image classification, you would find this section in the paper on the features that they used. So this is one page of features. And so they would use um yeah a gist etc. And then a second page of features and all their hyperparameters. So all kinds of different histograms and you would extract this kitchen sink of features and a third page here. And so you end up with uh this very large complex codebase because some of these feature types are implemented in MATLAB, some of them in Python, some of them in C++. And you end up with this large codebase of extracting all these features, caching them and then eventually plugging them into linear classifiers to do some kind of visual recognition task. So it was uh quite unwieldy uh but uh it worked to some extent but there were definitely room for improvement and so a lot of this changed uh in computer vision in 2012 with this paper from Alex Kepsky, Eliask and Jeff Hinton. So this is the first time that um someone took a convolutional neural network that is very similar to the one that you saw from 1998 from Yanakun and I'll go into details of how they differ exactly uh but they took that kind of network they scaled it up they made it much bigger and they trained on a much bigger data set on GPUs and things basically ended up working extremely well and this is the first time that computer vision community has really noticed these models and adopted them to work on larger images. So uh we saw that the performance uh of these models has improved drastically. Here we are looking at the imageet ILSVRC um visual recognition challenge over the years and we're looking at the top five errors. So low is good and you can see that from 2010 uh in the beginning uh these were feature-based methods and then in 2012 we had this huge jump in performance and that was due to um the first uh kind of convolutional neural network in 2012 and then we've managed to push that over time and now we're down to about 3.57%. Uh I think the results for imageet 2000 imageet challenge 2016 are actually due to come out today but I don't think that actually they've come out yet. I have this second tab here opened. I was waiting for the result, but I I don't think this is up yet. Okay. No, nothing. All right. Well, we'll get to find out very soon what happens right here. Uh, so I'm very excited to see that. Uh, just to put this in context, by the way, because you're just looking at numbers like 3.57. How good is that? That's actually really really good. So, what something that I did about two years ago now is that I tried to measure human accuracy on this data set. And so what I did that uh for that is I developed this web interface where I would show myself imageet images from the test set. And then I had this interface here um where I would have all the different classes of imageet. There's 10,00 and some example images. And then basically you go down this list and you scroll for a long time and you find what class you think that image might be. And then I competed against the comnet uh at the time and this was Google net in 200 uh in 2014. And uh so hot dog is a very simple class. You can do that quite easily. Uh but why is the accuracy not 0%. It well some of the things like hot dog seems very easy. Why isn't it trivial for humans to see? Well, it turns out that some of the images in a test set of imageet are actually mislabeled. But also some of the images are just very difficult to guess. So in particular, if you have this terrier, there's 50 different types of terriers and it turns out to be very difficult task to find exactly which type of terrier that is. You can spend minutes trying to find it. Turns out that convolutional neural networks are actually extremely good at this and so this is where I would lose points compared to comnet. Um so I estimate that human accuracy based on this is roughly 2 to 5% range depending on how much time uh you have and how much expertise you have and how many people you involve and how much they really want to do this which is not too much and uh so really we're doing extremely well and so we're down to 3% and uh I think the error rate if I remember correctly was about 1.5%. So if we get below 1.5% I would be extremely suspicious on imageet. Uh that seems wrong. So to summarize basically what we've done is um before 2012 computer vision looked somewhat like this where we had these feature extractors and then we trained a small portion at the end of the feature extractor extraction step. And so we only trained this last piece on top of these features that were fixed. And we've basically replaced the feature extraction step with a single convolutional neural network. And now we train everything completely end to end. And this turns out to work uh quite nicely. So I'm going to go into details of how this works in a bit. Uh also in terms of code complexity uh we kind of went from a setup that looks whoops I'm way ahead. Okay. We went from a setup that looks something like that in papers to something like uh you know instead of extracting all these things we just say apply 20 layers with 3x3 combo or something like that and things work quite well. Uh this is of course an overexaggeration but I think it's a correct first order statement to make is that we've definitely seen um that we've reduced code complexity quite a lot because these architectures are so homogeneous compared to what we've done before. So it's also remarkable that so we had this reduction in complexity. We had this amazing performance on imageet. One other thing that was quite amazing about the results in 2012 that is also a separate thing that did not have to be the case is that the features that you learn by training on imageet turn out to be quite generic and you can apply them in different settings. So in other words, this transfer learning um works extremely well. And of course, I didn't go into details of convolutional networks yet, but uh we start with an image and we have a sequence of layers just like in a normal neural network. And at the end, we have a classifier. And when you pre-train this network on imageet, then it turns out that the features that you learn in the middle are actually transferable and you can use them on different data sets and that this works extremely well. And so that didn't have to be the case. You might imagine that you could have a convolutional network that works extremely well on imageet but when you try to run it on some something else like birds data set or something that it might just not work well but that is not the case and that's a very interesting finding in my opinion. So um people noticed this back in roughly 2013 after the first convolutional networks. They noticed that you can actually take many computer vision data sets and it used to be that you would compete on all of these kind of separately and design features maybe for some of these separately and you can just uh shortcut all those steps that we had designed and you can just take these pre-trained features that you get from ImageNet and you can just train a linear classifier on every single data set on top of those features and you obtain many state-of-the-art results across many different data sets. And so this was quite a remarkable finding back then I believe. So things worked very well on imageet. Things transferred very well and the code complexity of course got much uh much more manageable. So now all this power is actually available to you with very few lines of code. If you want to just use a convolutional network uh on images it turns out to be only a few lines of code. If you use for example caris is one of the deep learning libraries that I'm going to go into and I'll mention again later in the talk. Uh but basically you just load a state-of-the-art convolutional neural network. You take an image, you load it and you compute your predictions and it tells you that this is an African elephant inside that image. And this took a couple hund couple hundred or a couple 10 milliseconds if you have a GPU. And so everything got much faster, much simpler, works really well, transfers really well. So this was really a huge advance in computer vision. And so as a result of all these nice properties, uh, comnets today are everywhere. So here's a collection of some of the some of the things I I try to uh find across across different applications. So for example, you can search Google photos for different types of um categories like in this case Rubik's cubes. Um you can find house numbers very efficiently. You can of course this is very relevant in self-driving cars and we're doing perception in the cars. Convolutional networks are very relevant there. Medical image diagnosis recognizing Chinese characters uh doing all kinds of medical segmentation tasks. Uh quite random tasks like whale recognition and more generally many Kaggle challenges. uh satellite image analysis recognizing different types of galaxies. You may have seen recently that um a waveet from deep mind also a very interesting paper that they generate music and they generate speech. Uh and so this is a generative model and that's also just a comet is doing most of the heavy lifting here. So it's a convolutional network on top of sound and uh other tasks like image captioning in the context of reinforcement learning and agent in environment interactions. We've also seen a lot of advances of using comnets as the core computational building block. So when you want to play Atari games or you want to play Alph Go or Doom or Starcraft or if you want to get robots to perform interesting manipulation tasks, all of this uses comes as a core computational um block uh to do very impressive things. Uh not only are we using it for a lot of different application, we're also finding uses in art. So um so here are some examples from DeepDream. So you can basically uh simulate what it looks like, what it feels like maybe to be on some drugs. So you can take images and you can just hallucinate features using comnets or you might be familiar with neural style which allows you to take arbitrary images and transfer arbitrary styles of different paintings like Bango on top of them. And this is all using convolutional networks. The last thing I'd like to note that I find also interesting is that in the process of trying to develop better computer vision architectures and trying to basically optimize for performance on the imageet challenge, we've actually ended up converging to something that potentially might function something like your visual cortex in some ways. And so these are some of the experiments that I find interesting where they've studied macak monkeys uh and they record from a subpopul of the um of the IT cortex. This is the part that does a lot of object recognition and so they record. So basically they take a monkey and they take a comnet and they show them images and then you look at what those images are represented at the end of this network. So inside the monkeykey's brain or on top of your convolutional network. And so you look at representations of different images and then it turns out that there's a mapping between those two spaces that actually seems to indicate to some extent that some of the things we're doing somehow ended up converging to something that the brain could be doing as well in the visual cortex. Um so that's just some intro. I'm now going to dive into convolutional networks and try to explain um the briefly how these networks work. Of course there's an entire class on this that I taught which is a convolutional networks class. And so I'm going to distill some of you know those 13 lectures into one lecture. So we'll see how that goes. I won't cover everything of course. Okay. So convolutional neural network is really just a single function. It goes from it's a function from the raw pixels of some kind of an image. So we take 224 x24x3 image. So three here is for the color channels RGB. You take the raw pixels, you put it through this function, and you get 1,000 numbers at the end. In the case of image classification, if you're trying to categorize images into 1,000 different classes and really functionally all that's happening in a convolutional network is just dotproducts and max operations. That's everything. But they're wired up together in interesting ways so that you are basically doing visual recognition. And in particular the this function f has a lot of knobs in it. So these ws here that participate in these dotproducts and in these convolutions and fully connected layers and so on these ws are all parameters of this network. So normally you might have about on the order of 10 million parameters and uh those are basically knobs that change this function. And so we'd like to change those knobs of course so that when you put images through that function you get probabilities that are consistent with your training data. And so that gives us a lot to tune and turns out that we can do that tuning automatically with back propagation uh through that search process. Now more concretely a convolutional neural network is made up of a sequence of layers just as in the case of normal neural networks. But we have different types of layers that we play with. Uh so we have convolutional layers here I'm using rectified linear unit relu for short as a nonlinearity. Uh so I'm making that an explicit its own layer. Um pooling layers and fully connected layers. The core computational building block of a convolutional network though is this convolutional layer and we have nonlinearities interspersed. We are probably getting rid of things like pooling layers. So you might see them slightly going away over time and fully connected layers can actually be represented. They're basically equivalent to convolutional layers as well. And so really uh it's just a sequence of com layers in the simplest case. So let me explain convolutional layer because that's the core computational building block here that does all the heavy lifting. So the entire comnet is this collection of layers and these layers don't function over vectors. So they don't transform vectors as a normal neural network but they function over volumes. So a layer will take a volume a threedimensional volume of numbers an array. In this case for example we have a 32x 32x3 image. So those three dimensions are the width, height and I'll refer to the third dimension as depth. We have three channels. Uh that's not to be confused with the depth of a network which is the number of layers in that network. So this is just the depth of a volume. So this convolutional layer accepts a threedimensional volume and it produces a threedimensional volume using some weights. So the way it actually produces this output volume is as follows. We're going to have these filters in a convolutional layer. So these filters are always small spatially like say for example 5x5 filter but their depth extends always through the input depth of the uh input volume. So since the input volume has three channels, the depth is three, then our filters will always match that number. So we have depth of three in our filters as well. And then we can take those filters and we can basically convolve them with the input volume. So uh what that amounts to is we take this filter. Um oh yeah, so that's just a point that the channels here must match. We take that filter and we slide it through all spatial positions of the input volume. And along the way as we're sliding this filter, we're computing dotproducts. So wrppose x plus b where w are the filters and x is a small piece of the input volume and b is the offset. And so this is basically the convolutional operation. You're taking this filter and you're sliding it through at all spatial positions and you're computing that products. So when you do this you end up with this activation map. So in this case uh we get a 28x 28 activation map. 28 comes from the fact that there are 28 unique positions to place this 5x5 filter into this 3 32x32 uh space. So there are 28 by 28 unique positions you can place that filter in. In every one of those you're going to get a single number of how well that filter likes that part of the input. Um so that carves out a single activation map. And now in a convolutional layer we don't just have a single filter but we're going to have an entire set of filters. So here's another filter a green filter. We're going to slide it through the input volume. It has its own parameters. So these there are 75 numbers here that basically make up a filter. there are different 75 numbers. We convolve them through get a new activation map and we continue doing this for all the filters in that convolutional layer. So for example, if we had six filters uh in this convolutional layer, then we might end up with 28x 28 activation maps six times and we stack them along the depth dimension to arrive at the output volume of 28x 28x 6. And so really what we've done is we've re-represented the original image which is 32x 32x3 into a kind of a new image that is 28x 28x 6 uh where this image basically has these six channels that tell you how well every filter matches or likes every part of the input image. So let's compare this operation to say using a fully connected layer as you would in a normal neural network. So in particular we saw that we processed a 32x 32x3 volume into 28x 28x6 volume. And uh one question you might want to ask is how many parameters would this require if we wanted a fully connected layer of the same number of output neurons here? So we wanted 28 x 28x 6 or time 28* 2 * 28 * 6 number of neurons fully connected. How many parameters would that be? Turns out that that would be quite a few parameters, right? because every single neuron in the opted volume would be fully connected to all of the 32x 32x3 numbers here. So basically every one of those 28x 28x 6 neurons is connected to 32x 32x3 turns out to be about 15 million parameters and also on that order of number of multiplies. So you're doing a lot of compute and you're introducing a huge amount of parameters into your network. Now since we're doing convolution instead uh you'll notice that think about the number of parameters that we've introduced with this example convolutional layer. So we've used uh we had six filters and every one of them was a 5x5x3 filter. So basically we just have 5x5x3 filters. We have six of them. If you just multiply that out we have 450 parameters. And in this I'm not counting the biases. I'm just counting the raw weights. So compared to 15 million we've only introduced very few parameters. Also, how many multiplies have we done? So, computationally, how many flops are we doing? Uh, well, we have 28 by 28 by six outputs to produce. And every one of these numbers is a function of a 5x5x3 region in the original image. So, basically, we have 28 x 28 by 6 and then there's every one of them is computed by doing 5* 5* 3 multiplies. So, you end up with only on the order of 350,000 um multiplies. So, we've reduced from 15 million to quite a few. So we're doing less flops and we're using fewer parameters. And really what we've done here is we've made assumptions, right? So we've made the assumption that because um the fully connected layer, if this was a fully connected layer, could compute the exact same thing. Uh but it would um so a specific setting of those 15 million parameters would actually produce the exact output of this convolutional layer. But we've done it much more efficiently. We've done that by introducing um these biases. So in particular, we've made assumptions. We've assumed, for example, that since we have these fixed filters that we're sliding across space, we've assumed that if there's some interesting feature that you'd like to detect in one part of the image, like say top left, then that feature will also be useful somewhere else like on the bottom right because we fix these filters and apply them at all the spatial positions equally. You might notice that this is not always something that you might want. For example, if you're getting inputs that are centered face images and you're doing some kind of a face recognition or something like that, then you might expect that you might want different filters at different spatial positions. Like say for eye regions you might want to have some eye like filters and for mouth region you might want to have mouth specific features and so on. And so in that case you might not want to use convolutional layer because those features have to be shared across all spatial positions. And the second um assumptions that we made is that these filters are small locally and so we don't have global connectivity. We have this local connectivity but that's okay because we end up stacking up these convolutional layers in sequence. And so this the neurons at the end of the comnet will grow their receptive field as you stack these convolutional layers on top of each other. So at the end of the comnet, those neurons end up being a function of the entire image eventually. So just to give you an idea about what these activation maps look like concretely, here's an example of an image on the top left. This is a part of a car I believe. And we have these different filters at we have 32 different small filters here. And so if we were to convolve these filters with this image, we end up with these activation maps. So this filter if you convolve it you get this activation map and so on. So this one for example has some orange stuff in it. So when we convolve with this image you see that this white here is denoting the fact that that filter matches that part of the image quite well. And so we get these activation maps. You stack them up and then that goes into the next convolutional layer. So the way this looks then uh looks like then is that we've processed this with some kind of a convolutional layer. We get some output. We apply a rectified linear unit, some kind of a nonlinearity as normal and then we just repeat that operation. So we keep plugging these con volumes into the next convolutional layer and so they plug into each other in sequence. Okay? And so we end up processing the image over time. So that's the convolutional layer. Now you'll notice that there are a few more layers. So in particular the pooling layer I'll explain very briefly. Um pooling layer is quite simple. Uh if you've used Photoshop or something like that, you've taken a large image and you've resized it, you've downsampled the image. Well, pooling layers do basically something exactly like that, but they're doing it on every single channel independently. So for every one of these channels independently in a input volume, we'll pluck out that activation map. We'll down sample it and that becomes a channel in the output volume. So it's really just a downsampling operation on these volumes. Uh so for example one of the common ways of doing this in the context of neural networks especially is to use max pooling operation. So in this case it would be common to say for example use 2x2 filters stride two uh so um and do a max operation. So if this is an input channel in a volume then we're basically what that amounts to is we're truncating it into these 2x two regions and we're taking a max over four numbers to produce uh one piece of the output. Okay. So this is a very cheap operation that downsamples your volumes. It's really a way to control the capacity of the network. So you don't want too many numbers. You don't want things to be too computationally expensive. It turns out that a pooling layer allows you to down sample your volumes. You're going to end up doing less computation and it turns out to not hurt the performance too much. So we use them basically as a as a way of controlling the capacity of these networks. And the last layer that I want to briefly mention of course is the fully connected layer which is exactly as what you're familiar with. So we have these volumes throughout as we've processed the image. At the end you're left with this volume and now you'd like to predict some classes. So what we do is we just take that volume we stretch it out into a single column and then we apply a fully connected layer which is really amounts to just a matrix multiplication and then that gives us uh probabilities after applying like a soft max or something like that. So let me now show you briefly uh a demo of what the convolutional network looks like. Uh so this is comjs. uh this is um a deep learning library for training convolutional neural networks that I've that is implemented in JavaScript. I wrote this maybe uh two years ago at this point. So here what we're doing is we're training a convolutional network on the CR10 data set. CR10 is a data set of 50,000 images. Each image is 32x 32x3 and there are different 10 different classes. So here we are training this network in the browser and you can see that the loss is decreasing which means that we're better classifying these inputs. And uh so here's the network specification which you can play with because this is all done in the browser. So you can just change this and play with this. Uh so this is an input image and this convolutional network I'm showing here all the intermediate activations and all the intermediate um basically activation maps that we're producing. So here we have a set of filters. We're convoling them with the image and getting all these activation maps. Uh I'm also showing the gradients but I don't want to dwell on that too much. Venue threshold. So ReLU thresholding anything below zero gets clamped at zero and then you pull. So this is just a downsampling operation and then another convolution relu pull com pool etc until at the end we have a fully connected layer and then we have our softmax so that we get probabilities out and then we apply a loss to those probabilities and back propagate. And so here we see that I've been training in this tab for the last maybe uh 30 seconds or 1 minute and we're already getting about 30% accuracy on CR10. So this these are test images from CR10 and these are the outputs of this convolutional network and you can see that it learned that this is already a car or something like that. So this trains pretty quickly in JavaScript. Uh so you can play with this and you can change the architecture and so on. Another thing I'd like to show you is uh this video because it gives you again this like very intuitive visceral feeling of exactly what this is computing is there's a very good video by Jason Yosinski uh from recent advance. I'm going to play this in a bit. This is from the deep visualization toolbox. So you can download this code and you can play with this. It's this interactive convolutional network demo and neural networks have enabled computers to better see and understand the world. They can recognize school buses and Z top left corner we show the in this popular. So what we're seeing here is these are activation maps in some particular uh shown in real time as this demo is running. Uh so these are for the com one layer of an Alex net which we're going to go into in much more detail. But these are the different activation maps that are being produced at this point. Um neural network called Alexet running in cafe. By interacting with the network, we can see what some of the neurons are doing. For example, on this first layer, a unit in the center responds strongly to light to dark edges. Its neighbor one neuron over responds to edges in the opposite direction, dark to light. Using optimization, we can synthetically produce images that light up each neuron on this layer to see what each neuron is looking for. We can scroll through every layer in the network to see what it does, including convolution, pooling, and normalization layers. We can switch back and forth between showing the actual activations and showing images synthesized to produce high activation. By the time we get to the fifth convolutional layer, the features being computed represent abstract concepts. For example, this neuron seems to respond to faces. We can further investigate this neuron by showing a few different types of information. First, we can artificially create optimized images using new regularization techniques that are described in our paper. These synthetic images show that this neuron fires in response to a face and shoulders. We can also plot the images from the training set that activate this neuron the most as well as pixels from those images most responsible for the high activations computed via the deconvolution technique. This feature responds to multiple faces in different locations. And by looking at the decons, we can see that it would respond more strongly if we had even darker eyes and rosier lips. We can also confirm that it cares about the head and shoulders but ignores the arms and torso. We can even see that it fires to some extent for cat faces using backrop or decon. We can see that this unit depends most strongly on a couple units in the previous layer con 4 and on about a dozen or so in con 3. Now let's look at another neuron on this layer. So what's this unit doing? From the top nine images, we might conclude that it fires for different types of clothing. But examining the synthetic images shows that it may be detecting not clothing per se, but wrinkles. In the live plot, we can see that it's activated by my shirt. And smoothing out half of my shirt causes that half of the activations to decrease. Finally, here's another interesting neuron. This one has learned to look for printed text in a variety of sizes, colors, and fonts. This is pretty cool because we never ask the network to look for wrinkles or text or faces. The only labels we provided were at the very last layer. So the only reason the network learned features like text and faces in the middle was to support final decisions at that last layer. For example, the text detector may provide good evidence that a rectangle is in fact a book seen on edge. And detecting many books next to each other might be a good way of detecting a bookcase, which was one of the categories we trained the net to recognize. In this video, we've shown some of the features of the deep viz toolbox. Okay, so I encourage you to play with that. It's it's really fun. So, I hope that gives you an idea about exactly what's going on. There's these convolutional layers. We downsample them from from time to time. There's usually some fully connected layers at the end, but mostly it's just these convolutional operations stacked on top of each other. So, what I'd like to do now is I'll dive into some details of how these architectures are actually put together. The way I'll do this is I'll go over all the winners of the imageet challenges and I'll tell you about the architectures, how they came about, how they differ, and so you'll get a concrete idea about what these architectures look like in practice. So we'll start off with the Alex net in 2012. Um so the Alex net just to give you an idea about the uh the sizes of these networks and the images that they process it took 227 x27 by3 images. And the first layer of an Alex net for example was a convolutional layer that had 11 by11 filters applied with a stride of four and there are 96 of them. stride of four I didn't fully explain because I wanted to save some time but intuitively it just means that as you're sliding this filter across the input you don't have to slide it one pixel at a time but you can actually jump a few pixels at a time so we have 11 by11 filters with a stride a skip of four and we have 96 of them you can try to compute for example what is the output volume if you apply this uh this um this sort of convolutional layer on top of this volume and I didn't go into details of how you compute that but basically there are formulas for this and you can look into details uh in the class but um you arrive at 55 x 55 by 96 volume as output. The total number of parameters in this layer we have 96 filters every one of them is 11 by 11 by3 because that's the input uh depth of these images. So basically just amounts to 11 * 11 * 3 and then you have 96 filters. So about 35,000 parameters in this very first layer. Uh then the second layer of an Alex net is a pooling layer. So we apply 3x3 filters at stride of two and they do max pooling. So you can again compute the output volume size of that after applying this to that volume and you arrive if you do some uh very simple arithmetic there you arrive at 27 by 27 by 96. So this is the down sampling operation. You can think about what is the number of parameters in this pooling layer. Um and of course it's zero. So pooling layers compute a fixed function a fixed down sampling operation. There are no parameters involved in the pooling layer. All the parameters are in convolutional layers and the fully connected layers which are in some extent equivalent to convolutional layers. So you can go ahead and just basically based on the description in the paper although it's non-trivial I think based on the description of this particular paper but you can go ahead and decipher what uh the volumes are throughout you can look at the uh kind of patterns that emerge in terms of how you actually um increase number of filters in higher convolutional layers. So we started off with 96 then we go to 256 filters then to 384 and eventually 4,96 units fully connected layers. You'll see also normalization layers here which have since become slightly deprecated. It's not very common to use the normalization layers that were used uh at the time for the Alexent architecture. What's interesting to note is how this differs from the 1998 yan lakun network. So in particular I usually like to think about four things that hold back progress. So uh at least in deep learning so the data as a constraint compute uh and then I like to differenti differentiate between algorithms and infrastructure algorithms being something that feels like research and infrastructure being something that feels like a lot of engineering has to happen and so in particular we've had progress in all those four fronts. So we see that in 1998 uh the data you could get a hold of maybe would be on the order of a few thousand whereas now we have a few million. So we had three orders of magnitude of increase in number of data. Compute uh GPUs have become available and we use them to train these networks. They are about say roughly 20 times faster than CPUs. And then of course CPUs we have today are much much faster than CPUs that they had back in 1998. So I don't know exactly to what that works out to but I wouldn't be surprised if it's again on the order of three orders of magnitude of improvement. Again uh I'd like to actually skip over algorithm and talk about infrastructure. So in this case we're talking about uh Nvidia releasing the CUDA library that allows you to efficiently create all these matrix vector operations and apply them on arrays of numbers. So um that's a piece of software that you we rely on and that we take advantage of that wasn't available before. And finally algorithms is kind of an interesting one because there's been uh in those 20 years there's been much less improvement in uh in algorithms than all these other three pieces. So in particular what we've done with the 1998 network is we've made it bigger. So you have more channels, you have more layers by a bit. Uh and the two really new things algorithmically are uh dropout and rectified linear units. So uh dropout is a regularization technique uh developed by Jeff Hinton and colleagues. And rectified linear units are these nonlinearities that train much faster than sigmoids and 10H's. And this paper actually had a plot u that showed that the rectified linear units trained a bit faster than sigmoids. And that's intuitively because of the vanishing gradient problems. And when you have very deep networks with sigmoids, um those gradients vanish as Hugo was talking about in last lecture. Uh so what's interesting also to note by the way is that both dropout and relu are basically like one line or two lines of code change. So it's about two line diff total in those 20 years. And both of them consist of setting things to zero. So with the ReLU, you set things to zero when they're lower than zero. And with Dropout, you set things to zero at random. So, it's a good idea to set things to zero. Apparently, that's what we've learned. So, if you try to find a new cool algorithm, look for oneline diffs that set something to zero. Probably will work better and we could add you here to this list. Uh, now some of the newest things that happened uh some of the comparing it again and giving you an idea about the hyperparameters that uh were in this architecture. Um, it was the first use of rectified linear units. We haven't seen that as much before. uh this network used the normalization layers which are not used anymore at least in the specific way that they use them in this paper. Uh they used heavy data augmentation. So you don't only put in you don't only pipe these images into the networks exactly as they come from the data set but you jitter them spatially around a bit and you warp them and you change the colors a bit and you just do this randomly because you're trying to build in some invarianes to these small perturbations and you're basically hallucinating additional data. Uh it was the um the first real um use of dropout. Um and roughly you see standard hyperparameters like say batch sizes of roughly 128 u using stocastic gradient descent with momentum usually 0.9 um in the momentum learning rates of 1 -2 you reduce them in normal ways. So you reduce roughly by factor of 10 whenever validation stops improving and weight decay of just a bit 5 negative4 and uh ensembling always helps. So you train seven independent convolutional networks separately and then you just average their predictions always gives you additional 2% improvement. So this is AlexNet the winner of 2012. In 2013 the winner was the ZFNET. This was developed by uh Matthew Zyler and Rob Fergus in 2013 and this was an improvement on top of Alexet architecture. In particular, one of the the bigger differences here were that the convolutional layer, the first convolutional layer, they went from 11 by11 stride 4 to 7 by7 stride 2. So you have slightly smaller filters and you apply them more densely. And then also they noticed that these convolutional layers in the middle if you make them larger if you scale them up then you actually gain performance. So they managed to improve a tiny bit. Matthew Zyler then went uh he um became the founder of clarify uh and uh he worked on this a bit more inside clarify and he managed to push the performance to 11% which was the winning entry at the time but we don't actually know what get gets you from 14% to 11% because Matthew never disclosed the full details of what happened there but uh he did say that it was more tweaking of these hyperparameters and optimizing that a bit so that was 2013 winner in 2014 we saw a slightly bigger diff to this um so one of the networks that was introduced then was a VGNet from Karen Simmonian and Andrew Zerman. What's beautiful about VGNet and they explored a few architectures here and the one that ended up working best was this D column which is why I'm highlighting it. What's beautiful about the VGNet is that it's so simple. So you might have noticed in these previous uh um in these previous networks you have these different filter sizes, different layers and you do different amount of strides and everything kind of looks a bit hairy and you're not sure where these hyperparameters are coming from. VGET is extremely uniform. All you do is 3x3 convolutions with stride one pad one and you do two x2 max poolings with stride two and you do this throughout completely homogeneous architecture and you just alternate a few comp and a few pool layers and you get a top top performance. So they managed to reduce the error down to 7.3% in the VGNet um just with a very simple and homogeneous architecture. So it's I've also here written out this D architecture just so you can see I'm not I'm not sure how instructive this is because it's kind of dense but you can definitely see and you can look at this offline perhaps but you can see how these volumes develop and you can see the kinds of sizes of these filters. Um so they're always 3x3 but the number of filters again grows. So we started off with 64 and then we go to 128 256 512. So we're just doubling it over time. Um I also have a few numbers here just to give you an idea of the scale at which these networks normally operate. So we have on the order of 140 million parameters. This is actually quite a lot. I'll show you in a bit that this can be about five or 10 million parameters and it works just as well. Um and it's about 100 megabytes for image in terms of memory in the forward pass and then the backward pass also needs roughly on that order. So that's roughly the numbers that we're uh we're working with here. Uh also you can note that most of the and this is true mostly in convolutional networks is that most of the memory is in the early convolutional layers. Most of the parameters at least in the case where you use these giant fully connected layers at the top would be here. Um so the winner actually in 2014 was not the VGET I only presented because it's such a simple architecture but the winner was actually Google net with a slightly um hairier architecture we should say. So it's still a sequence of things but in this case they've uh put inception modules in sequence and this is an example inception module. I don't have too much time to go into the details but you can see that it consists basically of convolutions and different kinds of strides and so on. Um so the Google net um is looks slightly uh hairier but it turns out to be more efficient in several respects. So for example it works a bit better than VGNET at least at the time. um it only has five million parameters compared to VGE that's 140 million parameters. So a huge reduction and you do that by the way by just throwing away fully connected layers. So you'll notice in this breakdown I did these fully connected layers here have 100 million parameters and 16 million parameters. Turns out you don't actually need that. So if you took take them away that actually doesn't hurt the performance too much. So uh you can get a huge reduction of parameters. Um and it was um it was slightly we can also compare to the original AlexNet. So compared to the original Alex net, we have fewer parameters, a bit more compute and a much better performance. So Google Net was really optimized to have a low footprint both memory wise uh both computation wise and both parameter- wise but it looks a bit uglier and VGNet is a very beautiful homogeneous architecture but there are some inefficiencies in it. Okay, so that's uh 2014. Now in 2015 we had a a slightly bigger delta on top of the architectures. So right now these architectures if Yan Lakun looked at them maybe in 1998 he would still recognize everything. So everything looks very like simple. You've just played with hyperparameters. So one of the first kind of bigger departures I would argue was in 2015 with the introduction of residual networks. Uh and so this is work from Kaming Hi and colleagues in Microsoft Research Asia. And so they did not only win the imageet challenge in 2015 but they won a whole bunch of challenges. And this was all just by applying these residual networks that were trained on imageet and then fine-tuned on all these different tasks and you basically can crush lots of different tasks whenever you get a new awesome comnet. Um so at this time the performance was basically 3.57% from these residual networks. So this is 2015. Also uh this paper tried to argue that if you look at the number of layers it goes up and then it uh they made the point that uh with residual networks as we'll see in a bit you can introduce many more layers and they uh and that that correlates strongly with performance. We've since found that in fact you can make these residual networks quite sh quite a lot shallower
Resume
Categories