File TXT tidak ditemukan.
Foundations of Deep Learning (Hugo Larochelle, Twitter)
zij_FTbJHsk • 2016-09-27
Transcript preview
Open
Kind: captions Language: en That's good. All right. Cool. So, uh, yeah. So, I was asked to, uh, give this presentation on the foundations of deep learning, which is mostly going over, uh, basic feed forward neural networks and motivating a little bit deep learning and some of the more recent developments and and some of the topics that you'll see across the next two days. So, um, I, as uh, uh, Andrew mentioned, I have just an hour, so I'm going to go fairly quickly on a lot of these things, which I think will mostly be fine if you're familiar enough with some machine learning and, uh, uh, a little bit about neural nets. But if you'd like to go into some of the more specific details, you can go check out my online lectures on YouTube. Uh, it's now taught by a much younger version of myself. uh and uh so just search for Google Averell and I am not the guy doing a bunch of skateboarding. I'm the geek teaching about neural nets. So go check those out if you want more details. But so what I'll cover today is uh I'll start with just describing and laying out the notation on feed for neural networks that is models that take an input vector x that might be an image or some text and produces an output f ofx. So I'll just describe for propagation and the different types of units and the type of functions we can represent with those and then I'll talk about how we actually train neural nets uh describing things like loss functions back propagation uh that allows us to get a gradient for training with stoastic gradient descent and mention a few tricks of the trade. So some of the things we do in practice to uh successfully train neural nets and then I'll end by talking about some um developments that are specifically useful in the context of deep learning that is neural networks with several hidden layers uh that came out you know at the very uh after the beginning of of deep learning say in 2006 that is things like dropout batch normalization and if I have some time uh unsupervised pre-training so let's get started and just talk about assuming we have some neural network how do they actually functions how do they make predictions um so let me lay down the notation um so a multi-layer feed forward neural network is a model that uh takes as input some vector x which I'm representing here with a different node for each of the dimensions uh in my input vector so each dimension is essentially a unit in that uh neural network and then it eventually produces at its output layer a uh an output And we'll focus on classification mostly. So you'll have multiple units here and each unit would correspond to one of the potential classes in which we would want to classify our input. So if we're identifying uh digits in handwritten character images uh and and say we're focusing on digits, you'd have 10 digits. So you would have so zero from zero to nine. So you'd have 10 output units. And to produce an output, the neural net will go through a series of hidden layers. um and those will be essentially the components that introduce nonlinearity that allows us to capture and perform very sophisticated types of classification functions. So if we have L hidden layers uh the way we compute all the layers in our neural net is as follows. Uh we first start by computing what I'm going to call a pre-activation. I'm going to note that a and I'm I'm going to index the layers by k. So AK is just the uh pre-activation at layer K and that is only simply going to be a linear transformation of the previous layer. So I'm going to note H K as the activation and the layer and by default I'll assume that layer zero is going to be the input. And so using that notation, the pre-activation at layer K is going to correspond to taking the activation at the previous layer K minus one, multiplying it by a matrix W K. Those are the parameters of the layer. Uh those essentially corresponds to the uh connections between the units between adjacent layers. And I'm going to add a bias vector. That's another parameter in my layer. So that gives me the pre-activation. And then next I'm going to get a hidden layer activation by applying an activation function. This will introduce some nonlinearity in the model. So I'm going to call that function G. And we'll go over a few uh choices that we have for um common choices for the activation function. And uh so I do this from layer 1 to layer L. And when it comes to the output layer, I'll also compute a pre-activation by performing a linear transformation. But then I'll usually apply a different activation function depending on the problem I'm trying to solve. So um having said that let's go to some of the choices for the activation function. So some of the activations functions you'll see one common one is the sigmoid activation function. Uh it's this function here. It's just one divided by 1 plus the exponential of minus the pre-activation. The shape of this function you can focus on that is this here. It takes the pre-activation which can vary from minus infinite to plus infinite and it squashes this between zero and one. So it's bounded by below and above below by zero and above by one. Okay. So it's a it's a function that saturates if you have very large or very um large magnitude positive or negative pre-activations. Uh another common choice is the hyperbolic tangent or tang activation function on this picture here. So squashes everything but instead of being between zero and one it's between minus one and one. And one that's become quite popular uh in neural nets is what's known as the rectified linear activation function or in papers you will see the uh relu unit that refers to the use of this activation function. So this one is different from the others in that it's not bounded above but it is bounded below and it's actually uh uh it will output zeros exactly if the pre-activation is uh negative. So those are the choices of activation functions for the hidden layers. And for the output layer, if we're performing classification, as I said, in our output layer, we will have as many units as there are classes in which an input could belong. And what we'd like is potentially um and what we often do is interpret each unit's activation as the probability according to the neural network that the input belongs to the corresponding class that it's labeled Y is the corresponding class C. So C would be like the index of that unit in the output layer. So we need an activation function that produces probabilities produces a multinnomial distribution over all the different classes. And the activation function we use for that is known as the softmax activation function. Uh it is simply as follows. You take your pre-activations and you exponentiate them. So that's going to give us positive numbers and then we divide each of the exponentiated pre-activations by the sum of all the pre uh the exponentiated pre-activations. So because I'm normalizing this way, it means that all my uh values in my output layer are going to sum to one and they're positive because I took the exponential. So I can interpret that as a multinnomormal distribution over the choice of all the C different classes. Okay, so that's what I'll use as the activation function at the output layer. Um and now beyond the math in terms of conceptually and also in the way we're going to program neural networks often what we'll do is that all these different operations the linear transformations the different types of activation functions uh will essentially implement all of them as an object and uh object that take arguments uh and the arguments would essentially be what other things are being combined to produce the next value. So for instance, we would have an object that might correspond to the uh computation of pre-activation which would take as argument what is the weight matrix and the bias vector for that layer and take some layer to transform and that would this object would sort of compute its value by applying the linear activation uh the linear transformation and then we might have objects that correspond the specific you know uh activation function. So like a sigmoid object or a tangent object or relu object and we just combine these objects together chain them into what ends up being a graph which I refer to as a flow graph that represents the computation done when you do a forward pass in your neural network up until you reach the output layer. So I mention it now because that's you'll see you know the different softwares that we presented over the weekend uh will essentially sort of you know exploit some of that representation of the computation in neural nets and it also be handy for computing gradients which I'll talk about in a few minutes. And so that's how we perform predictions in neural networks. So we get an input we eventually reach an output layer that gives us a distribution over classes if we're performing classification. If I want to actually classify, I would just assign the class corresponding to the unit that has the highest activation that would correspond to classifying into the class that has the highest probability according to the neural net. And but then you might ask the question, okay, what kind of problems can we solve with neural networks? Or more technically, what kind of functions can we represent mapping from some input X into some arbitrary output? And um so if you look at if you go look at my videos I try to give more intuition as to why we have this result here. But essentially uh if we have a single hidden layer neural network uh it's been shown that with a linear output we can approximate any continuous function arbitrarily well as long as we have enough hidden units. So that is there's a value for these biases and these weights such that any continuous function I can actually represent it as well as I want. I just need to add enough hidden units. Um, so this result applies if you use activation functions, nonlinear activation functions like sigmoid and tanh. Um, so as I said in my video, if you want a bit more intuition as to why that would be, uh, you can go check that out. Um, but that's a really nice result. It means that by focusing on this family of machine learning models that are neural networks, I can pretty much potentially represent any kind of classification function. However, this result does not tell us how do we actually find the weights and the bias values such that I can represent a given function. It doesn't essentially tell us how do we train a neural network. And so that's what we'll discuss next. So let's talk about that. How do we actually from a data set train a neural network to perform good classification on uh for that problem? So uh what we'll typically do is use a framework that's very generic in machine learning known as empirical risk minimization or structural risk minimization if you're using regularization. So this framework essentially transforms the problem of learning as a problem of optimizing. So what we'll do is that we'll first choose a loss function that I'm noting as L. And the loss function it compares the output of my model. So the output layer of my neural network with the actual target. So I'm indexing with an exponent here with t to uh essentially as the index over all my different examples in my training set. And so my loss function will tell me is this output good or bad given uh that the label is actually y. And what I'll do I'll also define a regularizer. Um so theta here is you can think of it as just the concatenation of all my biases and all of my weights in my neural net. So those are all the parameters of my neural network and the regularizer will essentially penalize certain values of of of these weights. So as I'll talk more specifically later on for instance you might want to have your weights not be too far from zero. That's a frequent intuition that we implement with regularizer. And so the optimization problem that we'll try to solve when learning is to minimize the average loss of my neural network over my training examples. So summing over all training examples I have capital T examples plus some u weight here that's known as the weight decay some hyperparameter lambda times my regularizer. So in other words, I'm going to try to have my uh loss on my training set as small as possible over all the training example and also try to satisfy my regularizer as much as possible. And so now we have this optimization problem and we learning will just correspond to trying to solve this problem. So performing this finding this arg here for over my weights and my biases. And if I want to do this, I can just invoke some optimization procedure from the uh uh optimization community. And the one algorithm that you'll see constantly in deep learning uh is stoastic gradient descent. This is the optimization algorithm that we'll often use for uh training neural networks. So SGD stoastic gradient descent functions as follows. you first initialize all of your parameters that is finding initial values for all my weight matrices and all of my biases and then for a certain number of epochs. So an epoch will be a full pass over all my examples that's what I'll call an epoch. Um so for a certain number of full iterations over uh my training set I'll draw each training example. So a pair x input x target y and then I'll compute what is the gradient of my loss with respect to my parameters all of my parameters all my weights and all my biases. This is what this notation here uh so nabla for the gradient of the loss function and here I'm indexing with respect to which parameter I want the gradient. So I'm going to compute what is the gradient of my loss function with respect to my parameters and plus lambda times the gradient of my regularizer as well and then I'm going to get a direction in which I should move my parameters. uh since the gradient tells me how to increase the loss uh I want to go in the opposite direction and decrease it. So my direction will be the opposite. So that's why I have a minus here. And so this delta is going to be the direction in which I'll move my parameters by taking a step. And the step is just a step size alpha which is often referred to as a learning rate times my direction which I just add to my current values of my parameters, my biases and my weights. And that's going to give me my new value for all of my parameters. And I iterate like that over going over all pairs X-Y's computing my gradient taking a steps side in the opposite direction and then doing that several times. Okay, so that's how stochastic gradient descent uh works and that's essentially the learning procedure. It's it's represented by this this procedure. So in this algorithm there are a few things we need to specify to be able to implement it and execute it. We need a loss function. a choice for the loss function. We need a procedure that's efficient for computing the gradient of the loss with respect to my parameters. Uh we need to choose a regularizer if we want one. And we need a way of initializing my parameters. So next what I'll do is I'll go through each of these uh these four different things we need to choose before actually being able to execute stoastic gradient descent. So first the loss function. So as I said, we will interpret the output layer as assigning probabilities to each potential class in which I can uh classify my input X. Well, in this case, something that would be natural is to try to maximize the probability of the correct class, the actual class in which my example XT belongs to, I'd like to increase the value of the probability assigned by computed by my neural network. Um and so because we set up the problem in which we had a loss that we minimize uh instead of maximizing the probability what we'll actually do is minimize the negative and the actual uh log probability. So the log likelihood of assigning X to the correct class Y. So this is represented here. So given my output layer and the true label Y, my loss will be minus the log of the probability of Y for my neur according to my neural net and that would be well take my output layer and look at the unit. So index the unit corresponding to the correct class. So that's why I'm indexing by Y here. Um we take the log because numerically it's turns out to be more stable. We get nicer looking gradients. uh and sometimes in certain softwares you'll see instead of talking about the negative log likelihood or log probability you'll see it referred as the cross entropy uh and that's because you can think of this as performing a sum over all possible classes and then for each class checking well is this potential class the target class. So I have an indicator function that is one if y is equal to c. So if my iterator class C is actually equal to the the real class, I'm going to multiply that by the log of the probability actually assigned to that class C. And this uh this function here, so this expression here is like a cross entropy between the empirical distribution which assigns zero probability to all the other classes but a probability of one to the correct class and the actual distribution over classes that my neural net is computing which is f ofx. Okay, that's just a technical detail. You can just think about this here. I only mentioned it because in certain libraries it's actually mentioned as the cross entropy loss. So that's for the loss. Um then we need also a procedure for computing what is the gradient of my loss with respect to all of my parameters in my neural net. So the biases and the weights. Um you can go look at my videos if you want the actual derivation of all the details for all of these different expressions. Uh I don't have time for that. So all I'll do and presumably a lot of you actually seen you know these derivations if you haven't just go check out the videos. In any case I'm going to go through what the algorithm is. I'm going to highlight some of the key points that will come up later in understanding how actually back propagation functions. So the basic idea is that we'll compute gradients by exploiting the chain rule and we'll go from the top layer all the way to the bottom computing gradients for uh layers that are closer and closer to the input as we go and exploiting the chain rule to exploit or reuse previous computations we've made at upper layers to compute the gradients at the layers below. So we usually start by computing what is the gradient at the output layer. So what's the gradient of my loss with respect to my output layer and actually it's it's more convenient to compute the loss with respect to the pre-activation. It's actually a very simple expression. Um so that that's why I have the gradient of this vector a l + one. That's the pre-activation at the very last layer of the loss function which is minus the log f ofxy. And it turns out this gradient is super simple. It's minus e of y. So that's the one hot vector for class Y. So what this means is E of Y is just a vector filled with a bunch of zeros and then a one at the correct class. So if Y was the fourth class, then in this case it would be this vector where I have a one at the fourth dimension. So E of Y is just a vector. It's we call it the one hot vector full of zeros and the single one at the position corresponding to the correct class. So what this part of the grain is essentially saying is that I'm going to increase I want to increase the probability of the correct class. I want to increase the pre-activation which will increase the probability of the correct class and I'm going to subtract what is the current probabilities assigned by my neural net to all of the classes. So f ofx that's my output layer and that's the current beliefs of the neural net as to in which class uh what's the probability assigning the input to each class. So what this is doing is essentially trying to decrease the probability of everything and specifically decrease it as much as I the neural net currently believes that the input belongs to it. And so if you think about the subtraction of these two things, well for the class that's the correct class, I'm going to have one minus some number between 0 and one because it's a probability. So that's going to be positive. So I'm going to increase the probability of the correct class. And for everything else, it's going to be zero minus a positive number. So it's going to be negative. So I'm actually going to decrease the probability of everything else. So intubally it makes sense. This gradient has the right behavior. And I'm going to take that pre-activation gradient. I'm going to propagate it from the top to the bottom and uh and essentially iterating from the last layer which is the output layer L+1 all the way down to the first layer. And uh as I'm going down, I'm going to compute the gradients with respect to my parameters and then compute what's the gradient for the uh pre-activation at the layer below and then iterate like that. So at each uh iteration of that loop, I take what is the current gradient of the loss function with respect to the pre-activation at the current layer and I can compute the gradient of the loss function with respect to my weight matrix. So not doing the uh derivation here it it's actually simply this vector. So my in my notation I assume that all the vectors are column vectors. So this pre-activation uh gradient vector and I multiply it by the transpose of the activations. So the value of the layer right below the the the layer k minus one. So because I take the transpose that's a multiplication like this. And you can see if I do the outer product essentially between these two vectors, I'm going to get a matrix of the same size as my weight matrix. So it all checks out. That makes sense. Uh turns out that the gradient of the loss with respect to the bias is exactly the gradient of the loss with respect to the pre-activation. So that's very simple. So that gives me now my gradients for my parameters. And now I need to compute okay, what is going to be the gradient of the pre-activations at the layer below. Uh well, first I'm going to get the gradient of the loss function um with respect to the activation at the layer below. Well, that's just taking my pre-activation gradient vector and multiplying it by for some reason doesn't show here, but and multiplying by the transpose of my weight matrix. Super simple operation. Just a linear transformation of my gradients at layer K. linear transform to get my gradients of the activation at the layer k minus one. And then to get the gradients of the pre-activation, so before the activation function, I'm going to I'm going to take this gradient here, which is the gradient of the activation function at the layer k minus one. And then I apply the gradient corresponding to the partial derivative of my nonlinear activation function. So this here, this refers to an elementwise product. So I'm taking these two vectors, this vector here and this vector here. I'm going to do an elementwise product between the two. And this vector here is just the partial derivative of the activation function for each unit individually that I've put together into a vector. Okay, this is what this corresponds to. Now the key things to notice is first that this pass computing all the gradients and doing all these iterations is actually fairly cheap. it's uh complexity is essentially the same as the one that's doing a a forward pass. So um all I'm doing are linear transformations multiplying by matrices in this case the transpose of my weight matrix and then I'm also doing this sort of nonlinear operation where I'm multiplying by the gradient of the activation function. So that's the first thing to notice and the second thing to notice is that here I'm I'm doing this elementwise product. So if any of these terms here for a unit is very close to zero, then the pre-activation gradient is going to be zero for the next layer. And I highlight this point because essentially whenever that's something to think about a lot when you're training neural nets, whenever this gradient here, these partial derivatives come close to zero, that it means the gradient will not propagate well to the next layer, which means that you're not going to get a good gradient to update your parameters. Now, when does that happen? When will you see these terms here being close to zero? Well, that's going to be when the partial derivatives of these nonlinear activation functions are close to zero or zero. So, uh we can look at the partial derivatives say of the sigmoid function. Uh it turns out it's super easy to compute. It's just the sigmoid itself times 1 minus the sigmoid itself. Uh so that means that whenever the activation of the unit for a sigmoid unit is close to one or close to zero, I essentially get a partial derivative that's close to zero. Uh you can kind of see it here. The slope here is essentially flat and the slope here is flat. That's the uh value of the partial derivative. So in other words, if my pre-activations are very negative or very positive, so if my unit is very saturated, then gradients will have a hard time propagating to the next layer. Okay, that's the key inside here. Um, same thing for the uh tangent function. So the turns out the partial derivative is also easy to compute. You just take the tangent value, square it and going to subtract it to one. And you indeed if it's close to minus one or close to one, you can see that the slope is flat. So again, if the unit is saturating, gradients will propagate have a hard time propagating to the next layers. And for the relu the uh rectified linear activation function the uh gradient is even simpler. It's uh you just check whether the pre-activation is greater than zero. If it is the partial derivative is one. If it's not it's zero. So you're actually either going to multiply by one or zero. You essentially get a binary mask when you're performing the propagation through the relu. And you can see it the the slope here is flat and otherwise you have a linear function. So actually here at the the the shrinking of the gradient towards zero is even harder. It's exactly multiplying by zero if you're have a unit that's uh saturating below. And beyond all the math uh in in terms of actually using those in practice during the weekend you'll see uh three different libraries that essentially allows you to compute these gradients for you. You actually usually don't write down back prop. you just use all of these modules that you've implemented and it turns out there's a way of automat automatically differentiating uh your loss function and getting gradients for free in terms of effort in terms of programming effort uh with respect to your parameters. So conceptually the way you do this and you'll see essentially three different libraries doing it in slightly different ways. Um what you do is you augment your flow graph by adding at the very end the computation of your loss function and then each of these boxes which are conceptually objects that are taking arguments and computing a value um you're going to augment them to also have a method that's a backrop or bprop method. You'll often see actually this expression being used prop. And what this method should do is that it should take as input what is the gradient of the loss with respect to myself and then it should propagate to its arguments. So the things that its parents in the flow graph the things it takes to compute its own value. It's going to propagate them using the chain rule what is their gradients with respect to the loss. So what this means is that you would sort of start the process by initializing well the gradient of the loss with respect to itself is one and then you pass the brop method here one and then it's going to propagate to its argument uh what is by using the chain rule what is the gradient of the loss with respect to f ofx and then you're going to call b prop on this object here and it's going to compute well I have the gradient of the loss with respect to myself f ofx from this I compute what's the gradient of my argument which is the pre-activation at layer 2 uh with respect to the loss. So I'm going to reuse the computation I just got and update it using my uh what is essentially the Jacobian and then I'm going to take the pre-activation here which now knows what is the gradient of the loss with respect to itself the pre-activation it's going to propagate to the weights and the biases and the layer below update them with informing them of what is the gradient of the loss with respect to themselves and you continue like this essentially going through the flow graph but in the opposite direction. So the library torch the basic library torch essentially functions like this quite explicitly it you construct you chain these elements together and then when you're performing back propagation you're going in the reverse order of these chained elements and then you have libraries like torch autograd and piano and tensifo which you'll learn about which are doing things slightly more sophisticated there and I'll uh you'll learn about that later on okay so that's a discussion of how you actually compute gradients of the loss with respect to the parameters. So that's another component we need in stoastic grain in descent. Uh we can choose a regularizer. One that's often used is the L2 regularization. So that's just the sum of the squared of all the weights and the gradient of that is uh just twice times the weight. So it's a super simple grain to compute. We usually don't regularize the biases. Um there's no particularly important reason for that. It's just uh it there are much fewer biases so it seems less important. Um and often this L2 regularization is often referred to as weight decay. So if you hear about weight decay that often refers to L2 regularization and then finally uh and this is also a very important point uh you have to initialize the parameters before you actually start doing backdrop and there are a few tricky cases you need to make sure that you uh don't fall into. So the biases often we initialize them to zero. There are certain exceptions but for the most part we initialize them to zero. But for the weights there are a few things we can't do. So we can't initialize the weights to zero. And especially if you have tanh activations um the reason and I won't explain it here but it's not a bad exercise to try to figure out why is that essentially when you do your first pass you're going to get gradients for all your parameters that are going to be zero. So you're going to be stuck at this zero initialization. So we can do that. Um we also can't initialize all the weights to exactly the same value. Um if again you think about it a little bit, what's going to happen is essentially that all the weights coming into a unit within the layer are going to have exactly the same gradients, which means they're going to be updated exactly the same way and which means they're going to stay constant the same. Not constant, but they're going to stay the same the whole time. So it's as if you have multiple copies of the same unit. So you essentially have to break that initial symmetry that you would create if you initialized everything to the same value. So what we end up doing most of the time is initialize the weights to some randomly uh generated value. Uh often we generate them u there are a few other recipes but one of them is to initialize them from some uniform distribution between uh lower and upper bound. Um this is a recipe here that is often used that has some theoretical grounding that's uh was derived specifically for the tanh. There's this paper paper here by Xavier Go and Yashu Benju you can check out for some intuition as to oh you know how you should initialize the weights but essentially they should be initially random and they should be initially close to zero random to break symmetry and uh um close to zero so that initially the units are not uh already saturated because if the units are saturated then there are no gradients that are going to pass through the units you're essentially going to get gradients very close to zero at the lower layers. So that's the main intuitions to have weights that are small and close to zero uh small and random. Okay, so those are all the pieces we need for running stochastic gradient descent. So that allows us to take a training set and run a certain number of epochs and have the neural net learn from that training set. Now there are other quantities in our neural network that we haven't specified how to choose them. So those are the hyperparameters. Um so usually we're going to have a separate validation set. Most people here are familiar with machine learning. So that's a typical procedure. And then we need to select things like okay how many layers do I want? How many units per layer do I want? Uh what's the step size the learning rate of my stoastic gradient descent procedure that alpha number uh what is the weight decay that I'm going to use. So a standard thing in machine learning is to perform a uh grid search that is if I have two hyperparameters I list out a bunch of values I want to try. So for the number of hidden units maybe I want to try 100 a thousand and 2,000 say and then for the learning rate maybe I want to try 0.01 and 0.001. So a grid search would just try all combinations of these three values for the hidden units and these two values for the learning rates. Um so that means that the more hyperparameters there are it's the number of configurations you have to try out uh blows up and and grows exponentially. So another procedure uh that is now more and more common which is more practical is to perform a form of random search. In this case what you do is for each parameter you actually determine a distribution of likely values you'd like to try. So it could be um so for the number of hidden units maybe I do a uniform distribution over all integers from 100 to a thousand say or maybe a log uniform distribution and for the learning rate maybe again a log uniform distribution but from 0.001 to 0.01 01 say and then to get an experiment so to get values for my hyperparameters to do an experiment with and get a performance on my validation set I just independently sample from these distributions for each hyperparameter to get a full configuration for my experiment and then because I have this way of getting one experiment I do it independently for all of my jobs all of my experiments that I will do. So in this case, if I know I have like enough compute power to do 50 experiments, I just sample 50 independent samples from these distributions for hyperparameters, perform these 50 experiments, and I just take the best one. And what's nice about it is that there are no unlike grid search, there are never any holes in the grid. That is, you just specify how many experiments you do. If one of your jobs died, well, you just have one less, but there there's no hole in your experiment. Um, and also one reason why it's particularly useful this approach is that if you have a specific value in grid search for one of the hyperparameters that just makes the experiment uh not work at all. So learning rates are a lot like this. If you have a learning rate that's too high, uh it's quite possible that convergence of the optimization will not converge. Well, if you're using a grid search, it means that for all the experiments that use that specific value of the learning rate, they're all going to be garbage. they're all not going to be useful and you don't really get this sort of big waste of computation if you do random search because most likely all the values of your hyperparameters are going to be unique because they're sampled say from a uniform distribution over some some range. So that actually works uh quite well and and and it's quite recommended and there are more advanced methods uh like uh methods based on machine learning basian optimization and or sometimes known as sequential model base optimization uh uh that I won't talk about but that works a bit better than uh random search um uh and and that's another alternative if you think you have an issue finding good hyperparameters is to investigate some of these more advanced methods. Um, now you do this for most of your hyperparameters, but for the number of epochs, the number of times you go through all of your uh examples in your training set, uh, what we usually do is not grid search or random search, but we use a thing known as early stopping. The idea here is that if I've trained a neural net for 10 epochs, well, training a neural net with all the other hyperparameters kept constant, but one more epoch is easy. I just do one more epoch. So I shouldn't try to I shouldn't start over and then do say 11 epochs from scratch. And so what we would do is we would just track what is the performance on the validation set as I do more and more epochs. And what we will typically see is the training error will go down. Uh but the validation set performance will go down and eventually go up. Um the intuition here is that the gap between the performance on the training set and the performance on the validation set will tend to uh increase. And since the training curve cannot go below usually some bound uh then eventually the validation set performance has to go up or sometimes it won't necessarily go up but it sort of stay stable. So with early stopping what we do is that if we reach a point where the validation set performance hasn't improved from some certain number of iterations which uh we refer to as the look ahead we just stop we go back to the neural net that had the best performance overall in the validation set and that's my neural network. So I have now a very cheap way of actually getting the number of iterations or the number of epochs over my training set. Uh a few more tricks of the trade. Uh so um it's always useful to normalize your data. It will um often have the effect of speeding up training. If you have real value data for binary data that's uh usually keep it as it is. Uh so what I mean by that is just subtract for each dimension what is the average in the training set of that dimension and then dividing by the standard deviation of each dimension again in my input space. Um so this can speed up training. Um we often use a decay on the learning rate. Um there are a few methods for doing this. One that's very simple is to start with a large learning rate and then track the performance on the validation set. And once on the validation set it stops improving, you decrease your learning rate by some ratio. Maybe you divide it by two and then you continue training for some time. Hopefully the validation set performance uh starts improving and then at some point it stops improving and then you stop or you divide again by two. So that sort of gives you an adaptive using the validations and an adaptive way of changing your learning rate and that can again uh work better than having a very small learning rate than waiting for a longer time. So making very fast progress initially and then slower progress towards the end. Um also I've described so far the approach for training neural nets that uh is based on a single example at a time but in practice we actually use what's called mini batches. That is we compute the loss function on a small subset of examples say 64 128 and then we take the average of the loss of all these examples in that mini batch. And uh that's actually we compute the gradient of this average loss on that mini batch. The reason why we do this is that it turns out that um you can very efficiently implement the forward pass over all of these 64 128 examples in my mini batch in one pass by instead of doing vector matrix multiplications when we compute the pre-activations uh doing matrix matrix multiplications which are faster than doing multiple matrix vector multiplications. So in your code often there will be this other hyperparameter which is mostly optimized for speed in terms of how quickly training will proceed uh of the number of examples in your mini batch. Other things to improve optimization might be using a thing like momentum. That is uh instead of using as the descent direction the gradient of the loss function, I'm actually going to track a descent direction which I'm going to compute as the current gradient for my current example or mini batch plus some fraction of the previous update, the previous uh direction of update. Uh and beta now is a hyperparameter you have to optimize. So what this does is if all the update directions agree across multiple updates then it will start picking up momentum and actually make bigger uh uh steps in those directions. And then there are multiple even more advanced methods for uh having adaptive types of learning rates. Uh I mentioned them here very quickly because you might see them in papers. There's a method known as adagrad where uh the learning rate is actually scaled for each des for each dimension. So for each weight and each biscase it's going to be scaled by what is the um square root of the cumulative sum of the squared gradients. So what I track is I take my gradient vector at each step. I do an elementwise square of all the dimensions of my gradients my gradient vector and then I accumulate that in some variable that I'm noting as gamma here. And then for my descent direction, I take the gradient and I do an elementwise division by the square root of this cumulative sum of squared gradients. Uh there's also RMS prop which is essentially like adagram but instead of doing a cumulative sum we're going to do an exponential moving average. So we take the previous value times some factor plus one minus this factor times the current squared gradient. So that's RMS prop. And then there's atom which is essentially a combination of RMS prop with momentum which is more involved and I won't have time to describe it here but that's another method that's often you know actually implemented in these different softwares and that uh people seem to use with a lot of success. And uh finally uh in terms of actually debugging your implementations um so for instance if you're lucky you can build your neural network without difficulty using the current tools that are available in torch or tensorflow or tano but maybe sometimes you actually have to implement certain gradients for a new module and a new box in your flow graph that isn't currently supported. If you do this you should check that you've implemented your gradients correctly. And one way of doing that is to actually compare the gradients computed by your code with a finite difference of estimate. So what you do is for each parameter you add some very small epsilon value say 10 to the minus 6 and you compute what is the output of of your module. Uh and then you subtract the same thing but where you've subtracted the small quantity and then you divide by 2 epsilon. So if epsilon is uh converges to zero then you actually get the partial derivative. But if it's just small, it's going to be an approximate. And usually this finite difference estimate will be very close to a correct implementation of the real gradient. So you should definitely do that uh if you've actually implemented some of the gradients in your code. And then another useful thing to do is to actually do a very small experiment on a small data set before you actually run your full experiment on your complete data set. So use say 50 examples. Uh so just taking a random subset of 50 examples from your your data set actually just make sure that your code can overfitit to that data can essentially classify it perfectly given you know enough capacity that you would think it should get it. Um so if it's not the case then there's a few things that you might want to investigate. uh maybe your initialization is such that the units are already saturated initially and so there's no actual optimization happening because some of the gradients on some of the weights are exactly zero. So you want want to check your initialization. Uh maybe your gradients are just you know you're using a model you implemented gradients for and maybe there are gradients are not properly implemented. Uh maybe you haven't normalized your input which creates some instability making it harder for stocastic gradient and ascent to uh uh uh work successfully. Uh maybe your learning rate is too large then you should consider trying smaller learning rates. That's actually a pretty good way of having a some idea of the uh magnitude of the learning rate you should be using and um and then once you actually over fit in your small training set you're ready to do a full experiment on on a larger data set. That said, this is not a replacement for gradient checking. So, um, backrop is and stocastic gradient descent, it's a great algorithm that's very bug resistant. Uh, you will potentially see some learning happening even if some of your gradients are wrong or say exactly zero. So, you should that's great. You know, if you're an engineer and you're implementing things, it's fun when code is somewhat bug resistant, but if you're actually doing science and trying to understand what's going on, that's can that can be a complication. So do do both uh gradient checking and a small experiment like that. All right. And so for the last few minutes, I'll actually try to motivate what you'll be learning quite a bit uh about in the next uh two days. Uh that is the specific case for deep learning. So I've already told you that if I have a neural net with enough hidden units, theoretically I can potentially represent pretty much any function, any classification function. So why would I want multiple layers? So there are a few motivations behind this. The first one is taken directly from our own brains. So we know in the visual cortex that the light that hits our retina eventually goes through several regions in the visual cortex eventually reaching an area known as V1 where you have units that are or neurons that are essentially tuned to small forms like edges. uh and then it goes on to V4 where it's slightly more complex patterns that the units are are tuned for and then you reach AIT where you actually have neurons that are specific to certain objects or certain units. And so the idea here is that perhaps that's also what we want in an artificial say uh you know vision system. We'd like it if it's detecting faces to have a first layer that detects simple edges and then another layer that perhaps puts these edges together detecting slightly more complex things like a nose or a mouth or eyes and then eventually have a layer that combines these slightly less abstract uh or more abstract uh uh units to get something even more abstract like a complete face. There's also some theoretical justification for doing uh using multiple layers. Um so the early results were mostly based on studying boolean functions or a function that takes as input can think of it as a vector of just zeros and ones and uh you could show that there are certain functions that um if you had a essentially a boolean neural network or uh essentially a boolean circuit and you restricted the number of layers of that circuit that there are certain functions that in this case to represent certain boolean functions exactly you would need an exponential number of units in each of these layers. Whereas if you allowed yourself to have multiple layers, then you could represent these functions more compactly. And so there's that's another motivation that perhaps with more layers, we can represent fairly complex functions in a more compact way. And then there's the reason that they just work. So we've seen in the past few years great success in speech recognition where it's essentially revolutionized the field where everyone's using deep learning for speech recognition and same thing for visual object recognition uh where again deep learning is sort of the method of choice for identifying objects in images. So then why are we doing this only recently? Why didn't we do deep learning way back when uh back prop was invented which is uh essentially in 1980s and even before that. Um so it turns out training deep neural networks is actually not that easy. There are a few hurdles that one can be confronted with. Uh I've already mentioned one of the issue which is that um some of the gradients might be fading as you go from the top layer to the bottom layer because we keep multiplying by the derivative of the activation function. So that makes trending hard. It could be that the lower layers have very small gradients are barely moving and exploring the space of uh correct you know features to learn for a given problem. Sometime sometimes that's the problem you find you have a hard time just fitting your data and you're essentially underfitting or it could be that with you know deeper neural nets or bigger neural nets we have more parameters. So perhaps sometimes we're actually overfitting. We're in a situation where all the functions that we can represent with the same neural net represented by this gray area function actually includes yes the right function but it's so large that for a finite training set the odds that I'm going to find the one that's close to the true classifying function the real system that I'd like to have is going to be very different. So in this case I'm in I'm essentially overfitting and that might also be a situation we're in. And unfortunately there's never there are many situations where one problem is observed overfitting or underfitting. Um and so we essentially have you know in the field develop tools for fighting both situations and I'm going to rapidly touch a few
Resume
Categories