Deep Reinforcement Learning (John Schulman, OpenAI)
PtAIh9KSnjo • 2016-09-27
Transcript preview
Open
Kind: captions Language: en so good morning everyone so I'm going to talk about uh some of the core methods in deep reinforcement learning um so the aim of this talk is as follows um first I'll do a brief introduction to what deepl is and um whether it might make sense to apply it in your problem um I'll talk about uh some of the core uh techniques uh so they're on the one hand we have the policy gradient methods uh then on the other hand we have uh methods that uh learn a q function including Q learning and sarsa and um I'll talk a little at the end about what are the pros and cons of these different methods so first what is reinforcement learning um it's a branch of machine learning cons uh concerned with taking sequences of actions um so um often uh it's described in in terms of an agent inter interacting with the previously unknown environment um and it's trying to maximize some kind of cumulative reward some kind of reward function that we've defined um accumulated over time and uh pretty much any kind of task where you have some kind of goal that you want to achieve can be stated in these terms uh so this is an extremely General uh formulation uh what's deep reinforcement learning it's pretty simple it's just uh reinforcement learning where you're using uh neural networks uh as function approximators um so uh the interesting thing about reinforcement learning and contrast to supervised learning is um it's actually not totally obvious what you should use your neural network to approximate in reinforcement learning and there are different kinds of algorithms that approximate different things so uh one choice is to use the neural network to approximate your policy which is uh how the agent chooses its actions um another choice is to approximate the value functions which measure how good or bad uh different states are or how or actions and um last you can use the um you can try to learn a model of the system a Dynamics model uh which will make predictions about next States and rewards okay so I'll now give a few examples of different um different places where you might apply reinforcement learning and what the observations and uh actions would be uh so one example is robotics um so here you could imagine a robot where the observations are the camera images and the joint angles of the robot um the actions are the joint torqus you're applying and um the reward is going to depend on what you want the robot to do so so this is something we uh as the algorithm designer get to Define so uh the rewards could be uh to stay balanced uh to navigate to some Target location or something more abstract like Serve and Protect humans uh so reinforcement learning has also been used in a lot of um more practical applications um well applications that have been practical in the past uh I think robotics will be very practical in the future um but uh for example um one uh one area is um Inventory management uh so this is just one example of how you could use reinforcement learning for a decision-making problem uh so you you have to decide how much to stock up on uh of every item and uh your observations would be your current inventory levels um actions would be how much of each item you're going to purchase and uh reward is your profit uh so people in operations research this is uh this is a subfield um study this kind of problem a lot um okay there are also a lot of uh machine learning problems where people have started to apply reinforcement learning techniques so uh one example is um attention um so the idea in attention is you don't want to look at the whole input at once uh you want to just focus on part of it uh so uh one example of this is um with a large image you might want to just crop out part of it and uh use that and just do detection on that part of the image um so uh here your observation would be your current image window action is where to look or where to crop your image um and uh reward is um your whether you make a classification error or not so here the um the actions are trying to um here you have to um try to choose the right area of the image to look at so you'll do the correct classification um reinforcement learning has also been used in um structured prediction problems um which haven't uh which in the past weren't considered to be reinforcement learning problems uh but it turns out that um like to actually properly solve them it it actually is a reinforcement learning problem uh so machine translation for example um uh you so you get a sour a sentence in the source language and you have to emit a sentence in the target language um and uh you can uh here your observations are the sentence in the source language you're emitting one word at a time in the Target l language and uh you have some reward function that looks at the whole sentence and tells you how good your translation was um so since this is non-differentiable and it's um you yeah you can't just uh like differentiate through the whole thing and do gradiant to sent so um it turns out you have to do um you can use a policy gradient method to optimize your translation system um so people have started to do that okay so that's just those are just a few examples um not exhaustive at all um but uh I just want to uh since I just want to say a little bit about how reinforcement learning fits into um the um fits into the picture of all the other um types of machine learning problems so previous um I mean previous uh courses in this uh series have talked about uh supervised learning and un supervised learning so how does uh reinforcement learning relate to them how is it different um so let's just uh first compare it to let's look at supervised learning so in supervised learning first um the environment samples an input output pair from some distribution row um the agent makes a prediction um why hat using its function f and uh it receives some loss which tells it if it made the right prediction or the wrong prediction um so the interpretation is um environment asks the agent a question and then tells the right answer um so contextual Bandits are um make this problem a little harder in that they give um The Learning agent a little bit less information um so now the environment samples an input um but notice that there's not a correct output associated with it um then the agent takes an action and uh the agent receives some cost which is from um some probability distribution so here um C is the cost we're sampling it from some probability distribution um and the agent doesn't know what this probability distribution is so that's what makes the problem hard um so environment asks the agent a question and uh the agent answers and the environment gives her a noisy score on the answer um so this is applied um this actually has a lot of applications so personalized recommendations is one big one along with advertising so um you have to besides um like uh customers who liked this I mean you for you have a customer and you know what they liked in the past so you have to make a prediction about what they're going to like in the future uh so you show them appropriate ads or links like what either like what ad what book you want to try to advertise to them or what video you want to show them and so on um so here you can the big difference between this and the supervised learning setting is you don't have access to the function uh the Lost function trying to optimize so in particular you can't differentiate through it um we don't know the process that generates C so we can't um compute the grading of the loss function and use that to tune the agent's parameters so that makes it uh so that makes the problem a bit harder or you you have to use a different kind of algorithm um lastly uh reinforcement learning is um almost the same as the contextual Bandit setting except now the environment is stateful so now instead of sampling um the initial state from scratch every time step uh from the same distribution um the um State evolves over time uh so you have some transition probability distribution called P here where um the the state X subt is uh conditioned on the previous state and the previous action and uh that makes the problem quite a bit harder because now well for a number of reasons uh for one thing the inputs you're getting depend on what actions you're taking so now that makes it harder to develop a stable reliable algorithm because now as the agent starts to learn it gets different inputs so that can lead to all sorts of um outof control um behavior and it also means you have delayed effects because uh since the system is stateful um uh you might need to take a lot of actions to get into the right state so um you might need to um you can't just greedily every time step you have to uh you have to think ahead effectively okay so just to summarize these differences there are two differences the first one is you don't have full analytic access to the function you're trying to optimize you have to query it through interaction uh second uh you're interacting with a stateful world which means that the input you get is going to depend on your previous actions and if you just take the first of those differences uh between supervised learning and reinforcement learning you get the contextual Bandit setting so that's sort of halfway in between okay so uh I realized that there uh multiple this audience probably has people with different interests uh some people are um doing research and want to know about what's the latest in the research world and some people are um want to apply these machine learning techniques to practical applications um so this slide is um for the latter group of people um so if you're wondering um if you have some problem where you think reinforcement learning might be relevant and you're wondering if you should apply reinforcement learning um so first uh I should say that the answer might be no it might be Overkill especially uh deep reinforcement learning so this is a set of fairly new techniques where it's not going to work out of the box very well um and uh it's these techniques aren't that well established so they require a lot of they have a lot of knobs to be tuned so uh it might be Overkill and yeah these techniques aren't that well established at the moment so it might be worth investigating some other methods first um so one one so if your problem has a small number of parameters you're trying to optimize over and um you have a simulator that you can uh like just do lots of experiments on um then derivative free optimization methods are likely to be better than reinforcement learning or they're likely to be easier to get working um so these methods just uh look at um they just you give them a blackbox function where you put in a parameter vector and it'll give you a noisy estimate of the score and these algorithms will just optimize uh over the parameters of that blackbox I mean that are being put into that black box um so uh yeah there's a variety of different methods um for derivative free optimization but these are easier to understand than reinforcement learning and they do kind of work out of the box um okay a lot of problems are actually um can be seen as cont contextual banded problems and the statefulness Of The World Isn't that relevant um so for example in advertising um this is where people people look at advertising as a contextual Bandit problem most of the time because you decide what ad to present the user with and then they either um click on it or they don't um but it's really um the user is kind of stateful because if you show them a terrible ad uh they might just go and download ad block uh so uh there is like your actions do have some repercussions um but um often you can just approximate it as being a contextual Bandit problem where there is no state so uh there's a better theoretical understanding of contextual Bandit problems uh and methods that are that have some guarantees so in that case um so if it is a contextual banit problem you might want to use those kind of algorithms instead um and lastly um the um operations research field has been uh using um these methods for a while on real problems and um they have a set of methods um which are um just pretty much the basic algorithms uh policy iteration and value iteration but they're um sort of well um they're welldeveloped ways of doing feature engineering for these problems that end up working pretty decently so these uh techniques are also worth considering instead of trying to throw a big neural network at it okay so now well now that I've talked about what why not to use deep reinforcement learning or what it's not good for um I'll just talk about um some recent uh success stories in deep reinforcement learning which are achievements that probably wouldn't have been possible using these other techniques um so um a a few years ago there is a pretty um influential result um by uh Min all from Deep Mind uh where they used um a deep Q learning algorithm um to play Atari games using the screen images as input um and uh that's hard because you have these G these games are you're trying to do different things in all these games and there're some of them are kind of complicated so it's pretty remarkable that you can just use a simple uh that a simple algorithm can solve them all um this the same algorithm can solve them all uh so since then people have also um solved or or solv this domain using uh policy gradients and another algorithm called dagger um so another big uh groundbreaking result was um beating a um a champion level player at go um also by Deep Mind um using a combination of um super learning from uh like from expert games plus policy gradients to fine-tune the supervised learning policy um plus Monte Carlo tree search um plus value functions to make the search work better so a combination of techniques and reinforcement learning um robotic so some of my colleagues at uh Berkeley had some um very nice results uh learning in real time how to do manipulation tasks um using an algorithm called guided policy search um using the PR2 robot um and uh some of my colleagues and I have um been working on robotic Locomotion um using um policy gradient methods and uh people have been working on Locomotion for a while and have been able to achieve pretty good results uh using uh very like highly engineered domain specific methods but um previously there hadn't been much success using general methods to solve it and last uh there have been some recent results um playing 3D games using policy gradients um in fact there was even a contest I heard about a couple days ago with this new visz Doom uh task which um is pretty nice so you might want to check out viz Doom okay so that's that's it for the highle overview uh part of this um now I'm going to start getting into the actual formalism and the technical details okay so the basic object uh in uh the field of reinforcement learning is the markof decision process um so the markof decision process is defined by the following components you have a state space this is all the different states of the system uh the action space these are all the actions the agent can take and you have um this probability distribution um which uh which determines the probability of next date and reward so R is the reward S Prime is the next state s and a are the actions so it's a conditional probability distribution sometime sometimes people split this out into a separate reward function but that's basically an equivalent formulation okay and sometimes there's some extra objects to find um will will'll be interested in the we'll we'll consider an an initial State distribution so this is um the world starts out in a certain State and uh the typical optimization problem you want to solve given this mdp is to maximize expected cumulative reward though there are various um ways of defining that more precisely which I'll go into uh later okay so there are various different settings of reinforcement learning um where you define a slightly different optimization problem the one we'll be most concerned with is called the episodic setting so here the agents experience is split up into a um a series of episodes which have um finite length so in each episode uh we first sample the initial state of the world from some probability distribution me and then um the agent uh keeps on acting until um the world ends up in some terminal state um so just to give a example of what terminal States might be like and how an epis episodic um reinforcement learning problem might look um so one example is um when termination is good and you want to terminate the episode as fast as possible uh so if we imagine setting up a task with some kind of Taxi robot that should get to the destination as fast as possible then the episode would be like one trip and uh it's terminate it's trying to terminate the episode as fast as possible um another example is um a waiter robot um where you have a fixed length shift but the waiter has to accumulate it has to do as well as possible during that shift so there the episode has a fixed length um the waiter has to say maximize tips or uh customer happiness um and then you could imagine another kind of task where uh termination is bad and you want the episode to last as long as possible um so you can view life as an example of that um but also you could imagine having a a walking robot um where uh you want it to walk as far as possible before it falls over and in this setting it's pretty easy to find to Define what the goal is um to we just want to maximize the expectation of the total reward per episode okay and the last object we're going to introduce here is um a policy so the policy is just the function that the agent uses to choose its actions so we have deterministic policies which are just uh the policy is denoted by pi so we have the action is just some function of the state and uh we also have U stochastic policies where the policy is a conditional probability distribution um so here is just we're just going to make a little bit more precise um the setting of The episodic mdp um so first we sample the initial state from this distribution me um then we um then we get uh we sample the first action from the policy a Zer from the policy then we sample next state and reward uh from the transition probability distribution and so on until we reach a terminal State s subt and then um the quantity we care about is the sum of all these rewards r0 plus R1 dot dot dot plus r subt minus one and um we want to maximize yeah so Ada is Ada of Pi is just defined as the um expected total reward of the policy Pi here's a picture that um illustrates exactly the same thing so you can look at it as a graphical model okay and lastly um in the policy gradient section in particular we're going to be interested in parameterized policies so here we have a parameter Vector um Theta which specifies U which specifies exactly what the policy is so um for example the family of policies could be just a neural n you have a certain neural network architecture and Theta specifies all the weights of this neural network so we could have a a deterministic policy of course or stochastic policy um and if you're wondering like concretely what would a policy look like I mean how do you use a neural network to represent your policy it's actually exactly you do exactly the same thing you would do if this were a classification or a regression problem uh so uh in so s here the state here is your input and the action is your output um so um if you have a discrete action space a discret set of actions um then um you would use a Network that outputs a vector of probabilities the probabilities of the different actions this is exactly like a classifier and if you have a continuous action space um you you would have your neural network output the mean and uh the diagonal of a covariance matrix of a gaussian distribution um so this is just like you're doing regression so you can use the same kind of architectures You' use in supervis learning okay so that's uh that's just the that's it for the formalism of mdps so now I'm going to go into policy gradient methods which are one uh Broad and general class of reinforcement learning methods which are um quite effective so to give a brief overview of this um here's here's the intuition of what policy grading methods are going to do um so here capital R means the sum of rewards of the whole episode episode um so our optimization problem is we want to maximize the expectation of the total reward um given our parameterized policy Pi sub Theta and um the intuition of how our algorithm is going to work is um we we're going to collect a bunch of trajectories I mean this is just run a bunch of episodes using our policy and then we want to make the good trajectories more probable so I mean some of the trajectories were lucky and they were really good some of them the agent was unlucky and they were bad and um The Good the ones that were good meaning there was high reward um that means the agent probably took good actions there so we want to uh increase the probability of the actions from those trajectories so um so the most basic version of uh policy gradient methods just try to make the good trajectories more probable without trying to figure out which were the good actions and which were the bad actions um slightly better methods or more um elaborate methods uh TR to figure out which actions were good and which ones were bad and then they try to make the good actions more probable and um lastly there's another class of methods which um which actually try to push the actions towards better actions so they differentiate the loss function with respect to the actions and they try to push the actions to better actions um so we're mostly going to talk about one and two here oh there's a question oh uh yeah good question so um well we're maximizing over the policy we're trying to find uh the best policy but here um the policy is assumed to be parameterized so there's some parameter Vector Theta that specifies the policy and now we just want to maximize with respect to Theta any other questions okay um so there's a very um a very fundamental fundamental concept which is called the score function grading estimator uh which um underlies policy gradient methods so actually to introduce this we're not going to talk about policies and RL at all we're just going to assume uh we have some expectation we have expectation of f ofx where X is sampled from some uh parameterized probability distribution so we want to compute uh the gring of this expectation with respect to Theta um so there's a general formula um that'll do this and the way you derive it is you just write the expectation as an integral um and then you just um move some things around uh you you swap the integral with the derivative and you um you turn it back into an expectation and uh what you get at the end is this bottom line which says that you take the expectation of function value times grad log probability uh so the in this is an unbiased estimator of the gradient meaning if we get enough samples it'll Converge on the right thing um so uh the way you can compute this estimator meaning the way you can get a noisy estimate of the grading of the expectation is you um just collect one you just get one sample um from this distribution and then you compute then you multiply F ofx times grad log probability um so uh the only requirement for being able to use this estimator is uh we need to be able to compute the probability density I mean we need to be able to an analytically compute it and we need to be able to differentiate it with respect to Theta and um often it needs to be differentiable um there's another uh way of deriving it using important sampling so you write down the important sampling estimator for the expectation and then you just uh swap the derivative with the expectation and you get the same thing okay so so now let me just give a little bit of intuition about this estimator oops okay so f ofx is measuring how good the sample X is um so that means that so G hat here is our gradient estimator meaning this is what we get if we take one sample X subi and we compute our estimator this is our estimate of the gradient um so if we move in Direction Gat um that pushes up the log probability of our sample xabi in proportion to how good it is so if we have really good um if we got a really good function value then we're going to try to push up its log probability a lot and if it was a bad function value then we're not going to try to push it up very much so it's pretty simple intuition um the really nice thing is um this is valid even if f ofx is discontinuous or if f ofx is um unknown meaning you only uh you don't get to differentiate it you just get to see the function values um or um the sample space um is a discrete set so X doesn't even have to be continuous um and this is um quite uh this is quite remarkable actually that you don't even need to have access to the full um you don't need to know exactly um what the function is that you're optimizing you just have to be able to query um for the function value um and this means this is a way of um being able to differentiate um functions through a system that has non-differentiable pieces um so for example in um in robotic Locomotion one issue is that um you have contacts between the robot's foot and the ground and um contact you make and break contact and that causes a discontinuous change in the Dynamics um so that makes it really hard to do smooth optimization techniques to come up with the right Behavior so when you use this kind of um grading estimator along with policy gradients which I'm going to uh talk about very soon um you can actually just uh differentiate you can optimize this system um even though it has differentiable pieces in it okay so uh okay so here's another little picture of what's going on so we have our function f ofx um which we're trying to maximize the expectation of and then we have our probability density P ofx um so we just sample a bunch of values from our probability density those are the blue dots on the x axis and um then uh we um so then we we look at the function values and um we're trying to push the uh probability distribution so that the probability goes up at um these samples in proportion to the function value um so over on the right side of the curve uh that means we're trying to push that F uh probability value up really hard and on the left side we're pushing it up softly uh so what's going to happen is the probability density is going to slide to the right if you can imagine a sort of physical analogy there okay so that's that's the score function gradient estimator this is a general technique um it can be used in various machine learning problems um now we're going to apply it to the reinforcement learning setting and um we're going to take our random variable X to be a whole trajectory um so the trajectory consists of State action reward State action reward and so on until the end of the episode and uh now um to get our gradient estimator uh to get the um uh to get the gradient of the expected reward all we've got to do is um compute the grad log probability uh times the total reward so um so this uh probability of the trajectory that sounds like a really unfriendly quantity because uh there's uh a long complicated process that's generates this trajectory with lots of uh lots of time steps but um log um okay so we can write out what this process is what this probability density is um so we have uh it's just a product of probabilities we've got our initial uh we've got our mu of s0 which is just our initial State distribution and then every time step we have um we sample the action according to Pi and we sample the next state and reward according to our Dynamics model so uh log turns that product into a sum and here's the cool part um everything that doesn't um contain Theta drops out um so the thing is we didn't know uh there are parts of this um probability uh distribution P of to given Theta that we don't have access to so if this is reinforcement learning uh we don't assume that we know the Dynamics model of the system we just find out about it by sampling uh by doing sample doing episodes um so um so since this uh product turns into a sum all the the pieces uh like the log uh log P there and the log me uh which we don't know just drop out so it doesn't matter um and uh what we get in the end is um we get a sum of log probab sum of uh log probabilities of actions so grad log Pi of action given State um so our formula looks like um our formula for the grading of the expectation is just the expectation over trajectories of um total reward of the trajectory times grad um grad of the sum of all the log probs so the interpretation of this is um we're uh taking our good trajectories and we're trying to increase their probability in proportion to how good they are um and you can think of this as uh being similar to supervised learning where we treat the good trajectories with high rewards as um positive examples in our supervised learning problem so we're using those to train the policy and which actions are good we're basically treating those actions as positive examples okay now we can improve this formula a little bit um so that was just uh the most basic uh I mean this is an unbiased estimator for the policy gradient so uh if we just take that expression inside the expectation on the right hand side and we take one sample of that it has the right mean so if we just get enough of them we're going to get the policy gradient um okay so that's um but we can also write down some other formulas uh that have the same mean but have lower variance so we can come up with better estimators for the policy gradient um and that's actually quite important because the one from the previous slide is really bad when you have uh a long a large number of time steps meaning it has really high variance so uh the first thing we can do is you uh we can use the temporal structure of the problem um by the way to derive these next bunch of formulas it just takes a bunch of really straightforward manipulation where you move around expectations um and I'm not going to go through all the math um but uh I'll just say what the formulas are so um okay so we can repeat the same argument from the previous slide um to just derive the gradient estimator for a single reward term so we end up with that reward term times the grad some of log probs and just summing over that we get a new formula um where we're not multiplying the sum of the the grad log prob of the whole thing times the sum of all rewards now um so let's look at that bottom formula um now we have a sum over time of grad log probability of the action at time that time times the sum of future rewards um so so now I mean in the formula from the previous slide we would have had all the rewards in that sum um but now we just have the future rewards and um that kind of makes sense because um an action can't affect the probability of the um previous rewards uh so to figure out if the action is good we should have only we should only be looking at the future rewards so this is a slightly better formula than the one on the previous slide meaning it has this exact same mean um except uh different uh the expr inside the expectation there has lower variance um and we can further reduce the variance by introducing a baseline um so now uh we can take any old function uh B which takes in a state and it outputs a real number and um we can subtract it from our sum of future rewards and um we didn't affect the mean of the um estimator at all so we yeah we didn't change uh the expect At All by introducing this Baseline um so yeah for any choice of B this gives us an unbiased estimator by the way if you're not that familiar with the terminology of estimators what I'm saying is uh we have a um expectation um on the right hand side of that for uh formula uh and uh the quantity inside that expectation is What's called the estimator and um if we get a bunch of samples uh then we can get an estimate of um of the thing on the left hand side which is what we care about so um so when I say it's an unbiased estimator that just means that well that just means that this equation is correct meaning that the thing on the right hand side equals the thing on the left hand side um so yeah this works for any choice of Baseline and um a near optimal choice is to use the expected return so the expected sum of future rewards and uh the interpretation of that is um if we took an action we only want to increase the probability of the action if it was a good action um so how do we tell if it was a good action well the sum of rewards after that action should have been better than expected um so the B ofs is the expected sum of rewards and we're just taking the difference between the measured thing and the expected thing yeah okay so uh that's okay that's the that that was a pretty key thing for variance reduction um and I'm going to introduce one last um variance reduction technique and actually all three of these are really important so um basically nothing's going to work um except for maybe really small scale problems unless you do these things um so the last variance reduction technique is to to use discounts um so um the discount Factor um ignores delayed effects between actions and Rewards so what we we're going to do here looks kind of like a hack but there's an explanation for it um which is instead of taking the sum of rewards uh we're going to take a discounted sum of rewards meaning that um we uh we add this exponential Factor uh gamma so that um when so when we're multiplying the grad log probability by some future award uh we multiply it by some uh quantity that decays with time so people typically use like gamma equals 99 or gamma equals 095 uh so that means like if you Ed 099 that means after 100 time steps um you're going to be um uh you're going to be reducing the reward by a factor of one over e so um so you're exponentially um you're decaying the um effect of the future rewards and the intuition is that um an action uh the action shouldn't affect rewards really far in the future like the system should um the s u the like the assumption is that the system doesn't have really long-term memory and it's sort of resets it's or or the there aren't effect the effects aren't that far delayed uh so you can just ignore um the interaction between action and a a reward way way in the future that's the uh intuition um so now instead of taking the Baseline to be the expected sum of future rewards we want to do a discounted sum uh so now were measuring if the action was better than expected according to this um like the according to the discounted sum um and now there's a more General class of formulas that looks like the one that I just wrote so this this one that's on the top of the slide is pretty good and um this is like almost as good as anything you're going to do to within a small constant Factor uh but um there's there's a more General class of formulas that um look like um grad log probability times uh some quantity a hat which we call the advantage estimate and this is in general just going to be um an estimate of um this is an it has a more a precise definition which is how much uh how like how much was this action um better than the um average action taken by the policy but in but informally this just means how much better was the action then expected so and and this formula makes a lot of sense because we we want to increase the probability of the good actions and de decrease the probability of the bad ones so we should um we should increase it in proportion to the goodness of the action okay so just to summarize so I just told you there's this gradient estimator meaning there's this expression you can compute which gives you a noisy estimate of the policy gradient so how do you actually turn this into an algorithm uh so this is silly algorithm s uh so um so here's what the algorithm looks like it's pretty much what you'd expect uh you um you take your policy um you initialize your policy parameter and your Baseline function um you uh for each it each iteration you um execute the um the uh current policy to get a bunch of whole episodes meaning whole trajectories and um each time step in the each trajectory you should compute the return meaning the sum of rewards following that time step the sum of discounted rewards and the advantage estimate which is um the sum of discounted rewards minus the Baseline uh then you refit the Baseline by trying to um make the Baseline function equal the returns uh and then um you update the policy using a policy gradient estimator so you're just doing SGD while updating the Baseline as you go along yeah so that's that's the vanilla policy gradient algorithm um and this is um I'll briefly talk this has been used to obtain some pretty good results so it's not um that bad of an algorithm but um there there several different directions that it can be improved so one one uh issue that you run into um is with step sizes um so in supervised learning step sizes aren't that big of a deal um because uh maybe you take too big of a step um but that's okay um you'll fix it the next update and um your uh current function your current classifier for example doesn't affect what inputs you're getting so even if you just um are doing really uh even if your network is just kind of thrashing around for a while because you're taking too big steps uh that's not going to cause any problems um but um uh yeah and reinfor so yeah so step sizes aren't that big of a deal you can just anal them uh you can start off with a large step size and anal them down to zero and that um works pretty well um in reinforcement learning if you take too big of a step you might wreck your policy um and even if you don't actually change the network that much so you don't lose all your nice features um you you might just change its Behavior a little too much and now it's going to do something totally different and visit a totally different part of State space um so since in reinforcement learning the system is stateful and your state distribution depends on your policy that makes that like brings uh a really a different problem and uh now like after you took that step the next batch of data you're going to get was collected by the bad policy and now you're never going to recover because you just forgot everything yeah so um One Way um that uh my colleagues and I well one way to fix this is to try to um to try to stop the basically try to stop the policy from taking too big of a step so um you can look at the K Divergence between the um old policy and the new policy um like before the update and after the update and make sure that um the uh distributions aren't that different so you're not taking too big of a step it's kind of an obvious thing to do uh so my colleagues and I developed an algorithm called trust region policy optimization um which looks at the yeah looks at the action distributions and tries to make sure the K Divergence isn't too large and uh there's this is very closely related to previous meth natural policy gradient methods which uh which are based on um which are doing something similar but usually it's not um set up as a hard constraint on the K Divergence so another um type of extension of policy gradient methods is um to do more uh to use value uh value functions to do um more variance reduction um instead of just using them as a baseline you can also um you can use them more aggressively and introduce some bias um so I won't go into the details in this talk um but um sometimes these are called actor critic methods um there's also another type of approach which I briefly touched on in the um earlier slide um where instead of just trying to increase the probability of the good actions you actually differentiate your loss with respect to the actions um this is like the reparameterization trick which is used um for um like for density modeling and unsupervised learning um so uh here you're trying to instead of just increasing the probability of the good actions you're trying to push the actions towards better actions and I'd say both of these bullet points um you're um potentially decreasing your variance a lot but at the cost of increasing bias so it's actually U makes the algorithms a little harder to um like to understand and to get them working because um with high variance if you just uh crank up the amount of data you can always drive your variants down as much as you want but with bias even if no matter how much data you get you're not going to get rid of the bias so if your grading is pointing in the wrong direction then you're not going to learn anything okay so now uh that that's it for the policy gradient section of this um this talk um so I wanted to show a quick video of uh some work that my colleagues and I did on learning Locomotion controllers with policy gradient methods which I think um well I found pretty exciting when I saw it uh so hopefully it's you find it interesting so here what we've got is a um humanoid a simulated let's see it okay yeah it's a simulated humanoid robot um in a physics simulator a realistic physics simulator called Moko and uh it has a neural network policy uh which takes in um The Joint angles of the robot and uh maybe some and a little bit of other kinematic information like joint it's got joint velocities and also um positions of the different um Links of the robot so that's what the input is it's pretty much the raw um state of the robot like no clever feature engineering there and um the output is going to be the joint torqus which are set 100 times a second so we're just mapping from joint angles to joint torqus and uh we Define a reward function which is to move forward as fast as possible so it gets a reward for moving forward and um it gets a uh so um the episode ends when it its head goes below a certain height meaning it fell over so that's basically the setup there was a little bit of tweaking for the reward function but um not too extensive um so whoops yeah so you can see first it just Falls forward a lot of times and then slowly it starts to develop a uh half decent looking walk and uh eventually it gets it down pretty well and at the very end of this um it could just keep running uh indefinitely so I think it was actually stable in a strong sense meaning I could just leave it for 15 minutes and it wouldn't fall over it would just keep going so uh here's another um robot model that um we just created without too much thought I mean we just decided to put a bunch of legs on this thing um and uh so we don't even know how this thing is supposed to walk um and uh just give it to the same algorithm and it just figures out some kind of crazy way to walk um so that's the nice thing about reinforcement learning uh you don't even need to know what you want it to do um I think this is also the physics are a little unrealistic here but here we set up up we used this um a similar model to the one in the first uh demo but uh here we just give it a reward for having its head at a certain height so there's a re word telling it to get its head up as high as possible and then it figures out how to get up off the ground oh let's see um I have I have low battery uh does anyone have a charger that I could oh thanks a lot you're a lifesaver okay any questions about policy gradients before I move on to the next part m oh yeah so the question was is the system time invariant uh yes the that's that's assumed that it's stationary oh right and also that it doesn't change from one episode to the next of course in some real world problems that might not be the case so that's I think that's also an interesting problem setting where you have a non-stationary environment um so the question question was uh for the Baseline to learn a good Baseline uh do you need to know the Dynamics of the system um so no you can just learn it by doing regression you just uh estimate the empirical returns and then you do regression to try to uh fit a function to that yeah so the question is um there's a discount factor which um causes the um which should cause the policy to disregard any effects that are delayed by more than a 100 time steps so um how does it still work that this guy learns how to stand up um even though that might take more than 100 time steps is that correct yeah um so yeah you're right um and in fact I would say that these methods um aren't guaranteed to work well if you have more than a 100 time steps uh so sometimes they work anyway often they work anyway but there's no guarantee um so I think there's actually something pretty fundamental missing in how uh like how to deal with really long time scales and people have recently been thinking about hierarchical reinforcement learning where you have um different levels of uh detail of the system where you might have a like one level of description where you have a um like a short time a small time step and then you have successively larger time steps and uh you can that allows you to plan over much longer Horizons um so that's something that's currently in active area of research but yeah I would say that none of these methods are going to um do are guaranteed to do anything reasonable if you have uh more than one over one minus gamma time steps uh between action and reward oh yeah so in this kind of task if you introduced terrain or something could it uh I think if it didn't if you didn't train it to deal with terrain then it um then it might fail it probably would fail actually I don't think it would fail because uh the funny thing about these policies are actually really robust because um you train them with the stochastic policy policy um so there's a lot of noise being generated by the policy itself um so in practice uh it's um it's actually so it's able to deal with huge noise introduced by the policy and as a result um I found that if you um change the Dynamics parameters a little it can usually still work but yeah there's no guarantee that it'll do anything if you give it something you didn't train it for um I I think that you probably could train it um this do the same kind of training with uh terrain I didn't have any terrain so I didn't try it but that would be nice to try okay I'm going to move on to the next part of the talk uh feel free if you have more questions to find me afterwards okay so now I'm going to talk about a different uh type of reinforcement learning algorithm so okay so these uh so the previous kind of methods are distinguished by the fact that they learn they explicitly represent a policy which is the function that chooses your actions and they try to optimize it with respect to the parameters of the policy um so the nice thing about the policy gradient methods we just talked about is that you're optimizing the thing you care about um so and you're optimizing it with gradient descent so that makes it kind of easy to understand what's going on um because if you take if you're getting the proper grading estimate and you take small enough steps then you should be improving I mean of course you still could get stuck in a local minimum but at least uh or you get stuck in a bad local minimum but at least it's a local minimum and you can use the our understanding of optimization to figure out what's going on so these next class of methods are a little different because um they're not optimizing the policy directly uh they're learning something else called a q function uh which measures how good um State action pairs are are so it measures um I'll I'll say that more formally L later but it's just measuring how good the actions are um and uh these methods are actually um the these are um able to ex exactly solve um mdps efficiently in uh the setting where you have a finite number of states and actions um so these are these are the preferred methods for exactly solving them in in those settings um but um you can apply them uh with um continuous States and actions and um using um using expressive function approximators like neural networks but it's a little harder to understand um what's going on in these methods like when they're going to work and when they're not going to work so um I'll Define um the relevant quantities here uh so the Q function is defined as uh the expected sum of rewards um when we condition on the first state and the first action um so we're conditioning on s0 equals s a0 equals a and we're um we're and the Q function is the expected discounted sum of rewards uh when we're acting under the policy Pi so um by convention I'm starting out with uh time Step Zero I could have also said that um we're taking RT plus RT + 1 plus RT plus two and so on uh but since we're assuming the system is stationary it should be exactly the same so just by convention I'm going to say that the first I'm going to always use time Z 1 two three and so on Just for ease of notation so the Q function is just telling you how good and this state action pair is under your current policy um the value function well the state value function usually called V is uh just um conditioning on the state it's uh telling you how good that state is what's the expected reward at that State and lastly there's an the advantage function is the difference between the Q function and the state value function meaning how much better is that action than uh what the policy would have done we're not going to talk about Advantage functions in this section but it was actually this corresponds to the notion of Advantage estimator uh we briefly men
Resume
Categories