Deep Reinforcement Learning (John Schulman, OpenAI)

PtAIh9KSnjo • 2016-09-27

Transcript preview

Open

Kind: captions
Language: en
so good morning
everyone so I'm going to talk about uh
some of the core methods in deep
reinforcement
learning um so the aim of this talk is
as follows um first I'll do a brief
introduction to what deepl is and um
whether it might make sense to apply it
in your problem um I'll talk about uh
some of the core uh
techniques uh so they're on the one hand
we have the policy gradient methods uh
then on the other hand we have uh
methods that uh learn a q function
including Q learning and
sarsa and um I'll talk a little at the
end about what are the pros and cons of
these different
methods so first what is reinforcement
learning um it's a branch of machine
learning cons uh concerned with taking
sequences of actions um
so um often uh it's described in in
terms of an agent inter interacting with
the previously unknown environment um
and it's trying to maximize some kind of
cumulative reward some kind of reward
function that we've defined um
accumulated over time and uh pretty much
any kind of task where you have some
kind of goal that you want to achieve
can be stated in these terms uh so this
is an extremely General uh
formulation uh what's deep reinforcement
learning it's pretty simple it's just uh
reinforcement learning where you're
using uh neural networks uh as function
approximators um so uh the interesting
thing about reinforcement learning and
contrast to supervised learning is um
it's actually not totally obvious what
you should use your neural network to
approximate in reinforcement learning
and there are different kinds of
algorithms that approximate different
things so uh one choice is to use the
neural network to approximate your
policy which is uh how the agent chooses
its actions um another choice is to
approximate the value functions which
measure how good or bad uh different
states are or how or
actions and um last you can use the um
you can try to learn a model of the
system a Dynamics model uh which will
make predictions about next States and
rewards okay so I'll now give a few
examples of different um different
places where you might apply
reinforcement learning and what the
observations and uh actions would be uh
so one example is
robotics um so here you could imagine a
robot where the observations are the
camera images and the joint angles of
the robot um the actions are the joint
torqus you're applying and
um the reward is going to depend on what
you want the robot to do so so this is
something we uh as the algorithm
designer get to Define so uh the rewards
could be uh to stay balanced uh to
navigate to some Target location or
something more abstract like Serve and
Protect
humans uh so reinforcement learning has
also been used in a lot of um more
practical applications um well
applications that have been practical in
the past uh I think robotics will be
very practical in the future um but uh
for example um one uh one area is um
Inventory management uh so this is just
one example of how you could use
reinforcement learning for a
decision-making problem
uh so you you have to decide how much to
stock up on uh of every item and uh your
observations would be your current
inventory levels um actions would be how
much of each item you're going to
purchase and uh reward is your
profit uh so people in operations
research this is uh this is a subfield
um study this kind of problem a lot
um okay there are also a lot of uh
machine learning problems where people
have started to apply reinforcement
learning
techniques so uh one example is um
attention um so the idea in attention is
you don't want to look at the whole
input at once uh you want to just focus
on part of it uh so uh one example of
this is um with a large image you might
want to just crop out part of it and uh
use that and just do detection on that
part of the image um so uh here your
observation would be your current image
window action is where to look or where
to crop your image um and uh reward is
um your whether you make a
classification error or
not so here the um the actions are
trying to um here you have to um try to
choose the right area of the image to
look at so you'll do the correct
classification um reinforcement learning
has also been used in um structured
prediction problems um which haven't uh
which in the past
weren't considered to be reinforcement
learning problems uh but it turns out
that um like to actually properly solve
them it it actually is a reinforcement
learning problem uh so machine
translation for example
um uh you so you get a sour a sentence
in the source language and you have to
emit a sentence in the target language
um and uh you can uh here your
observations are the sentence in the
source language you're emitting one word
at a time in the Target l language and
uh you have some reward function that
looks at the whole sentence and tells
you how good your translation was um so
since this is non-differentiable and
it's um you yeah you can't just uh like
differentiate through the whole thing
and do gradiant to sent so um it turns
out you have to do um you can use a
policy gradient method to optimize your
translation system um so people have
started to do
that okay so that's just those are just
a few examples um not exhaustive at all
um but uh I just want to uh since I just
want to say a little bit about how
reinforcement learning fits into um the
um fits into the picture of all the
other um types of machine learning
problems so previous um I mean previous
uh courses in this uh series have talked
about uh supervised learning and un
supervised learning so how does uh
reinforcement learning relate to them
how is it different um so let's just uh
first compare it to let's look at
supervised learning so in supervised
learning first um the environment
samples an input output pair from some
distribution
row um the agent makes a prediction um
why hat using its function f and uh it
receives some loss which tells it if it
made the right prediction or the wrong
prediction um so the interpretation is
um environment asks the agent a question
and then tells the right
answer um so contextual Bandits are um
make this problem a little harder in
that they give um The Learning agent a
little bit less information um so now
the environment samples an input um but
notice that there's not a correct output
associated with it um then the agent
takes an
action and uh the agent receives some
cost which is from um some probability
distribution so here um C is the cost
we're sampling it from some probability
distribution um and the agent doesn't
know what this probability distribution
is so that's what makes the problem
hard um so environment asks the agent a
question and uh the agent answers and
the environment gives her a noisy score
on the
answer um so this is applied um this
actually has a lot of applications so
personalized recommendations is one big
one along with advertising so um you
have to besides um like uh customers who
liked this I mean you for you have a
customer and you know what they liked in
the past so you have to make a
prediction about what they're going to
like in the future uh so you show them
appropriate ads or links like what
either like what ad what book you want
to try to advertise to them or what
video you want to show them and so
on um so here you can the big difference
between this and the supervised learning
setting is you don't have access to the
function uh the Lost function trying to
optimize so in particular you can't
differentiate through it um we don't
know the process that generates C so we
can't um compute the grading of the loss
function and use that to tune the
agent's parameters so that makes it uh
so that makes the problem a bit harder
or you you have to use a different kind
of
algorithm um lastly uh reinforcement
learning is um almost the same as the
contextual Bandit setting except now the
environment is
stateful so now instead of sampling um
the initial state from scratch every
time step uh from the same distribution
um the um State evolves over time uh so
you have some transition probability
distribution called P here where um the
the state X subt is uh conditioned on
the previous state and the previous
action and uh that makes the problem
quite a bit harder because now well for
a number of reasons uh for one thing the
inputs you're getting depend on what
actions you're taking so now that makes
it harder to develop a stable reliable
algorithm because now as the agent
starts to learn it gets different inputs
so that can lead to all sorts of um
outof control um
behavior and it also means you have
delayed effects because uh since the
system is stateful um uh you might need
to take a lot of actions to get into the
right state so um you might need to um
you can't just greedily every time step
you have to uh you have to think ahead
effectively okay so just to summarize
these differences there are two
differences the first one is you don't
have full analytic access to the
function you're trying to optimize you
have to query it through
interaction uh second uh you're
interacting with a stateful world which
means that the input you get is going to
depend on your previous actions and if
you just take the first of those
differences uh between supervised
learning and reinforcement learning you
get the contextual Bandit setting so
that's sort of halfway in
between okay so uh I realized that there
uh multiple this audience probably has
people with different interests uh some
people are um doing research and want to
know about what's the latest in the
research world and some people are um
want to apply these machine learning
techniques to practical applications um
so this slide is um for the latter group
of people um so if you're wondering um
if you have some problem where you think
reinforcement learning might be relevant
and you're wondering if you should apply
reinforcement learning um so first uh I
should say that the answer might be no
it might be Overkill especially uh deep
reinforcement learning so this is a set
of fairly new techniques where it's not
going to work out of the box very well
um and uh it's these techniques aren't
that well established so they require a
lot of they have a lot of knobs to be
tuned so uh it might be Overkill and
yeah these techniques aren't that well
established at the moment so it might be
worth investigating some other methods
first um so one one so if your problem
has a small number of parameters you're
trying to optimize over and um you have
a simulator that you can uh like just do
lots of experiments on um then
derivative free optimization methods are
likely to be better than reinforcement
learning or they're likely to be easier
to get working um so these methods just
uh look at um they just you give them a
blackbox function where you put in a
parameter vector and it'll give you a
noisy estimate of the score and these
algorithms will just optimize uh over
the parameters of that blackbox I mean
that are being put into that black box
um so uh yeah there's a variety of
different methods um for derivative free
optimization but these are easier to
understand than reinforcement learning
and they do kind of work out of the
box um okay a lot of problems are
actually um can be seen as cont
contextual banded problems and the
statefulness Of The World Isn't that
relevant um so for example in
advertising um this is where people
people look at advertising as a
contextual Bandit problem most of the
time because you decide what ad to
present the user with and then they
either um click on it or they don't um
but it's really um the user is kind of
stateful because if you show them a
terrible ad uh they might just go and
download ad block uh so uh there is like
your actions do have some repercussions
um but um often you can just approximate
it as being a contextual Bandit problem
where there is no state so uh there's a
better theoretical understanding of
contextual Bandit problems uh and
methods that are that have some
guarantees so in that case um so if it
is a contextual banit problem you might
want to use those kind of algorithms
instead um and lastly um the um
operations research field has been uh
using um these methods for a while on
real problems and um they have a set of
methods um which are um just pretty much
the basic algorithms uh policy iteration
and value iteration but they're um sort
of well um they're welldeveloped ways of
doing feature engineering for these
problems that end up working pretty
decently so these uh techniques are also
worth considering instead of trying to
throw a big neural network at
it okay so now well now that I've talked
about what why not to use deep
reinforcement learning or what it's not
good for um I'll just talk about um some
recent uh success stories in deep
reinforcement learning which are
achievements that probably wouldn't have
been possible using these other
techniques um so um a a few years ago
there is a pretty um influential result
um by uh Min all from Deep Mind uh where
they used um a deep Q learning algorithm
um to play Atari games using the screen
images as
input
um and uh that's hard because you have
these G these games are you're trying to
do different things in all these games
and there're some of them are kind of
complicated so it's pretty remarkable
that you can just use a simple uh that a
simple algorithm can solve them all um
this the same algorithm can solve them
all uh so since then people have also um
solved or or solv this domain using uh
policy gradients and another algorithm
called
dagger um so another big uh
groundbreaking result was um beating a
um a champion level player at go um also
by Deep Mind um using a combination of
um super learning from uh like from
expert games plus policy gradients to
fine-tune the supervised learning policy
um plus Monte Carlo tree search um plus
value functions to make the search work
better so a combination of techniques
and reinforcement
learning um robotic so some of my
colleagues at uh Berkeley had some um
very nice results uh learning in real
time how to do manipulation tasks um
using an algorithm called guided policy
search um using the PR2
robot um and uh some of my colleagues
and I have um been working on robotic
Locomotion um using um policy gradient
methods and uh people have been working
on Locomotion for a while and have been
able to achieve pretty good results uh
using uh very like highly engineered
domain specific methods but um
previously there hadn't been much
success using general methods to solve
it and last uh there have been some
recent results um playing 3D games using
policy gradients um in fact there was
even a contest I heard about a couple
days ago with this new visz Doom uh task
which um is pretty nice so you might
want to check out viz
Doom okay so that's that's it for the
highle overview uh part of this um now
I'm going to start getting into the
actual formalism and the technical
details okay so the basic object uh in
uh the field of reinforcement learning
is the markof decision process um so the
markof decision process is defined by
the following components you have a
state space this is all the different
states of the system uh the action space
these are all the actions the agent can
take and you have um this probability
distribution um which uh which
determines the probability of next date
and reward so R is the reward S Prime is
the next state s and a are the actions
so it's a conditional probability
distribution sometime sometimes people
split this out into a separate reward
function but that's basically an
equivalent
formulation okay and sometimes there's
some extra objects to find um will
will'll be interested in the we'll we'll
consider an an initial State
distribution so this is um the world
starts out in a certain
State and uh the typical optimization
problem you want to solve given this mdp
is to maximize expected cumulative
reward though there are various um ways
of defining that more precisely which
I'll go into uh
later okay so there are various
different settings of reinforcement
learning um where you define a slightly
different optimization problem the one
we'll be most concerned with is called
the episodic setting so here the agents
experience is split up into a um a
series of episodes which have um finite
length so in each
episode uh we first sample the initial
state of the world from some probability
distribution me and then um the agent uh
keeps on acting until um the world ends
up in some terminal state
um so just to give a example of what
terminal States might be like and how an
epis episodic um reinforcement learning
problem might look um so one example is
um when termination is good and you want
to terminate the episode as fast as
possible uh so if we imagine setting up
a task with some kind of Taxi robot that
should get to the destination as fast as
possible then the episode would be like
one trip and uh it's terminate it's
trying to terminate the episode as fast
as
possible um another example is um a
waiter robot um where you have a fixed
length shift but the waiter has to
accumulate it has to do as well as
possible during that shift so there the
episode has a fixed length um the waiter
has to say maximize tips or uh customer
happiness um and then you could imagine
another kind of task where uh
termination is bad and you want the
episode to last as long as possible um
so you can view life as an example of
that um but also you could imagine
having a a walking robot um where uh you
want it to walk as far as possible
before it falls
over and in this setting it's pretty
easy to find to Define what the goal is
um to we just want to maximize the
expectation of the total reward per
episode
okay and the last object we're going to
introduce here is um a policy so the
policy is just the function that the
agent uses to choose its
actions so we have deterministic
policies which are just uh the policy is
denoted by pi so we have the action is
just some function of the state and uh
we also have U stochastic policies where
the policy is a conditional probability
distribution um so here is just we're
just going to make a little bit more
precise um the setting of The episodic
mdp um so first we sample the initial
state from this distribution me um then
we um then we get uh we sample the first
action from the policy a Zer from the
policy then we sample next state and
reward uh from the transition
probability distribution and so on until
we reach a terminal State s subt and
then um the quantity we care about is
the sum of all these rewards r0 plus R1
dot dot dot plus r subt minus one and um
we want to maximize yeah so Ada is Ada
of Pi is just defined as the um expected
total reward of the policy
Pi here's a picture that um illustrates
exactly the same thing so you can look
at it as a graphical
model okay and lastly um in the policy
gradient section in particular we're
going to be interested in parameterized
policies so here we have a parameter
Vector um Theta which specifies U which
specifies exactly what the policy is so
um for example the family of policies
could be just a neural n you have a
certain neural network architecture and
Theta specifies all the weights of this
neural
network so we could have a a
deterministic policy of course or
stochastic policy
um and if you're wondering like
concretely what would a policy look like
I mean how do you use a neural network
to represent your policy it's actually
exactly you do exactly the same thing
you would do if this were a
classification or a regression problem
uh so uh in so s here the state here is
your input and the action is your output
um so um if you have a discrete action
space a discret set of actions um then
um you would use a Network that outputs
a vector of probabilities the
probabilities of the different actions
this is exactly like a
classifier and if you have a continuous
action space um you you would have your
neural network output the mean and uh
the diagonal of a covariance matrix of a
gaussian distribution um so this is just
like you're doing regression so you can
use the same kind of architectures You'
use in supervis
learning okay so that's uh that's just
the that's it for the formalism of
mdps so now I'm going to go into policy
gradient methods which are one uh Broad
and general class of reinforcement
learning methods which are um quite
effective so to give a brief overview of
this
um here's here's the intuition of what
policy grading methods are going to do
um so here capital R means the sum of
rewards of the whole episode episode um
so our optimization problem is we want
to maximize the expectation of the total
reward um given our parameterized policy
Pi sub
Theta and um the intuition of how our
algorithm is going to work is um we
we're going to collect a bunch of
trajectories I mean this is just run a
bunch of episodes using our policy and
then we want to make the good
trajectories more probable so I mean
some of the trajectories were lucky and
they were really good some of them
the agent was unlucky and they were bad
and um The Good the ones that were good
meaning there was high reward um that
means the agent probably took good
actions there so we want to uh increase
the probability of the actions from
those trajectories so um so the most
basic version of uh policy gradient
methods just try to make the good
trajectories more probable without
trying to figure out which were the good
actions and which were the bad actions
um slightly better methods or more um
elaborate methods uh TR to figure out
which actions were good and which ones
were bad and then they try to make the
good actions more
probable and um lastly there's another
class of methods which um which actually
try to push the actions towards better
actions so they differentiate the loss
function with respect to the actions and
they try to push the actions to better
actions um so we're mostly going to talk
about one and two
here oh there's a question
oh uh yeah good question so um well
we're maximizing over the policy we're
trying to find uh the best policy but
here um the policy is assumed to be
parameterized so there's some parameter
Vector Theta that specifies the policy
and now we just want to maximize with
respect to
Theta any other
questions okay
um so there's a very um a very
fundamental fundamental concept which is
called the score function grading
estimator uh which um underlies policy
gradient methods so actually to
introduce this we're not going to talk
about policies and RL at all we're just
going to assume uh we have some
expectation we have expectation of f ofx
where X is sampled from some uh
parameterized probability
distribution so we want to compute uh
the gring of this expectation with
respect to Theta um so there's a general
formula um that'll do this and the way
you derive it is you just write the
expectation as an integral um and then
you just um move some things around uh
you you swap the integral with the
derivative and you um you turn it back
into an expectation and uh what you get
at the end is this bottom line which
says that you take the expectation of
function value times grad log
probability
uh so the in this is an unbiased
estimator of the
gradient meaning if we get enough
samples it'll Converge on the right
thing um so uh the way you can compute
this estimator meaning the way you can
get a noisy estimate of the grading of
the expectation is you um just collect
one you just get one sample um from this
distribution and then you compute then
you multiply F ofx times grad log
probability
um so uh the only requirement for being
able to use this estimator is uh we need
to be able to compute the probability
density I mean we need to be able to an
analytically compute it and we need to
be able to differentiate it with respect
to Theta and um often it needs to be
differentiable um there's another uh way
of deriving it using important sampling
so you write down the important sampling
estimator for the expectation and then
you just uh swap the derivative with the
expectation and you get the same
thing okay so so now let me just give a
little bit of intuition about this
estimator oops okay so f ofx is
measuring how good the sample X is um so
that means that so G hat here is our
gradient estimator meaning this is what
we get if we take one sample X subi and
we compute our estimator this is our
estimate of the gradient um so if we
move in Direction Gat um that pushes up
the log probability of our sample xabi
in proportion to how good it is so if we
have really good um if we got a really
good function value then we're going to
try to push up its log probability a lot
and if it was a bad function value then
we're not going to try to push it up
very much so it's pretty simple
intuition um the really nice thing is um
this is valid even if f ofx is
discontinuous or if f ofx is um unknown
meaning you only uh you don't get to
differentiate it you just get to see the
function values um or um the sample
space um is a discrete set so X doesn't
even have to be continuous um and this
is um quite uh this is quite remarkable
actually that you don't even need to
have access to the full um you don't
need to know exactly um what the
function is that you're optimizing you
just have to be able to query
um for the function value um and this
means this is a way of um being able to
differentiate um
functions through a system that has
non-differentiable pieces um so for
example in um in robotic Locomotion one
issue is that um you have contacts
between the robot's foot and the ground
and um contact you make and break
contact and that causes a discontinuous
change in the Dynamics um so that makes
it really hard to do smooth optimization
techniques to come up with the right
Behavior so when you use this kind of um
grading estimator along with policy
gradients which I'm going to uh talk
about very soon um you can actually just
uh differentiate you can optimize this
system um even though it has
differentiable pieces in
it
okay so
uh okay so here's another little picture
of what's going on so we have our
function f ofx um which we're trying to
maximize the expectation of and then we
have our probability density P ofx um so
we just sample a bunch of values from
our probability density those are the
blue dots on the x axis and um then uh
we
um so then we we look at the function
values and um we're trying to push the
uh probability distribution so that the
probability goes up at um these samples
in proportion to the function value um
so over on the right side of the curve
uh that means we're trying to push that
F uh probability value up really hard
and on the left side we're pushing it up
softly uh so what's going to happen is
the probability density is going to
slide to the right if you can imagine a
sort of physical analogy
there okay so that's that's the score
function gradient estimator this is a
general technique um it can be used in
various machine learning problems
um now we're going to apply it to the
reinforcement learning setting and um
we're going to take our random variable
X to be a whole
trajectory um so the trajectory consists
of State action reward State action
reward and so on until the end of the
episode and uh now um to get our
gradient estimator uh to get the um uh
to get the gradient of the expected
reward all we've got to do is um compute
the grad log probability uh times the
total
reward so um so this uh probability of
the trajectory that sounds like a really
unfriendly quantity because uh there's
uh a long complicated process that's
generates this trajectory with lots of
uh lots of time steps but um log um okay
so we can write out what this process is
what this probability density is um so
we have uh it's just a product of
probabilities we've got our initial uh
we've got our mu of s0 which is just our
initial State distribution and then
every time step we have um we sample the
action according to Pi and we sample the
next state and reward according to our
Dynamics
model so uh log turns that product into
a
sum and here's the cool part um
everything that doesn't um contain Theta
drops out um so the thing is we didn't
know uh there are parts of this um
probability uh distribution P of to
given Theta that we don't have access to
so if this is reinforcement learning uh
we don't assume that we know the
Dynamics model of the system we just
find out about it by sampling uh by
doing sample doing episodes um so um so
since this uh product turns into a sum
all the the pieces uh like the log uh
log P there and the log me uh which we
don't know just drop out so it doesn't
matter um and uh what we get in the end
is um we get a sum of log probab sum of
uh log probabilities of actions so grad
log Pi of action given
State um so our formula looks like um
our formula for the grading of the
expectation is just the expectation over
trajectories of um total reward of the
trajectory
times grad um grad of the sum of all the
log
probs so the interpretation of this is
um we're uh taking our good trajectories
and we're trying to increase their
probability in proportion to how good
they are um and you can think of this as
uh being similar to supervised learning
where we treat the good trajectories
with high rewards as um positive
examples in our supervised learning
problem so we're using those to train
the policy and which actions are good
we're basically treating those actions
as positive
examples okay now we can improve this
formula a little bit um so that was just
uh the most basic uh I mean this is an
unbiased estimator for the policy
gradient so uh if we just take that
expression inside the expectation on the
right hand side and we take one sample
of that it has the right mean so if we
just get enough of them we're going to
get the policy
gradient um okay so that's um but we can
also write down some other formulas uh
that have the same mean but have lower
variance so we can come up with better
estimators for the policy gradient um
and that's actually quite important
because the one from the previous slide
is really bad when you have uh a long a
large number of time steps meaning it
has really high
variance so uh the first thing we can do
is you uh we can use the temporal
structure of the problem
um by the way to derive these next bunch
of formulas it just takes a bunch of
really straightforward manipulation
where you move around expectations um
and I'm not going to go through all the
math um but uh I'll just say what the
formulas
are so um okay so we can repeat the same
argument from the previous slide um to
just derive the gradient estimator for a
single reward term so we end up with
that reward term times the grad some of
log
probs and just summing over that we get
a new formula um where we're not
multiplying the sum of the the grad log
prob of the whole thing times the sum of
all rewards now um so let's look at that
bottom formula um now we have a sum over
time of grad log probability of the
action at time that time times the sum
of future rewards um so so now I mean in
the formula from the previous slide we
would have had all the rewards in that
sum um but now we just have the future
rewards and um that kind of makes sense
because um an action can't affect the
probability of the um previous rewards
uh so to figure out if the action is
good we should have only we should only
be looking at the future
rewards so this is a slightly better
formula than the one on the previous
slide meaning it has this exact same
mean um except uh different uh the expr
inside the expectation there has lower
variance um and we can further reduce
the variance by introducing a
baseline um so now uh we can take any
old function uh B which takes in a state
and it outputs a real number and um we
can subtract it from our sum of future
rewards and um we didn't affect the mean
of the um estimator at all so we yeah we
didn't change uh the expect At All by
introducing this
Baseline um so yeah for any choice of B
this gives us an unbiased estimator by
the way if you're not that familiar with
the terminology of estimators what I'm
saying is uh we have a um expectation um
on the right hand side of that for uh
formula uh and uh the quantity inside
that expectation is What's called the
estimator and um if we get a bunch of
samples uh then we can get an estimate
of um
of the thing on the left hand side which
is what we care
about so um so when I say it's an
unbiased estimator that just means that
well that just means that this equation
is correct meaning that the thing on the
right hand side equals the thing on the
left hand
side um so yeah this works for any
choice of Baseline and um a near optimal
choice is to use the expected return so
the expected sum of future
rewards and uh the interpretation of
that is um if we took an action we only
want to increase the probability of the
action if it was a good action um so how
do we tell if it was a good action well
the sum of rewards after that action
should have been better than expected um
so the B ofs is the expected sum of
rewards and we're just taking the
difference between the measured thing
and the expected
thing yeah
okay so uh that's
okay that's the that that was a pretty
key thing for variance reduction um and
I'm going to introduce one last um
variance reduction technique and
actually all three of these are really
important so um basically nothing's
going to work um except for maybe really
small scale problems unless you do these
things um so the last variance reduction
technique is to to use discounts um
so um the discount Factor um ignores
delayed effects between actions and
Rewards so what we we're going to do
here looks kind of like a hack but
there's an explanation for it um which
is instead of taking the sum of rewards
uh we're going to take a discounted sum
of rewards meaning that um we uh we add
this exponential Factor uh gamma so that
um when so when we're multiplying the
grad log probability by some future
award uh we multiply it by some uh
quantity that decays with time so people
typically use like gamma equals 99 or
gamma equals 095 uh so that means like
if you Ed 099 that means after 100 time
steps um you're going to be um uh you're
going to be reducing the reward by a
factor of one over e so um so you're
exponentially um you're decaying the um
effect of the future rewards and the
intuition is that um an action uh the
action shouldn't affect rewards really
far in the future like the system should
um the s u the like the assumption is
that the system doesn't have really
long-term memory and it's sort of resets
it's or or the there aren't effect the
effects aren't that far delayed uh so
you can just ignore um the interaction
between action and a a reward way way in
the
future that's the uh intuition um so now
instead of taking the Baseline to be the
expected sum of future rewards we want
to do a discounted sum uh so now were
measuring if the action was better than
expected according to this um like the
according to the discounted
sum um and now there's a more General
class of formulas that looks like the
one that I just wrote so this this one
that's on the top of the slide is pretty
good and um this is like almost as good
as anything you're going to do to within
a small constant Factor uh but um
there's there's a more General class of
formulas that um look like um grad log
probability times uh some quantity a hat
which we call the advantage estimate and
this is in general just going to be um
an estimate of um this is an it has a
more a precise definition which is how
much uh how like how much was this
action um better than the um average
action taken by the policy but in but
informally this just means how much
better was the action then
expected so and and this formula makes a
lot of sense because we we want to
increase the probability of the good
actions and de decrease the probability
of the bad ones so we should um we
should increase it in proportion to the
goodness of the
action okay so just to summarize so I
just told you there's this gradient
estimator meaning there's this
expression you can compute which gives
you a noisy estimate of the policy
gradient so how do you actually turn
this into an algorithm uh so this is
silly algorithm s uh so um so here's
what the algorithm looks like it's
pretty much what you'd expect uh you um
you take your policy um you initialize
your policy parameter and your Baseline
function um you uh for each it each
iteration you um execute the um the uh
current policy to get a bunch of whole
episodes meaning whole trajectories and
um each time step in the each trajectory
you should compute the return meaning
the sum of rewards following that time
step the sum of discounted rewards and
the advantage estimate which is um the
sum of discounted rewards minus the
Baseline uh then you refit the Baseline
by trying to um make the Baseline
function equal the returns uh and then
um you update the policy using a policy
gradient estimator so you're just doing
SGD while updating the Baseline as you
go
along yeah so that's that's the vanilla
policy gradient algorithm um and this is
um I'll briefly talk this has been used
to obtain some pretty good results so
it's not um that bad of an algorithm but
um there there several different
directions that it can be
improved so one one uh issue that you
run into um is with step sizes um so in
supervised learning step sizes aren't
that big of a deal um
because uh maybe you take too big of a
step um but that's okay um you'll fix it
the next update and um your uh current
function your current classifier for
example doesn't affect what inputs
you're getting so even if you just um
are doing really uh even if your network
is just kind of thrashing around for a
while because you're taking too big
steps uh that's not going to cause any
problems um but
um uh yeah and reinfor so yeah so step
sizes aren't that big of a deal you can
just anal them uh you can start off with
a large step size and anal them down to
zero and that um works pretty well um in
reinforcement learning if you take too
big of a step you might wreck your
policy um and even if you don't actually
change the network that much so you
don't lose all your nice features um you
you might just change its Behavior a
little too much and now it's going to do
something totally different and visit a
totally different part of State space um
so since in reinforcement learning the
system is stateful and your state
distribution depends on your policy that
makes that like brings uh a really a
different problem and uh now like after
you took that step the next batch of
data you're going to get was collected
by the bad policy and now you're never
going to recover because you just forgot
everything
yeah so um One Way um that uh my
colleagues and I well one way to fix
this is to try to um to try to stop the
basically try to stop the policy from
taking too big of a step so um you can
look at the K Divergence between the um
old policy and the new policy um like
before the update and after the update
and make sure that um the uh
distributions aren't that different so
you're not taking too big of a step it's
kind of an obvious thing to do uh so my
colleagues and I developed an algorithm
called trust region policy optimization
um which looks at the yeah looks at the
action distributions and tries to make
sure the K Divergence isn't too large
and uh there's this is very closely
related to previous meth natural policy
gradient methods which uh which are
based on um which are doing something
similar but usually it's not um set up
as a hard constraint on the K
Divergence so another um type of
extension of policy gradient methods is
um to do more uh to use value uh value
functions to do um more variance
reduction um instead of just using them
as a baseline you can also um you can
use them more aggressively and introduce
some bias um so I won't go into the
details in this talk um but um sometimes
these are called actor critic methods
um there's also another type of approach
which I briefly touched on in the um
earlier slide um where instead of just
trying to increase the probability of
the good actions you actually
differentiate your loss with respect to
the actions um this is like the
reparameterization trick which is used
um for um like for density modeling and
unsupervised learning
um so uh here you're trying to instead
of just increasing the probability of
the good actions you're trying to push
the actions towards better
actions and I'd say both of these bullet
points um you're um potentially
decreasing your variance a lot but at
the cost of increasing bias so it's
actually U makes the algorithms a little
harder to um like to understand and to
get them working because um with high
variance if you just uh crank up the
amount of data you can always drive your
variants down as much as you want but
with bias even if no matter how much
data you get you're not going to get rid
of the bias so if your grading is
pointing in the wrong direction then
you're not going to learn anything
okay so now uh that that's it for the
policy gradient section of this um this
talk um so I wanted to show a quick
video of uh some work that my colleagues
and I did on learning Locomotion
controllers with policy gradient methods
which I think um well I found pretty
exciting when I saw it uh
so hopefully
it's you find it
interesting so here what we've got is a
um humanoid a
simulated let's see it okay yeah it's a
simulated humanoid robot um in a physics
simulator a realistic physics simulator
called
Moko and uh it has a neural network
policy uh which takes in um The Joint
angles of the robot and uh maybe some
and a little bit of other kinematic
information like joint it's got joint
velocities and also um positions of the
different um Links of the robot so
that's what the input is it's pretty
much the raw um state of the robot like
no clever feature engineering there and
um the output is going to be the joint
torqus which are set 100 times a second
so we're just mapping from joint angles
to joint
torqus and uh we Define a reward
function which is to move forward as
fast as possible so it gets a reward for
moving forward and um it gets a uh so um
the episode ends when it its head goes
below a certain height meaning it fell
over so that's basically the setup there
was a little bit of tweaking for the
reward function but um not too extensive
um
so
whoops yeah so you can see first it just
Falls forward a lot of times and then
slowly it starts to develop a uh half
decent looking walk and
uh eventually it gets it down pretty
well and at the very end of this um it
could just keep running uh indefinitely
so I think it was actually stable in a
strong sense meaning I could just leave
it for 15 minutes and it wouldn't fall
over it would just keep
going so uh here's another um robot
model that um we just created without
too much thought I mean we just decided
to put a bunch of legs on this thing um
and uh so we don't even know how this
thing is supposed to walk um and uh just
give it to the same algorithm and it
just figures out some kind of crazy way
to walk
um so that's the nice thing about
reinforcement learning uh you don't even
need to know what you want it to do um I
think this is also the physics are a
little unrealistic here
but here we set up up we used this um a
similar model to the one in the first uh
demo but uh here we just give it a
reward for having its head at a certain
height so there's a re word telling it
to get its head up as high as possible
and then it figures out how to get up
off the
ground oh let's see um I have I have low
battery uh does anyone have a charger
that I could
oh thanks a lot you're a
lifesaver okay any questions about
policy gradients before I move on to the
next
part m
oh yeah so the question was is the
system time invariant uh yes the that's
that's
assumed that it's
stationary oh right and also that it
doesn't change from one episode to the
next of course in some real world
problems that might not be the case so
that's I think that's also an
interesting problem setting where you
have a non-stationary
environment um so the question question
was uh for the Baseline to learn a good
Baseline uh do you need to know the
Dynamics of the system um so no you can
just learn it by doing
regression you just uh estimate the
empirical returns and then you do
regression to try to uh fit a function
to that
yeah so the question is
um there's a discount factor which um
causes the um which should cause the
policy to disregard any effects that are
delayed by more than a 100 time steps so
um how does it still work that this guy
learns how to stand up um even though
that might take more than 100 time steps
is that correct yeah um so yeah you're
right um and in fact I would say that
these methods um aren't guaranteed to
work well if you have more than a 100
time steps uh so sometimes they work
anyway often they work anyway but
there's no guarantee um so I think
there's actually something pretty
fundamental missing in how uh like how
to deal with really long time scales and
people have recently been thinking about
hierarchical reinforcement learning
where you have um different levels of uh
detail of the system where you might
have a like one level of description
where you have a um like a short time a
small time step and then you have
successively larger time steps and uh
you can that allows you to plan over
much longer Horizons um so that's
something that's currently in active
area of research but yeah I would say
that none of these methods are going to
um do are guaranteed to do anything
reasonable if you have uh more than one
over one minus gamma time steps uh
between action and
reward oh yeah so in this kind of task
if you introduced terrain or something
could it uh I think if it didn't if you
didn't train it to deal with terrain
then it um then it might fail it
probably would fail actually I don't
think it would fail because uh the funny
thing about these policies are actually
really robust because um you train them
with the stochastic policy policy um so
there's a lot of noise being generated
by the policy itself um so in practice
uh it's um it's actually so it's able to
deal with huge noise introduced by the
policy and as a result um I found that
if you um change the Dynamics parameters
a little it can usually still work but
yeah there's no guarantee that it'll do
anything if you give it something you
didn't train it for um I I think that
you probably could train it um this do
the same kind of training with uh
terrain I didn't have any terrain so I
didn't try it but that would be nice to
try okay I'm going to move on to the
next part of the talk uh feel free if
you have more questions to find me
afterwards okay so now I'm going to talk
about a different uh type of
reinforcement learning
algorithm so okay so these uh
so the previous kind of methods are
distinguished by the fact that they
learn they explicitly represent a policy
which is the function that chooses your
actions and they try to optimize it with
respect to the parameters of the policy
um so the nice thing about the policy
gradient methods we just talked about is
that you're optimizing the thing you
care about um so and you're optimizing
it with gradient descent so that makes
it kind of easy to understand what's
going on um because if you take if
you're getting the proper grading
estimate and you take small enough steps
then you should be improving I mean of
course you still could get stuck in a
local minimum but at least uh or you get
stuck in a bad local minimum but at
least it's a local minimum and you can
use the our understanding of
optimization to figure out what's going
on so these next class of methods are a
little different because um they're not
optimizing the policy directly uh
they're learning something else called a
q function uh which measures how good um
State action pairs are are so it
measures um I'll I'll say that more
formally L later but it's just measuring
how good the actions are um
and uh these methods are actually um the
these are um able to ex exactly solve um
mdps efficiently in uh the setting where
you have a finite number of states and
actions um so these are these are the
preferred methods for exactly solving
them in in those settings um but um you
can apply them uh with um continuous
States and actions and um using um using
expressive function approximators like
neural
networks but it's a little harder to
understand um what's going on in these
methods like when they're going to work
and when they're not going to
work so um I'll Define um the relevant
quantities here uh so the Q function is
defined as uh the expected sum of
rewards um when we condition on
the first state and the first action um
so we're conditioning on s0 equals s a0
equals a and we're um we're and the Q
function is the expected discounted sum
of rewards uh when we're acting under
the policy
Pi
so um by convention I'm starting out
with uh time Step Zero I could have also
said that um we're taking RT plus RT + 1
plus RT plus two and so on uh but since
we're assuming the system is stationary
it should be exactly the same so just by
convention I'm going to say that the
first I'm going to always use time Z 1
two three and so on Just for ease of
notation so the Q function is just
telling you how good and this state
action pair is under your current policy
um the value function well the state
value function usually called V is uh
just um conditioning on the state
it's uh telling you how good that state
is what's the expected reward at that
State and lastly there's an the
advantage function is the difference
between the Q function and the state
value function meaning how much better
is that action than uh what the policy
would have
done we're not going to talk about
Advantage functions in this section but
it was actually this corresponds to the
notion of Advantage estimator uh we
briefly men

Resume

# Memahami Deep Reinforcement Learning: Konsep Dasar, Algoritma, dan Penerapannya

### Inti Sari (Executive Summary)
Video ini memberikan pengantar mendalam mengenai **Deep Reinforcement Learning (Deep RL)**, sebuah cabang Machine Learning yang berfokus pada pengambilan keputusan berurutan untuk memaksimalkan reward kumulatif. Pembicara menjelaskan dua pendekatan utama dalam Deep RL, yaitu **Policy Gradients** dan **Q-Learning**, beserta perbandingan kelebihan, kekurangan, dan algoritma turunannya seperti TRPO dan DQN. Selain membahas teori matematika dan formalisme Markov Decision Process (MDP), sesi ini juga menyoroti penerapan praktis dalam robotika dan permainan, serta tantangan seperti penentuan ukuran langkah (*step size*) dan masalah eksplorasi.

---

### Poin-Poin Kunci (Key Takeaways)
*   **Definisi Deep RL:** Cabang ML di mana *agent* berinteraksi dengan lingkungan yang tidak diketahui untuk memaksimalkan *reward* dari waktu ke waktu, menggunakan jaringan saraf (*neural networks*) sebagai pendekati fungsi.
*   **Perbandingan Metode:** Berbeda dengan *Supervised Learning* (input-output pasangan) dan *Contextual Bandits* (tanpa *state*), RL menangani lingkungan yang *stateful* di mana input bergantung pada tindakan sebelumnya.
*   **Dua Keluarga Algoritma Utama:**
    1.  **Policy Gradients:** Mengoptimalkan kebijakan (*policy*) secara langsung; lebih umum, andal, dan mudah dipahami.
    2.  **Q-Learning:** Mempelajari fungsi nilai (seberapa baik suatu tindakan di suatu keadaan); lebih efisien sampel namun lebih sulit di-*debug* dan kurang umum.
*   **Optimasi Policy Gradients:** Melibatkan peningkatan probabilitas *trajectory* yang menghasilkan *reward* tinggi, dengan perbaikan seperti pengurangan varians melalui *baseline*, *discount factor*, dan *advantage estimate*.
*   **Tantangan Implementasi:** Masalah ukuran langkah yang terlalu besar dapat menghancurkan kebijakan yang telah dipelajari, sementara *time step* yang terlalu kecil dapat mempersulit penugasan kredit (*credit assignment*) dan eksplorasi.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Pengantar dan Definisi Deep RL
Video dimulai dengan memperkenalkan Deep Reinforcement Learning sebagai metode inti untuk tugas berorientasi tujuan.
*   **Aplikasi:** Digunakan dalam berbagai bidang seperti robotika (keseimbangan, navigasi), manajemen inventaris, perhatian (*attention*) pada gambar, dan prediksi terstruktur seperti terjemahan mesin.
*   **Perbedaan dengan Metode Lain:**
    *   *Supervised Learning:* Lingkungan memberikan pasangan input-output yang benar.
    *   *Contextual Bandits:* Tidak ada *state*; tindakan saat ini tidak memengaruhi input masa depan (cocok untuk rekomendasi iklan).
    *   *Reinforcement Learning:* Lingkungan memiliki *state*; tindakan memengaruhi keadaan masa depan, membuat input menjadi non-stasioner dan efeknya tertunda.

#### 2. Kapan Tidak Menggunakan Deep RL & Kisah Sukses
Deep RL tidak selalu menjadi solusi utama dan seringkali dianggap "berlebihan" (*overkill*) untuk masalah sederhana.
*   **Alternatif:** Untuk parameter sedikit dengan simulator, optimasi tanpa turunan (*derivative-free optimization*) lebih disarankan. Untuk masalah tanpa *statefulness*, *Contextual Bandits* atau metode *Operations Research* lebih tepat.
*   **Kisah Sukses:** Deep RL telah berhasil mencapai performa manusia super dalam permainan Atari (Deep Q-Learning), permainan Go (kombinasi SL dan Policy Gradients), dan robotika (lokomosi dan manipulasi real-time).

#### 3. Formalisme: MDP dan Kebijakan (Policy)
*   **Markov Decision Process (MDP):** Objek dasar dalam RL yang terdiri dari ruang keadaan (*state space*), ruang tindakan (*action space*), dan distribusi probabilitas transisi.
*   **Pengaturan Episodik:** Pengalaman *agent* dibagi menjadi episode terbatas. Tujuannya adalah memaksimalkan ekspektasi total *reward* per episode.
*   **Kebijakan (Policy):** Fungsi yang digunakan *agent* untuk memilih tindakan, bisa bersifat deterministik atau stokastik (distribusi probabilitas bersyarat).

#### 4. Policy Gradients: Intuisi dan Estimator Gradien
*   **Kebijakan Terparameterisasi:** Menggunakan vektor parameter $\theta$ (misalnya bobot jaringan saraf) untuk mendefinisikan kebijakan. Arsitekturnya mirip dengan *Supervised Learning* (klasifikasi untuk aksi diskrit, regresi untuk aksi kontinu).
*   **Intuisi:** Tujuannya adalah membuat *trajectory* (lintasan) yang menghasilkan *reward* tinggi menjadi lebih mungkin terjadi.
*   **Score Function Estimator:** Metode matematis untuk menghitung gradien dari ekspektasi terhadap parameter, menghasilkan estimator yang tidak bias (*unbiased estimator*).

#### 5. Perbaikan pada Policy Gradients dan Algoritma Vanilla
*   **Estimator Dasar:** Menggunakan gradien log probabilitas dikali total *reward*.
*   **Perbaikan Varians:**
    1.  **Struktur Temporal:** Menggunakan jumlah *reward* masa depan (*future rewards*) daripada total *reward* keseluruhan untuk setiap langkah waktu.
    2.  **Baseline:** Mengurangi fungsi *baseline* (misalnya ekspektasi pengembalian) dari *reward* untuk mengurangi varians tanpa mengubah rata-rata.
    3.  **Diskon:** Menggunakan faktor diskon $\gamma$ untuk mengabaikan efek yang tertunda terlalu lama.
*   **Algoritma Vanilla Policy Gradient:** Melibatkan eksekusi kebijakan, penghitungan pengembalian (*return*), pembaruan *baseline*, dan pembaruan kebijakan menggunakan SGD.

#### 6. Tantangan Ukuran Langkah dan Solusi (TRPO)
*   **Masalah:** Dalam RL, langkah optimasi yang terlalu besar dapat menghancurkan kebijakan, menyebabkan *agent* mengunjungi ruang keadaan yang buruk dan sulit pulih.
*   **Solusi (TRPO):** *Trust Region Policy Optimization* membatasi seberapa jauh kebijakan baru berubah dari kebijakan lama dengan menggunakan *KL Divergence* sebagai kendala.
*   **Actor-Critic:** Menggunakan fungsi nilai secara agresif untuk mengurangi varians, meskipun ini memperkenalkan bias.

#### 7. Demo: Pembelajaran Locomotion
Demonstrasi menunjukkan humanoid robot dalam simulator fisika (MuJoCo) yang belajar berlari dari awal.
*   **Input:** Sudut sendi, kecepatan, posisi link.
*   **Output:** Torsi sendi.
*   **Reward:** Bergerak maju secepat mungkin. Episode berakhir jika kepala robot jatuh di bawah ketinggian tertentu. Robot berhasil belajar berjalan dan berlari dengan stabil tanpa rekayasa fitur yang kompleks.

#### 8. Q-Learning dan Persamaan Bellman
Pendekatan kedua adalah mempelajari fungsi Q (*Q-function*) alih-alih kebijakan secara langsung.
*   **Definisi:** Fungsi Q mengukur seberapa baik tindakan tertentu di keadaan tertentu.
*   **Persamaan Bellman:** Persamaan konsistensi untuk fungsi nilai, menghubungkan nilai keadaan saat ini dengan nilai keadaan masa depan.
*   **Dynamic Programming:** Algoritma seperti *Value Iteration* dan *Policy Iteration* dapat menyelesaikan MDP secara tepat jika model lingkungan diketahui.

#### 9. Q-Learning dengan Pendekatan Fungsi (Deep Q-Networks)
Karena probabilitas transisi tidak diketahui dalam RL, kita menggunakan estimator sampel dan pendekatan fungsi (Neural Networks).
*   **Neural Fitted Q Iteration:** Menggunakan jaringan saraf untuk memperkirakan fungsi Q dengan meminimalkan kesalahan kuadrat berdasarkan *backup* Bellman.
*   **DQN (Deep Q-Network):** Versi online yang menggunakan dua trik penting:
    1.  **Replay Pool:** Menyimpan riwayat data untuk pengambilan sampel yang representatif.
    2.  **Target Network:** Salinan jaringan Q yang tertunda untuk menstabilkan pembelajaran.
*   **SARSA:** Varian online dari *Policy Iteration* yang terkadang bekerja lebih baik daripada DQN tergantung pengaturannya.

#### 10. Perbandingan Akhir dan Diskusi (Q&A)
*   **Perbandingan Algoritma:**
    *   *Policy Gradients:* Lebih andal, umum, dan mudah dipahami.

---

## Kesimpulan & Pesan Penutup
Deep Reinforcement Learning menyediakan kerangka kerja yang canggih untuk memecahkan masalah pengambilan keputusan berurutan melalui interaksi *agent* dan lingkungan. Dengan memahami perbedaan mendasar antara Policy Gradients dan Q-Learning, serta tantangan seperti varians dan penentuan ukuran langkah, kita dapat menerapkan metode ini secara efektif. Teknik-teknik lanjutan seperti TRPO dan DQN telah membuktikan bahwa Deep RL mampu mencapai performa luar biasa dalam berbagai aplikasi nyata, mulai dari robotika hingga permainan strategis.

Read

file updated 2026-02-13 13:23:05 UTC