MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL)

zR11FLZ-O9M • 2019-01-24

Transcript preview

Open

Kind: captions
Language: en
today I'd like to overview the exciting
field of deep reinforcement learning
introduced overview and provide you some
of the basics I think it's one of the
most exciting fields in artificial
intelligence it's marrying the power and
the ability of deep neural networks to
represent and comprehend the world with
the ability to act on that understanding
on that representation taking as a whole
that's really what the creation of
intelligent beings is understand the
world and act and the exciting
breakthroughs that recently have
happened captivate our imagination about
what's possible and that's why this is
my favorite area of deep learning and
artificial intelligence in general and I
hope you feel the same so what is deep
reinforcement learning we've talked
about deep learning which is taking
samples of data being able to in a
supervised way compress encode the
representation that data in the way that
you can reason about it I would take
that power and apply it to the world
where sequential decisions are to be
made so it's looking at problems and
formulations of tasks where an agent an
intelligent system has to make a
sequence of decisions and the decisions
that are made have an effect on the
world around the agent how how do all of
us any intelligent being that it's
tasked with operating in the world how
did he learn anything especially when
you know very little in the beginning
it's trial and error is the fundamental
process by which reinforcement learning
agents learn and the deep part of deep
reinforcement learning is neural
networks as using the frameworks and
reinforcement learning where the neural
network is doing the representation of
the world based on which the actions are
made
and we have to take a step back when we
look at the types of learning sometimes
the terminology itself can confuse us to
the fundamentals there are supervised
learning there semi-supervised learning
there's unsupervised learning there's
reinforcement learning and there's this
feeling that supervised learning is
really the only one where you have to
perform the manual annotation where you
have to do the large-scale supervision
that's not the case every type of
machine learning is supervised learning
it's supervised by a loss function or a
function that tells you what's good and
what's bad you know even looking at our
own existence is how we humans figure
out what's good and bad there's all
kinds of sources direct and indirect by
which our morals and ethics we figure
out what's good and bad the difference
we supervised and unsupervised and
reinforcement learning is the source of
that supervision what's implied when you
say unsupervised is that the cost of
human labor required to attain the
supervision is low but it's never
Turtles all the way down it's Turtles
and then there's a human at the bottom
there at some point there needs to be
human intervention human input to
provide what's good and what's bad and
this will arise in reinforcement
learning as well I have to remember that
because the challenges and the exciting
opportunities of reinforcement learning
lie in the fact of how do we get that
supervision in the most efficient way
possible but supervision nevertheless is
required for any system that has an
input and an output that's trying to
learn like a neural network does to
provide an output that's good he needs
somebody to say what's good and what's
bad for you curious about that there's
been a few books a couple written
throughout the last few centuries from
Socrates to Nietzsche I recommend the
latter especially so let's look at
supervised learning and reinforcement
learning let like to propose a way to
think about the difference
that is illustrative and useful when we
start talking about the techniques so
supervised learning is taking a bunch of
examples of data and learning from those
examples where a ground truth provides
you the compressed semantic meaning of
what's in that data and from those
examples one by one whether it's
sequences or single samples we learn
what how to then few take future such
samples and interpret them reinforcement
learning is teaching what we teach an
agent through experience not by showing
a singular sample of a data set but by
putting them out into the world the
distinction there the essential element
of reinforcement learning then for us
now we'll talk about a bunch of
algorithms but the essential design step
is to provide the world in which to
experience the agent learns from the
world the from the world it gets the
dynamics of that world the physics of
the world from that world that gets the
rewards what's good and bad and us as
designers of that agent do not just have
to do the algorithm we have to do design
the the world in which that agent is
trying to solve a task the design of the
world is the process of reinforcement
learning the design of examples the
annotation of examples is the world of
supervised learning and the essential
perhaps the most difficult element of
reinforcement learning is the reward the
good versus bad here a baby starts
walking across the room we want to
define success as a baby walking across
the room and reaching the destination
that's success and failure is the
inability to reach that destination
simple and reinforcement learning in
humans
the way we learn from these very few
examples appear to learn from very few
examples of trial and error is a mystery
a beautiful mystery full of open
questions it could be from the huge
amount of data 230 million years worth
of bipedal data there who've been
walking what mammals walking ability to
walk or 500 million years the ability to
see having eyes so that's the the
hardware side somehow genetically
encoded in us is the ability to
comprehend this world extremely
efficiently it could be through not the
hardware not the five hundred million
years but the the few minutes hours days
months maybe even years in the very
beginning were born the ability to learn
really quickly through observation to
aggregate that information filter all
the junk that you don't need and be able
to learn really quickly through
imitation learning through observation
the way for walking that might mean
observing others talk the idea there is
if there was no other around we would
never be able to learn this the
fundamentals of this walking or as
efficiently it's through observation and
then it could be the algorithm totally
not understood is the algorithm that our
brain uses to learn the backpropagation
that's an artificial neural networks the
same kind of processes not understood in
the brain that could be the key so I
want you to think about that as we talk
about the very trivial by comparison
accomplishments and reinforcement
learning and how do we take the next
steps but it nevertheless is exciting to
have machines that learn how to act in
the world the process of learning for
those who have fallen in love with
artificial intelligence the process of
learning is thought of as intelligence
it's the ability to know very little and
through experience examples interaction
with the world in whatever medium
whether it's data or simulation so on
be able to form much richer and
interesting representations of that
world be able to act in that world
that's that's the dream
so let's look at this stack of what an
age what it means to be an agent in this
world from top the input to the bottom
the output is the there's an environment
we have to sense that environment we
have just a few tools as humans have
several sensory systems on cars you can
have lidar camera
stereo vision audio microphone
networking GPS IMU sensor so on whatever
robot you can think about there's a way
to sense that world and you have this
raw sensory data and then once you have
the raw sensory data you're tasked with
representing that data in such a way
that you can make sense of it as opposed
to all the the raw sensors and the I the
cones and so on that taking just giant
stream of high bandwidth information we
have to be able to form higher
abstractions of features based on which
we can reason from edges to corners to
faces and so on that's exactly what deep
learning neural networks have stepped in
to be able to in an automated fashion
with as little human input as possible
be able to form higher-order
representations of that information then
there is the the learning aspect
building on top of the greater
abstractions form through the
representations be able to accomplish
something useful well--there's
discriminative tasks a generative task
and so on based on the representation be
able to make sense of the data be able
to generate new data and so on from
sequence the sequence to sequence the
sample from Sam of the sequence and so
on and so forth to actions as we'll talk
about and then there is the ability to
aggregate all the information has been
received in the past to the useful
information that's pertinent to the task
at hand it's the thing the old it looks
like a duck quacks like a duck swims
like a duck three different data sets
I'm sure there's state-of-the-art
algorithms for the three image class
education audio recognition video
classification - activity recognition so
on aggregating those three together is
still an open problem and that could be
the last piece again I want you to think
about as we think about reinforcement
learning agents how do we play how do we
transfer from the game of Atari to the
game of go to the game of dota to the
game of a robot navigating an uncertain
environment in the real world and once
you have that once you sense the raw
world once you have a representation of
that world then we need to act which is
provide actions within the constraints
of the world in such a way that we
believe can get us towards success the
promise excitement of deep learning is
is the part of the stack that converts
raw data into meaningful representations
the promise the dream of deeper
enforcement learning is going beyond and
building an agent that uses that
representation and acts achieve success
in the world that's super exciting the
framework and the formulation
reinforcement learning at its simplest
is that there's an environment and
there's an agent that acts in that
environment the agent senses the
environment by a by some observation
well there's partial or complete
observation of the environment and it
gives the environment and action it acts
in that environment and through the
action the environment changes in some
way and then a new observation occurs
and then also as you provide they
actually make the observations you
receive a reward in most formulations of
this of this framework this entire
system has no memory that the the only
thing you two could be concerned about
as a state you came from the state you
arrived in and the reward received the
open question here is what can't be
modeled in this kind of way can we model
all of it
from from human life to the game of go
can all this be model in this way and
what are is this a good way to formulate
the learning problem of robotic systems
in the real world in simulated world
those are the open questions the
environment could be fully observable or
partially observable like in poker
it could be single agent or multi agent
Atari versus driving like deep traffic
deterministic or stochastic static
versus dynamic static is in chess
dynamic again and driving in most
real-world applications the screen
versus continuous like games chess or
continuous and carpal balancing a polo
on a cart
the challenge for RL in real world
applications is that as a reminder
supervised learning is teaching by
example learning by example teaching
from our perspective reinforcement
learning is teaching by experience and
the way we provide experience the
reinforcement learning agents currently
for the most part is through simulation
or through highly constrained real-world
scenarios so the challenge is in the
fact that most of the successes is with
systems environments that are simulated
so there's two ways to then close this
gap to directions of research and work
one is to improve the algorithms improve
the ability of the algorithm student to
form policies that are transferable
across all kinds of domains including
the real world including especially in
the real world so train and simulation
transfer to the real world
or is we improve the simulation in such
a way that the fidelity of the
simulation increased increases to the
point where the gap between reality and
simulation is is minimal to a degree
that things learn the simulation are
directly trivially transferable to the
to the real world
okay the major components of an RL agent
an agent operates based on a strategy
called the policy it sees the world it
makes a decision that's a policy makes a
decision how to act sees the reward sees
a new state acts sees a reward
she's new States and acts and this
repeats forever until a terminal state
the value function is the estimate of
how good a state is or how good a state
action pair is meaning taking an action
in a particular state how good is that
ability to evaluate that and then the
model different from the environment
from the perspective the agent so the
environment has a model based on which
it operates and then the agent has a
representation best understanding of
that model so the purpose for an RL
agent in this simply formulated
framework is to maximize reward the way
that the reward mathematically and
practically is talked about is with a
discounted framework so we discount
further and further future award so the
reward that's farther into the future is
means less to us in terms of
maximization than reward that's in the
near term and so why do we discount it
so first a lot of it is a math trick to
be able to prove certain aspects analyze
certain aspects of convergence and in
general on a more philosophical sense
because environments either are or can
be thought of a stochastic random it's
very difficult to there's a degree of
uncertainty
which makes it difficult to really
estimate the the the reward they'll be
in the future because of the ripple
effect of the uncertainty let's look at
an example a simple one helps us
understand policy's rewards actions
there's a robot in the room there's 12
cells in which you can step it starts in
the bottom left it tries to get rewards
on the on the top right there's a plus
one it's a really good thing at the top
right wants to get there by walking
around there's a negative 1 which is
really bad you wants to avoid that
Square and the choice of action is this
up-down left-right for actions so you
could think of there being a negative
reward of point 0 4 for each step so
there's a cost to each step and there's
a stochastic nature to this world
potentially we'll talk about both
deterministic stochastic so in the in
the stochastic case when you choose the
action up with an 80% probability with
an 80% chance you move up but with 10%
chance to move left another 10 move
right so that's the Catholic nature even
though you try to go up you might end up
in a blocks to the left into the right
so for a deterministic world the optimal
policy here given that we always start
in the bottom left is really shortest
path is you know you can't ever because
there's no stochasticity you're never
gonna screw up and just fall into the
hole negative 1 hole that you just
compute the shortest path and walk along
that shortest path why shortest path
because every single step hurts there's
a negative a reward to it point 0 4
so shortest path is the thing that
minimizes the reward shortest path to
the to the plus 1 block ok let's look at
it stochastic world like I mentioned the
80% up and then split to 20 10 % to left
and right how does the policy change
well first of all we need to have we
need to have a plan for every single
block in the area because you might end
up there due to this the castus 'ti of
the world ok the the basic
addition there is that we're trying to
go avoid up the closer you get to the
negative one hole so just try to avoid
up because up the stochastic nature of
up means that you might fall into the
hole with a 10% chance and given the
point zero for step reward you're
willing to take the long way home
in some cases in order to avoid that
possibility the negative one possibility
now let's look at a reward for each step
if it decreases to negative two it
really hurts to take every step then
again we go to the shortest path despite
the fact that there's a stochastic
nature in fact you don't really care
that you step into the negative one hole
because every step really hurts you just
want to get home and then you can play
with this reward structure right yes
instead of negative 2 or negative point
0 4 you can look at negative 0.1 and you
can see immediately that the structure
of the policy it changes so with a
higher value the higher negative reward
free step immediately the urgency of the
agent increases versus the less urgency
the lower the negative reward and when
the reward flips so it's positive the
every step is a positive so the entire
system which is actually quite common in
reinforcement learning the entire system
is full of positive rewards and so that
then the optimal policy becomes the
longest path is grad school taking as
long as possible never reaching the
destination so what lessons do we draw
from robot in the room two things the
environment model the dynamics is just
there in the trivial example the
stochastic nature the difference between
80 percent 100 percent and 50 percent
the model of the world the environment
has a big impact on what the optimal
policy is
and the reward structure most
importantly the thing we can often
control more in our constructs of the
task we try to solve them enforcement is
the what is good and what is bad and how
bad is it and how good is it
the reward structure is a big impact and
that has a complete change like like
Robert Frost say the complete change on
the policy the choices the agent makes
so at when you formulate a reinforcement
learning framework as researchers as
students what you often do is you design
the environment you design the world in
which the system learns even when your
ultimate goal is the physical robot it
does still there's a lot of work still
done simulation so you design the world
the parameters of that world and you
also design the reward structure and it
can have a transformative results slight
variations in those parameters going to
huge results on huge differences on the
policy that's arrived and of course the
example I've shown before I really love
is the impact of the the changing reward
structure might have unintended
consequences and those consequences for
real-world system can have obviously
highly detrimental costs that are more
than just a failed game of Atari so here
is a human performing the task gate
playing the game of coast runners racing
around the track and so it's when you
finish first and you finish fast you get
a lot of points and so it's natural to
then okay let's do an RL agent and then
optimize this for those points and will
you find out in the game is that you
also get points by picking up the little
green turbo things and with agent
figures out is that you can actually get
a lot more points even
by simply focusing on the green turbos
focusing on the green turbos just
rotating over and over slamming into the
wall fire and everything just picking it
up especially because ability to pick up
those turbos can avoid the terminal
state at the end of finishing the race
in fact finishing the race means you
stop collecting positive reward so you
never want to finish collected turbos
and though that's a trivial example it's
not actually easy to find such examples
but they're out there of unintended
consequences that can have highly
negative detrimental effects when put in
the real world we'll talk about a little
bit of robotics when you put robots for
wheeled ones like autonomous vehicles
into the real world and you have
objective functions that have to
navigate difficult intersections full of
pedestrians you have to form intent
models those pedestrians here you see
cars asserting themselves through dense
intersections taking risks and within
those risks that are taking by us humans
will drive vehicles we have to then
encode that ability to take subtle risk
into into AI based control algorithms
perception then you have to think about
at the end of the day there's an
objective function and if that objective
function does not anticipate the green
turbos that are to be collected and then
result in some understand the
consequences could have very negative
effects especially in situations that
involve human life that's the field of
AI safety and some of the folks will
talk about deep mind and open AI that
are doing incredible work in RL also
have groups that are working on a AI
safety for a very good reason this is a
problem that I believe that artificial
intelligent will define some of the most
impactful positive things in the 21st
century
but I also believe we are nowhere close
to solving some of the fundamental
problems of AI safety that we also need
to address as we
those algorithms okay examples and
reinforcement learning systems all of it
has to do with formulation or rewards
formulation of states and actions you
have the traditional the often used
benchmark of a cart balancing a poll
continuous so the action is the
horizontal force to the cart the goal is
to balance the poll so stays top and the
moving cart and the reward is one in
each time step if the poll is upright in
the state measured by the cart by the
agent is the pole angle angular speed
and of course self sensing of the cart
position and the horizontal velocity
another example here didn't want to
include the video because it's really
disturbing but I do want to include the
slide because it's really important to
think about is by sensing the the raw
pixels learning and teaching an agent to
play a game of doom so the goal there is
to eliminate all opponents the state is
the raw game pixels the action is
up/down shoot reload and so on and the
positive reward is when an opponent is
eliminated and negative one the agent is
eliminated simple I added it here
because again on the topic of AI safety
we have to think about objective
functions and how that translate into
the world of not just autonomous
vehicles but things that even more
directly have harm like autonomous
weapon systems and we have a lecture on
this in the AGI series and on the
robotics platform the manipulate object
manipulation and grasping objects
there's a few benchmarks there's a few
interesting applications learning the
problem of grabbing objects moving
objects manipulating objects rotating
and so on especially when those objects
don't have have complicated shapes and
so the goal is to pick up an object in
the purely in the grasping objects
allenge the state is the visual
racial slurs visual visual base the raw
pixels of the objects the actions is to
move the arm grasp the object pick it up
and obviously it's positive when the
pickup is successful the reason I'm
personally excited by this is because
it'll finally allow us to solve the
problem of the claw which has been
torturing me for many years
I don't know that's not at all why I'm
excited by it okay and then we have to
think about as we get greater and
greater degree of application in the
real world with robotics
like cars the the main focus of my
passion in terms of robotics is how do
we encode some of the things that us
humans encode how do we you know we have
to think about our own objective
function our own reward structure our
own model of the environment about which
we perceive and reasonable in order to
then encode machines that are doing the
same
and I believe autonomous driving is in
that category but to ask questions of
ethics we have to ask questions of of
risk value of human life value of
efficiency money and so on all these in
front of ethical questions that an
autonomous vehicle unfortunately has to
solve before it becomes fully autonomous
so here are the key takeaways of the
real-world impact of reinforcement
learning agents on the deep learning
side okay these neural networks that
form high representation the fun part is
the algorithms all the different
architectures the different
encoder/decoder structures all the
attentions self attention recurrent
Sallust Engr use all the fun
architectures and the data so that and
the ability to leverage different data
sets in order to discriminate better
than perform this Crematory tasks better
than you know MIT does better than stand
for that kind of thing that's the fun
part the hard part is asking good
questions and collecting huge amounts of
data that's representative over the task
that's for real world impact not cvpr
publication real-world impact
a huge amount of data on a deeper
enforcement learning side the key
challenge the fun part again is the
algorithms how do we learn from data
some of the stuff I'll talk about today
the hard part is defining the
environment defining the acts of space
and the reward structure as I mentioned
this is the big challenge and the
hardest part is how to crack the gap
between simulation in the real world the
leaping lizard that's the hardest part
we don't even know how to solve that
transfer learning problem yet for the
real world in fact the three types of
reinforcement learning there's countless
algorithms and there's a lot of ways to
economize them but at the highest level
there's model-based and there's model
free model based algorithms learn the
model of the world so as you interact
with the world you construct your
estimate of how you believe the dynamics
of that world operates the nice thing
about doing that is once you have a
model or an estimate of a model you're
able to anticipate you're able to plan
into the future you're able to use the
model to in a branching way predict how
your actions will change the world
so you can plan far into the future this
is the mechanism by which you can you
can do chess in the simplest form
because in chess you don't even need to
learn the model the models learnt is
given to you chess go and so on
the most important way in which they're
different I think is the sample
efficiency is how many examples of data
are needed to be able to successfully
operate in the world and so model based
methods because they're constructing a
model if they can are extremely simple
efficient because once you have a model
you can do all kinds of reasoning that
doesn't require experiencing every
possibility of that model you can unroll
the model to see how the world changes
based on your actions value based
methods are ones that look to estimate
the quality of states the quality of
taking a certain action in the certain
state so they're called off policy
versus the last category that's on
policy what does it mean to be off
policy it means that they constantly
value based agents constantly update how
good is taken action in a state and they
have this model of that goodness of
taking action in a state and they use
that to pick them optimal action they
don't directly learn a policy a strategy
of how to act they learn how good it is
to be in a state and use that goodness
information to then pick the best one
and then every once in a while flip a
coin in order to explore and then policy
based methods our ones that directly
learn a policy function so they take as
input the the world representation of
that world neural networks and this
output a action where the action is
stochastic so okay that's the range of
model-based value based and policy based
here's an image from open AI that I
really like I encourage you to as we
further explore here to look up spinning
up in deeper enforcement learning from
open AI here's an image that texana
mises in the way that I described some
of the recent developments in RL so at
the very top the distinction between
model free RL and model-based RL in
model free RL which is what we'll focus
on today there is a distinction between
policy optimization so on policy methods
and q-learning
which is all policy methods pause
optimizations methods that directly
optimize the policy they'll directly
learn the policy in some way and then
q-learning off policy methods learn like
I mentioned the value of taking a
certain action in the state and from
that learned that learned Q value be
able to
choose how to act in the world so let's
look at a few sample representative
approaches in this space let's start
with the with the one that really was
one of the first great breakthroughs
from google deepmind on the deep IRL
side and solving atari games dqn deep
queue learning networks deep queue
networks and let's take a step back and
think about what cue learning is
q-learning looks at the state action
value function queue that estimates
based on a particular policy or based on
an optimal policy how good is it to take
an action in this state the estimated
reward if I take an action in this state
and continue operating under an optimal
optimal policy it gives you directly a
way to say amongst all the actions I
have which action should that take to
maximize the reward now in the beginning
you know nothing you know you don't have
this value estimation you don't have
this cue function so you have to learn
it and you learn it with a bellman
equation of updating it you take your
current estimate and update it with the
reward you seed received after you take
an action here it's off policy and model
free you don't have to have any estimate
or knowledge of the world you don't have
to have any policy whatsoever all you're
doing is roaming about the world
collecting data when you took a certain
action here award you received and
you're updating gradually this table
where the table has state states on the
y-axis and actions on the x-axis and the
key part there is because you always
have an estimate of what of to take an
action of the value of taking that
action so you can always take the
optimal one but because you know very
little in the beginning that optimal is
going to you have no way of knowing
that's good or not so there's some
degree of expiration the fundamental
aspect of value based methods or ami are
all methods like I said it's trial and
error is exploration so for value based
methods that q-learning
the way that's done is with the flip of
a coin epsilon greedy with a flip of a
coin
you can choose to just take a random
action and you slowly decrease epsilon
to zero as your agent learns more and
more and more so in the beginning you
explore a lot with epsilon 1 and epsilon
of zero in the end when you're just
acting greedy based on the your
understanding of the world as
represented by the q-value function for
non neural network approaches this is
simply a table the Q this Q function is
a table like I said on the Y State X
actions and in each cell you have a
reward that's at this counter reward
that you estimated to be received there
and as you walk around with this bellami
equation you can update that table but
it's a table nevertheless number of
states times number of actions now if
you look at any practical real-world
problem and an arcade game with raw
sensory input is a very crude first step
towards the real world so raw sensor
information this kind of value iteration
and updating a table is impractical
because here's for a game of break out
if we look at four consecutive frames of
a game of breakout size of the of the
raw sensory input is 84 by 84 pixels
grayscale every pixel has 256 values
that's 256 to the power of whatever 84
times 84 times 4 is whatever it is it's
significantly larger the number of atoms
in the universe so the size of this cue
table if we use the traditional approach
is intractable
you'll know it's to the rescue deep RL
is rl+ neural networks where the neural
networks is tasked with taking this in
Valley based methods taking this cue
table and learning a compress
representation of it learning an
approximator for the function from state
action to the value that's what
previously talked about the ability the
powerful ability of neural networks to
form representations from extremely high
dimensional complex raw sensory
information so it's simple the framework
remains for the most part the same in
reinforcement learning
it's just that this cue function for
value based methods becomes a neural
network and becomes an approximator
where the hope is as you navigate the
world and you pick up new knowledge
through the back propagating the
gradient and the loss function that
you're able to form a good
representation of the optimal q function
so using your networks with you'll know
it's a good at which is function
approximator x' and that's DQ 1 deep Q
Network was used to have the initial
incredible nice results on our K games
where the input is the raw sensory
pixels with a few convolutional layers
for the connected layers and the output
is a set of actions you know probability
of taking that action and then you
sample that and you choose the best
action and so this simple agent whether
the neural network that estimates that Q
function very simple network is able to
achieve superhuman performance on many
of these arcade games that excited the
world because it's taking raw sensory
information with a pretty simple network
that doesn't in the beginning understand
any of the physics of the world any of
the dynamics of the environment and
through that intractable space the
intractable state space is able to learn
how to actually do pretty well the loss
function for DQ n has to Q functions one
is the expected the predicted Q value of
a taking an action in a particular state
and the other is the target against
which the loss function is calculated
which is what is the value that you got
once you actually take in that action
and once you've taken that action the
way you calculate the value is by
looking at the next step and choosing
the max to Singh if you take the best
action in the next state what is going
to be the Q function so there's two
estimators going on with in terms of
neural networks those two forward passes
here there's two Q's in this equation so
in traditional DQ n that's just that's
done by a single neural network with a
few tricks and double DQ n that's done
by two neural networks and I mentioned
tricks because with this and with most
of RL tricks tell a lot of the story a
lot of what makes
systems work is the details in in games
and robotic systems in these cases the
two biggest tricks for DQ n that will
reappear and a lot of value based
methods is experience replay so think of
an agent that plays through these games
as also collecting memories you collect
this bank of memories that can then be
replayed the power of that one of the
central elements of what makes value
based methods attractive is that because
you're not directly estimating the
policy but are learning the quality of
taking an action in a particular state
the you're able to then jump around
through your memory and and play
different aspects of that memory so
learn train the network through the
historical data and then the other trick
simple is like I said that there is so
the loss function has two queues
so you're it's it's a dragon chasing its
own tail it's easy for the loss function
to become unstable so the training does
not converge so the trick of fixing a
target Network is taking one of the
queues and only updating in every X
steps every thousand steps and so on and
taking the same kind of network
it's just fixing it so for the target
network that defines the loss function
just keeping it fixed and only updating
any regulator so you're chasing a fixed
target with a loss function as opposed
to a dynamic one so you can solve a lot
of the Atari games with minimal effort
come up with some creative solutions
here break out here after 10 minutes of
training on the left after a to have 2
hours of training on the right is coming
up with some creative solutions again
it's pretty cool because this is raw
pixels right we're now like there's been
a few years since this breakthrough so
kind of take it for granted but I still
for the most part captivated by just how
beautiful it is that from the raw
sensory information
neural networks are able to learn to act
in a way that actually supersedes humans
in terms of creativity in terms of in
terms of actual raw performance it's
really exciting and games of simple form
is the cleanest way to demonstrate that
and you the the same kind of DQ and
network is able to achieve superhuman
performance and a bunch of different
games
there's improvements to this like dual
DQ one again the q function can be
decomposed which is useful in to the
value estimate of being in that state
and what's called and in future slides
that we called advantage
so the advantage of taking action in
that state the nice thing of the
advantage as a measure is that it's a
measure of the action quality relative
to the average action that could be
taken there so if it's very useful
advantage versus sort of raw reward is
that if all the actions you have to take
are pretty good you want to know well
how much better it is in terms of
optimism
that's a better measure for choosing
actions in a value-based sense so when
you have these two estimates you have
these two streams for neural networking
the dueling DQ n DG QM where one
estimates the value the other the
advantage and that's again that dueling
nature is useful for also on the there
are many states in which the action is
decoupled the quality of the actions is
decouple from the state so many states
it doesn't matter which action you take
so you don't need to learn all the
different complexities all the topology
of different actions when you in a
particular state and another one is
prioritize experience for play like I
said experience replay is really key to
these algorithms and the thing that
sinks some of the policy optimization
methods and experiments replay is
collecting different memories but if you
just sample randomly in those memories
you're now affected the sampled
experiences are really affected by the
frequency of those experience occurred
not their importance so prioritize
experience replay assigns a priority a
value based on the magnitude of the
temporal difference learned error so the
the stuff you have learned the most from
is given a higher priority and therefore
you get to see through the experience
replay process that that particular
experience more often okay moving on to
policy gradients this is on policy
versus q-learning off policy policy
gradient
is directly optimizing the policy where
the input is the raw pixels and the
policy network represents the forms of
representations of that environment
space and as output produces a
stochastic estimate a probability of the
different actions here in the pong the
pixels a single output that produces the
probability of moving the paddle up so
how do pause gradients vanilla policy
grading the very basic works is you
unroll the environment you play through
the environment here pong moving the
paddle up and down and so on collecting
no rewards and only collecting reward at
the very end based on whether you win or
lose every single action you're taking
along the way gets either punished or
rewarded based on whether it led to
victory or defeat this also is
remarkable that this works at all
because the credit assignment there's a
is I mean every single thing you did
along the way is averaged out it's like
muddied it's the reason that policy
gradient methods are more inefficient
but it's still very surprising that it
works at all so the pros versus DQ one
the value based methods is that if the
world is so messy that you can't learn a
q function the nice thing about policy
gradient because it's learning the
policy directly that it will at least
learn a pretty good policy usually in
many cases faster convergence it's able
to deal with stochastic policies so
value based methods can out learners the
gassing policies and it's much more
naturally able to deal with continuous
actions the cons is it's inefficient
versus dqn it's it can become highly
unstable as we'll talk about some
solutions to this during the training
process and the credit assignment so if
we look at the chain of actions that
lead to a positive reward some might be
awesome action some may be good action
some might be terrible actions but that
doesn't matter as long as the death
the nation was good and that's then
every single action along the way gets a
positive reinforcement that's the
downside and there's now improvements to
that advantage actor critic methods a to
see combining the best of value based
methods and policy base methods so
having an actor two networks an actor
which is policy based and that's the one
that's takes the actions samples the
actions from the policy Network and the
critic that measures how good those
actions are and the critic is value
based all right so as opposed to in the
policy update the first equation there
the reward coming from the destination
the that our war being from whether you
won the game or not every single step
along the way you now learn a Q value
function Q s a state and action using
the critic Network so you're able to now
learn about the environment about
evaluating your own actions at every
step so you're much more sample
efficient there's a synchronous from
deep mind and synchronous from open AI
variants of this but of the actor
advantage actor critic framework but
both are highly parallelizable the
difference with a three C the
asynchronous one is that every single
agency just throw these agents operating
in the environment and they're learning
they're rolling out the games and
getting the reward they're updating the
original Network asynchronously the
global network parameters asynchronously
and as a result they're also operating
constantly an outdated versions of that
network the open AI approach that fixes
this is that there's a coordinator that
there's these rounds where everybody all
the agents in parallel are rolling out
the episode but then the coordinator
waits for everybody to finish in order
to make the update to the global network
and then distributes all the same
parameter
to all the agents and so that means that
every iteration starts with the same
global parameters and that has really
nice properties in terms of conversions
and stability of the training process
okay from google deepmind the deep
deterministic policy gradient is
combining the ideas of dqn but dealing
with continuous action spaces so taking
a policy network but instead of the
actor actor critic framework but instead
of picking a stochastic policy having
the actor operator on the since the
casting nature is picking the best
picking a deterministic policy so it's
always choosing the best action but ok
with that the problem quite naturally is
that when the policy is now
deterministic it's able to do continuous
action space but because it's termina
stick it's never exploring so the way we
inject exploration into the system is by
adding noise either adding noise into
the action space on the output or adding
noise into the parameters of the network
that have then that create perturbations
and the actions such that the final
result is that you try different kinds
of things and the the scale of the noise
just like well the epsilon greedy in the
exploration for DQ on the scale of the
noise decreases as you learn more and
more so on the policy optimization side
from open ai and others
we'll do a lecture just on this there's
been a lot of exciting work here the
basic idea of optimization on policy
optimization with PPO and TRP au is
first of all we want to formulate
reinforcement learning as purely an
optimization problem and second of all
if policy optimization the actions you
take influences the rest of your the
optimization process you have to be very
careful about the actions you take in
particular you have to avoid taking
really bad actions when you're
convergence the the training performance
in general collapses so how do we do
that
there's the line search methods which is
where gradient descent or gradient
descent falls under which which is the
how we train deep neural networks is you
first pick a direction of the gradient
and then pick the step size the problem
with that is that can get you into
trouble here there's a nice
visualization walking along a ridge is
it can it can result in you stepping off
that Ridge again the collapsing of the
training process the performance the
trust region is is the underlying idea
here for the for the policy optimization
methods that first pick the step size so
that constrain in various kinds of ways
the the magnitude of the difference to
the weights that's applied and then the
direction so it placing a much higher
priority not choosing bad actions that
can throw you off the optimization path
should actually we should take to that
path and finally the on the model-based
methods and we'll also talk about them
in the robotics side there's a lot of
interesting approaches now where deep
learning is starting to be used for a
model-based methods when the model has
to be learned but of course when the
model doesn't have to be learned it's
given inherent to the game you know the
model like Ingo and chess and so on out
zero has really done incredible stuff so
what's wise what is the model here so
the way that a lot of these games are
approached you know game of Go it's
turn-based one person goes and then
another person goes and there's this
game tree at every point as a set of
actions that could be taken and quickly
if you look at that game tree it's it
becomes you know a girl's exponentially
so it becomes huge a game of go is the
hugest of all in terms of because the
number of choices you have is the
largest and there's chess and then you
know it gets the checkers and then
tic-tac-toe and it's just the the degree
at every step increases decreased based
on the game structure and so the task
for a neural network there is to learn
the quality of the board it's that it's
to learn which boards which game
positions are most likely to result in a
are most useful to explore and a result
in a highly successful state so that
choice of what's good to explore what's
what branch is good to go down is where
we can have neural network step in and
without phago it was pre trained the
first success that beat the world
champion was pre trained on expert games
then with alphago zero
it was no pre training on expert systems
so no imitation learning is just purely
through self play through suggesting
through playing itself new board
positions many of these systems use
Monte Carlo tree search and during the
search balancing exploitation
exploration so going deep on promising
positions based on the estimation then
you'll network or with a flip of a coin
playing under play positions and so this
kind of here you can think of as an
intuition of looking at a board and
estimating how good that board is and
also estimating how good that board is
likely to lead to victory down the end
so as to mean just general quality and
probability of leading to victory then
the next step forward is alpha zero
using the same similar architecture with
MCTS what do you call it research but
applying it to different games and
applying it and competing against other
engines state-of-the-art engines and go
and shogi in chess and outperforming
them with very few very few steps so
here's this model-based approaches which
are really extremely simple efficient if
you can construct us such a model and in
in the robotics if you can learn such a
model I can be exceptionally powerful
here beating the the engines which are
far superior to humans already stockfish
can destroy most humans on earth at the
game of chess the ability through
learning through through estimating the
quality of a board to be able to defeat
these engines is incredible and the the
exciting aspect here versus engines that
don't use neural networks is that the
number its it really has to do with
based on the neural network you explore
certain positions you explore certain
parts of the tree and if you look at
grandmasters human players in chess they
seem to explore very few moves they have
a really good neural network at
estimating which are the likely branches
which would provide value to explore and
on the other side stock fish and so on
are much more brute force in their
estimation for the MCTS and then alpha
zero is a step towards the Grandmaster
is the number of branches need to be
explored as much much fewer a lot of the
work is done in the representation form
by the neural network it's just super
exciting and then it's able to uh
perform stockfish in chess it's able to
outperform Elmo and shogi and it's
itself in go or the previous iterations
of alphago zero and so on now the
challenge here the sobering truth is
that majority of real world application
of agents that have to act in this world
perceive the world and act in this world
are for the most part not based have no
RL involved so the action is not learned
use neural networks to perceive certain
aspects of the world but ultimately the
action is not is not learned from data
that's true for all most of the
autonomous vehicle companies are all of
the autonomous vehicle companies
operating today and it's true for
robotic manipulation in the industrial
robotics and any of the humanoid robots
have to navigate in this world under
uncertain conditions all the work from
Boston Dynamics doesn't involve any
machine learning as far as we know now
that's beginning to change here with
animal the the recent development where
the certain aspects of the control a
robotic could be learned
you're trying to learn more efficient
movement you're trying to learn more
robust movement on top of the other
controllers so it's quite exciting
through RL to be able to learn some of

Resume

Berikut adalah rangkuman komprehensif dan terstruktur mengenai materi Deep Reinforcement Learning (DRL) berdasarkan transkrip yang diberikan.

***

# **Panduan Lengkap Deep Reinforcement Learning: Dari Teori Dasar hingga Penerapan Dunia Nyata**

### **Inti Sari (Executive Summary)**
Video ini membahas konsep dasar dan penerapan **Deep Reinforcement Learning (DRL)**, sebuah pendekatan yang menggabungkan *Deep Neural Networks* dengan kemampuan pengambilan keputusan berbasis pengalaman. Pembahasan mencakup perbedaan mendasar antara pembelajaran *supervised* dan *reinforcement*, komponen-komponen utama dalam DRL seperti *Policy* dan *Value Function*, serta berbagai metode algoritma mulai dari Q-Learning hingga Actor-Critic. Video juga menyoroti tantangan krusial seperti kesenjangan antara simulasi dan dunia nyata, isu keamanan AI (*AI Safety*), serta studi kasus sukses seperti AlphaGo.

---

### **Poin-Poin Kunci (Key Takeaways)**
*   **Definisi DRL:** Perpaduan antara representasi dunia oleh *Deep Learning* dan kemampuan bertindak melalui *trial and error* (Reinforcement Learning).
*   **Mekanisme Belajar:** Berbeda dengan pembelajaran *supervised* yang menggunakan contoh, RL belajar dari interaksi dengan lingkungan (*experience*) untuk memaksimalkan *reward*.
*   **Komponen Utama:** Agent, Environment, Policy (strategi), Value Function (estimasi kebaikan keadaan), dan Model (representasi lingkungan).
*   **Algoritma Utama:** Terbagi menjadi *Model-Free* (Q-Learning/DQN, Policy Gradients) dan *Model-Based* (AlphaGo/AlphaZero).
*   **Tantangan:** Risiko "konsekuensi yang tidak diinginkan" (*reward hacking*), ketidakstabilan pelatihan, dan kesulitan mentransfer hasil simulasi ke dunia nyata (*Sim-to-Real gap*).
*   **Masa Depan:** Peluang penelitian terbuka luas untuk meningkatkan konvergensi, menyelesaikan permainan yang belum terpecahkan, dan menerapkan RL pada robotika otonom.

---

### **Rincian Materi (Detailed Breakdown)**

#### **1. Pengenalan Deep Reinforcement Learning (DRL)**
Deep Reinforcement Learning (DRL) didefinisikan sebagai pernikahan antara *Deep Neural Networks*—yang bertugas merepresentasikan dan memahami dunia—dengan kemampuan untuk bertindak berdasarkan pemahaman tersebut. Proses intinya adalah pengambilan keputusan berurutan (*sequential decision-making*) di mana keputusan agen mempengaruhi keadaan dunia.
*   **Filsafat Pembelajaran:** Semua pembelajaran mesin pada dasarnya diawasi oleh *loss function*, namun sumber pengawasannya berbeda. RL mengajarkan melalui pengalaman/interaksi, bukan sekadar menunjukkan sampel data.
*   **Definisi Kecerdasan:** Proses belajar yang dimulai dari sedikit pengetahuan dan membentuk representasi yang kaya melalui interaksi.
*   **Arsitektur Agent:** Alur kerjanya dimulai dari **Environment** -> **Raw Sensory Data** (Input tinggi) -> **Representation** (Abstraksi oleh Deep Learning) -> **Learning** -> **Aggregation** -> **Action**.

#### **2. Kerangka Kerja dan Komponen RL**
RL bekerja dalam kerangka Agent dan Environment. Agent mengamati (sebagian atau penuh), bertindak, menerima *reward*, dan lingkungan berubah.
*   **Jenis Lingkungan:** Bisa bersifat *fully/partially observable* (seperti Poker), *single/multi-agent* (Atari vs Mengemudi), *deterministic/stochastic*, dan *discrete/continuous*.
*   **Komponen Utama Agent:**
    *   **Policy:** Strategi untuk memetakan keadaan menjadi aksi.
    *   **Value Function:** Estimasi seberapa baik keadaan atau aksi tersebut di masa depan.
    *   **Model:** Representasi agen terhadap lingkungan.
*   **Discounted Reward:** Hadiah masa depan dinilai lebih rendah daripada hadiah segera (*near-term*) karena alasan matematis (konvergensi) dan ketidakpastian lingkungan.
*   **Studi Kasus "Robot di Ruangan":** Contoh bagaimana biaya langkah (*step cost*) dan sifat lingkungan (deterministik vs stokastik) mengubah kebijakan optimal. Jika risiko jatuh ke lubang besar, agen memilih jalan memutar; jika biaya langkah sangat mahal, agen mengambil risiko jalan terpendek.

#### **3. Desain Lingkungan, Risiko, dan Keamanan AI (AI Safety)**
Peneliti mendesain lingkungan dan struktur *reward*. Perubahan kecil pada parameter dapat menghasilkan kebijakan yang sangat berbeda.
*   **Konsekuensi yang Tidak Diinginkan:** Contoh game *Coast Runners*. Agen RL fokus mengambil poin hijau (*power-ups*) dengan berputar-putar dan menabrak dinding, dan tidak pernah menyelesaikan balapan karena menyelesaikan balapan menghentikan poin. Ini menunjukkan bahaya jika fungsi objektif tidak selaras dengan tujuan manusia.
*   **Keamanan AI:** Sangat krusial, terutama untuk sistem otonom seperti mobil yang berinteraksi dengan pejalan kaki. Fungsi objektif harus antisipatif terhadap perilaku "eksploitasi" oleh agen.
*   **Contoh Penerapan:** *Cart Pole* (keseimbangan), *Doom* (tembak-menembak berbasis pixel), dan *Object Manipulation* (mengambil objek).

#### **4. Metode *Model-Free*: Q-Learning dan Deep Q-Networks (DQN)**
Dalam metode *Model-Free*, agen tidak perlu memahami model lingkungan secara eksplisit.
*   **Kategori:** Terbagi menjadi *Policy Optimization* (On-Policy) dan *Q-Learning* (Off-Policy).
*   **Q-Learning:** Mempelajari nilai (*Q-value*) dari mengambil aksi tertentu dalam keadaan tertentu untuk memaksimalkan reward. Menggunakan persamaan Bellman untuk pembaruan nilai.
*   **Eksplorasi:** Dilakukan dengan *Epsilon-Greedy* (terkadang mengambil tindakan acak).
*   **Masalah Tabel:** Q-learning tradisional menggunakan tabel yang tidak mungkin diterapkan pada data masukan mentah (*raw sensory input*) seperti piksel game yang ruang statenya sangat besar.
*   **Solusi DQN:** Menggunakan *Neural Network* sebagai pendekati fungsi (*function approximator*) untuk menggantikan tabel Q. DQN menggunakan input piksel mentah dan lapisan konvolusi untuk mencapai performa superhuman pada game Atari tanpa pengetahuan fisika sebelumnya.

#### **5. Pengembangan DQN dan Policy Gradients**
*   **Dueling DQN:** Arsitektur yang memisahkan estimasi menjadi *Value* (kebaikan keadaan) dan *Advantage* (kualitas relatif aksi). Berguna ketika kualitas aksi tidak terlalu mempengaruhi keadaan.
*   **Prioritized Experience Replay:** Memprioritaskan pengalaman dengan kesalahan (*error*) tinggi untuk dipelajari lebih sering, daripada sampling acak.
*   **Policy Gradients:** Metode *On-Policy* yang mengoptimalkan kebijakan secara langsung (Input -> Probabilitas Aksi). Contoh: game Pong. Setiap aksi dihukum atau dihargai berdasarkan hasil akhir (menang/kalah).
    *   *Kelebihan:* Bekerja di dunia yang berantakan, konvergen lebih cepat, menangani aksi kontinu secara alami.
    *   *Kekurangan:* Tidak efisien (masalah *credit assignment*), dan pelatihan yang tidak stabil.
*   **Actor-Critic (A2C/A3C):** Menggabungkan *Value-based* dan *Policy-based*. "Actor" mengambil aksi, "Critic" menilai seberapa baik aksi tersebut (Q-value) di setiap langkah, sehingga lebih efisien sampelnya.

#### **6. Metode *Model-Based* dan AlphaGo**
*   **Model-Based:** Mempelajari model lingkungan atau menggunakan model yang sudah diberikan (seperti aturan catur).
*   **AlphaGo & AlphaZero:** Menggunakan *Monte Carlo Tree Search* (MCTS) yang dipandu oleh jaringan saraf tiruan sebagai "intuisi" untuk menilai kualitas papan dan probabilitas kemenangan.
    *   AlphaGo Lee menggunakan *pre-training* pada permainan ahli.
    *   AlphaZero belajar murni dari *self-play* tanpa pengetahuan awal manusia.
    *   AlphaZero mengalahkan mesin catur terbaik (Stockfish) dengan menjelajahi cabang yang lebih sedikit namun lebih akurat, mirip cara Grandmaster manusia berpikir.

#### **7. Penerapan Dunia Nyata dan Tantangan Sim-to-Real**
*   **Realitas Robotika:** Kebanyakan robot dunia nyata (mobil otonom, robot industri) saat ini **tidak** sepenuhnya menggunakan RL untuk aksi kontrol karena risikonya. Namun, perubahan sedang terjadi, misalnya penggunaan RL untuk *long-term planning* pada mobil otonom (Waymo) atau kontrol dinamis pada robot humanoid.
*   **Kesenjangan Simulasi (Sim-to-Real Gap):** Tantangan terbesar adalah mentransfer hasil pelatihan di simulasi ke dunia nyata. Solusinya adalah meningkatkan algoritma *transferability* atau membuat simulasi semirip mungkin dengan dunia nyata.

---

### **Kesimpulan & Pesan Penutup**
Deep Reinforcement Learning telah menunjukkan potensi luar biasa, mulai dari menguasai permainan papan kompleks hingga potensi penerapan di robotika otonom. Namun, tantangan besar masih ada, terutama dalam menjembatani kesenjangan antara simulasi dan realitas, serta memastikan keamanan AI agar tujuan agen selaras dengan nilai manusia.

**Ajakan/Tindakan Lanjutan:**
Bagi mereka yang tertarik mendalami bidang ini, terdapat banyak peluang penelitian terbuka, antara lain:
1.  Meningkatkan pendekatan yang ada, terutama dalam hal konvergensi dan performa.
2.  Fokus pada tugas-tugas yang belum terpecahkan (permainan tertentu yang belum bisa dikalahkan RL).
3.  Mengusulkan masalah baru yang belum pernah ditangani oleh Reinforcement Learning sebelumnya.

Video ditutup dengan undangan untuk menghadiri sesi mendalam mengenai topik terkait ("Deep Traffic") pada keesokan harinya.

Read

file updated 2026-02-13 13:24:21 UTC