MIT 6.S094: Deep Reinforcement Learning for Motion Planning
QDzM8r3WgBw • 2017-01-22
Transcript preview
Open
Kind: captions
Language: en
all right hello everybody Welcome Back
glad you came
back
today we will unveil the first tutorial
the first project this deep traffic code
named deep traffic where your task is to
solve
the traffic problem using deep
reinforcement
learning and I'll talk about what's
involved in designing a network there
how you submit your own network how you
participate in the
competition as I said the winner gets a
very special prize to be announced
later what is machine
learning the several types there's
supervised learning as I mentioned
yesterday that's what's uh meant usually
when you discuss about you talk about
machine learning and talk about its
successes supervised learning requires a
data set where you know the ground truth
you know the inputs and the
outputs and you provide that to a
machine learning algorithm in order to
learn the mapping between the inputs and
the outputs in such a way that you can
generalize to further examples in the
future unsupervised learning is the
other side when you know absolutely
nothing about the outputs about the
truth of the data that you're working
with all you get is data and you have to
find underlying structure underlying
representation of the data that's
meaningful for you to to accomplish a
certain task whatever that is there's
semi-supervised data where only Parts
usually a very small
amount is has is labeled there's ground
truth available for just a small
fraction of it if you think of
images that are out there on the
internet and then you think about imag
net a data set where every image is
labeled the size of that uh image not
data set is a tiny subset of all the
images on the U available online but
that's the task we're dealing with as
human beings as people interested in
doing machine learning
is how to expand the size of
that of the part of our data that we
know something confidently
about and reinforcement learning sits
somewhere in
between it's it's semisupervised
learning
where there's an agent that has to exist
in the
world and that
agent knows the inputs that the world
provides
but knows very little about that world
except through
occasional time delayed rewards this is
what it's like to be human this is what
life is about you don't know what's good
and bad you kind of have to just live it
and every once in a
while you find out that all that stuff
you did last week was pretty bad idea
that's reinforcement learning that's
semi supervised
in the sense that only a small subset of
the data comes with some ground truth
some certainty that you have to then
extract Knowledge
from so first at the core of anything
that's that works currently in terms of
a pract in a practical sense there has
to be some ground truth there has to be
some
truth that we can hold on to as we try
to
generalize and that's supervised
learning even as in reinforcement
learning the only thing we can count on
is that truth that comes in a form of a
reward so the standard supervised
learning pipeline is you have some raw
data the inputs you have ground truth
the labels the outputs that matches to
the
inputs then you run a certain any kind
of algorithm whether that's a neural
network or another pre-processing
processing algorithm that extracts the
features from that data set you can
think of a picture of a face that
algorithm could extract the nose the
eyes the corners of the eyes the
pupil or even lower level features in
that
image after that we insert those
features into a into a model a machine
learning model we train that model
then we as we whatever that algorithm is
as we pass it through that training
process we then
evaluate after we've seen the exam this
one particular
example how much better are we at other
tasks and as we repeat this Loop the
model learns to perform better and
better at generalizing from the raw data
to the labels that we have and finally
you get to release that model into the
wild to actually do
prediction on data has never seen before
that you don't know
about and the task
there is to predict the
labels okay so neural networks is what
this class is
about it's one of the machine learning
algorithms that has has proven to be
very
successful and the building block the
computational building block of a neural
network is a
neuron a
perceptron is a type of
neuron it's the original old school
neuron where the output is binary a zero
or
one it's not real
valued and the
process that a perceptron goes through
is it's has multiple
inputs and a single
output the inputs each of the inputs
have weights on them shown here on the
left as 7.6
1.4 those weights are applied to the
inputs and if perceptron the inputs are
ones or zeros
binary and those weights are
applied to and then sum together
a bias on each neuron is then added on
top
and a
threshold there's a test whether that
summed Value Plus the bias is below or
above a threshold if it's above a
threshold produces a one if it's below a
threshold it produces a zero
simple it's one of the only things we
understand about neural networks
confidently we can prove aot a lot of
things about this
neuron for
example what we
know is that a neuron can approximate uh
approximate a n gate a n
gate
is a logical operation a logical
function that takes his input has two
inputs A and B here on the DI on the
diagram on the
left and the table shows what that
function is when the inputs are zeros 01
in any order the output is a one
otherwise it's a
zero the cool thing about an Nate is
that it's a
universal
gate that you can build up any computer
you have where you phone in your pocket
today can be built out of just nand
Gates so it's a it's a functionally
complete you could build any logical
function out of them if you stack them
together in arbitrary ways the problem
with Nan Gates and
computers is they're built from the
bottom up you have to design these
circuits of Nan
Gates so the cool thing here is with a
perceptron we can learn this magical
land gate we can learn this
function so let's go through what how we
can do that how a perceptron can perform
the nand
operation here's the four examples if we
put the weights of -2 on the on each of
the inputs and a bias of three on the
neuron then if we perform that same
operation of summing the weights times
the
inputs plus the
bias in the top left we get
when the inputs are zeros and there's
sum to the bias we get a
three that's a that's a positive number
which means the output of a perceptron
will be a one in the top right when the
input is a zero and a
one that sum is still a positive number
again produces a
one and so on when the inputs are both
ones then uh the output is a negative 1
Less Than Zero
so while this
is simple it's really important to think
about it's it's a sort
of uh the
one basic computational truth you can
hold on to as we talk about some of the
magical things neural network can
do because if you compare a circuit of
Nan Gates and a circuit of neurons
neurons the difference while a circuit
of
neurons which is what we think of as a
neural network can perform the same
thing as the circuit of Nang Gates what
it can also do is it can
learn it can learn the arbitrary logical
functions that a arbitrary uh Circuit of
Nan Gates can represent but it doesn't
require the human
designer we can
evolve if you
will so one of the key aspects here one
of the key drawbacks of
perceptron is it's not very smooth in
its
output as we change the weights on the
inputs and we change the bias and we
tweak it a little bit it's very likely
that when you
get it's very easy to make make the
neuron output a a zero instead of a one
or a one instead of a zero so when we
start stacking many of these together
it's hard to control the output of the
thing as a
whole now the essential
step that makes a neural network work
that a circuit of perceptrons
doesn't is if the output is made smooth
is made continuous with an activation
function and so instead of using a step
function like a perceptron does shown
there on the left we use any kind of
smooth function
sigmoid where the output can change
gradually as you change the weights and
the
bias and this is a a basic but critical
step and so
learning is generally the process of
adjusting those weights gradually and
seeing how it has an effect on the rest
of the network you just keep tweaking
weights here and there and seeing how
much closer you get to the ground truth
and if you get farther away you just
adjust the weights in the opposite
direction
that's neural networks in a
nutshell there is what we will mostly
talk about today is feed forward neural
networks on the left going from inputs
to
outputs with no
Loops there is
also these amazing things called
recurrent e
networks they're amazing because they
have memory they have a memory of State
they
remember the temporal dynamics of the
data that went
through but the painful
thing is that they're really hard to
train today we'll talk about feed foral
neural networks so let's look at this
example an example of stacking a few a
few of these rounds together let's think
of the
task the basic tasks now famous using
the classification of numbers you have
an image of a number handwritten
number and your task is given that
image to say what number is in that
image now what is an image an image is a
collection of pixels in this case 28 by
28 pixels that's a total of 78 four
numbers those numbers are from 0 to
255 and on the left of the network we
the size of that input despite the
diagram is 784 neurons that's the
input then comes the hidden
layer it's called The Hidden layer
because it
has it has no inter action with the
input or the
output it is simply a block used the
it's at the core of the computational
power of neural networks is the hidden
layer it's tasked with forming a
representation of the data in such a way
that it maps from the inputs to the
outputs in this case there is 15 neurons
in the hidden layer
there is 10 values on the
output uh corresponding to each of the
numbers there's several ways you could
you can build this kind of network and
this is what the magic of neural
networks is you can do it in a lot of
ways you only really need four outputs
to represent values 0 through
9 but in practice it seems that having
10 outputs works better and how do these
work whenever the input is a five the
the output neuron in charge of the five
gets really excited and output a value
that's close to one from 0 to one close
to one and then the other
ones get uh output a value hopefully
that is close to zero and when they
don't we adjust the weights in such a
way that they get closer to zero and
closer to one depending on whether this
the correct neuron associated with the
picture we'll talk talk about the
details of this training process more
tomorrow when it's more
relevant but
the what we've discussed just now is the
forward pass through the network it's
the pass when you take the inputs apply
the weights sum them together add the
bias produce the output and check which
of the outputs produces the highest
confidence of the
number then once those prob abilities
for each of the numbers is is
provided we
determine the
gradient that's used to punish or reward
the weights that resulted in either the
correct or the incorrect decisions and
that's called back propagation we step
backwards through the network applying
those punishments or
rewards because of the smoothness of the
activation functions that is a
mathematically efficient
operation that's where the GP gpus step
in so for our example of numbers the
ground
Truth for number six looks like the
following in the slides y ofx equals
to a uh 10-dimensional
Vector where only one of them the the
sixth
Valu is of one the rests are
zero that's the ground truth that comes
with the image the loss function here
the basic loss function is the squared
error Y ofx is the ground truth and a is
the output of the neural
network resulting from the forward
pass so when you input that number of a
of a six and it outputs whatever it
outputs that's a a 10-dimensional
vector and it's summed over the
inputs to produce the squared error
that's our loss function the loss
function the objective function that's
what's used to
determine how much to reward or punish
the back propagated weights throughout
the
network and the basic
operation of optimizing that loss
function of minimizing that lost
function is done with various variants
of of gradient
descent it's hopefully a somewhat smooth
function but it's a highly nonlinear
function this is why we can't prove much
about Neal
networks is it it's a highly High
dimensional highly nonlinear function
that's hopefully smooth enough where the
gradient
descent can find its way to at least a
good solution and there has to be some
stochastic element there that uh doesn't
get that jumps around to ensure that it
doesn't get stuck in a local Minima of
this very complex
function okay that's supervised learning
there's there's inputs there's outputs
ground truth that's our comfort zone CU
we're pretty confident we know what's
going on all you have to do is just you
have this data set you train
and you train a network on that data set
and you can evaluate it you can write a
paper and try to beat a previous paper
it's great the problem is when you then
use that neural network to create an
intelligent system that you put out
there in the world and now that system
no longer is working with your data set
it has to exist in this world that's
doesn't that's may be very different
from the ground truth so the takeaway
from supervised learning is that neural
networks's a great
memorization but in a sort of
philosophical way they might not be
great at
generalizing at reasoning
beyond the specific flavor of data set
that they were trained on the hope for
reinforcement learning is that we can
extend the knowledge we gain in a
supervised way
to the huge World
outside where we don't
have uh the ground truth of how to act
of what does how good a certain state is
or how bad a certain state is this is a
kind of Brute Force reasoning and I'll
talk
about kind of what I mean there but it
feels like it's closer to reasoning as
opposed to memorization that's a good
way to think of supervised learning as
memorization you're just studying for an
exam and as many of you know that
doesn't mean you're going to be
successful in life just because you get
an
A and so
uh a reinforcement learning
agent or just any agent a human
being or any
machine existing in this
world can operate in the following way
from the perspective of the agent it can
execute an action it can receive an
observation resulting from that action
in a form of a new state and it can
receive a reward or a
punishment you can break down every our
you can you could break down our
existence in this way simplistic
view but it's a convenient one on the
computational side and from the
environment
side the environment receives the action
ction emits the observation so your
action changes the world therefore that
world has to
change and then tell you about it and
give you a reward or a punishment for
it
so let's look
at again one of the most fascinating
things uh I'll try to convey why this is
fascinating a little bit later on
on is the work of Deep Mind on
Atari this is Atari
Breakout a
game uh where a paddle has to move
around that's the world that's existing
in it's a paddle the agent is a
paddle and there's a bouncing ball and
you're trying to move your actions to
right move right move
left inste you're trying to move in such
a way that the ball doesn't get past
you and so here is a human level
performance of that agent and so what
does this paddle have to do has to
operate in this environment it has to
act move left move
right each action changes the state of
the world this see may seem obvious but
moving right changes
visually the state of the world in fact
what we're watching now on the slides is
the world changing before your eyes for
this little
guy and it gets Rewards or
punishments rewards it gets in the form
of points they're racking up points in
the top left of the of the video and
then when the ball gets past the
paddle it gets punished by dying quote
unquote and that's the number of lives
there has left going from 5 to four to
three down to
zero and so the goal is to select at any
one
moment the action that maximizes future
reward without any knowledge of what a
reward
is in a greater sense of the word all
you have is an instantaneous reward or
punishment instantaneous response of the
world to your
actions and this can be model as a a
markof decision
process markof decision process is a
mathematically convenient construct it
has no memory all you get is you have a
state that you're currently in you
perform an action you get a reward and
you find yourself in a new state and
that repeats over and over you start you
from state zero you go to state one you
once again repeat an action get a reward
go to the next state okay that's that's
the formulation that we're operating in
when you're in a certain State you have
no memory of what happened two states
ago everything is operating on the
instantaneous
instantaneously and so what are the
major components of a reinforcement
learning agent there's a
policy that's that's the agent the
function uh broadly defined of an
agent's Behavior
that
means that includes the knowledge of how
for any given State what is an action
that I will take with some
probability value function is how good
each state and action are in any
particular
State and there's a
model now this is a
little uh a subtle thing that is
actually the biggest problem with
everything you'll see today is the model
is how we represent the environment and
what you'll see today some amazing
things that neural networks can achieve
on a relatively simplistic model of the
world and the question whether that
model can extend to the real world where
human lives are at stake in the case of
driving so let's look at the simplistic
World a robot and a
room you start at the bottom left your
goal is to get to the top
right your possible actions are going up
down left and
right now this world could be
deterministic which means when you take
when you go up you actually go up or it
could be
non-deterministic as human life is is
when you go up sometimes you go right
so in this case if if you choose to go
up you move up 80% of the time you move
left 10% of the time and you move right
10% of the time and when you get to the
top right you get a reward of plus one
when you get to the second block from
that 42 you get negative one you get
punished and every time you take a step
you get a slight punishment of
04 okay so question is if you start at
the bottom left is this a good solution
is this a good policy by which you exist
in the
world and it is if the world is
deterministic if whenever you choose to
go up you go up when you choose to go
right you go
right but if the actions are
sarcastic that's not the
case in what I described previous
previously with 08 up
and probability of 0.1 going left and
right this is the optimal
policy now if we punish every single
step with a -2 as opposed toga
04 so every time you take a step it
hurts you're going to try to get to a a
positive block as quickly as
possible and that's what this policy
says I'll walk through a negative one if
I have to As Long As I Get stop getting
the
-2 now if the reward for each step is
A.1 you might choose to go around that
negative one
block slight detour to avoid the
pain and then you might take an even
longer detour as the reward for each
step goes up or the punishment goes down
I
guess and
then and then if the There's an actual
positive reward for every step you take
then you'll never you'll avoid going to
your the finish line you'll just wander
the the
world we saw that with the uh Coast
racer yesterday the uh the boat that
chose not to finish the race because it
was having too much fun getting points
in the
middle
so let's look at the world that this
agent is operating in is a value
function that value function depends on
a reward
the reward that comes in the
future and that reward is discounted
because the world is stochastic we don't
we can't expect the
reward to come along to us in the way
that we hope it does based on the policy
based on the way we choose to
act and so there's a gamma there that
over
time as the reward is farther and
farther into the future this SCS
that
reward diminishes the impact of that
future reward in your evaluation of the
current
state and so your goal is to develop a
strategy that
maximizes the discounted future reward
this sum this discounted
sum and reinforcement
learning there is a lot of approaches
for coming up
with a good
policy a near optimal an optimal
policy there's a lot of fun math
there you can try to construct a model
that optimizes some estimate of this
world you can try in a Monte Carlo way
through just simulate that world and see
how it
unrolls and as it unrolls you try to
compute the optimal
policy or what we'll talk about today is
q-learning
it's an off- policy
approach where the
policy is estimated as we go
along the
policy is represented as a q
function the Q
function shown there on the left is I
apologize for the equations I lied
there'll be some
equations
the the input to uh the Q function is a
state at time t s t and an action that
you choose to take in that state
at and your goal is in that state to
choose an action which maximizes the
reward in the next
step and and what Q learning does and
I'll describe the process is it's able
to approximate through
experience the optimal Q function the
optimal function that tells you how to
act in any state of the
world you just have to live it you have
to simulate this world you have to move
about it you have to
explore in order to see every possible
State try every different action get
rewarded get
punished and figure out what is the
optimal thing to
do that's done
using this Bellman
equation on the left the output is the
new state the estimate the Q function
estimate of the new state for new
action and that's this is the update
rule at the core of Q
learning you take the estimate the old
estimate and add based on the learning
rate Alpha from 0 to
one B update the evaluation of that
state based on your new reward that you
received at that time so you've arrived
in a certain State St you try to do an
action and then you got a certain reward
and you update your estimate of that
state and action pair based on this
rule when the learning rate is zero you
don't learn when Alpha is zero you never
change your world view based on the new
the new incoming
evidence when Alpha is
one you every time change
your evaluation your world evaluation
based on the new
evidence and that's the key ingredient
to reinforcement learning first you
explore then you
exploit first you explore in a non-g
greedy way and then you get greedy you
figure out what's good for you and you
keep doing
it so if you want to learn in in atar
game first you try every single action
every state you screw up get punished
get rewarded and eventually you figure
out what's actually the right thing to
do and you just keep doing it and that's
how you win against the the greatest
world the greatest human players in the
world in a game of Go For example as
we'll talk
about and the way you do that is you
have an Epsilon greedy policy that over
time with a
probability of 1 minus Epsilon you
perform an optimal greedy action with
the probability of Epsilon you perform a
random action random action being being
explore and so as Epsilon goes down from
1 to zero you explore less and
less so the algorithm here is really
simple on the bottom of the slide there
it's the algorithm version the pseudo
code version of the
equation the Bellman equation update you
initialize your estimate of State action
pairs
arbitrarily a random number now this is
an important
point when you start playing or living
or doing whatever you're doing in
whatever you're doing with reinforcement
learning or driving you have no
preconceived notion of what's good and
bad it's random or however you choose to
initialize it and the fact that it
learns anything is
amazing I want you remember that that's
one of the amazing things
about the Q learning at all and then the
Deep neural network version of Q
learning the algorithm repeats the
following step you step into the world
observe an initial State you
select an action a so that action if
you're exploring will be a random action
if you're greedily pursuing doing the
best action you can it'll be the action
that maximizes the Q function You
observe a reward after you take the
action and a new state that you find
yourself in and then you update your
estimate of the previous state you were
in having given taken that action using
that Bellman equation
update and repeat this over and
over and so there on the bottom of the
slide
is a summary of
Life
yes uh the Q
function yes yes yeah it's a single the
question was is the Q function a single
value and yes it's it's just a single uh
continuous value
so the question was how do you model the
world
so the way you model so let's start this
very simplistic world of Atari uh paddle
you think you model it as a paddle that
can move left and right and there's some
blocks and you model the physics of the
uh the ball
uh that requires a lot of expert
knowledge in that particular game so you
sit there handcrafting this model that's
hard to do even for a simplistic game
the other model you could take is
looking at this world in the way that
humans do visually so take the model in
as a set of
pixels just the model is all the pixels
of the world you know nothing about padd
or balls or physics or colors and points
they're just pixels coming in that seems
like a ridiculous model of the world but
it seems to work for Atari it seems to
work for human beings when you're born
you see there's light coming into your
eyes and you don't have
any as far as no as far as we know you
don't come with an instruction when
you're born you know there's people in
the world and there is there's good guys
and bad guys and there this is how you
walk no all you get is light
sound and the other
sensors um
and and you get to learn about the every
single thing you think of as the way you
model the world is a learned
representation and we'll talk about how
a neural network does that it learns to
represent the world but if we have to
hand model the
world it's an impossible
task it's that's that's the question if
we have to hand model the world then
that world better be a simplistic
one
yeah that's a great question so the
question was what is the robustness of
this model if if the way you represent
the world is at all even slightly
different from the way you thought that
world is uh that's not that well studied
as far as I'm aware you I mean it's it's
already amazing that if you construct if
you have a certain input of the world if
you have a certain model of the world
that you can learn anything is already
amazing the question is and it's an
important one is uh we'll talk a little
bit about it not about the world model
but the reward function if the reward
function is slightly different the real
reward function of life or of driving or
of Coast Runner is different then what
you expected it to be what's the what's
the negative there yeah it's it's uh it
could be huge
so there was another question or no
never mind
yep sorry can you ask that
again yes you can change it over so the
question was uh do you change the alpha
value over time and you certainly should
change the alpha value over time
yeah else
so the question was what is the the
complex interplay of the Epsilon
function with the Q learning update um
that's 100% fine-tuned hand tuned to the
particular learning problem so uh you
certainly want
to the more complex the the the larger
the number of states in the world and
the larger the number of actions the
longer you have
to wait before you decrease the Epsilon
to zero but you have to play with it and
it's one of the parameters you have to
play with unfortunately and there's
quite a few of them which is why you
can't just drop a reinforcement learning
agent into the
world oh the effect in that sense no no
it's just a coin
flip and if if that Epsilon is 0.5 half
the time you're going to take a random
action so it's no there's no specific
it's not like uh you'll take the best
action and then with some probability
take the second best and so on I mean
you can certainly do that but in the
simple formulation that works is you
just take a random action because you
don't want to have a preconceived notion
of what's a good action to try when
you're exploring the whole point is you
try crazy stuff if it's a
simulation okay so uh good question so
representation
matters this is the question about how
we represent the world so we can think
of this world of a breakout for
example of the satari game as a paddle
that moves left and right and the exact
position of the different things it can
hit construct this complex model uh this
expert driven model that has to
fine-tune it to this particular
problem but
in practice the more complex this model
gets the worse that Bellman equation
update that Val the trying to construct
a q function for every single
combination of state and actions becomes
too difficult because that function is
too sparse and huge so if you think
of of looking at this world in a general
way in the way human beings would is
it's a collection of pixels visually
if you just take in a pixel this game is
a collection of 84 by 84 pixels an image
RGB
image and then you look at not just the
current image but look at the
temporal trajectory of those images so
like if there's a ball moving you want
to know about that movement so you look
at four Images so kerm image and three
images
back and say they're Grace scale with
256 gray levels that size of the Q table
that the Q value
function has to
learn
is whatever that number is but it's
certainly larger than the number of
atoms in the
universe that's a large number so you
have to run the simulation long enough
to touch at least a few times
the uh most of the states in that Q
table
so as Elon Musk says you may need to uh
run you know we live in a
simulation you may run have to run
another a universe just to to uh to
compute the the the CU function in this
case so that's where deep learning steps
in is instead
of modeling the world
as a q
table you estimate you try to learn that
function and so the takeaway from
supervised learning if you remember that
it's good at memorizing we're good at
memorizing data the hope for
reinforcement
learning with a q q learning is that we
can
extend the occasional rewards we get to
generalize over the operation
the actions you take in that world
leading up to the rewards and the hope
for deep learning is that we can move
this reinforcement learning system into
a world that doesn't need to be uh that
can be defined
arbitrarily can
include all the pixels of an Atari game
can include all the pixels sensed by
drone or a robot or a
car but still needs a formalized
definition of that world which is much
easier to
do when you're able to take in sensors
like an
image so deep Q learning deep
version so instead of learning a q table
a q function we try and estimating
that Q Prime we try to learn it using
machine
learning so it tries to learn something
parameters this huge complex function we
tried to learn
it and the way we do that is we have a
neural network the same kind that I
showed that learn the numbers to map
from an image to a uh classification of
that image into a number the same kind
of network is used to take in a state
and an action and produce a q
value now here's the amazing
thing that without knowing anything in
the
beginning as I said with a Q table it's
initialized randomly the Q
function through this deep Network knows
nothing in the beginning all it knows is
in the simulated world the rewards you
get for a particular game so you have to
play play time and time again and
see the the rewards you get for every
single iteration of the game but in the
beginning it knows
nothing and it's able to learn to play
better than human beings this is a deep
mind paper playing Atari with deep
reinforcement learning from
2013 that's one of the key things that
got everybody excited about the
of deep learning and artificial
intelligence is
that using a convolutional network which
I'll talk about tomorrow but it's a it's
a vanilla Network like any other like I
talked about earlier today just a
regular Network that takes the raw
pixels as I said and estimates that Q
function from the raw pixels is able to
play on many of those games better than
a human
being and the loss function that I
mentioned previously
so again very uh vanilla loss function
very simple objective function the the
first one you'll probably implement we
have a tutorial in tensor
flow squared error so we take this
Bellman
equation where the estimate is Q the Q
function estimate of state and action is
the maximum reward you get for taking
any of the
actions uh that takes you to any of the
future
States and
you try to take that action observe that
the result of that action and if the
target is
different that your learned Target the
learned what the function is learned is
the expected reward in that case is
different than what you actually
got you adjust it you adjust the weights
on the network
and this is exactly the process by which
we learn how to exist in this pixel
world so you're mapping States and
actions to a q
value the algorithm is as
follows this is how we train it we're
given a
transition s current state action taken
in that state R the reward you get and S
S Prime is what you the state you find
yourself
in and so we replaced the basic update
rule in the previous pseudo
code by taking a forward pass through
the network given that s
State we look at
what the predicted Q value is of that
action we then do another forward pass
through that Network and see what we
actually
get and then if we're totally
off
we punish we uh back propagate the
weights in a way that a next time will
make less of that
mistake and you repeat this
process and this is a you're playing
your this is a
simulation you're learning against
yourself
and again the same rule applies here
exploration versus
exploitation you start out with an
Epsilon
of zero or one you're
you're mostly exploring and then you
move towards an Epsilon of
zero and with the tari breakout this is
the Deep Mind paper result is training
epics on the x-axis on the y- AIS is the
average action value and the average
reward per
episode I'll show why it's kind of an
amazing result but it's messy because
there's a lot of tricks involved so it's
not just putting in a bunch of pixels of
a game and getting uh an agent that
knows how to win at that game there's a
lot of pre-processing and playing with
the data
required so which is unfortunate because
uh the truth is
messier than the
hope but one of the critical tricks
needed is called experience
replay so as opposed to letting an agent
so you're learning this big Network that
learn that tries to build a model of
what's good to do in the world and
what's
not and you're learning as you
go so with with experience replay you're
keeping a track of all the things you
did and every every once in a while you
look back into your memory and pull out
some of those old experiences the old
good old times and train on those again
as opposed to letting the
agent run itself into some local Optima
where it tries to learn a very subtle
aspect of the game that actually in the
global sense doesn't get you farther to
winning the game very much like
life so here's the algorithm deep Q
learning algorithm
cedo
code we initialize the replay memory
again there's this is a this little
trick that's
required is keeping a track of stuff
that's happened in the past we
initialize the action value function Q
with random weights and observe initial
State again same thing select an action
with a probability Epsilon
explore otherwise choose the best one
based on the estimate provided by the
neural network and then carry out the
action observe the reward and store that
experience in the replay
memory and then sample random transition
from replay
memory so uh with a certain probability
you bring those old times back to get
yourself out of the local
Minima and then you train the Cure the Q
network using the the difference
between your what you actually got and
your
estimate you repeat this process over
and
over so here's what you can do after 10
minutes of training on the left uh so
that's very little
training what you get is a paddle
that learns hardly anything and it just
keeps dying if you look at it goes from
5 to 4 to 3 to two to one those are the
number lives
left then after two hours of training
and a single
GPU it learns to Win It win the you know
not die uh rack up points and uh learns
to avoid the ball from passing passing
the paddle which is great that's human
level performance really better than
Some Humans you know but it uh it still
dies sometimes so it's very human level
and then after 4 hours it does something
really
amazing it figures out how to win at the
game in a very lazy way which is
drill a hole through the the blocks up
to the top and get the ball stuck up
there and then it does all the hard work
for you that minimizes the probability
of the ball getting past your paddle cuz
it's just stuck in the in the um in the
blocks up top so it that might be
something that you wouldn't even figure
out to do yourself and that's
an I I need to sort of pause here to to
to clearly explain what's happening the
input to this algorithm is just the
pixels of the game it's the same thing
that human beings take in when they take
the visual perception and it's able to
learn under this constru trained
definition of what is a reward and a
punishment it's able to learn to get a
high
reward that's General artificial
intelligence a very small example of it
but it's General it's general purpose it
knows nothing about games it knows
nothing about paddles or physics it's
just taking sensory input of the game
and they've uh did the same thing for a
bunch of different games in
Atari
and uh what's shown here in this plot on
the x-axis and um is a bunch of
different games from Atari and on the y-
AIS is a percentile where 100% is about
the best that human beings can do
meaning it's the score that human beings
would get so everything about there in
the middle everything to the left of
that is far exceeding human level
performance and below all that is on par
or worse than human level performance so
you can learn all so many
boxing
pinball all of these games and it
doesn't know anything about any of the
individual games it's just taken in
pixels it's just as if you put a human
being beside behind any of these uh
games and ask them to learn to
beat beat the
game and there's been a lot of
improvements on this algorithm recently
yes
question no nope there's no so the
question was do they customize the model
for game for a particular game and no
the point you could of course but the
point is it doesn't need to be
customized for the game
but but the important thing
is that it's still only on Atari
games right so the question whether this
is transferable to driving perhaps
not right you play the game well you do
no you don't have the well yeah you play
one step of the
game so you take action in a
state and then you observe that so you
have the it's simulation I mean that's
really
uh that's one of the biggest problems
here is you require the simulation to uh
in order to get the ground
truth yes so that's a great question and
or a comment the the comment was that uh
for a lot of these
situations the reward function might not
change at all depending on your actions
the rewards are really
most of the time delayed 10 20 30 steps
down down the line which is why it is
amazing that this works at
all that it's learning a it's learning
locally and through that process of
simulation of hundreds of thousand times
runs to the game it's able to
learn what to do now such that I get a
reward later
it it's uh if you just pause look at the
math of it it's very simple math and
look at the result it's
incredible so there's a lot of
improvements this
one called the general reinforcement
learning architecture or
gorilla the cool thing about this in the
simula world at least is that you can
run deep re enforcement learning and
distributed way you could do both the
simulation in a distributed way you can
do the learning in the distributed
way you can you can generate experiences
which is what this kind of diagram shows
you can either from human
beings or from simulation so for example
the way
that the way that Alpha go uh the Deep
Mind team has beat the game of Go is
they they learn from both expert games
and by playing it
itself so you can do this in a
distributed way and you could do the
learning a distributed way so you can
scale and in this particular case the uh
gorilla uh has achieved a better result
than the uh dqn
Network that's part of their nature
paper okay so let me now get
to driving for a second here where where
does reinforcement learning
where reinforcement learning can step
in and help so this is back to the open
question that I asked yesterday is
driving closer to chess or to everyday
conversation chess meaning it can be
formalized in a simplistic way and we
could think about it as an obstacle
avoidance problem and once the obstacle
avoidance is solved you just navigate
that strain space you choose to move
left you choose to move right in a lane
you choose to speed up or slow
down well if it's a game like chess
which we'll assume for today as opposed
to for tomorrow for
today we're going to go with the one on
the
left and we're going to look at Deep
traffic here's
this
game
simulation where the goal is to achieve
the highest average speed you
can on this seven Lane Highway full of
cars and so as a side note for students
a requirement is they have to follow the
tutorial that I'll present a link for at
the end of this
presentation and what they have to do is
achieve a speed build a network that
achieves a speed of 65 M hour High
there is a
leaderboard and you get to submit the
model you come up with with a simple
click of a button so all of this runs in
the browser which is also another
amazing
thing and then you immediately or
relatively
so make your way up the
leaderboard so let's look let's zoom in
what is this uh what is this world
two-dimensional world of traffic
is what does it look like for the uh
intelligence
system we discretize that world into a
grid showing here on the left that's the
representation of the state there's
seven lanes and every single Lane is
broken up into blocks
spatially and if there's a car in that
block the length of a car is about three
blocks three of those grid
blocks then that grid is seen as
occupied
and then the red car is
you that's the thing that's running in
the intelligent
agent there is on the left is the
current speed of the red
car actually says MIT on
top and then you also have a count of
how many cars you
passed and if your network sucks then
that number is going to get to be
negative
uh you can also change with a drop down
the simulation speed from normal on the
left to fast on the right so normal
is so you know the fast speeds up the
replay of the
simulation the one on the left normal it
feels a little more like real driving
the there's a drop down for different
display options uh the default is none
in terms of stuff you show on the
road then there is the learning input
which is the while the whole Space is
desized you can choose what your car
sees and that's you could choose how far
ahead it sees behind how far to the left
and right it
sees and so by choosing the learning
input to visualize the learning input
you get to see what you set that input
to
be then there is the safety
system this is a system that protects
you from
yourself the way we've made this
game is that it operates under something
similar if you have some intelligence if
you drive and you have um adaptive
cruise control in in your car it
operates in the same way it when it gets
close to the car in front it slows down
for you and it doesn't let you run the
car to the left of you to the right of
you off the road so it constrains the
movement capabilities of your car in
such a way that you don't hit
anybody because then it would have to
simulate collisions and it would just be
a
mess so it protects you from that and so
you can choose to visualize that quote
unquote safety system
with the visualization box and then you
can also choose to visualize the full
map this is the full occupancy map that
you get if you would like
to provide as input to the
network now that input for every single
grid that it's a number it's not just a
01 whether there's a car in there it's
the maximum speed limit which is 80
miles per hour don't don't go don't get
Crazy 80 M hour this a speed
limit that block when it's empty is set
to the 85 miles uh 80 M hour and when
it's occupied it's set to the number
that's uh the speed of the
car and then the blocks that you the red
car is occupying is set to the number to
a very large number much higher than the
speed limit
so safety system here shown in
red are the parts of the grid that are
not that your car can't move into
question what's
that
y yes yes uh the question was what was
the third option I just mentioned and uh
it's you the red car itself you're
yourself the blocks underneath that car
are set to really high number it's the
way for the algorithm to know for the
learning algorithm to
know that this these blocks are
special so safety
system shows red here if the car can't
move into those
blocks so
any in terms of of uh when it lights up
red it means the car can't speed up
anymore in front of it and when the
blocks to the left or to the right light
up is red that means you can't change
lanes to the left or right on the right
of the
slide you're free to go free to do
whatever you want that's what that
indicates is all the blocks are yellow
safety system says you're free to choose
any of the five actions and the five
actions are
move left move right stay in place
accelerate or slow
down and those actions are given as
input that action is is uh that's what's
produced by the what's called here the
brain the brain takes in the current
state as input the last reward and
produces and through and learns uses
that reward to train the network
through backward function there it's
back
propagation and then ask the brain given
the current state to give it the next
action with a forward pass the forward
function you don't need to know the
operation of this function in particular
this is not something you need to worry
about but you can if you want you can
customize this learning
step the there is by the way what I'm
describing now there's there's just a
few lines of code right there in the
browser that you can change and
immediately uh well with a press of a
button changes the simulation or the
design of the network you don't need to
have any special Hardware you don't need
to do anything special and the tutorial
cleanly outlines exactly all of these
steps but it's kind of amazing that you
can design a deep neural network that's
part of the reinforcement learning agent
so it's a uh deep Q learning agent right
there in the
browser so you can choose the s lane
side variable which controls how
many lanes to the side you see so when
that value is zero you only look forward
when that value is one you have one lane
to the left one value to the right it's
really the lane the radius of your
perception system patches ahead is how
far ahead you look patches behind is how
far behind you look
and so for example here the lane side
equals two that means it looks two to
the left two to the right obviously if
two to the right is
Offroad it provides a value of zero in
those
blocks if we set the patches behind to
be 10 it looks 10 patches back behind
starting at the one patch back is
starting from the front of the car
the
scoring for the evaluation for the
competition is your average speed over a
predefined period of
time and so the method we do we use to
collect that speed is we um we run the
agent 10 runs about 30 simulated minutes
of game each and take the median speed
of the 10
runs that's the
score uh this is done server side
and so uh so given that we gotten some
for this for this code recently gotten
some publicity online
unfortunately this might be a dangerous
thing to say there no cheating possible
but because it's done server side and we
did uh this is Javascript and it runs in
the browser it's hopefully uh sandboxed
so where you can't do anything tricky
but we dare you to try
you can uh try it locally to get an
estimate and I'll there's a button that
says evaluate and it gives you a score
right back of how well you're doing with
the the current
Network that button
is start evaluation run you press the
button it does a progress bar and it
gives you the average
speed you can the there's a Code
box where you modify all the variables I
mentioned and the tutorial describes
this in detail and then once you're
ready you modify a few things you can
press apply code it
restarts
it kills all the training that you've
done up to this point or resets it and
starts the training
again so save often and there's a save
button so the training uh is done on a
separate
thread in web
workers which are exciting things that
allow you to allow JavaScript to run
amazingly in uh in a in
a on multiple uh CPU cores in a parallel
way so the simulation that scores this
or the sorry the training is done a lot
faster than real
time a th000 frames a second a thousand
movement steps a
second this is all in
JavaScript and the next day gets shipped
to the main simulation from time to time
as the training goes
on so 
Resume
Read
file updated 2026-02-13 13:25:00 UTC
Categories
Manage