Transcript

U1toUkZw6VI • MIT 6.S094: Convolutional Neural Networks for End-to-End Learning of the Driving Task
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0021_U1toUkZw6VI.txt
Back Raw
Kind: captions
Language: en
all right welcome back everyone sound
okay all
right so today we will we talked a
little bit about neural networks started
to talk about neural networks yesterday
today we'll
continue to talk about neural networks
that work with images convolutional
networks and see how those types of
networks can help us drive a
car if we have time we'll cover a simple
illustrative case study of
detecting traffic lights the problem of
detecting green yellow red if we can't
teach our neural networks to do that
we're in trouble but it's a good clear
illustrative case study of a three class
classification
problem okay next there's
deep
Tesla here looped over and over in a
very short GIF this is actually running
live on a website right now we'll show
it towards the end of the lecture this
once again just like deep traffic is a
neural network that learns to steer a
vehicle based on the video of the
forward roadway and once again doing all
of that in the browser using
JavaScript so you'll be able to train
your own very Network to drive using
real world
data I'll explain
how you will we will also have a
tutorial and
code briefly describe today at the end
of lecture if there's time how to do the
same thing in tensor flow so if you want
to build a network that's bigger deeper
and you want to utilize gpus to train
that Network you want to not do it in
your browser you want to do it offline
using tensor flow and having a powerful
GPU on your computer and we'll explain
how to do
that computer
vision so we talked
about vanilla machine learning where
there's no where the size yesterday
where the size of the input is small for
the most part the number of neurons in
the case of the neural network
is on the order of 10 100
1,000 when you think of images images
are a collection of pixels one of the
most iconic images from computer vision
in the bottom left there is Lena I
encourage you to Google it and uh figure
out the story behind that image is uh
quite shocking when I found out
recently so once again computer vision
is these
days dominated by a data driven
approaches by Machine
learning where all of the same
methods that are used on other types of
data are used on images where the input
is just a collection of pixels and
pixels
are numbers from 0 to 255 discrete
values so we can think exactly what
we've talked about previously we can
think of images in the same exact way
it's just numbers and so we can do the
same kind of thing we can do supervised
learning where you have an input image
an output label the input image here is
a picture of a woman the label might be
woman on supervised learning same thing
we'll look at that briefly as well is
clustering images into
categories again semi-supervised and
reinforcement learning in fact the Atari
games I talked about
yesterday do some pre-processing on the
images they're doing computer vision
they're using convolutional neur
networks as we'll discuss
today and the pipeline for supervised
learning is again the same there's raw
data in the form of images there's
labels on those images we perform a
machine learning algorithm performs
feature
extraction it trains given the inputs
and outputs on the images and the labels
of those images constructs a model and
then test that model and we get a metric
an accuracy accuracy is the term that's
used to often describe how well a model
performs it's a
percentage I apologize for the constant
presence of cats throughout this course
I assure you this course is about
driving not
cats but images are numbers
so for us uh we take it for granted
we're really good at looking uh at and
converting visual perception as human
beings converting visual perception into
semantics we see this image and we know
it's a cat but a computer only sees
numbers RGB values for colored image
there's three values for every single
Pixel from 0 to
255 and so given that image we can think
of two problems one is regression and
the other is classification regression
is when given an image we want to
produce a real valued output back so if
we have an image of the forward roadway
we want to produce a value for the
steering wheel
angle and if you have an algorithm
that's really smart it can take any
image of the forward roadway and produce
the perfectly correct steering angle
that drives the car safely across the
United States we'll talk talk about how
to do that and where that
fails classification is when the input
again is an image and the output is a
class label a discrete class label
underneath it though often is still a
regression problem and what's produced
is a probability that this particular
image belongs to a particular
category and we use a threshold to chop
off the the out puts associated with low
probabilities and take the labels
associated with the high probabilities
and convert it into a discret
classification I mentioned this
yesterday but it Bears saying again
computer vision is
hard we once again take it for granted
as human beings we're really good at
dealing with all of these problems
there's Viewpoint variation the object
looks totally different in terms of the
numbers of the behind images in terms of
the pixels when viewed from a different
angle Viewpoint
variation objects when you standing far
away from them or up close are totally
different size we're good at detecting
that they are different size it's still
the same object as human beings but
that's still a really hard problem
because those sizes can vary drastically
we talked about occlusions and
deformations with
cats well understood
problem there's background clutter
you have to separate the object of
interest from the
background and given the
three-dimensional structure of our world
there's a lot of stuff often going on in
the background the Clutter their int
class variation that's often greater
than interclass variation meaning
objects of the same type often have more
variation than the objects that you're
trying
to uh separate them
from there's the hard one for driving
illumination light is the way we
perceive things the reflection of light
off the
surface and the source of that light
changes the way that object appears and
we have to be robust all of
that so the image classification
pipeline is the same as I
mentioned there's
categories it's a classification problem
so there's categories of cat dog mug hat
and you have a bunch of examples image
examples of each of those categories and
so the input is just those images paired
with the
category and you
train to map to to estimate a function
that maps from the images to the
categories for all of that you need data
a lot of it there is unfortunately
there's a growing number of data sets
but they're still relatively
small we get excited there are millions
of images but they're not billions or
trillions of
images and these
are the data sets that you will see if
you read academic literature most often
mnist the one that's been beaten to
death and then we'll use as well in this
course is uh a data set of handwritten
digits
where the categories are zero to to
nine imag net one of the largest image
data sets fully labeled image data sets
in the world has images with a hierarchy
of categories from
wet and what you see there is a labeling
of what images associated with which
words are present in the data
set Carr 10 and carr1 100 are tiny
images that are used to prove in a very
efficient and quick way offand that your
algorithm that you're trying to publish
on or trying to impress the world with
works well it's small it's a small data
set C far 10 means there's 10
categories and places is a data set of
natural scenes Woods
Nature City so on so let's look at Sear
10 is a data set of 10 categories
airplane automobile bird cat so on
they're shown there with sample images
as the
rows and so let's build a
classifier that's able to take images
from one of these 10 categories and tell
us
what is uh shown in the image so how do
we do
that once again the all the algorithm
sees is
numbers so we have to
try to have at the very core we have to
have an operator for comparing two
images if given an image and I want to
say if it's a cat or a dog I want to
compare it to images of cats and compare
it to images of dogs and see which one
matches better so there has to be a
comparative
operator okay so one way to do that is
take the absolute difference between the
two images pixel by pixel take the
difference between uh each individual
pixel shown on the bottom of the slide
for a 4x4 image and then we sum that
pixel wi pixel wise absolute difference
into a single number so if the image is
totally different pixel wise that'll be
a high number if it's the same image the
number will be
zero oh it's the absolute value too of
the
difference that's called L1 distance
doesn't
matter when we speak of distance we
usually mean L2
distance and so if we try to so we can
build a classifier that just uses this
operator to compare it to every single
image in the data set and say I'm going
to pick
the I'm going to pick the category
that's the closest using this
comparative
operator I'm going to find I have a
picture of a cat and I'm going to look
through the data set and find the image
that's the closest to this
picture and say that is the category
that this picture belongs to so if we
just flipped a coin and randomly picked
which category and image belongs to get
that
accuracy would be on average 10% it's
random the accuracy we achieve with
our
brilliant image difference algorithm
that just goes go to the data set and
finds the closest one is
38% which is pretty good it's way above
10% so you can think about this
operation of looking through the data
set and finding the closest image
as what's called Ker's neighbors where K
in that case is one meaning you find the
one closest neighbor to this image that
you're asking a question
about and accept the label from that
image you could do the same thing
increasing
k k increasing K to two means you take
the two nearest neighbors you find the
two closest in terms of pixel wise image
difference images to this particular
query
image and find which category do those
belong
to what's shown up top on the left is
the data uh the data set we're working
with red green blue
what's shown in the middle is the one
nearest neighbor classifier
meaning this is how you segment the
entire space of different things that
you can
compare and if a point falls into any of
these regions it will be immediately
associated with the nearest neighbor
algorithm to belong to that image to to
that
region with uh F neighbors there's IM
immediately an issue
the issue is that there's white regions
there's tide Breakers where your five
closest
neighbors are from various categories so
it's unclear what you belong
to so if we this is a good example of
parameter tuning you have one parameter
K and you have to your task as a machine
as a teacher machine learning you have
to teach this algorithm how to do your
learning for
you is to figure out that parameter
that's called parameter tuning or
hyperparameter tuning as it's called in
neural
networks and so on the bottom right of
the slide is on the x axis is K as we
increase it from zero
to
100 and on the y- AIS is classification
accuracy it turns out that that the best
k for this data set is seven seven years
neighbors with that we get a performance
of
30% human level
performance and I should say that the
way that number is we get that number as
we do with a lot of the machine learning
pipeline process is you separate the
data into the parts of data set you use
for training
and SE and another part they use for
testing you're not allowed to touch the
testing part that's
cheating you construct your model of the
world on the training data set and you
use what's called cross
validation where you take a small part
of the training data shown fold five
there in
yellow to uh leave that part out from
the training and then use it as part of
the hyperparameter tuning as you train
figure out with that yellow part fold
five how well you're doing and then you
choose a different fold and see how well
you're doing and keep playing with
parameters never touching the test part
and when you're ready you run the
algorithm on the test data to see how
well you really do how well it really
generalizes yes question is there any
way to DET an inition what or do you
just have to R so the question was is
there a good way
to is there any good intuition behind
what a good K is there's general rules
for different data sets but usually you
just have to run through it grid search
Brute
Force yes
question good good question
yes yeah yes the question was is each
pixel one number or three numbers for
majority of computer vision throughout
it history use grayscale images so was
one number but RGB is three numbers and
there's sometimes a depth value too so
it's four numbers so it's if you have a
stereovision camera that gives you the
depth information of the pixels that's a
fourth and then if you stack two images
together it could be
six in general everything we work with
will be three numbers for a pixel
it was yes so the question was for the
absolute value is just one number
exactly right so in that case it was
grayscale images so it's not RGB
images so that's you know this algorithm
is pretty good if we use the
best we hyper uh we optimize the hyper
parameters of this algorithm choose K
of7 seems to work well for this
particular CFR 10 data set okay we get
30% accuracy it's impressive higher than
10% human beings perform at about 94
slightly above 94% accuracy for
carr10 so given an image it's a tiny
image I should clarify it's like a it's
a little
icon given that image human being is
able to determine accurately one of the
10 categories with 94% accuracy and the
currently state-of-the-art convolution
neural network
is 95 it's
95.4% accuracy and believe it or not
it's a heated battle but the most
important the critical fact here is it's
recently surpassed humans and certainly
surpassed the ker neighbors
algorithm so how does this work let's
briefly look back it's all it all still
boils down to this little guy the the
neuron that some sums the weights of its
inputs adds a bias produces an output
based on an activation a smooth
activation
function yes question
sorry the question was you take a
picture of a cat so you know it's a
cat but that's not encoded
anywhere like you have to write that
down somewhere so you have you have to
write as a caption this is is my cat and
then the unfortunate thing given the
internet and how witty it is you can't
trust the captions and images cuz maybe
you're just being clever and it's not a
cat at all it's a dog dressed as a
cat yes
question uh sorry CNS do better than
what yeah uh so the question was do com
neural networks generally do better than
nearest neighbors there's very few
problems on
which neural networks don't do better
yes they almost always do better except
when you have almost no
data so you need
data and convolutional neural networks
isn't some special magical thing it's
just neural networks with some with some
cheating up front that I'll explain some
tricks to try to reduce the size and
make it capable to deal with
images so again yeah the input is in
this case that we looked at classifying
an image of a number as opposed to doing
some fancy convolutional tricks we just
take the the entire 28x 28 pixel image
that's
[Music]
784 pixels as the input that's 784
neurons and the input 15 uh neurons on
the hidden layer and 10 neurons in the
output now everything we'll talk about
has the same exact structure nothing
fancy there is a forward pass through
the network where you take an input
image and produce an output
classification and there's a backward
pass through the network through back
propagation where you adjust the weights
when your prediction doesn't match the
ground truth
output and learning just boils down to
op optimization it's just optimizing a
smooth
function differentiable
function that's defined as the loss
function that's usually as simple as a
squared
error between the true output and the
one you actually got so what's the
difference what are convolution neural
networks convolution neural networks
take take inputs that have some spatial
consistency have some meaning to the
spatial have some spatial meaning in
them like images there's other is other
things you can think of the dimension of
time and you can input audio signal into
a convolution your
network and so the input is usually for
every single layer that's a
convolutional layer the input is a 3D
Volume and the output is a 3D
Volume I'm simplifying because you can
call it 4D too but it's 3D there's
height width and depth so that's an
image the height and the width is the
width and the height of the image and
then the depth for grayscale image is
one for an RGB image is
three for a 10f frame video of grayscale
images the depth is
10 it's just a volume a threedimensional
matrix of
numbers
and every the only thing that a
convolutional layer does is take a 3D
Volume as input produce a 3D Volume as
output and has some smooth
function operating on the inputs on the
sum of the
inputs that may or may not be a
parameter that you tune that you try to
optimize
that's it it's a Lego pieces that you
stack together in the same way as we
talked about
before so what are the types of layers
that a convolution your network have
there's inputs so for example a color
image of 32x
32 will be a volume of 32x 32x
3 the convolutional
layer takes advantage of the space
relationships of the input
neurons and a convolutional
layer it's the same exact neuron as for
a fully connected Network the regular
Network we talked about before but it's
just has a narrower receptive field it's
more
focused the uh the inputs to a neuron on
the convolutional layer come from a
specific region from the previous
layer and the parameters on each field
fil you could think of this as a filter
because you slide it across the entire
image and those parameters are shared so
as opposed to taking the for if you
think about two layers as opposed to
connecting every single Pixel in the in
the first layer to every single neuron
in the following layer you
only
connect the neurons and the input layer
that are close to each other to the
output layer
and then you
enforce the weights to be tied together
spatially and what that results in is a
filter every single layer on the output
you could think of as a filter that gets
excited for example for an edge and it
when it sees this particular kind of
edge in the image it'll get excited
it'll get excited in the top left of the
image
top right top left uh bottom left bottom
right the Assumption there is that an a
powerful feature for detecting a cat is
just as important no matter where in the
image it
is and this allows you to cut away a
huge number of connections between
neurons but it still boils down on the
right as a neuron that sums
a collection of inputs and applies
weights to
them the
tunable the spatial arrangement of the
output volume volume relative to the
input volume is controlled by three
things the number of
filters so for every single quote
unquote filter you'll get an extra layer
on the
output so if the the the input let's
talk about the very first layer the
input is 32x 32x 3 it's an RGB image of
32x
32 if the number of filters is
10 then the resulting
depth the resulting number
of stacked channels in the output will
be
10 stride is given is the step size of
the filter that you slide along the
image often times that's just one or
three and that directly
reduces the size the spatial size the
width and the height of the output image
and then there is a convenient thing
that's often done is
patting the image on the outsides with
zeros so that the input and the output
have the same height and width
so this is a visualization of
convolution and I encourage you to kind
of maybe offline think about what's
happening it's similar to the way human
Vision
Works crudely so if there's any experts
in the
audience so the input here on the
left is a collection of numbers 0 1 2
and a
filter well uh the uh there is two
filters shown as W1 and W w0 and
W1 those filters uh shown in red are the
different weights applied on those
filters and each of the filters have a c
a depth just like the input a depth of
three so there's three of them in each
column
and so let's see if you can yeah and so
you slide that filter along the image
keeping the weights the same this is the
sharing of the
weights and so your first filter you
pick the weights this is an optimization
problem you pick the weights in such a
way that it fires it gets excited a
useful features and doesn't fire for not
useful features and then this is a
second filter that fires for useful
features and not and produces an uh a
signal on the
output depending on a positive number
meaning there's a strong feature in that
region and a negative number if there
isn't but the filter is the same this
allows for drastic reduction in the
parameters and so you can deal with
inputs that are a thousand by thousand
pixel image for example or
video that there's a really part
powerful concept there the
spatial the spatial sharing of Weights
that means there's a spatial invariance
to the features you're detecting it
allows you to learn from arbitrary
images it can you so you don't have to
be concerned about pre-processing the
images in some clever way you just give
the raw
image there's another
operation
pooling it's a way to reduce the size of
the layers
by uh for example in this case Max
pooling for taking a a collection of
outputs and choose next one and
summarizing those collection of pixels
such that the output of the pooling
operation is much smaller than the
input because the justification there
is that you don't need a high
resolution uh localization of exactly
where which pixel got uh is important in
the image according to you know you
don't need to know exactly which pixel
is associated with the cat ear or you
know a cat
face as long as you kind of know it's
around that part and that reduces a lot
of complexity in the in the operations
yes
question the question was when uh when
is too much pooling when do you stop
pooling
so so pooling is a very crude operation
that doesn't have any one thing you need
to know is it doesn't have any uh
parameters that are learnable so you
can't learn
anything clever about the about pooling
you're just picking in this case Max
pool so you're picking being the largest
number so you're reducing the resolution
you're losing a lot of information
there's an argument that you're not you
know losing that much information as
long as you're not pulling the entire
image into a single
value but you're gaining training uh
efficiency you're gaining the memory
size the you're reducing the size of the
network so it's it's definitely a thing
that people debate and some it's a par
that you play with to see what works for
you okay
so how does this thing look like as a
whole a convolutional your own network
the input is an
image there's usually a convolutional
layer there is a pooling operation
another convolutional layer another
pooling operation and so
on at the very end if the task is
classification
you have a stack of convolutional layers
and pooling layers there are several
fully connected layers so you go from
those the spatial convolutional
operations to fully connecting every
single neuron in a layer to the
following layer and you do this so that
by the end you have a collection of
neurons each one is associated with a
particular class so in what we looked at
yesterday as the input is an image of a
number 0 through 9 the output here would
be 10 neurons so you boil down that
image with a with a collection of
convolutional convolutional
layers with one or two or three fully
connected layers at the end that all
lead to 10 neurons and each of those
neurons job is to
give get fired up when it sees a
particular number and for the other ones
to produce a low probability and so this
kind of
process is how
you have the 95 percentile accuracy on
the
cfr1 problem this here is imag net data
set that I mentioned is how you take
this image of a
leopard of a container ship and produce
a probability that that is a container
ship or a
leopard also shown there are the outputs
of the the other nearest nearest neurons
in terms of their
confidence now you can use the same
exact operation by chopping
off the fully connected layer at the end
and as opposed to mapping from image to
a prediction of what's contained in the
image you map from the image to another
image and you can train that image to be
one that
gets
excited spatially meaning it uh gives
you a high close to one value for areas
of the image that contain the object of
interest and then a low number for areas
of the image that are unlikely to
contain that image and so from this you
can go on the left an original image of
a woman on a horse to a segmented image
of knowing where the woman is and where
the horse is and where the background
is the same process can be done for
detecting the object so you can segment
the scene into a bunch of interesting
objects
candidates for interesting objects and
then go through those
candidates one by one and perform the
same kind of classification as in the
previous step where just an input is an
image and the output is a
classification and through this process
of hopping around an image you can
figure out exactly where is the best way
to segment the cow out of the image it's
called object
detection okay so how can they how can
these magical convolution neural
networks help us in
driving this is a
video of the forward roadway from a data
set that we'll look at that we've
collected from a Tesla but first let me
look at driving
briefly the the general driving task
from the human
perspective on average an American
driver in the United States drives 10
point 10,000 miles a year a little more
for Rural a little less for
urban there is about 30,000 fatal
crashes and 32
plus sometimes as high as 38,000
fatalities a year this includes car
occupants
pedestrians
bicyclists and motorcycle
riders this may be a surprising fact but
in a class uh on self-driving cars we
should remember that uh so ignore the
59.9% that that's other the most popular
cars in the United States are pickup
trucks Ford F1 series Chevy Silverado
Ram it's an important point
that we're still married to
our to uh wanting to be in
control and
so one of the interesting cars that we
look at and the car C that is the data
set that we provide to the class is is
collected from is a
Tesla it's the one that comes at the
intersection of the Ford F-150 and the
cute little Google self-driving car on
the
right it's fast it allows you to have a
feeling of control but it can also drive
itself hundreds of miles on the highway
if need
be it allows you to press a button and
the car takes over
it's a fascinating tradeoff of
transferring control from the human to
the
car it's a transfer of trust and it's a
chance for us to study the
psychology of human beings as they
relate to
machines at 60 plus miles an
hour in case you're not aware a little
summary of human beings uh we're
distracted things
would like to text use the
smartphone watch videos groom talk to
passengers eat
drink
texting 169 billion texts were sent in
the US every single month in
2014 on average 5
Seconds our I spent off the road while
texting 5 Seconds
seconds that's the opportunity for
automation to step
in more than that there's what Nitsa
refers to as the
4ds drunk drug distracted and drowsy
each one of those opportunities for
automation to step
in drunk
driving stands to benefit significantly
from automation
perhaps so the mind miles let's look at
the miles the data there 3 trillion 3
million
million 3 million million miles driven
every
year and Tesla autopilot our case study
for this class and as human beings is
driven on full autopilot mode so is
driving by
itself 300 million miles as of December
and the
fatalities
for human controlled
Vehicles is 190 million so about 30 plus
thousand fatalities a year and currently
in Tesla under Tesla autopilot there's
one
fatality there's a lot of ways you can
tear that statistic apart but it's one
to think about already
perhaps automation results in safer
driving
the thing is we don't understand
automation because we don't have the
data we don't have the
data on the forward roadway video we
don't have the data on the
driver and we just don't have that many
cars on the road today that drive
themselves so we need a lot of data
we'll provide some of it to you in the
class and as part of our research in MIT
we're collecting huge amounts of it of
cars driving
themselves and what we collecting that
data is how we get to understanding so
talking about the the data and what
we'll will be doing training our
algorithms
on here is a Tesla Model S model X we
have instrumented 17 of them have
collected over 5,000 hours and 70 ,000
miles and I'll talk about the cameras
that we put
in in
them we're collecting video of the
forward roadway this is a highlight of a
trip from Boston to Florida of one of
the people driving a
Tesla what's also shown in blue is the
amount of time that autopoll was
engaged currently zero minutes and then
it it grows and grows
for prolonged periods of time so
hundreds of miles people engage
autopilot out of 1.3 billion miles
driven in a Tesla 300 million or an
autopilot you do the math whatever that
is
25% so we are collecting data of the
forward roadway of the driver we have
two cameras on the driver what we're
providing with a class
is epics of time of the forward roadway
for privacy
considerations cameras used to
record our your regular webcam the
Workhorse of the computer vision
Community the
C920 and we have some special lenses on
top of it now what's special about these
webcams nothing that cost 70 bucks can
be that good
right the what's special about them is
that they they do on board compression
and allow you
to collect huge amounts of data and use
reasonably sized storage capacity to
store that data and train your
algorithms
on so what on the self-driving side do
we have to work
with how do we build a self-driving car
there is the sensors
radar lar
Vision
Audio all looking outside helping you
detect the objects in the external
environment to localize yourself and so
on and there's the sensors facing inside
visible light camera audio again and
infrared camera to help detect
pupils so we can decompose the cell
driving car task into four steps
localization and answering where am I
Scene understanding using the texture of
the information of the scene around to
interpret the identity of the different
objects in the
scene and the semantic meaning of those
objects of their movement there's
movement planning once you figured all
that out find found all the pedestrians
found all the other cars how do I
navigate through this maze uh a clutter
of objects in a safe and uh legal way
and there's driver State how do I detect
using video or other
information video of the driver detect
information about their emotional state
or their distraction level yes
question yes that's a real time figure
from lar lar is the sensor that provides
you the um the 3D Point
cloud of the external scene so
lar is a technology used
by most folks working with cell driving
cars to give you a strong ground truth
of the
objects it's probably the the best
sensor we have for getting 3D
information the least noisy 3D
information about the external
question so autopilot is always changing
one of the most amazing things about uh
this vehicle is that the updates to
autopilot come in uh in the form of
software so it's the amount of time it's
available changes it's become more
conservative with
time but in this this is one of the
earlier versions and it shows the second
line in yellow it shows what how often
the autopilot was available but not
turned on so it was the total driving
time was 10 hours autopilot was
available 7 hours and was engaged an
hour this particular person is uh a
responsible driver cuz what you see or
is more cautious driver uh what you see
is it's raining autopilot is still
available but
the comment was that you shouldn't trust
that one fatality number as an
indication of safety because the drivers
elect to only engage the system when
they are uh when it's safe to do so it's
a totally open uh there's a lot bigger
Arguments for that number than just that
one the the the the question
is whether that's a bad thing so maybe
we can trust human beings to engage You
know despite the poorly filmed YouTube
videos despite the hype in the media
you're still a human being riding a 60
mes an hour in a metal box with your
life in the line you won't engage the
system unless you know it's completely
safe unless you've built up a
relationship with it it's not all the
stuff you see where a person gets in the
back back of a Tesla and starts sleeping
or is playing chess or whatever that's
all for YouTube the reality is when it's
just you in the car it's still your life
on the line and so you're going to do
the responsible thing unless perhaps
you're a teenager and so on but that
never changes no matter what you're
in
so the question was what do you need to
see or sense about the external
environment to be able to successfully
Drive do you need Lane markings do you
need other what what are the land marks
based on which you do the localization
of the navigation and that depends on
the
sensors so a uh with a Google
self-driving car in sunny California it
depends on lidar to in a high resolution
way map the
environment in order to be able to
localize itself based on lidar and lidar
now I don't know the details of exactly
where lar
fails but it's not good with rain it's
not good with snow it's not good when
the environment is changing so what what
snow does is it changes the visual the
the appearance the reflective texture of
the surfaces around us human beings are
still able to figure stuff out but a car
that's relying heavily on lar won't be
able to localize itself using the
landmarks it previously has detected
because they look different now with
snow computer vision can help
us with Lanes
or following a car the two the two
landmarks that we use stay in the lane
is following a car in front of you or
staying between two lanes that's the
nice thing about our roadways is they're
designed for human eyes so you can use
computer vision for L uh for lanes and
for cars in front to follow them and
there is radar that's a crude but
reliable source of distance information
that allows you to not collide with
metal
objects so all of that together
depending on what you want to rely on
more gives you a lot of information the
question is when it's the messy
complexity of of real life occurs uh how
reliable will it be in the urban
environment and so
on so
localization how can deep learning help
so first first let's just
quick summary of visual odometry it's
using a monocular or stereo input of
video
images to determine your orientation in
the
world the orientation in this case of a
vehicle to in the in the frame of the
world and all you have to work with is a
video of the forward worldway
and with stereo you get a little extra
information of how far away different
objects
are and so this is where one of our
speakers on Friday will talk about his
expertise slam simultaneous localization
and mapping this is a very well studied
and understood
problem of detecting unique features in
the external
scene and localizing your s based on the
trajectory of
those those unique features when the
number of features is high enough it
becomes an optimization problem you know
this particular Lane moved a little bit
from frame to frame you can track that
information and fuse everything together
in order to be able to estimate your
trajectory through the three-dimensional
space you also have other sensors to
help you out you have
GPS which is
pretty accurate not perfect but pretty
accurate it's another signal to help you
localize yourself you also have IMU
accelerometer tells you the your
acceleration the
uh from the gyroscope the accelerometer
you have the six degree of freedom of
movement information about how the
moving object the car is navigating
through
space so you can do the
that
the you could do using the old school
way of
optimization given a unique set of
features like sift
features and that step
involves with stereo
input undistorting and rectifying the
images you have two images you have to
from the two images compute the depth
map so for every single Pixel Computing
the your estimate best estimate of the
depth of that pixel the
three-dimensional position
relative to the
camera then you compute that's where you
compute the disparity map that's what
that's called
the from which you get the distance then
you detect unique interesting features
in the scene sift is a popular one as a
popular algorithm for detecting unique
features and then you over time track
those features and that tracking is what
allows you to through the vision alone
to get information about your trajectory
through three-dimensional space you
estimate that trajectory there's a lot
of assumptions assumptions that bodies
are
rigid so you have to figure out if a
large object passes right in front of
you you have to figure out that that
what that
was you have to figure out the mobile
objects in the scene
and those that are
stationary or you can cheat what we'll
talk
about and do it using neural networks
end to end now what does end to end mean
and this will come up a bunch of times
throughout this class and today end to
end
means and I refer to it as cheating
because it takes away a lot of the hard
work of hand engineering features
you take the raw input of whatever
sensors in this case it's taking stereo
input from a stereovision camera so two
images a sequence of two images coming
from a stereo vision camera and the
output is a estimate of your trajectory
through space so as opposed to doing the
hard work of Slam of detecting unique
features of localizing yourself of
tracking those features and figuring out
what your trajectory is you simply train
the
network with some ground truth that you
have from a more accurate sensor like
ldar and you train it on a set of inputs
that are stereo Vision inputs and
outputs is the trajectory to
space you have a separate convolution
neur networks for the velocity and for
the
orientation and this works pretty
well unfortunately not quite well and uh
John Leonard will talk about
that slam is one of the places where
deep learning has not been able to
outperform the previous
approaches where where deep learning
really helps is the scene understanding
part is interpreting the objects in the
scene it's detecting the
various parts of the scene segmenting
them and with Optical flow determining
their
movement so previous approaches for
detecting
objects like the uh traffic uh
signal classification of detection that
we have the tensor flow tutorial for or
to use use har like features or other
types of features that are hand
engineered uh from the
images now we can use convolution
networks to to replace the extraction of
those features
and there's a TENS of flow
implementation of
seget which is taking the exact same
neural network that I talked about it's
the same thing just it it's uh the the
beauty is you just apply similar types
of networks to different problems and uh
depending on the complexity of the
problem can get quite amazing uh
performance in this case we
convolutional in the network meaning the
output is an image input is an image a
single monocular image the output is
a segmented image where the colors
indicate your best pixel by pixel
estimate of what object is in that part
this has no is not using any spal
information it's not using any temporal
information so it's per uh it's
processing every single frame separately
and it's able to separate the road from
the uh trees from the pedestrians other
cars and so
on this is intended to lie on top of uh
a radar SL liar type of Technology
that's giving you the three-dimensional
or stereo Vision three-dimensional
information about the scene you're sort
of painting that scene with the identity
of the objects that are in
it your best estimate of it this is
something I'll talk about
tomorrow is recurring your Networks and
we can use recurring real networks that
work with temporal data to process video
and also process
audio in this
case we can process what's shown on the
bottom is a spectrogram of audio for a
wet Road and a dry road you can look at
that spectrogram as an
image and process it in a temporal way
using Rec curral networks just slide it
across and keep feeding it to a
network and it does incredibly well on
the simple tasks
certainly of dry Road versus Red Road
this is an important a subtle but very
important task and there's many like it
to know that the road the the texture
the
quality the characteristics of the road
wetness being a critical one when it's
not raining but the road is still wet
that information is very
important okay so for movement
planning the same kind of
approach
uh on on the right is work from one of
our other speakers sires Kon the same
approach we're using for the to solve
traffic through friendly competition is
the same that we can use for for what
Chris geres does with his race cars for
planning trajectories in highspeed
movement
along
complex
curves so we can solve that problem
using
optimization solve the control problem
using optimization or we can use it with
reinforcement learning by
running tens of millions hundreds of
millions of time through that simulation
of taking that curve and learning which
trajectory doesn't both optimizes the uh
the speed at which you take the turn and
the safety of the
vehicle exactly the same thing that
you're using for
traffic and for driver State this is
what we'll talk about next week is all
the fun face stuff eyes face
emotion this is we have video of the
driver video of the driver's body video
of the driver's
face on the left is one of the Tas in
his younger
days still looks the same there he
is so uh that's uh in that particular
case
you're doing one of the easier problems
which is one of detecting where the head
and the eyes are positioned the head and
eye pose in order to determine What's
called the gaze of the driver where the
driver is looking
glance and so shown and we'll talk about
these problems from the left to the
right on the left and green are the
easier problems on the red are the
harder from the computer vision aspect
so on the left is body pose head pose
the larger the object the easier it is
to detect and the orientation of it is
the easier to detect and then there is
pupil diameter detecting the
pupil the characteristics the position
the size of the pupil and there's micro
cods things that happen at 1 millisecond
frequency the Tremors of the eye all
important information for to determine
the state of the driver some are
possible with computer vision some are
not this is something that we'll uh talk
about I think on
Thursday is the detection of where the
driver is looking so with this is a
bunch of the cameras that we have in a
Tesla this is Dan driving a Tesla and
detecting exactly where of one of six
regions we've converted it into a
classification problem of left right
rear view mirror instrument cluster
Center stack or forward roadway so we
have to determine out of those six
categories which is which direction is
the driver looking at this is important
for driving we don't care exactly the
XYZ position of where the driver is
looking at We Care that they're looking
at the road or not are they look at
their cell phone in their lap or they
looking at the forward roadway and we'll
be able to answer that pretty
effectively using convolution your own
networks you can also look at
emotion using cnns to
extract again converting emotion complex
world of emotion into a binary problem
of frustrated versus
satisfied this is a video of drivers
interacting with a voice navigation
system if you've ever used one you know
it may be a source of frustration from
folks and so this is self-reported this
is one of the hard you know driver
emotion if you're in what's called
Effective computing is the field of
studying
emotion from the computational
side if you're if you know that if
you're working in that field you know
that The annotation side of emotion is a
really challenging one so getting the
ground truth of well okay so this guy is
smiling so can I label that as happy or
he's frowning does that mean he's sad
most Effective computing folks do just
that in this case with
self-report ask people how frustrated
they were in a scale of 1 to 10 Dan up
top reported a
uh a one so not frustrated he's
satisfied with the interaction and the
other driver reported it as a nine he
was very frustrated with interaction now
what you notice is there's a very cold
stoic look on Dan's face which is an
indication of
happiness and in the case of frustration
uh the driver is smiling so this is a
sort of a good reminder that we can't
trust our own human instincts and
Engineering features and engineering the
ground truth we have
to trust the
data trust the ground truth that we
believe is the closest reflection of the
actual semantics of what's going on in
the
scene okay so end to end driving getting
to the the project and the
tutorial so if driving is like a
conversation
and thank
you for someone to clarifying that this
is Arctic Triumph in in U in Paris this
video If and if driving is like uh
natural language conversation that we
can think of end to-end
driving as skipping the entire touring
test components and treating it as an
ENT natural language generation so what
we do is we take as input the external
sensors as and output the control of the
vehicle and the magic happens in the
middle we replace that entire step with
a neural
network the TA told me to not include
this image because it's the cheesiest
I've ever seen I
apologize thank you thank
you uh
I regret
nothing so this is this is to show our
path to uh to self-driving cars but they
it's to explain a point that we have a
large data set of ground truth if we
were to formulate the driving task of
Simply taking external
images and producing steering commands
acceleration of braking commands then we
have a lot of ground truth we
have a large number of drivers on the
road every
day driving and therefore collecting our
ground Truth for us because they're an
interested party in producing the
steering commands that keep them
alive and therefore if we were to record
that data it becomes ground truth so if
it's possible to learn this what we can
do is we can collect data for the
manually control vehicles and use that
data to train an algorithm to to
to control a self-driving
vehicle okay so one one of the first
folks that did this is NVIDIA where they
actually train in an external image the
image of the forward
roadway and a neural network a
convolution neural network a simple
vanilla convolution neural networks I'll
briefly
outline take an image in take a steering
produce a steering command out and
they're able to successfully
to some degree learn to navigate basic
turns curves and even stop or um make
sharp turns at a at a uh T
intersection so this this network is
simple there is input on the bottom
output up top the input is a 66x 200
pixel image RGB shown on the left or
shown on the left as the raw input and
then you crop it a little bit and resize
it down 66x 200 that's what we have in
the code as well the in the two versions
of the code we provide for you both that
runs in the browser and in tensor
flow it has a few
layers a few convolutional layers a few
uh fully connected layers and an
output this is a regression Network it's
producing not a classif a of cat versus
dog it's producing a steering command
how do I turn the steering wheel that's
it the rest is Magic and we train it
on uh human
input what we have here as a project is
an implementation of this system in
comnet JS that runs in your browser this
is the tutorial tutorial to follow and
the project to to take
on so unlike the Deep traffic
game this is reality this is a real uh
real input from real
Vehicles so you can go to this
link demo went wonderfully yesterday so
let's see maybe be two for two
so there's a tutorial and then the
actual game the actual simulation is on
deep Tesla JS I
apologize everyone's going there now
aren't
they does it work on a phone it does
great
again similar structure up top is the uh
visualization of the loss function as
the network is learn is learning and
it's always
training next is the input for the
layout of the network there's the the
specification of the input 200x
66 there's a convolutional layer there's
a a pooling layer and the output is a
regression layer a single neuron this is
a tiny version deep tiny right is uh as
a tiny version
of the Nvidia uh
architecture and then you can visualize
the operation of this network on real
video the actual wheel value that
produced by the driver by the autopilot
system is in blue and the output of the
network is in
White in what's indicated by Green is
the cropping of the image that is then
resized to produce the 66x 200 input to
the
network so once again amazingly this is
running in your browser
training on real world video so you can
get in your car today input it and maybe
teach and your'll network to drive like
you we have the code in Comet JS and
tensor flow to do that and a
tutorial well let me briefly describe
some of the um some of the work
here so the input to the network is a
single image this is for for deep Tesla
JS single image the output is a steering
wheel value between -20 and 20 that's in
degrees we
record like I said uh thousands of hours
but we provide publicly 10 video clips
of highway driving from a
Tesla half are driven by autopilot half
were driven by
human the wheel values extracted from a
perfectly
synchronized can we are collecting all
of the messages from can which contains
steering wheel value and that's
synchronized with the video we crop
extract the window the green one I
mentioned and then provide that as the
input to the
network so this is a slight difference
from Deep traffic with the red car
weaving through traffic because there's
the messy reality of real world lighting
conditions and your task for the most
part in this simple steering
task is to stay inside the lane and say
the lane markings in an endtoend way
learn to do just
that so Comet JS is a JavaScript
implementation of cnns of convolution
networks it
supports really arbitrary Network
I mean all neural networks are simple
but because it runs in JavaScript it's
not utilizing
GPU the larger the network the more it's
going to uh be weighed down uh
computationally now unlik the previous
unlike deep traffic this isn't a
competition but if you are a student
registered for the course you still do
have to submit the code you still have
to submit your own your own car uh as
part of the class hey the
question so uh the question was the
amount of data that's
needed is there a general rules of thumb
for the amount of data needed for a
particular task in driving for
example it's a good
question you generally have to like I
said your networks of good memorizers so
you have to just
have every case represented in the
training set that you're interested in
as much as possible so that
means in general if you want a picture
if you want to classify difference in
cats and dogs you want to have at least
1 th000 cats and 1 th000 dogs and then
you do really
well the problem with driving is twofold
one is that most of the time driving
looks the same and the stuff you really
care about is when driving looks
different it's all the edge cases so
what we're not good with neural networks
is generaliz in from the common case to
the edge cases to the outliers so
avoiding a crack just because you can
stay in the highway for thousands of
hours successfully doesn't mean you can
avoid a crash when somebody runs in
front of you on the road and the other
part with driving is the accuracy you
have to achieve is really high so
for uh for C versus dog a you know a
life doesn't depend on your
error on your ability to steer a car
inside of a lane uh you better be very
close to 100%
accurate there's a box for Designing the
network there's a visualization of the
metrics measuring the performance of the
network as a trains there is a
visualization a layer visualization of
the what features the network is
extracting at every convolutional layer
and every fully connected layer
there is ability to restart the the
training visualize
the visualize the network performing on
real
video there
is the input
layer the convolutional
layers the video visualization
an interesting tidbit on the bottom
right is a uh
barcode that will has ingeniously
design how do I clearly explain why this
is so cool it's a way to through video
synchronize multiple streams of data
together so it's very easy for those who
have worked with multimodal data where
there's several streams of data to for
the for them to
become unsynchronized especially when a
big component of training and neural
network is shuffling the data so you
have to shuffle the data in clever ways
so you're not overfitting any one little
aspect of the video and yet maintain the
data perfectly synchronized so what he
did instead of doing the hard work of
connecting the uh the D the the steering
wheel and the and the video is actually
putting the steering wheel on top of the
video as a barcode
the final result is you can watch the
network
operate and over time it learns more and
more to steer correctly I'll fly through
this a little bit in the interest of
time just kind of summarize some of the
things that you can play with in terms
of tutorials and let you guys
go this is the same kind of process ENT
to end driving with tensor flow so we
have code available on GitHub you just
put up on um
my GitHub under deep Tesla that takes in
a single video or an arbitrary number of
videos trains on them uh and produces
the visualization that compares the
steering wheel the actual steering wheel
and the predicted steering
wheel the steering wheel when it agrees
with a human driver or with the
autopilot system lighting up is green
and when it disagrees lighting up is red
hopefully not too
often again this is some of the details
of how that's exact done in tensor flow
this is vanilla convolution neur
networks specifying a bunch of
layers convolutional layers a fully
connected layer train the model so you
iterate over the batches of
images run the model over a test set of
images and get this
result we have a tutorial
or IPython notebook in a tutorial up on
this this is perhaps the best way to get
started with convolution networks in
terms of our class is looking at the
simplest class image classification
problem of traffic light classification
so we have these images of traffic
lights we did the hard work of detecting
them for you so now you have to uh
figure out you have to build a
convolutional network it
gets figures out the concept of color
and gets excited when it sees red yellow
or
green if anyone has
questions and welcome those you can stay
after class if you have any um concerns
with Docker with tensor flow with how to
win traffic deep traffic just stay after
class or come by Friday 5 to 7 see you
guys tomorrow