Transcript

1L0TKZQcUtA • MIT 6.S094: Introduction to Deep Learning and Self-Driving Cars
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0019_1L0TKZQcUtA.txt
Back Raw
Kind: captions
Language: en
all right hello
everybody hopefully you can hear me well
yes yes great so welcome to course 6s
094 deep learning for self-driving
cars uh we will introduce to you the
methods of deep learning of deep neural
networks using the guiding case study of
building self-driving
cars my name is Lex
Friedman uh you get to listen to me for
majority of these
lectures and I am part of an amazing
team uh with some brilliant T would you
say brilliant so Dan um Dan Brown you
guys want to stand up you okay they're
in front row
Spencer uh William Angel Spencer Dot and
all the way in the back uh the smartest
and the tallest person I know Ben
Benedict uh genck so what you see there
on the left of the slide is a uh
visualization of one of the two projects
that one of the two simulations
games that will'll get to go through we
use it as a way to teach you about deep
reinforcement learning but also as a way
to excite excite
you by challenging you to compete
against others if you wish uh to win a
special prize yet to be announced super
secret
prize so you can reach me and the taas
at decars mit.edu if you have any
questions about the tutorials about the
lecture about anything at
all the website cars. mit.edu has the
lecture content code tutorials again um
like today the lecture slides for today
are already up in PDF form uh this the
the slides celles if you want to see
them just email me but they're uh over a
gigabyte in size because they're very
heavy in videos so I'm just posting the
PDFs and there will be lecture videos
available a few days after the lecture
is given so speaking of which there is a
camera in the back this is being
videotaped and recorded but for the most
part the camera is just on the speaker
so you shouldn't have to worry if that
kind of thing wores you then uh you
could sit on the periphery of the
classroom or maybe I suggest sunglasses
and a mustach fake mustache that would
be a good idea there is a competition
for the game that you see on the left
I'll describe exactly what's
involved in order to get credit for the
course you have to design a neural
network that drives the car just above
the speed limit 65 miles hour but if you
want to win you need to go a little
faster than that
so what who this class is
for you may be new to programming new to
machine learning new to robotics or
you're an experts in those fields but
want to go back to the
basics so what you will learn is an
overview of deep reinforcement
learning a convolution on your own
networks recurring your old networks and
how these methods can help improve each
of the components of autonomous driving
perception visual perception
localization mapping control planning
and the detection of driver
State okay two projects uh code name
deep traffic is the first one uh there
is in this particular formulation of it
there is seven
Lanes uh it's a top view uh it looks
like a game but uh I assure you it's
very serious
it is uh the agent in red the car in red
is being controlled by neural network
and we'll explain how you can control
and design the various aspects the
various parameters of this neural n this
neural
network and it learns in the browser so
this we're using compet JS which is a
library that is programmed by Andre
karthy in JavaScript so amazingly we
live in a world where you can train in a
matter of minutes a neural network in
your browser and we'll talk about how to
do that the reason we did this is so
that there is very few requirements for
get you up and started with your own
networks so in order to complete this
project for the course you don't need
any requirements except to have a Chrome
browser and uh to win the competition
you don't need anything except the
Chrome
browser the second project code named
deep
Tesla or
Tesla is using uh data from a Tesla
vehicle of the forward roadway and using
an learning taking the image the and
putting it into convolutional neural
networks that directly Maps a regressor
that maps to a steering angle so all it
takes is a single image and it predicts
a steering angle for the car and you we
have data for the car itself and you get
to build a neural network that tries to
do better tries to steer better or at
least as good as the
car okay let's get started with the
question with the thing that we
understand so poorly at this time
because it's so shoud in mystery but it
fascinates many of us and that's the
question of what is
intelligence this is from a
1996 March 1996 Time
Magazine and the question C machinists
think is answered below where they
already do so what if anything is
special about the human
mind it's a good question for 1996 a
good question for 2016
17 now and the future and there's two
ways to ask that question one is the
special purpose
version can artifici can an artificial
intelligent system achieve a
well-defined specifically uh formally
defined finite set of
goals and this uh little diagram from a
book that got me into artificial
intelligence as a bright-eyed high
school student the art artificial
intelligence just a modern
approach this is a uh beautifully
simple diagram of an intell of a system
it exists in an environment it has a set
of sensors that do the
perception it takes Those sensors in
does something magical there's a
question mark there and with a set of
effectors Acts in the world manipulates
objects in that
world and
so special Purp purpose we
can under this formulation as long as
the environment is formly defined well
defined as long as a set of goals are
well defined as long as the set of
actions sensors and the ways that the
perception carries itself out is well
defined we have good algorithms of which
we'll talk about that can optimize for
those goals the question is if we inch
along this path
will we get closer to the general
formulation to the general purpose
version of what artificial intelligence
is can it achieve poorly defined
unconstrained set of goals with an
unconstrained poorly defined set of
actions and unconstrained poorly defined
utility functions
rewards this is what human life is about
this is what we do pretty well most days
exists in
a undefined full of
uncertainty
world so okay we can separate tasks into
uh different three different categories
formal tasks this is the
easiest it doesn't seem so it didn't
seem so at the birth of artificial
intelligence but that's in fact true if
you think about it the easiest is the
formal tasks playing board games theorem
proving all the kind of mathematic iCal
logic problems that can be formal
formally
defined then there's the expert expert
tasks so this is this is where a lot of
the exciting breakthroughs have been
happening where machine learning methods
datadriven methods can help aid or
improve on the performance of our human
experts this means medical diagnosis
Hardware design scheduling and then
there is the thing that we take for
granted the trivial thing the thing
that we do so easily every day when we
wake up in the morning the mundane tasks
of everyday speech of written language
of visual
perception of
walking which we'll talk about in
today's lecture is a fascinatingly
difficult task an object
manipulation so the question is that
we're asking here before we talk about
deep learning before we talk about the
specific methods we really want to dig
in and and and try to see what is it
about
driving how difficult is
driving is it is it more like chess
which you see on the left there where we
can formally Define a set of lanes a set
of actions and uh formulate it as this
you know there's five set of actions you
can change a lane you can avoid
obstacles there's you can formally
Define an obstacle you can formally
Define the rules of the road or is is
there something about natural language
something similar to everyday
conversation about driving that's
requires a much higher degree of
reasoning of
communication of learning of existing in
this underactuated space is it a lot
more than just left lane right lane
speed up slow down
so let's look at it as uh as a chess
game here's the chess pieces what is uh
what are the sensors we get to work with
on a self of on an autonomous vehicle
and we'll get a lot more in depth on
this especially with the guest speakers
who built many of
these there's radar there's the range
sensors radar liar that give you
information about obstacles in an
environment that help localize the
obstacles in the environment there's the
visible light camera and stereo Vision
that gives you texture information that
helps you figure out not just where the
obstacles are but what they are helps to
classify those helps to understand their
subtle
movements then there is the information
about the vehicle itself about the
trajectory and the movement of the
vehicle that comes from the GPS and IMU
sensors and there is the state of the
the rich Rich state of the vehicle
itself what is it doing what are all the
individual systems doing that comes from
the can
Network and there there is one of the
less studied but fascinating to us on
the research side is audio the sounds of
the road that provide the rich
context of a wet Road the sound of a
road that when it stop raining but it's
still wet the sound that it makes
the screeching tire and the and honking
these are all fascinating signals as
well and the focus of the research in
our group the thing that's really much
under uh investigated is the internal
facing sensors the
driver sensing the the state of the
driver where they looking are they
sleepy the emotional state are they in
the seat at all
and the same with
audio that comes from the visual
information and the audio
information more than that here's the
tasks if you were to break into modules
the task of what it means to build a
self-driving
vehicle first you want to know where you
are where am I localization and mapping
you want to map the external
environment figure out where all the
different uh uh obst are all the
entities are and use that estimate of
the environment to then figure out where
I am where the robot is then there's
seen
understanding is understanding not just
the positional aspects of the uh
external environment and the Dynamics of
it but also what those entities are is
it a car is it a pedestrian is it a
bird there's movement planning once you
have
kind of figured out to the best of your
abilities your position and the position
of other entities in this world there's
figuring out a trajectory through that
world and
finally once you've figured out how to
move about safely and effectively
through that world is figuring out what
the human that's on board is doing
because as I will talk about the path to
a self-driving vehicle and that is hence
our focus on Tesla
may go through semi-autonomous Vehicles
where there is
a where the vehicle must not only drive
itself but effectively hand over control
from the
car to the human and
back okay quick history well there's a
lot of fun stuff from the 80s and 90s
but the the the big breakthroughs came
when in the second DARPA Grand Challenge
with Stanford Stanley when they won the
competition one of five cars that
finished this was an incredible
accomplishment in a desert
race a fully autonomous vehicle was able
to complete the
race in record
time uh the
Dara Urban challenge in
2007 where the race was the uh the task
was no longer race through the desert
but through a urban
environment and cmu's
boss with GM won that
race and a lot of that work led directly
into
the uh
acceptance
and large major industry players taking
on the challenge of building these
vehicles Google Now wh Mo self-driving
car Tesla with its autopilot system and
now autopilot 2 system Uber with its
testing in uh
Pittsburgh then there's many other
companies including one of the speakers
for this course of
nutonomy that are driving the the
wonderful streets of
Boston okay so let's take a step
back we have if we think about the
accomplishments in the Dara
Challenge and if we look at the
accomplishments of the Google self
driving car which essentially boils the
world down into a chess
game it uses incredibly accurate sensors
to build a three-dimensional map of the
world localize itself effectively in
that world and move about that
world in a very well- defined
way now what if
driving the open question is if driving
is more like a
conversation like a natural language
conversation how hard is it to pass the
touring test the touring test as the
popular current formulation is can a
computer M be mistaken for a human being
in more than 30% of the time when a
human is talking behind a veil having a
conversation with either a computer or
human can they mistake the other side of
that
conversation for being uh a human when
it's in fact a
computer
and the the way you would in a natural
language build a system that has
successfully passes the T in test is the
natural language part processing part to
enable it to communicate successfully so
generate language and interpret language
then you represent knowledge the state
of the conversation transferred over
time and the last piece and this is the
hard piece is the automated
reasoning is
reasoning can we
teach machine learning methods to reason
that's that is something that will
propagate through our discussion
because as I will talk about the various
methods the various deep learning
methods neural networks are good at
learning from
data but they're not yet there is no
good mechanism for
reasoning now reasoning could be just
something that we tell ourselves we do
to feel special better to feel like
we're better than machines reasoning may
be simply something as simple as
learning from
data we just need a larger
Network or there could be a totally
different mechanism required and we'll
talk
about the possibilities
there
yes which is that
the US no it's it's very difficult to
find these kind of situations in the
United States so the question was for
this video is it in the United States or
not um I believe it's uh in um Tokyo so
India
uh a few European
countries uh are
much uh are much
more towards the direction
of uh natural language versus chess in
the United States uh generally speaking
we follow rules more concretely the
quality of Roads is better the marking
on the roads is better so there's less
requirements
there
the these cars are driving the left
side I see I just okay yeah you're right
it is cuz uh yep yeah so but it's
certainly not the United States I'm
pretty uh it's spent quite a bit of
Googling uh trying to find the United
States and it's it's
difficult so let's talk
about
the recent breakthroughs in machine
learning and what is at the core of
those
breakthroughs is neural
networks that have been around for a
long time and I will talk about what has
changed what are the cool new things and
what has hasn't changed and what are its
possibilities but first a
neuron
crudely is a computational
building block of the brain I know
there's a few folks here Neuroscience
folks this is uh hardly a
model it is mostly an
inspiration and
so the the human
neuron has inspired the artificial
neuron the computational building block
of a neural network of an artificial
neural network I to give you some
context these neurons for both
artificial and human brains are
interconnected in the human brain
there's
about I believe 10,000 outgoing
connections from every
neuron on
average and they're interconnected to
each
other our the largest current as far as
I'm aware are artificial neural network
has 10 billion of those connections
synapses our human brain to the best
estimate that I'm aware of
has 10,000 times that
so 100 to 1,000 trillion
synapses now what is a artificial
neuron this building block of a neural
network it takes a set of
inputs it puts a weight on each of those
inputs sums them
together applies a bias value on each
that that sits on each neuron and using
an activation function that takes as
input the that sum plus the bias and
squishes it together to produce a zero
to one
signal and this allows us a single
neuron take a few inputs and produces an
output a classification for example a
01 and as as we'll talk about
simply it it
can serve as a linear classifier so it
can draw a line it can learn to draw a
line between like what's seen here
between the blue
dots and the yellow dots and that's
exactly what we'll do in the iyon
notebook that I'll talk
about but the basic algorithm is you
initialize the the weights on the
inputs and you compute the
output you perform this previous
operation I talked about sum up and
compute the output and if the
output does not match the ground
truth the expected output the output
that it should
produce the weights are punished
accordingly and we'll talk through a
little bit of the math of
that and this process is repeated until
the perceptron does not make any more
mistakes now
here's uh the amazing thing about neural
networks there's several and I'll talk
about
them
uh one on the mathematical side is the
universality of neural networks with
just a single layer if we stack them
together a single hidden
layer the inputs on the left the outputs
on the right and in the middle there's a
single hidden layer it can closely
approximate any function any
function so this is an incredible
property that with a single layer any
function you could think
of
that you know you can think of driving
as a function it takes an input
the the world outside as
output the control of the vehicle there
exists in your own network out there
that can drive
perfectly it's a fascinating
mathematical
fact so we can think of this then these
functions as a special purpose function
special purpose intelligence you can
take U say as
input uh the number of bedrooms the
square feet
the type of neighborhood those are the
three
inputs
it passes that value through to the
hidden layer and then one more step it
produces the final price estimate for
the house or for the
residents and we can teach a network to
do this pretty well in a supervised way
this is supervised learning you provide
a lot of examples
where you know the number of bedrooms
the square feet the type of
neighborhood and then you also know the
final price of the House of the
residence and then you can as I'll talk
about through a process of back
propagation teach these
networks to make this prediction pretty
well now some of the exciting
breakthroughs
recently have been in the general
purpose
intelligence this is from Andre
karthy who's now at open
AI I would like to take a moment here to
try to explain how amazing this is this
is a game of
pong uh if you're not familiar with
pong uh there's two paddles and you're
trying to uh bounce the the ball
back and uh in such a way that prevents
the other guy from bouncing the the wall
the ball back at you on
the
the the pl the the artificial
intelligence agent is on the right in
green and up top is the score 8 to1 now
this takes about 3 Days To Train on a
regular computer this network what is
this network doing it's called a policy
Network the input is the raw pixels
they're they're they're slightly uh
processed and also you take the
difference between uh two frames but
it's basically the raw pixel information
that's the input there's a few hidden
layers and the output is a single
probability of moving
up that that's it that's that's the
whole that's that's the whole
system and what it's doing
is it
learns not you don't know at any one
moment you don't know what the right
thing to do is is it to move up is it to
move down you only
know what the right thing to do is by
the fact that eventually you win or lose
the
game so this is the amazing thing here
is there's no no supervised learning
about there's no like Universal fact
about any one state being good or bad
and any one action being good or bad in
any state but if you punish or reward
every single action you took every
single action you took for entire game
based on the result so no matter what
you did if you won the game The end
justifies the means if you won the game
G every action you took and every every
action uh State pair gets rewarded if
you lost the game it gets
punished and this process with only
200,000 games where the the the system
just simulates the games it can learn to
beat the
computer this system knows nothing about
pong nothing about
games this is general
intelligence except for the fact that
it's just a game of
pong and I
will talk about how this can uh be
extended further and why this is so
promising and why this is
also we should proceed with
caution so
again there's a set of actions you take
up down up down based on the output of
the network there's a threshold given
the probability of moving up you move up
or down based on the output of the
network and you have a set of
states and every single state action
pair is rewarded if there's a win and
it's punished if there's a
loss when when you go
home think about how amazing that is and
if you don't understand why that's
amazing uh spend some time on it it's
incredible sure sure thing uh the
question was uh what is supervised
learning what is unsupervised learning
what's the difference so supervised
learning is when people talk about
machine learning they mean supervised
learning most of the time supervised
learning
is learning from data it's learning from
example when you have a set of inputs
and a set of outputs that that you know
are correct what are called Ground
truth so you need those examples a large
amount of them to train any of the
machine learning algorithms to learn to
then generalize that to Future
examples this
is actually there's a third one called
reinforcement learning where the ground
truth is sparse the information
about when something is good or not the
ground truth only happens every once in
a while at the end of the game not every
single frame and unsupervised learning
is when you have no information about
the outputs that are correct or
incorrect and it is the the the excite M
of the deep Learning Community is
unsupervised learning but it has
achieved no major breakthroughs at this
point this is the I'll talk about what
the future of deep learning is and a lot
of the people that are working in the
field are excited by it but right now
every excite any interesting
accomplishment is has to do with
supervised
learning you talk about so I guess
the the green one right yeah and the BR
one is just a teristic solution like
look at the velocity so basically the
reinforcement learning here is learning
from somebody who has certain rules and
how can
that be guaranteed that it would
generalize to somebody else for
example uh so this it's a uh the
question was uh this uh the paddle the
green paddle learns to play this game
successfully against this specific One
Brown paddle with us operating under
specific kinds of rules how do we know
it can generalize to other games other
things and it
can't but the mechanism by which it
learns
generalizes so as long as you let it
play as as long as you let it play in
whatever world you want it to succeed in
long enough it will use the same
approach to learn to succeed in that
world the problem
is you this works for worlds you can
simulate
well
unfortunately uh one of the big
challenges of neural networks is that
we're not they're not currently
efficient Learners we need a lot of data
to learn anything human beings need like
need one example often times and they
learn very efficiently from that one
example so uh and again I'll I'll I'll
I'll I'll talk about that as well it's a
good question so the drawbacks of neural
networks so if you think about the way a
human being would approach this game
this game of pong they would only need a
simple set of instructions you're in
control of a paddle and you can move it
up and down and your task is to bounce
the ball past the other player uh
controlled by
AI now the human being would immediately
they may not win the game but they would
immediately understand the game and will
be able to successfully play it well
enough to pretty quickly learn to beat
the
game but they need to have a concept of
control what it means to control a
paddle they need to have a concept of a
paddle they need to have a concept of
moving up and down and a ball and
bouncing they have to know they have to
have a at least a loose concept of real
world physics that they can then project
that real world physics onto the
two-dimensional world all of these
concepts are
uh are Concepts that you come to the
table with that's
knowledge and the kind of way you
transfer that Knowledge
from uh from your previous experience
from from childhood to now when you come
to this
game that is something is called
reasoning whatever reasoning
means and the question is whether
through this same kind of
process you can see the entire
world as a game of
pong and reasoning is simply ability to
simulate that game in your
mind and learn very efficiently much
more efficiently than 200,000
iterations
the other challenge of deep neural
networks and machine learning broadly is
you need big data inefficient Learners
as I said that data also needs to be
supervised data you need to have ground
truth which is very costly for so
annotation a human being looking at a
particular image for example and
labeling that as something as as a cat
or a dog whatever object is in the image
that's very costly
and for particularly for uh neural
networks there's a lot
of parameters to tune there's a lot of
hyperparameters you need to figure out
the network structure first how does
this network look how many layers how
many hidden
nodes what type of uh what type of
activation function on each node there's
a lot of hyper parameters there and then
once you built your network
there is parameters for how you teach
that Network there's learning rate loss
function mini badge size number of
training iterations uh gradient update
smoothing and the selecting even the uh
Optimizer with which
you uh with which you solve the various
differential equations
involved it's a topic of many research
papers certainly it's Rich enough for
research papers but it's also really
challenging it means that you can't just
plop a network down and it will solve
the problem
generally and defining a good loss
function or in the case of pong or games
a
good reward function is difficult so
here's a game this is a recent result
from open
AI
and I'm teaching a uh a network to play
the game of Coast Runners and the goal
of Coast
Runners is to go you're in a boat you're
the task is to go around a
track uh and successfully complete a
race against other people you're racing
against now this network is an optimal
one and what has figured out that
actually in the game
G it gets a lot of points for collecting
certain objects along the path so what
you see is it's figured out to go in a
circle and collect those those green uh
turbo
things and what it's figured out is you
don't need to complete the game to earn
the
reward now that more sort
of there uh and despite being on fire
and hitting the wall and going through
this whole process it's actually uh
achieved at least a local Optima given
the reward function of maximizing the uh
the number of
points and
so the it's it's figured out a way to
earn a higher reward while ignoring the
implied bigger picture goal of finishing
the race which us as
humans uh understand much
better this this raises for self-driving
cars ethical
questions besides other questions you
can watch this for hours and he will do
that for hours and that's the point is
um it it's it's hard to teach it's hard
to
encode
the formally Define a utility function
under which an intelligence system needs
to operate and that's made obvious even
in a simple game and so what what is yep
question so the question was what's an
example of a local Optimum that uh an
autonomous car so similar to the coast
race so what would be the example in the
real world for an autonomous
vehicle
and it's a touchy
subject but it would certainly have to
be
involved the the choices we make under
near crashes and crashes the choices a
car makes when to avoid uh for example
if the if there's a crash imminent and
there's no way you can stop to prevent
the crash do you do you keep the driver
safe or do you keep the other uh people
safe and there has to be some
even if you if even if you don't choose
to acknowledge it even if it's only in
the data and the learning that you do
there's an implied reward function
there and we need to be aware of that
reward function is because he may it may
find something until you actually see it
we won't know it once we see it we'll
realize that oh that was a bad design
and that's the scary thing it's hard to
know ahead of time what that
is uh so the recent breakthroughs from
Deep
learning
came several factors first is the
compute More's law CPUs are getting
faster 100 times faster every
decade then there's
gpus also the ability to train neural
networks and
gpus and now
as6 has has created a lot of uh
capabilities in terms of Energy
Efficiency and uh uh being able to train
larger networks more
efficiently there is larger well first
of all in the in the 21st century
there's digitized data there's larger
data sets of Digital Data and now there
is that data uh is becoming more
organized not just vaguely uh a ailable
data out there on the internet it's
actual organized data sets like
imet certainly for natural language
there's large data
sets there is the algorithm
Innovations back prop back propagation
convolution neur networks lstms all
these different architectures for
dealing with specific uh specific types
of domains and tasks there's the huge
one is infrastructure is uh on the
software and the the hardware side
there's git ability to share an open
source way
software there is pieces of software
that
uh uh that make Robotics and make
machine learning easier Ross tensor flow
there's a Amazon Mechanical
turque which allows for efficient cheap
annotation of large scale data sets it's
a AWS in the cloud hosting uh machine
learning hosting the data and the
compute and then there's a financial
backing of large companies Google
Facebook
Amazon but really nothing has changed
there really has not been any
significant breakthroughs we're using
the ex convolutional neural networks
have been around since the '90s neural
networks have been around since the
60s there's been a few
improvements but the hope is that's in
in terms of
methodology the compute has really been
the
Workhorse the ability to do uh the
hundredfold Improvement every
decade holds promise and the question is
whether that reasoning thing I talked
about is all you need is a larger
Network that is the open
question so some terms for deep learning
first of all deep learning
is a PR term for neural
networks it is a term for utilizing it's
a for uh for deep neural networks for
neural networks that have many
layers it a symbolic term for the newly
gained capabilities that compute has
brought us that that training on gpus
have brought
us so deep learning is a subset of
machine learning there's many other
methods that are are still
effective the terms that we that'll come
up in this class is first of all
multi-layer perceptron deep neural
networks recurrent neural networks lstm
long short-term memory networks CNN or
con Nets convolutional neural networks
deep belief networks and the operation
that'll come up is convolution pooling
activation functions and back
propagation yep cool question
so the question was what is the purpose
of the different layers inur Network
what does it mean to have one
configuration versus
another so a neural
network having several
layers it's the only thing you have an
understanding of is the inputs and the
outputs you don't have a good
understanding about what each layer
does they're mysterious things neural
networks so I'll talk about how with
every layer it forms a higher
level a higher order representation of
the input so it's not like uh the first
layer does localization the second layer
does path planning the third layer uh
does navigation how you get from here to
Florida or maybe it
does but we don't
know so we know we're beginning to
visualize neural networks for simple
tasks like for image net classifying
cats versus dogs we can tell what is the
thing that the first layer does the
second layer the Third layer and we'll
look at that but for driving when you as
the input provide just the images and
the output the steering it's still
unclear what you
learn partially because we don't have
neural networks that drive successfully
yet feed
neur the neuron layers or does it
eventually generate them on its own over
time so this uh this is uh the question
was do uh do you uh does a neural
network generate layers over time like
does it
grow it that's one of the challenges is
that a neural network is predefined the
architectures the number of nodes number
of layers that's all fixed unlike the
human brain where neurons die and are
born all the time neural network is
pre-specified that's it that's all you
get and if you want to change that you
have to change that and then retrain
everything so it's fixed
so what I encourage you is to proceed
with caution because there's this
feeling when you first teach a network
with very little effort how to do some
amazing task like classify a
face uh versus nonface or your face
versus other faces or cats versus dogs
it's an incredible
feeling and then you there's there's
definitely this feeling that I'm an
expert but what you realize
is it it's we you don't actually
understand how it works and getting it
to perform well for more generalized
tasks for larger scale data sets for
more useful applications requires a lot
of hyperparameter tuning figuring out
how to tweak little things here and
there and still in the end you don't
understand why it works so damn
well so deep
learning these deep neural network
architectures is representation
learning this is the difference between
uh traditional machine learning methods
where for example for the task of having
an image here is the input the input to
the networks here is on the bottom the
output is up at top so and the input is
a single image of a person in this case
and so the input
specifically is all of the pixels in
that image RGB the different colors of
the pixels in the
image and over
time what what a network does is build a
multi-resolution representation of this
data the first
layer builds uh learns the concept of
edges for example
the second layer starts to learn
composition of those edges corners
Contours then it starts to learn about
object parts and finally actually
provide a label for the entities that
are in the
input and this is the difference between
traditional machine learning methods
where the concepts like edges and
corners and Contours are manually
pre-specified by human
by human beings human experts for the
particular
domain and representation
matters
because figuring out a line for the
cartisian coordinates of this particular
data set where you want to design a
machine Learning System that tells the
difference between green triangles and
blue
circles is difficult there's no line
that separates them
cleanly and if you were to ask a human
being a human expert in the field to try
to draw that
line they would probably do a PhD on it
and still not
succeed but a neural network can
automatically figure out to remap that
input into po polar coordinates where
the representation is such that it's an
easily linearly separable data
set and so deep learning is a subset of
representation learning is a subset of
machine learning and a key subset of
artificial
intelligence now because of this because
of its ability to compute an arbitrary
number of features that at the core of
the represent presentation so you're not
if you were trying to detect a cat in an
image you're not specifying 215 specific
features of cat ears and whiskers and so
on that a human expert would specify you
allow an newal Network to discover tens
of thousands of such
features which maybe for cats you are an
expert but for a lot of objects you may
never be able to sufficiently provide
the the the features which success UL
would be used for identifying the object
and so this kind of representation
learning one is easy in the sense that
all you have to provide is inputs and
outputs all you need to provide is a
data set that you care about without
hand engineering
features and two because of the uh its
ability to construct arbitrarily sized
representations Jeep neural networks are
hungry for data the more data we give
them the more they're able to learn
about this particular data
set so let's look at some
applications
first some cool things that deep neural
networks have been able to accomplish up
to this point let me go through them
first the basic
one Alex
net uh is for imag net is a famous data
set and it's a competition of
classification and
localization where the task is given an
image identify what are the five most
likely things in that image and what is
the most likely and you have to do so
correctly so on the right there's an
image of a leopard and you have to
correctly classify that that is in fact
a
leopard so they're able to do this
pretty well given a specific
image determine that it's a leopard
and we started what's shown here on the
x- axis is years on the y- axis is error
in
classification so starting from 2012 on
the left with
alexnet and
today the errors decreased from
16% and 40% before then with traditional
methods have decreased to below 4% so
human level performance if I were to
give
you this picture of a leopard there's a
four four% for 4% of those pictures of
leopards you would not say it's a
leopard that's human level performance
so for the first time in
2015 uh convolution your networks
outperform human beings that in itself
is incredible that's something that
seemed impossible and now is because
it's
done is not as
impressive
but I just want to get to why that's so
impressive
because computer vision is hard now we
as human beings have evolved visual
perception over millions of years
hundreds of millions of
years so we take it for granted but
computer vision is really hard visual
perception is really hard there is
illumination variability so it's the
same object the only way we tell
anything is from the shade the
reflection of light from that surface
it's it could be the same object with
drastically in terms of pixels
drastically different looking
shapes and we still know it's the same
object there is POS variability and
occlusions probably my favorite caption
for an image uh for a figure in a
academic paper is deformable and
truncated
cat these are pictures you know cats are
famously uh are deformable they take a
lot of different
shapes it's it's arbitrary poses are are
uh are possible so you have to have
computer vision needs to not still the
same object still the same class of
objects given all the variability in the
posst and occlusions there a huge
problem we still know it's an object we
still know it's a cat even when parts of
it are not visible and sometimes large
parts of it are not
visible and then there's all the
intraclass
variability in intraclass all of these
on the top two rows are cats many of
them look drastically different and the
top bottom two rows are dogs also look
drastically different and yet some of
the dogs look like cats some of the cats
look like dogs and as human beings are
pretty good at telling the difference
and we want computer vision to do better
than that it's hard so how is this done
this this is done with convolutional
networks the input to which is a raw
image here is an input on the left of a
number
three and I'll talk about through
convolutional
layers that image is processed passed
through convolutional layers maintain
spatial
information on the output they in this
case predicts
which of uh the images uh what number is
shown in the image 0 1 2 through
9 and so this these networks that is
this is exactly everybody is using the
same kind of network to determine
exactly that input is an image output is
a
number uh and in the case of you know
probability that it's a leopard what is
that number then there's segmentation
built on top of these convolutional
networks where you you chop off the
end and and convolution eyesee the
network you chop off the end where the
output is a heat map uh so you can have
instead of a a detector for a cat you
can do a cat heat map where it the part
of the image of the output heat map gets
excited the neurons on that output get
excited in the spatially spatially
excited uh in the parts of the image
that contain a tabby cat
and this kind of process can be used to
segment the image into different objects
a horse so the original input on the
left is a woman on a horse and the
output is a is a fully segmented image
of knowing where's the woman where's the
horse and this kind of process can be
used for object detection which is the
task of detecting an object in an
image now the traditional methods with
with convolution real networks and in
general in computer vision is is the
sliding window approach where you have a
detector like the leopard detector that
you slide through the image to find
where in that image is a
leopard this the segmenting approach the
rcnn approach is efficiently segment the
image in such a way that it can propose
different parts of the image that are
likely to have a leopard or in this case
a
cowboy and that drastically reduces the
computation of requ re ments of of the
object detection
task and so the these networks this is
the uh currently one of the best
networks for the image net task of
localization is the de uh deep residual
networks they're
deep so
vgg19 is one of the famous ones vggnet
you're starting to get above 20 layers
in many cases 34 layers is the reset
one so the lesson there is the deeper
you go the more representational power
you have the higher
accuracy but you need more
data other applications colorization of
images so this
again
input is a single image and output is a
single image so you can take a black and
white video from a
film from an old film and recolor it and
all you need to do to train that Network
in the supervised way is provide modern
films and convert them to grayscale so
now you have arbitrarily sized data
sets that are able to uh data sets of
gray scale to color
and you're able to with
very with very little effort on top of
it to
successfully well somewhat successfully
recolor
images again Google translate does image
translation in this way image to image
it first perceives here in uh German I
believe fam was German correct me if I'm
wrong uh dark chocolate written in
German on a box so this can take this
image
detected different letters convert them
to text translate the text and then
using the image to image
mapping map the letters the translated
letters back onto the box and you can do
this in real
time uh on
video so what we've talked about up to
this point on the left are vanilla
neural networks convolution neural
networks that map a single input to a
single output a single image to a number
single image to another image then there
is recurrent neural networks the map
this is the more General formulation
that map a sequence of images or a
sequence of words or a sequence of any
kind to another
sequence and these networks are able to
do incredible things with natural
language with with
video and anytime series data for
example we can convert text to
handwritten digits with with hand
handwritten text here we type in and you
could do this online type in deep
learning for self-driving cars and it
will use an arbitrary handwriting style
to generate the words deep learning for
self driving
cars this is done using a recurr on your
own
networks we can also take car rnns
they're called this character level
recurrent neural networks that train on
a data
set at an arbitrary Text data
set and learn to
generate text one character at a time so
there is no preconceived syntactical
semantic structure that's provided to
the network it learns that
structure so for example you can train
it on Wikipedia articles like in this
case and it's able to generate
successfully not only text that
makes some kind of uh grammatical sense
at least but also keep perfect syntactic
structure for Wikipedia for markdown
editing for latch editing and so
on this text says naturalism a decision
for the majority of Arab countries
Capital Li whatever that means was
grounded by the Irish Language by John
Clair and so on these are sentences if
you didn't know better that might sound
correct and it does so let me pause one
character at a time so there is these
aren't words being generated this is one
character you start with a beginning
three letters
Nat you generate you
completely without knowing of the word
naturalism
this is
incredible you can do
this to start a sentence and let the
your will complete that sentence so for
example if you start the sentence with
life
is or life is about
actually it will complete it with a lot
of fun things the weather life is about
kids life is about the true love of Mr
Mom
is about the truth now and this is from
uh Jeffrey Hinton the last two if you
start with the meaning of life it can
complete that with the meaning of life
is literary recognition maybe true for
some of us
here publish a
parish and the meaning of life is the
tradition of ancient human
reproduction also true for some of us
here I'm
sure
okay so what else can you do you can uh
this has been very exciting recently is
image caption recognition uh generation
I'm sorry image caption uh generation is
an is important for uh for large data
sets of images where we want to be able
to determine what's going on inside
those images uh especially for search if
you want to find a man sitting in a
couch with a dog you type it into Google
and it's able to find that
so here shown in in Black text a man
sitting on a couch with a dog is
generated by the system a man sitting in
a chair with a dog in his lap is
generated by a human Observer and again
these annotations are done by detecting
the different obstacles uh the different
objects in the scene so segmenting the
scene detecting on the right there's a
woman a crowd a cat a camera holding you
know purple all of these words are being
detected then a syntactically correct
sentence is generated a lot of them and
then you order which sentence is the
most likely and in this way you can
generate very accurate uh labeling of
the
images captions for the images and you
could do the same kind of process
for image question
answering you can ask how many so
quantity how many chairs are there
you can ask about location where where
are the right
bananas you can ask about the type of
object what is the object on the chair
it's a
pillow and these are again using the
recurr on NE
networks uh you could do the same thing
with uh with the video caption
Generation video caption uh description
generation so looking at had a sequence
of images as opposed to just a single
image what is the action going on in
this uh in this situation this is the
difficult task there's a lot of work in
it uh in this area now for on the left
is correct descriptions of a man is
doing stunts on his bike a heard of
zebras are walking in the field and on
the right there's a small bus running
into a building you know it's talking
about relevant entities but just doing
an incorrect description uh a man is
cutting a piece
of a piece of a pair of a paper he's
cutting a piece of a pair of a paper so
the words are
correct perhaps but so you're close but
uh no cigar so one one of the
interesting
things you can do with the recal
networks is if you think about the way
we look at images human beings look at
images is is uh we only have a small
phobia with which we focus on on in the
scene so right now your periphery is
very distorted the only thing if you're
looking at the slides or you're looking
at me that's the only thing that's in
Focus majority of everything else is out
of focus so we can use the same kind of
concept to try to teach a Neal Network
to steer around the image both for
perception and generation of those
images this is important first on the
general artificial intelligence point of
it being just
fascinating that we can SEL selectively
steer our attention but also it's
important for things like drones that
have to fly at high speeds in an
environment where at 300 plus frames a
second you have to make decisions so you
can't possibly localize yourself or
perceive the world around yourself
successfully if you have to interpret
the entire scene so what you can do is
you can steer for example here shown is
reading reading uh house
numbers uh by steering around an
image you could do the same task for
reading and for writing so reading
numbers here on the mes data set on the
left is reading numbers but you can also
selectively Mo steer
a steer Network around an image to
generate that image starting with a
blurred Image First and then getting
more and
more higher resolution as the steering
goes
on work here at MIT is able
to
map video to
audio so H stuff with a
drumstick Silent Video and able to
generate the
sound that would drumstick hitting that
particular object
makes so you can get texture
information from that uh
impact so here is is a video uh of a
human soccer player playing soccer and
a state-ofthe-art
machine playing
soccer and well let me give him some
time to build up
[Laughter]
okay so uh soccer this is we take this
for granted but walking is hard object
manipulation is hard soccer is harder
than chess for us to do much harder on
your phone now you have you you can have
a chess engine
that beats the best players in the
world and you have to internalize that
because the question is this is a
painful video but the the question is
where does driving fall is it closer to
chess or is it closer to
soccer for those incredible brilliant
Engineers that worked on the most recent
DARPA challenge this would be is very
painful video to watch I
apologize this is a video from the DARPA
challenge of Rob of robots
struggling with the ba basic object
manipulation and walking
tasks so it's it's mostly a fully
autonomous navigation
task
maybe I'll just let this play for a few
minute for a few moments
to to let it internalize how difficult
this task
is of
balancing of planning in an
underactuated way where you don't have
full control of everything when when
there is a Delta between your perception
of of what you think the world is and
what the reality is so there robot was
trying to
turn an object that wasn't
there and this is an MIT entry that
actually successfully I believe got in
points for this because
it got into that
area but as as as a lot of the teams
talked about the hardest part so what
one of the things the robot had to do is
get into a car and drive it and get out
of the car and there's a few other
manipulation tasks we to walk on on on
steady ground they had to drill a hole
through a wall all all these tasks and
what a lot of teams said is the hardest
part the hardest task of all of them is
getting out of the car so it's not
getting into the car it's this very task
that you saw now is is is the robot
getting out of the car these are things
we take for
granted so in our evaluation of what is
difficult about driving we have to
remember that some of those things we
may take for granted in the same kind of
way that we take walking for
granted this is this is this is
Mor's
Paradox with Hans morac from
CMU let me just quickly read that quote
and quote in the large highly evolved
sensory and motor portions of the human
brain is billions of years of experience
about the nature of the world and how to
survive in it so this is data this is
Big Data
billions of years and abstract thought
which is reasoning the stuff we think is
intelligence is perhaps less than
100,000
years uh of data old we haven't yet
mastered it and so sorry I'm inserting
my own statements in the middle of a
quote but
uh we it hasn't it's been it's been very
recent that we've learned how to think
and so so we respect it perhaps
more than the things we take for granted
like walking and visual perception and
so on but those may be strictly a matter
of
data data and training time and network
size so walking is
hard the question is how hard is
driving and the that's an important
question because because the margin of
error is
small one there's one
fatality per 100 million miles that's
the number of people that die in car
crashes every year one fatality per 100
million
miles that's a
0.00001% margin of error that's through
all the time you spend on the road that
is the error you get we're impressed
with imag net being able to classify a
leopard a cat or a dog at close
to uh at above human level performance
but this is the margin of error we get
with driving and we have to be able to
deal with snow with heavy rain with big
open parking lots with parking garages
any pedestrians that behaves
irresponsibly as rarely as that happens
or just
unpredictably again especially in Boston
Reflections the ones especially this is
one of some of the things you don't
think about the lighting variations that
blind the
cameras
cres yeah the question was if that
number changes if you look at just
crashes so fatality is per crash no
crashes per yeah so one of the big
things is cars have gotten really good
at crashing and not hurting the uh
anybody so the number of crashes is much
much larger than number of fatalities
which is a great thing we've built safer
cars uh but still you know even one
fatality is is too
many so this
is one one that uh Google self-driving
car
team is uh quite open
about their
performance since hitting public roads
this is from a report that shows the
number of
times the driver disengage um the the
car gives up
control that it asks the driver to take
control back or the driver takes control
back by force meaning that they're
unhappy with the decision that the car
was making or it was putting the car or
other pedestrians or other cars in
unsafe situations and so if you see over
time that there's there's there's been a
total from 2014 to
2015 there's been a total of 341 times
on beautiful San Francisco roads uh and
I say that seriously because the weather
conditions are great there uh 341 times
that the driver had to elect to take
control
back so it's a work in
progress and let me give you something
to think about here this with neural
networks
is a big open question the question of
robustness so this is an amazing paper I
encourage people to read it there's a
there's a couple papers around this
topic deep neural networks are easily
fooled
so here are eight
images where if given to a neural
network as input ACC convolution your
network as input the network with higher
than
99.6% confidence says that the image for
example on the top left is a robin next
to it is a cheetah then an armadillo a
panda an electric guitar a baseball a
starfish a king penguin all of these
things are obviously not in the images
so networks can be fooled with
noise more importantly more
practically for the real
world adding just a little bit of
distortion a little bit of noise
Distortion to the
image can force the network to produce a
totally wrong prediction so here is an
example uh the there's three columns
correct image
classification the slight addition of
distortion and the resulting prediction
of an ostrich for all three images on on
the left and a prediction of an ostrich
for all three images on the right this
ability to fool networks easily brings
up an important point and that point is
that that there has
been a lot of
excitement about neural networks
throughout their history there's been a
lot of excitement about artificial
intelligence throughout its history and
uh not coupling that excitement not
grounding that excitement in the reality
the real challenges around that has
resulted in in
crashes in AI Winters when funding dried
out and people became hopeless in terms
of the possibilities of artificial
intelligence so here's a 1958 New York
Times article that said the Navy
revealed the embryo of an electronic
computer today this is what when the
first percept perceptron that I talked
about was implemented in Hardware by
Frank
rosenblat it took 400 pixel image
input and it provided a single
output weights were encoded in Hardware
potentiometers and weights were updated
with electric motors now New York Times
wrote the Navy revealed the embryo of an
electronic computer today that expects
we'll be able to walk talk see write
reproduce itself and be conscious of its
existence Dr Frank rosenblat a research
psychologist at the Cornell aeronautical
laboratory Buffalo said perceptrons
might be fired to the planets as
mechanical Space
Explorers this might seem ridiculous but
this was the General opinion of the
time and as we know now
perceptrons cannot even separate a non
linear
function they're just linear
classifiers and so this led to two major
AI Winters in the 70s and in the late
80s and early
90s the light Hill report in
1973 by the UK government said that no
part of the field have discoveries made
so far produced the major impact that
was promised so if the hype build
builds beyond the capabilities of our
research reports like this will
come and they have the possibility of
creating another AI winter so I want to
pair the optimism some of the cool
things we'll talk about in this
class with the reality of the challenges
ahead of
us the focus of the research Community
this is some of the the key players in
deep
learning what are the things that are
next for deep learning the 5-year
Vision we want to run on smaller cheaper
mobile
devices we want to explore more in the
space of unsupervised learning as I
mentioned and reinforcement
learning we want to do things that
explore the space of videos more the
recurring your old networks like being
able to summarize videos or generate
short
videos one of the big efforts especially
in the companies with dealing with large
data is multimodal learning learning
from multiple data sets with multiple
sources of
data and lastly making money from these
Technologies there's a lot
of this despite the excitement the there
has been an inability for the most part
to make serious money from some of
the more interesting parts of deep
learning and while I
uh got made made fun of by the Tas for
including this slide because it's shown
in so many sort of business type
lectures but it it is true that we're at
the peak of a hype cycle and we have to
make sure
be given the large amount of hype and
excitement there is we proceed with
caution one example of that let me
mention is we already talked about
spoofing the cameras spoofing the
cameras with a little bit of noise so if
you think about it uh self-driving
Vehicles operate with set of sensors and
they rely on Those sensors to convey to
accurately capture that information now
what happens not only when the world
itself produces noisy visual information
but what if somebody actively tries to
spoof that data one of the fascinating
things that been recently done is
spoofing of
lidar so these R liar is is a is a range
sensor that gives a 3D Point cloud of
the objects in the external
environment and you're able to
successfully do a replay attack where
you have the car see people and other
cars around it when there's actually
nothing around
it in the same way that you can spoof a
camera to see things that are not
there a neural
network so let me run through some of
the libraries that we'll work with and
that are out there that you might work
with if if you proceed with deep
learning tensor flow that is the most
popular one these days it's heavily
backed and developed by
Google it's has a primarily a python
interface uh and
is uh very good at at operating on
multiple
gpus there's caras and also tflearn and
TF slim which are libraries that operate
on top of tensor flow that uh make it
slightly easier slightly more
userfriendly
interfaces uh to get get up and
running
torch if you're interested to get in uh
at the lower level uh tweaking of the
different parameters of neural networks
creating your own architectures torch is
excellent for that with its own uh Lua
interface Lua is a programming
language and heavily bagged by Facebook
there's the old school theano which just
what I started on and a lot of people
early on in deep learning started on
it's one of the first libraries that
supported uh had came with GPU support
it definitely encourages lower level
tinkering and as a python interface and
many of these if not all rely on
nvidia's nvidia's library
for for doing some of the low-level
computations involved with training
these NE
networks on Nvidia
gpus mxnet heavily supported by
Amazon and uh they've officially
recently announced that they're going to
be with their AWS going to be Allin on
mxnet neon recently uh bought by Intel
started out as a Manufacturing a
manufacturer of uh neural network chips
which is really
exciting and it performs exceptionally
well now here are good things Cafe
started in Berkeley uh also was very
popular in Google before tensor Flow
came out it's primarily designed for
computer vision with conv Nets but is
now expanded to uh all the um all other
domains there is CN TK as it used to be
known and now called the Microsoft
cognitive toolkit nobody calls it that
still I'm aware
of it's says multi-gpu support has its
own brain script custom
language as well as other
interfaces and what we'll get to play
around in this class is amazingly deep
learning in the browser
right our favorite is Comet JS what
you'll use built by Andre karpathy from
Stanford now open
AI it's good for explaining the basic
concept of neural networks it's fun to
play around with uh all you need is a
browser so very few requirements it can
it can't leverage gpus
unfortunately but for a lot of things
that we're doing you don't need gpus
you'll be able to train a network with
very little and relatively efficiently
uh without the need of gpus it has full
support for CNN RNN and uh even deep
reinforcement learning
learning Cara's
JS which seems incredible we tried to
use for this
class didn't didn't happen to uh it has
GPU support so it runs in the browser
with GPU support with openg or however
it works uh magically but we're able to
accomplish a lot of things we need
without the use of gpus so I should it's
incredible to live in a day and age when
um it it literally as I'll show in the
tutorials it takes just a few minutes to
get started with building your own y
network that classifies
images and a lot of these libraries are
friendly in that way so all the
references mentioned in
this presentation are available at this
link and the slides are available there
as well so I think in the interest of
time let me wrap up thank you so much
for uh coming in today and tomorrow I'll
explain the reinforcement learning game
and the actual competition and how you
can win it thanks very much
guys