Kind: captions
Language: en
thank you everyone for braving the cold
and the snow to be here this is six at
zero nine for deep learning for
self-driving cars and it's a course
where we cover the topic of deep
learning which is a set of techniques
that have taken a leap in the last
decade for our understanding of what
artificial intelligence systems are
capable of doing and self-driving cars
which is systems that can take these
techniques and integrate them in a
meaningful profound way internal daily
lives in a way that transform society so
that's why both of these topics are
extremely important and extremely
exciting my name is Lex Friedman and I'm
joined by an amazing team of engineers
in Jack terwilliger Jude Julia Kendall's
burger Dan Brown Michael Glaser Lee ding
Spencer Dodd and Benedict genic among
many others we build autonomous vehicles
here at MIT not just ones that perceive
and move about the environment but ones
that interact communicate and earn the
trust and understanding of human beings
inside the car the drivers and the
passengers and the human beings outside
the car the pedestrians and other
drivers and cyclists the website for
this course self-driving cars that
mit.edu
if you have questions email deep cars at
MIT tidy you slack deep - MIT for
registered MIT students you have to
register on the website and by midnight
Friday January 19th build a neural
network and submit it to the competition
that achieves the speed of 65 miles per
hour on the new deep traffic 2.0 it's
much harder and much more interesting
and last year's for those of you who
participated there's three competitions
in this class deep traffic seg fuse deep
crash there's guest speakers that come
from way more Google Tesla and those are
starting new autonomous vehicle startups
in voyage autonomy and Aurora and then
use a lot today from CES and we have
shirts for those of you who braved the
snow and continued to do so towards the
end of the class there will be free
shirts yes I said free and shirts in the
same sentence you should be here okay
first the deep traffic competition
there's a lot of updates and we'll cover
those on Wednesday it's a deep
reinforcement learning competition last
year we received over 18,000 submissions
this year we're going to go bigger not
only can you control one car well then
you'll network you can control up to ten
this is multi agent deeper enforcement
learning this is super cool
second psych fuse dynamic driving scene
segmentation competition where you're
given the raw video the the kinematics
of the vehicles in the movement of the
vehicle the state-of-the-art
segmentation for the training set you're
given ground truth labels pixel level
labels scene segmentation and optical
flow and with those pieces of data your
task to try to perform better than the
state of the art in image based
segmentation why is this critical and
fascinating in an open research problem
because robots that act in in this world
in the physical space not only must
interpret use these deep learning
methods to interpret the spatial visual
characteristics of a scene they must
also interpret understand and track the
temporal dynamics of the scene this
competition is about temporal
propagation of information not just
seeing segmentation you must understand
in space and time and finally deep crash
where we use deep reinforcement learning
to slam cars thousands of times at here
at MIT at the gym you're given data on a
thousand runs or a car or a car knowing
nothing is using a monocular camera as a
single input driving over 30 miles an
hour through a scene it has very little
control through very little capability
to localize itself it must act very
quickly in that scene you're given a
thousand runs to learn anything we'll
discuss this in the coming weeks this
competition will result in four
submissions that we evaluate everyone's
in simulation but the top four
submissions we put head-to-head at the
gym and until there is a winner declared
we keep slamming cars at 30 miles an
hour
deep crash and also on the website is
from last year and on github there's
deep Tesla which is using the
large-scale naturalistic driving data
set we have to train a neural network to
do enter and steering that takes in
monocular video from the forward roadway
and produces steering commands that
steering commands for the car lectures
today we'll talk about deep learning
tomorrow we'll talk about autonomous
vehicles deep RLS on Wednesday driving
scene understanding so segmentation
that's Thursday on Friday we have sasha
or knew the director of engineering at
way mo way mo is one of the companies
that's truly taking huge strides and
fully autonomous vehicles they're taking
the fully l4 l5 autonomous vehicle
approach and it's fascinating to learn
he's also the head of perception for
them
to learn from him what kind of problems
they're facing and what kind of approach
they're taking on we have ameliafe
Rizzoli who one of last year's speakers
sir - Carmen said Amelia as the smartest
person he knows so Amelia for zoli's the
CTO of metonymy an autonomous vehicle
company that was just acquired by Delphi
for a large sum of money and they're
doing a lot of incredible work in
Singapore and here in Boston next
Wednesday we are going to talk about the
topic of our research or my personal
fascination is deep learning for driver
state sensing understanding the human
perceiving everything about the human
being inside the car and outside the car
one talk I'm really excited about is
Oliver Cameron on Thursday he is now the
CEO of autonomous vehicle startup voyage
there's previously the director of the
self-driving car program for audacity he
will talk about how to start a
self-driving car company for those he
said that MIT folks and entrepreneurs if
you want to start one yourself
he'll tell you exactly how its super
cool and then Sterling Anderson who was
the director previously a Tesla auto
pilot team and now is a co-founder of
Aurora the the self-driving car startup
that I mentioned that has now partnered
with NVIDIA and many others so why
self-driving cars this class is about
applying data-driven learning methods to
the problem of autonomous vehicles why
self-driving cars are fascinating and an
interesting problem space quite possibly
in my opinion this is the first wide
reaching and profound integration of
personal robots in society wide-reaching
because there's 1 billion cars on the
road even a fraction of that will change
the face of Transportation and how we
move about this world profound and this
is an important point that
not always understood is there's an
intimate connection between a human and
a vehicle when there's a direct transfer
of control it's a direct transfer of
control that takes that his or her life
into the hands of an artificial
intelligence system I showed a few click
quit quick Clips here you can Google
first time with Tesla autopilot on
YouTube and watch people perform that
transfer of control there's something
magical about a human and a robot
working together that will transform
what artificial intelligence is in the
21st century and this particular
autonomous system AI system self-driving
cars is on the scale and the profound
the life-critical nature of it is
profound in a way that it will truly
test the capabilities of AI there is a
personal connection that will argue
throughout these lectures that we cannot
escape considering the human being that
autonomous vehicle must not only
perceive and control its movement
through the environment you must also
perceive everything about the human
driver and the passenger and interact
communicate and build trust with that
driver because in my view as I will
argue throughout this course an
autonomous vehicle is more of a personal
robot than it is a perfect perception
control system because perfect
perception and control so this world
full of humans is extremely difficult
and could be two three four decades away
full autonomy autonomous vehicles are
going to be flawed they're going to have
flaws and we have to design systems that
are effectively caught that effectively
transfer control to human beings when
they can't handle the situation and that
transfer of control isn't as a
fascinating
for AI because the obstacle avoidance
perception of obstacles and obstacle
avoidance is the easy problem it's the
safe problem going 30 miles an hour
navigating through streets of Boston is
easy it's when you have to get to work
in you're late or you're sick of the
person in front of you that you want to
go into in the opposing lane and speed
up that's human nature and we can't
escape it our artificial intelligence
systems can't escape human nature they
must work with it what's shown here is
one of the algorithms we'll talk about
next week for cognitive load or we take
the raw 3d convolutional neural networks
take in the eye region the blinking and
the pupil movement to determine the
cognitive load of the driver we'll see
how we can detect everything about the
driver where they're looking emotion
cognitive load body pose estimation
drowsiness the the movement towards full
autonomy is so difficult I would argue
that it almost requires human level
intelligence that the as I said two
three four decade out journey for
artificial intelligence researchers to
achieve full autonomy will require
achieving solving some of the problems
fundamental problems of creating
intelligence and that's something we'll
discuss in much more depth in a broader
view in two weeks for the artificial
general intelligence course where we
have Andrey Carpathia from Tesla Ray
Kurzweil Mark Ryberg from Boston
Dynamics who asked for the dimensions of
this room because he's bringing robots
nothing else was told to me it'll be a
surprise
so that is why I argue the human
centered artificial intelligence
approach in every algorithm of a design
considers the human for autonomous
vehicle on the left the perception scene
understanding and the control problem as
we'll explore through the competitions
in the assignments of this course can
handle 90 and increasing percent of the
cases but it's the 10 1.1 percent of the
cases as we get better and better that
we have to we're not able to handle
through these methods and that's where
the human perceiving the human is really
important this is the video from last
year of Arc de Triomphe thank you I
didn't know it last year I know now that
is one of millions of cases where human
to human interaction is them is the
dominant driver not the basic perception
control problem so why deep learning in
this space because deep learning is a
set of methods that do well from a lot
of data and to solve these problems
where human life is at stake we have to
be able to have techniques that learn
from data learn from real-world data
this is the fundamental reality of
artificial intelligent systems that
operate in the real world they must
learn from real world data whether
that's on the left for the perception
the control side or on the right for the
human the perception and the
communication interaction and
collaboration with the human and the
human robot interaction ok so what is
deep learning it's a set of techniques
if you allow me the definition of
intelligence being the ability to
accomplish complex goals then I would
argue that definition of understanding
maybe reasoning is
the ability to turn complex information
into simple useful actionable
information and that is what deep
learning does deep learning is
representation learning or feature
learning if you will its able to take
raw information raw complicated
information that's hard to do anything
with and construct here are hierarchical
representations of that information to
be able to do something interesting with
it it is the branch of artificial
intelligence which is most capable and
focused on this task forming
representations from data whether it's
supervised or unsupervised whether it's
with the help of humans or not it's able
to construct structure find structure in
the data such that you can extract
simple useful actionable information on
the left from Ian Goodfellows book is
the basic example of a Mis
classification the input of the image on
the bottom with the raw pixels and as we
go up the stack as we go up the layers
hiring higher-order representations are
formed from edges to contours the
corners to object parts and then finally
the full object semantic classification
of what's in the image this is
representation learning a favorite
example for me is one from four
centuries ago our place in the universe
and representing that place in the
universe whether it's relative to earth
or relative to the Sun on the left as
our current belief on the right is the
one that has held widely for centuries
ago representation matters because
what's on the right is much more
complicated than what's on the left
you can think of in a simple case here
when the task is to draw a line that
separates green triangles and blue
circles in the Cartesian coordinate
space on the left the task is much more
difficult impossible to do well on the
right is trivial in polar coordinates
this transformation is exactly what
which we need to learn this is
representation learning so you can take
the same task of having to draw a line
that separates the blue curve and the
red curve on the left if we draw a
straight line it's going to be a high
there's no way to do it with zero error
with 100% accuracy shown on the right is
our best attempt but what we can do with
deep learning with a single hidden layer
Network done here is form the the
topology the mapping of the space in
such a way in the middle that allows for
a straight line to be drawn that
separates the blue curve and the red
curve the learning of the function in
the middle is what we're able to achieve
with deep learning it's taking raw
complicated information and making it
simple actionable useful and the point
is that this kind of ability to learn
from raw sensory information means that
we can do a lot more with a lot more
data so deep learning gets better with
more data and that's important for real
world applications where edge cases are
everything this is us driving with two
perception control systems one is in
Tesla vehicle with the auto pilot
version one system that's using a
monocular camera to perceive the
external environment and produce control
decisions and our own neural network
running on adjacent ex2 that's taking in
the same with a monocular camera and
producing control decisions and the two
systems argue and when they disagree
they raise up a flag to say that this is
an edge case the East
that needs human intervention there is
covering such edge cases using machine
learning is the main problem art of
artificial intelligence and when applied
to the real world it is the main problem
to solve okay so what our neural
networks inspired very loosely and I'll
discuss about the key difference being
our own brains and artificial brains
because there's a lot of insights in
that difference but inspired loosely by
biological neural networks here as a
simulation of a thalamocortical brain
network which is only 3 million neurons
476 million synapses the full human
brain is a lot more than that a hundred
billion neurons 1,000 trillion synapses
there's an inspirational music with this
one that I didn't realize was here it
should make you think artificial neural
networks yeah let's just let it play the
the human neural network is a hundred
billion neurons right 1,000 trillion
synapses one of the stated in that
state-of-the-art neural networks as
resin that 152 which has 16 million
synapses that's a difference of about a
seven order of magnitude difference the
human brains have ten million times more
synapses than artificial neural networks
plus or minus one order of magnitude
depending on the network so what's the
difference between a biological neuron
and an artificial neuron the topology of
the human brain have no layers neural
networks are stacked in layers they're
fixed for the most part there is chaos a
very little structure in our human brain
in terms of how neurons are connected
they're connected often to 10,000 plus
other neurons the number of synapses
from individual neurons that are that
are input into the neuron is huge
they're asynchronous the human brain
brain works asynchronously artificial
neural networks work synchronously the
learning algorithm for artificial neuron
networks the only one the best one is
back propagation and we don't know how
human brains learn processing speed this
is one of the the only benefits we have
with artificial neural networks is
artificial neurons are faster but
they're also extremely power and
efficient and there is a division in two
stages of training and testing with
neural networks with by
logical neural networks as you're
sitting here today they're always
learning
the only profound similarity the
inspiring one the captivating one is
that both are distributed computation at
scale there is an emergent aspect to
neural networks where the basic element
of computation a neuron is simple is
extremely simple but when connected
together beautiful amazing powerful
approximator z' can be formed a neural
network is built up with these
computational units where the inputs
there's a set of edges with weights on
them the edges are the weights are
multiplied by this input signal a bias
is added with a nonlinear function that
determines whether the network gets
activated or not well the neuron gets
activated or not visualized here and
these neurons can be combined in a mall
in number of ways they can form a
feed-forward neural network or they can
feed back into itself to form to have
state memory in recurrent neural
networks the ones on the left are the
ones that are most successful for most
applications in computer vision the ones
on the right are very popular and
specific one temporal dynamics or
dynamics time series of any kind are
used in fact the ones on the right are
much closer to the way our human brains
are and the ones on the left but that's
why they're really hard to train one
beautiful aspect of this emergent power
for multiple neurons being connected
together is the universal property that
with a single hidden layer these
networks can learn any function learn to
approximate any function which is an
important property to be aware of
because the limits here are not in the
power of the networks the limits
in is in the methods by which we
construct them and train them what kinds
of machine learning deep learning are
there we can separate into two
categories memorizers now the approaches
that essentially memorize patterns in
the data and approaches that we can
loosely say are beginning to reason to
generalize over the data with minimal
human input on top on the left are the
quote unquote teachers is how much human
input and blue is needed to make the
method successful for supervised
learning which is what most of deep
learning successes come from or most of
the data is annotated by human beings
the human is at the core of the success
most of the data that's part of the
training needs to be annotated by human
beings with some additional successes
coming from augmentation methods that
extend that extend the data based on
which these networks are trained and the
semi-supervised reinforcement learning
and unsupervised methods that we'll talk
about later in the course that's where
the near-term successes we hope are and
with the unsupervised learning
approaches that's where the true
excitement about the possibilities of
artificial intelligence lie being able
to make sense of our world with minimal
input from humans so we can think of two
kinds of deep learning impact spaces one
is a special purpose intelligence is
taking a problem formalizing it
collecting enough data on it and being
able to solve a particular case that's
that provides value of particular
interest here is a network that
estimates apartment costs in the Boston
area so you could take the number of
bedrooms the square feet and the
neighborhood and
provide as output the estimated cost on
the right is the actual data of
apartment cost we're actually standing
in a in a area that has over three
thousand dollars for a studio apartment
some of you may be feeling that pain and
then there's general-purpose
intelligence or something that feels
like approaching general-purpose
intelligence which is reinforcement and
unsupervised learning here with Audra
for magical parties pong the pixels a
system that takes in 80 by 80 pixels
image and with no other information is
able to beat is able to win at this game
no information except a sequence of
images raw sensory information the same
way the same kind of information that
human beings take in from the visual
audio touch sensory data the very
low-level data and be able to learn to
win and it's very simplistic and it's
very artificially constructed world but
nevertheless a world where no feature
learning is performed only raw sensory
information is used to win with very
sparse minimal human input we'll talk
about that on Wednesday with deep
reinforcement learning so but for now
we'll focus on supervised learning where
there is input data there is a network
we're trying to train a learning system
and there's a correct output that's
labeled by human beings that's the
general training process for a neural
network input data labels and the
training of that network that model so
that in a testing stage a new input data
that has never seen before its task with
producing guesses and is evaluated based
on that for autonomous vehicles that
means being released either in
simulation or in the real world to
operate and how they learn how neural
networks learn is given the forward pass
of taking the input data
from the training stage in the training
stage the taking the input data
producing a prediction and then given
that there's ground truth in the
training stage we can we can have a
measure of error based on a loss
function that then punishes the the
synapses the connections the parameters
that were involved with making that that
wrong prediction and it back propagates
the error through those weights we'll
discuss that in a little bit more detail
in a bit here so what can we do with
deep learning you can do one-to-one
mapping really you can think of input as
being anything it can be a number of
vector numbers a sequence of numbers a
sequence of vector of numbers anything
you can think of from images to video to
audio to text can be represented in this
way and the output can the same be a
single number or it can be images video
text audio one-to-one mapping on the
bottom one-to-many many-to-one many to
many and many to many with different
starting points for the data
asynchronous some quick terms that will
come up deep learning is the same as new
networks it's really deep neural
networks large neural networks it's a
subset of machine learning that has been
extremely successful in the past decade
multi-layer perceptron deep neural
network recurrent neural network long
short-term memory Network lsdm
convolution neural network and deep
belief networks all of these will come
up to the slides and there is specific
operations layers within these networks
of convolution pooling activation and
back propagation this concept that we'll
discuss in this class
activation functions there's a lot of
variants on the left is the activation
function the left column and the x-axis
is the input on the y-axis is the output
the sigmoid function the output
if the font is too small the output is
not centered at zero for the 10h
function it's centered at zero but it
still suffers from vanish ingredients
vanish ingredients is 1 the value the
input is low or high the the output of
the network as you see in the right
column there the derivative of the
function is very low so the learning
rate is very low for real you not it's
also not zero centered but it does not
suffer from vanish ingredients back
propagation is the process of learning
it's the way we take go from error
compute as the loss function and the
bottom right of the slide taking the
actual output of the network with a
forward pass subtracting it from the
ground truth squaring dividing by two
and using that loss function that back
propagate through to construct a
gradient to back propagate the error to
the weights that were responsible for
making either a correct or an incorrect
decision so the sub desks that there's a
forward pass there's a backward pass and
a fraction of the weights gradients
subtracted from the weight that's it
that process is modular so it's local to
each individual neuron which is why it's
extremely just it's we're able to
distribute it across multiple across the
GPU parallelize across the GPU so
learning for a neural network these
competition units are extremely simple
they're extremely simple to then correct
when they make an error when they're
part of a larger network that makes an
error and all that boils down to is
essentially an optimization problem
where the objective utility function is
the loss function and the goal is to
minimize it and we have to update the
parameters the weights and the synapses
and the biases to decrease that loss
function and that loss function is
highly nonlinear
depending on the activation functions
different properties different issues
arise theirs vanish ingredients for
sigmoid where the learning can be slow
there's dying Raley's where the
derivative is exactly zero for inputs
less than zero there are solutions to
this like leaky Raley's and a bunch of
details you may discover when you try to
win the deep traffic competition but for
the most part these are the main
activation functions and it's the choice
of the neural network designer which one
works best
there's saddle points all the problems
from your miracle non-linear
optimization that arise come up here
it's hard to break symmetry and
stochastic gradient descent without any
kind of tricks - it can take a very long
time to arrive at the minima one of the
biggest problems in all of machine
learning and certainly deep learning is
overfitting you can think of the blue
dots and a plot here as the data to
which we want to fit a curve we want to
design a learning system that
approximates the regression of that of
this data so in green is a sine curve
simple fits well and then there's a
ninth degree polynomial which fits even
better in terms of the error but it
clearly over fits this data if there's
other data that it has not seen yet that
it has to fit it's likely to produce a
high error so it's over fitting the
training set this is a big problem for
small data sets and so we have to fix
that with regularization regularization
is a set of methodologies that prevent
overfitting learning the training too
well in order and then to not be able to
generalize to the testing stage and
overfitting
the main symptom is the air
and training set but increases in the
test set so there's a lot of techniques
and traditional machine learning that
deal with this and cross validation and
so on but because of the cost of
training for neural networks its
traditional to use of what's called a
validation set so you create a subset of
the training that you keep away for
which you have the ground truth and use
that as a representative of the testing
set so you perform early stoppage or
more realistically just save a
checkpoint often to see how as the
training evolves the performance changes
on the validation set and so you can
stop when the performance in the
validation set is getting a lot worse it
means you're overtraining on the
training set in practice of course we
run training much longer and see when
what is the best performing what what is
the best performing snapshot checkpoint
of the network dropout is another very
powerful regularization technique where
we randomly remove part of the network
randomly remove some of the nodes in the
network along with its incoming and
outgoing edges so what that really looks
like is a probability of keeping a node
and in many deep learning frameworks
today it comes with a dropout layer so
it's essentially a probability that's
usually greater than 0.5 then that a
node will be kept for the input layer
the probability should be much higher or
more effectively what works well as just
adding noise what's the point here you
want to create enough diversity in the
training data such that it is
generalizable to the testing and as
you'll see with deep traffic competition
there's l2 and a1 penalty weight decay
weight penalty where there's a
penalisation on the weights that get too
large the l2 penalty keeps the way it's
small unless the air derivative is huge
and produces
model and prefers to distribute when
there is two similar inputs it prefers
to put half the weights on each
distribute the weights as opposed to
putting the weight on one of the edges
makes the network more robust our one
penalty has the one benefit that for
really large weights they're allowed to
be to stay so it allows for a few
weights to remain very large these are
the regularization techniques and I
wanted to mention them because they're
useful to some of the competitions here
in the course and I recommend to go to
playground tents the temple floor
playground to play around with some of
these parameters where you get to online
in the browser play around with
different inputs different features
different number of layers and
regularization techniques and to build
your intuition about classification
regression problems given different
input datasets so what changed why over
the past many decades neural networks
that have gone through two winters are
now again dominating the artificial
intelligence community CPUs GPUs Asics
so computational power has skyrocketed
from Moore's law to GPUs there is huge
data set including image net and others
there is research back propagation in
the 80s the convolutional neural
networks lsdm there's been a lot of
interesting breakthroughs about how to
design these architectures how to build
them such that they're trainable
efficiently using GPUs there is the
software infrastructure from being able
to share the data or get to being able
to Train networks and share code and
effectively impune ural networks as a
stack of layers as opposed to having to
implement stuff from scratch with
tensorflow pi torch and other than that
and other deep learning frameworks and
there's huge financial backing from
Google Facebook and so on
deep learning is in order to understand
why it works so well and where its
limitations are we need to understand
where our own intuition comes from about
what is hard and what is easy the
important thing about computer vision
which is a lot of what this course is
about even it's in deep reinforcement
learning formulation is that visual
perception for us human beings was
formed 540 million years ago
that's 540 million million years worth
of data an abstract thought is only
formed about a hundred thousand years
ago that's several orders of magnitude
less data so we can make with the neural
networks predictions that seemed trivial
the that trivial to us human beings but
completely challenging and wrong to
neural networks here on the Left showing
a prediction of a dog with a little bit
of a distortion and noise added to the
image producing the image on the right
and your network is confidently 99
percent plus accuracy predicting that
it's an ostrich and there's all these
problems is to deal with whether it's in
computer vision data whether it's in
text data audio all of this variation
arises in vision its illumination
variability the set of pixels and the
numbers look completely different
depending on the lighting conditions
it's the biggest problem in driving is
lighting conditions lighting variability
pose variation objects need to be
learned from every different perspective
I'll discuss that for when sensing the
driver most of most of most of the deep
learning work that's done in the face on
the human is done on the frontal face or
semi frontal face there's very little
work done on the full 360 pose
variability that a human being could
take on
intraclass variability for the
classification problem for the detection
problem there is a lot of different
kinds of objects for cats dogs cars
bicyclists pedestrians so that brings us
to object classification and I'd like to
take you through where deep learning has
taken big strides for the past several
years leading up to this year to 2018 so
let's start at object classification is
when you take a single image and you
have to say one class that's most likely
to belong in that image the most famous
variant of that is the image net
competition image net challenge image
now data set is a data set of 14 million
images with 21,000 categories and for
say the category of fruit there's a
total of 180 8000 images of fruit and
there is 1200 images of granny smith
apples it gives you a sense of what
we're talking about here so this has
been the source of a lot of interesting
breakthroughs in deep learning and a lot
of the excitement in deep learning is
first the big successful network at
least one that became famous and deep
learning is Alex net in 2012 that took a
leap of a significant leap in
performance on the image net challenge
so it was one of the first neural
networks though successfully trained on
the GPU and achieved an incredible
performance boost over the previous year
on the image net challenge the challenge
is and I'll talk about some of these
networks it's to given a single image
give five guesses and you have five
guesses to guess for one of them to be
correct the human annotation is a
question often comes up so how do you
know the ground truth human level
performance is 5.1 percent accuracy on
this task but the way the annotation for
image net is performed is there's a
Google search where you pull the images
already labeled for you and then the
annotation that
Mechanical Turk other humans perform is
just binary is this a cat or not a cat
so they're not tasked with performing
the very high-resolution semantic
labeling of the image
okay so through from 2012 with alex net
to today and the big transition in 2018
of the image net challenge leaving
Stanford and going to Kegel it's sort of
a monumental step because in 2015 with
the resonant network was the first time
that the human level performance was
exceeded and I think this is a very
important map of where deep learning is
for particularly what I would argue is a
toy example despite the fact that it's
14 million images so we're developing
state-of-the-art techniques here and the
next stage as we are now exceeding human
level performance on this task is how to
take these methods into the real world
to perform scene perception to perform
driver state perception in 2016 and 2017
see you image and se net has a very
unique new addition to the previous
formulations that has achieved an
accuracy of 2.2 percent error 2.25
percent error on the image net
classification challenge it's an
incredible result ok so you have this
image classification architecture that
takes in a single image and produces
convolution and takes it through pooling
convolution and at the end fully
connected layers and performs a
classification task or regression task
and you can swap out that layer to
perform any kind of other tasks
including with recurrent neural networks
of image captioning and so on or
localization of bounding boxes or you
can do fully convolutional networks
which we'll talk about on Thursday which
is when you take a image as an input and
produce an image as an output but where
the output image in this case is a
segmentation
is where a color indicates what what the
object is the category of the of the
object so it's pixel level segmentation
every single pixel in the image is
assigned a class a category where that
pixel belongs to this is the kind of
task that's overlaid on top of other
sensory information coming for the car
in order to perceive the external
environment you can continue to extract
information from images in this way to
produce image to image mapping for
example to colorize images and take from
grayscale images to color images or you
can use that kind of heat map
information to localize objects in the
image so as opposed to just classifying
that this is a image of a cow our CNN
fast and faster our CNN and a lot of
other localization networks allow you to
propose different candidates for where
exactly the cow is located in this image
and thereby being able to perform object
detection not just object classification
in 2017 has been a lot of cool
applications of these architectures one
of which is background removal again
mapping from image to image ability to
remove and background from selfies of
humans or human-like pictures of faces
the reference is with some incredible
animations are in the bottom of the
slide and the slides are now available
online pix depicts HD there's been a lot
of work and Ganz in generative artifice
aerial networks in particular in driving
Ganz have been used to generate examples
that generate examples from source data
whether that's from raw data or in this
case with pics to pix HD is taking
course semantic labeling of the images
pixel level and producing
photorealistic high-definition images of
the forward roadway this is an exciting
possibility for being able to generate a
variety of cases for self-driving cars
for autonomous vehicles to be able to
learn to generate to augment the data
and be able to change the way different
roads look Road conditions to change the
way vehicles look cyclists pedestrians
then we can move on to recurrent neural
networks everything I've talked about
was one-to-one mapping from image to
image or image to number but kernel
networks at work with sequences we can
use sequences to generate handwriting to
generate text captions from an image
based on the localization as the various
detections in that image we can provide
video description generation so taking a
video and combining convolutional neural
networks with recurrent neural networks
using convolutional neural networks to
extract features frame to frame and
using those extracted features to input
into our TRN ends to then generate a
labeling a description of what's going
on in the video a lot of exciting
approaches for autonomous systems
especially in drones where the time to
make a decision is short same with the
RC car traveling 30 miles an hour
attentional mechanisms for steering the
attention of the network have been very
popular for the localization tasks and
for just saving how much interpretation
of the image how many pixels need to be
considered in the classification task so
we can steer we can model the way a
human being looks around an image to
interpret it and use the network to do
the same and we can use that kind of
steering to draw images as well finally
the big breakthroughs in 2017 came from
this pong to pixels the reinforcement
learn
using sensory data raw sensory data and
use reinforcement learning methods deep
are all methods of which you'll talk
about and Wednesday I'm really excited
about the underlying methodology of deep
traffic and deep crash is using neural
networks as the approximate errs inside
reinforcement learning approaches so
alphago in 2016 have achieved a
monumental task that when I first
started in artificial intelligence was
told to me is impossible for a system to
accomplish which is to win at the game
of go against the top human player in
the world however that method was
trained on human expert positions the
alphago system was trained on previous
games played by human experts and in an
incredible accomplishment
alphago zero in 2017 was able to beat
alphago and many of its variants by
playing itself from zero information so
no knowledge of human experts no games
no training data very little human input
and what more it was able to generate
moves that were surprising to human
experts I think it's Einstein that said
that intelligence that the key mark of
intelligence is imagination I think it's
beautiful to see an artificial
intelligence system come up with
something that surprises human experts
truly surprises for the gambling junkies
deep stack and a few other variants have
been used in 2017 to win a heads-up
poker again another incredible results I
was always told in artificial
intelligence would be impossible for
deep for any machine learning method to
achieve and was able to beat a
professional player and several
competitors have come along since we're
yet to be able to beat to win in a
setting so multiple players for those of
you familiar heads-up poker is
one-on-one it's a much much smaller
easier space to solve there's a lot more
human human dynamics going on from when
there's multiple players but that's the
task for 2018 and the drawbacks it's one
of my favorite videos they show it often
of Coast runners for these deep
reinforcement learning approaches the
learning of the reward function the
definition of the word function chain
controls how the actual system behaves
and this will come this would be
extremely important for us with
autonomous vehicles here the boat is
tasked with gaining than the highest
number of points and it figures out that
it does not need to race which is the
whole point of the game in order to gain
points but instead pick up green circles
that regenerate themselves over and over
this is the the counterintuitive
behavior of a system that would not be
expected when you first designed the
reward function and this is a very
formal simple system nevertheless is
extremely difficult to come up with a
reward function that makes it operate in
the way you expect it to operate very
applicable for autonomous vehicles and
of course in the perception side as I
mentioned with the ostrich and the dog a
little bit of noise with ninety nine
point six percent confidence we can
predict that the noise up top is a
robbing a cheetah armadillo lesser Panda
these are outputs from actual
state-of-the-art neural networks taking
in the noise and producing a confident
prediction it should build our intuition
to understand that we don't that the
visual characteristics the vision the
spatial characteristics of an image did
not necessarily convey the level of
hierarchy necessary to function in this
world in a similar way with a dog in the
ostrich and everything in an ostrich and
work confidently with a little bit of
noise can make the wrong prediction
thinking the school bus as an ostrich
and a speaker as an ostrich they're
easily fooled but not really because
they perform the task that they were
trained to do well so we have to make
sure we keep our intuition optimized to
the way machines learn not the way
humans have learned over the 540 million
years of data that we've gained through
developing the AI through evolution the
current challenges were taking on first
transfer learning there's a lot of
success in transfer learning between
domains that are very close to each
other so image classification from one
domain to the next there's a lot of
value in forming representations of the
way scenes look in order to show scenes
look in order to do scene segmentation
the driving case for example but we're
not able to do any any bigger leaps in
the way it would perform transfer
learning the biggest challenge for deep
learning is to generalize generalize
across domains it lacks the ability to
reason in the way that we've defined
understanding previously which is the
ability to turn complex information into
simple useful information convert domain
specific complicated sensory information
that doesn't relate to the initial
training set that's the open challenge
for deep learning trained on very little
data and then go and reason and operate
in the real world right now you'll know
what's a very inefficient they require
big data they require supervised data
which means they need human cost of
human input they're not fully automated
despite the fact that the feature
learning incredibly the big breakthrough
feature learning is performed
automatically you still have to do a lot
of design of the actual architecture of
the network and all the different hyper
parameter tuning needs to be performed
human input perhaps a little bit more
edgy
kthuman input and former PhD students
postdocs faculty is required to
hyper-girl doing these hyper parameters
but nevertheless human input is still
necessary they cannot be left alone for
the most part the reward defining the
reward as we saw with coast run is
extremely difficult for systems that
operate in the real world
transparency quite possibly it's not an
important one but neural networks
currently are black box for the most
part they're not able to accept through
a few successful visualization methods
that visualize different aspects of the
activations they're not able to reveal
to us humans why they work or where they
fail and this is a philosophical
question for autonomous vehicles that we
may not care as human beings if a system
works well enough but I would argue that
it will be a long time before systems
work well enough or we don't care we'll
care and we'll have to work together
with these systems and that's where
transparency communication collaboration
is critical edge cases it's all about
educators in robotics in Thomas vehicles
the 99.9% of driving is really boring
it's the same especially highway driving
traffic driving it's the same the
obstacle avoidance the car following the
lane centering all these problems a
trivial it's the edge cases the
trillions of edge cases that need to be
generalised over on a very small amount
of training data so again I return to
why deep learning I mentioned a bunch of
challenges and this is an opportunity
it's an opportunity to come up with
techniques that operate successfully in
this world so I hope the competitions
would present in this class and the
autonomous vehicle domain will give you
some insight and an opportunity to apply
in some of these cases are open research
problems with semantic segmentation of
external perception with control of the
vehicle and deep traffic and with deep
crash of control of the vehicle and
under actuated high speed conditions and
the driver state perception so with that
I wanted to introduce deep learning to
you today before we get to the fun
tomorrow of autonomous vehicles so we'd
like to thank Nvidia Google Auto live
Toyota and at the risk of setting off
people's phones Amazon Alexa Auto but
truly I would like to say that I've been
humbled over the past year by the
thousands of messages were received by
the attention by the 18,000 competition
entries by the many people across the
world not just here at MIT that are
brilliant that I got a chance to
interact with and I hope we go bigger
and do some impressive stuff in 2018
thank you very much and tomorrow is
self-driving
[Applause]