Kind: captions
Language: en
all right so the human side of AI so how
do we
turn this camera back in on the human so
we've been talking about perception how
to detect cats and
dogs pedestrians Lanes how to steer a
vehicle B based on the external
environment the thing that's really
fascinating and severely understudied is
The Human Side so you know you talk
about the Tesla we we have cameras in 17
Teslas driving around
Cambridge because Tesla is one of the
only Vehicles allowing
you to uh to experience in the real way
on the road the interaction between the
human and the
Machine and the thing that we don't have
that deep learning needs on The Human
Side of of semi-autonomous vehicles and
fully autonomous vehicles is video of
drivers that's what we're collecting
that's what uh my work is in is looking
at billions of video frames of human
beings driving 60 M an hour plus on the
highway in their semi-autonomous
Tesla what are the things that we want
to know about the
human the uh uh if if we were a deep
learning therapist and we try to uh tear
apart uh break break apart the different
things we can detect from this raw set
of
pixels we can look here from the green
to Red is the different detection
problems the different computer vision
detection problems green means it's less
challenging it's
feasible even under poor lighting
conditions variable pose noisy
environment for
resolution red means it's really hard no
matter what you do that's starting on
the left with face detection body pose
one of the best studied and one of the
easier computer vision
problems we have huge data sets for
these and then there is micro scods the
slight Tremors of the eye that happen
one uh at a rate of uh a th times a
second all right let's look at
uh well first like why do we even care
about the human in the
car so one is trust this trust part is
uh so you think about
it to build trust the car needs to have
some awareness of the the the biological
thing it's carrying inside the human
inside you kind of assume the car knows
about you cuz you're like sitting there
controlling it but if you think about it
almost every single car on the road
today has no sensors with which it's
perceiving you it knows some cars have a
pressure sensor on the steering wheel
and a a pressure sensor or some kind of
sensor uh detecting that you're sitting
in the seat that's the only thing it
knows about you that's that's it so how
is the car supposed
to this same car that's driving 70 M
hour on the highway
autonomously how is it supposed to build
trust with you if it doesn't perceive
you that's that's one of the critical
things here so if I'm some if I'm
constantly advocating something is that
we should have a driverf facing camera
in every car and that despite the
privacy concerns now you have a camera
on your phone and you don't have as much
of a privacy concern there is but
despite the privacy concerns the safety
benefits are huge the trust benefits are
huge
so let's start with the easy one body
pose detecting body
pose why do we care so there's uh seat
Bel
design there's these dummies crash as
dummies with which uh which are used to
design uh the safety system uh the
passive Safety Systems in our cars and
theyve make certain assumptions about
body shapes male female child body
shapes but they also make assumptions
about the position of your body in the
seat they have the optimal position the
position they assume you take the
reality is in a Tesla when the car is
driving itself the variability if you
remember the cat the deformable cat you
start doing a little bit more of that
you start to reach back in the back seat
in your purse your bag for your cell
phone these kinds of things and that's
when the crashes happen and we need to
know how often that happens the car
needs to know that you're in that
position and that's critical for that
very serious moment when the actual
crash
happens how do you do this is deep
learning class right so this deep
learning to the
rescue whenever you have these kind of
tasks of detecting for example body
poses you're detecting points of the
shoulders points of the head five 10
points along the arms the skeleton
how do you do that you have a CNN
convolution your network that takes this
input image and takes as an output it's
a regressor it gives an XY position of
the whatever you're looking for the left
shoulder the right shoulder and then you
have a Cascade of
regressors that give you all these
points that give you the shoulders the
arms and so on and then you have Through
Time on every single frame you make that
prediction and then you optimize
you know you you know you know you can
make certain assumption about physics
that you can't your arm can't be in this
place in the in one frame and then the
next frame be over here it moves
smoothly through space so under those
constraints you can then minimize the
error the temporal error from frame to
frame or you can just dump all the
frames as if there are different
channels like RGB is three channels you
could think of those channels as in time
you can dump all those frames together
in a what are call 3D convolution on
your networks you dump them all together
and then you estimate the body pose in
all the frames at
once and there are some data sets for
sports and we're building our own I
don't know who that guy
is let's let's fly through this a little
bit so what's called gaze classification
gaze is another word for glance right
it's a classification
problem here's uh one of the Tas for
this
class again not here cuz he's married
had to be home I know where his
priorities are at this is on camera
should be
here there's five cameras this is what
we're recording in the Tesla this is a
Tesla vehicle there's in the b in the
bottom right there's a blue icon that
lights up automatically detected if it's
operating under autopilot that means the
car is currently driving itself there's
five cameras one of the forward roadway
one in the instrument cluster one of the
center stack steering wheel his face and
then it's it's a classification problem
you dump the raw pixels into a
convolution network have six classes
forward roadway you're predicting where
the person is looking forward roadway
left right
uh Center stack instrument cluster
rearview mirror and you give it millions
of frames for every class
simple and it does incredibly well at
predicting where the driver is looking
and the process is the same for majority
of the driver State problems that have
to do with the face the face has so much
information uh where you're looking
emotion drowsiness so different degrees
of frustration I'll fly through those as
well but the process is the same there's
some pre-processing so this is in the
wild data there's a lot of crazy light
going on there's noise there's vibration
from the vehicle so first you have to V
video stabilization you have to remove
all that vibration all that noise as
best as you can there's a lot of
algorithms non- neural network
algorithms boring but they work for
removing the St the removing the noise
removing the effects of sudden light
variations and the vibrations of the
vehicle there's the automated
calibration so you have to estimate the
the frame of the camera the position of
the camera and estimate the identity of
the person you're looking at the more
you can specialize the network to the
identity of the person and the identity
of the car the person is riding in the
better the performance for the different
driver State classification so you
personalize the network you have a
background model that works on everyone
and you specialize each individual this
is transfer learning you specialize each
individual Network to that one
individual all right there is face
frontal ation fancy name for the fact
that no matter where they're looking you
want to transfer that face so the eyes
the nose are the exact same position in
the image that way if you want to look
at at the
eyes and you want to study the subtle
movement of the eyes the subtle blinking
the Dynamics of the eyelid the velocity
of the eyelid it's always in the same
place so you can really focus in remove
all effects of any other motion of the
head and then you just
this the beauty of deep learning right
you don't there is some
pre-processing uh because this is you
know real world data but you just dump
the raw pixels in you dump the raw
pixels in and and predict whatever you
need what do you need one is emotion you
can have so we had uh we had a we had a
study where people used uh a crappy and
a good voice-based navigation system so
the crappy one got them really
frustrated and they self-reported it as
a frustrating experience or not on scale
1 to 10 so that gives us ground truth
but had a bunch of people use this
system and uh you know they they put
themselves as frustrated or not and so
then we can predict we can train aish
neural network to predict is this person
frustrated or not I think we've seen a
video of that turns out smiling is a
strong indication of of frustration you
can also predict drowsiness in this
way gaze estimation in this way
cognitive load I'll briefly look at that
and the process is all the same you
detect the face you find the landmark
points in the face for the face
alignment face frontalization
and then you dump the raw pixels in for
classification step five you can use
svms there or you can use what everyone
uses now convolution your own
networks this is the one part where CNN
have still struggle to compete is the
alignment
problem is this is where I talked about
the Cascade
regressors is
finding the
landmarks
on the the the eyebrows the nose the
jawline the mouth there's certain
constraints there and so algorithms that
can utilize those constraints
effectively can often perform better
than endtoend regressors that just don't
have any concept of what a face is
shaped
like and there's huge data sets and
we're part uh of of the awesome
Community that's building those data
sets for face alignment okay so this is
again the TA in this younger
form this is uh live in the car realtime
system predicting where they're
looking this is taking slow steps
towards
the exciting direction that machine
learning is headed which is unsupervised
learning the less you have to have
humans look through the data and
annotate that data the more power these
machine learning algorithms get right
so currently supervised learning is
what's needed you need human beings to
label a cat and label a dog but if if
you can only have a human being label 1%
one tenth of a perc of a data set only
the hard cases so the machine can come
to the human and be like I don't know I
don't know what I'm looking at at these
pictures because of the partial light
occlusions we're not good at uh dealing
with occlusions whether it's your own
arm or because of light conditions we're
not good
with Crazy Light uh drowning out the
image this is what the Google
self-driving car actually struggle with
when they're trying to use their Vision
sensors moving out of frame so just all
kinds of occlusions they really
hard for computer vision
algorithms and in those case we want a
machine to step in and say and pass that
image on to the human be like help me
out with
this and the other Corner cases is so in
driving for example 90 plus% of the time
all you're doing is staring forward at
the roadway in the same way that's where
the machine shines that's where machine
annotation automated uh annotation
shines because it's seen that face for
hundreds of millions of frames already
in that exact position so it can do all
the hard work of annotation for you it's
in the trans position away from those
positions that he needs a little bit of
help just to make sure that this person
just start looking away from the road to
the rear view and you bring those points
up so you're uh there's uh using optical
flow putting the optical flow in the
convolution when your on network you use
that to predict when some when something
has changed and when something has
changed you bring that to the machine
for annotation all of this is to build a
giant billions a frames uh annotated
data set of ground truth on which to
train your driver State
algorithms and in this way you can
control on the x-axis is the fraction of
frames that a human has to annotate 0%
on the left 10% on the right and then
the accuracy tradeoff the more the human
annotates the higher the accuracy you
approach 100% accuracy but you can still
do pretty good this is for the Gaye
classification task when uh
uh with an 84 uh uh 84 fold almost two
words is magnitude reduction in in human
annotation this is the future of machine
learning and hopefully one day no human
annotation and the
result is millions of images like this
video
frames same thing Drive frustration this
this is what I was talking about the
frustrated driver is the one that's on
the
bottom so a lot of movement of the
eyebrows and a lot of smiling and that's
true subject after
subject and the happy The satisfied we
don't say happy the satisfied driver is
cold and stoic and that's true for
subject after subject cuz driving is a
boring experience and you want it to
stay that way yes
question
or uh great great great question they're
Nota will be absolutely that's a great
question there is a so this is cars
owned by MIT there is somebody in the
back but then my emotions or I'm happy
might have nothing to do my Driving
Experience so uh the comment was my
emotions might then have nothing to do
with the Driving Experience uh yes let
me continue that comment is your
emotions are often
you're an actor on the stage for others
with your emotion so when you're alone
you might not express emotion you're
really expressing emotion often times
for others like your frustration is like
oh what the heck that's for for the
passenger and that that's absolutely
right so one of the cool
things we're doing well as I said we now
have over a billion video frames in the
Tesla we're stud collecting huge amounts
of data in the Tesla and it's emotion is
complex thing right
um in this case we can we know the
ground truth of how frustrated they were
in naturalistic data when it's just
people driving around we don't know how
they're really feeling at the moment
we're not asking them to like enter in
an app how are you feeling right
now but we do know certain things like
we we know that people sing a
lot that has to be a paper at some point
it's awesome people love
singing so that doesn't happen in this
kind of data because there's somebody
sitting in in the car and I think the
expression of frustration is also the
same yes
so yeah the qu uh the question is yeah
so or the comment is that the data set
the solo data set is probably going to
be very different from a data set that's
nons solo with a passenger and it's very
true the tricky thing about driving and
this is why it's a huge huge challenge
for cell driving cars for the external
facing sensors and for the internal
facing sensors analyzing human behavior
is is like 99.9% of driving is the same
thing it's really boring so finding the
interesting bits is actually pretty
complicated so that has to do with em
motion that has to do with so singing is
easy to find so we can track the mouth
pretty well so whenever you're talking
or singing we can find that but how do
you find subtle expressions of emotion
it's hard
when you're
solo and cognitive
load
that's that's a that's a fascinating
thing I mean similar to emotion it's uh
it's a little more
Concrete in a sense that there's a lot
of good science and and ways to measure
cognitive load cognitive workload how
occupied your mind is mental workload is
another term used and so the window to
the soul the the cognitive workload soul
is is the eyes so pupil so first of all
the eyes move in two different ways well
they move a lot of ways but two major
ways is sads these are these ballistic
movements they jump around whenever you
look around the room they're they're
actually just jumping around when you
read They're eyes are jumping around and
when like if follow you just follow this
bottle with your eyes that your eyes are
actually going to move smoothly uh
smooth Pursuit somebody actually just
told me today that probably has to do
with our hunting background or as
animals I I don't know how that helps
like frogs track flies really well so
that you have to like I don't know
anyway the point is there are smooth
Pursuit movements where the eyes move
smoothly and those are all indications
of certain aspects of cognitive load and
and then there is very subtle movements
which are almost imperceptible for
computer vision and these are um micro
scods these are Tremors of the eye
here work from here from Bill Freeman uh
magnifying those subtle movements these
are taken at uh 500 frames a
second and so cognitive
load when the pupil that black dot in
the middle just in case we don't know
what a pupil is in the middle of the eye
when it gets larger that's in an
indicator of Co of high cognitive load
but it also gets larger when the light
is dim so there's like this complex
interplay so we can't rely in the wild
outside in the car or just in general
Outdoors on using the pupil size even
though pupil size have been used
effectively in the lab to measure
cognitive load it can't be reliably used
in the car and the same with
blinks the uh when when there's a higher
cognitive load your blink rate decreases
and your blink duration
shortens okay I think I'm just repeating
the same thing over and over but you can
imagine how we can predict cognitive
load
right we extract video of the
ey here is the uh the primary eye of the
the person the system is
observing happens to be that the same ta
once
again we take the sequence of 100 um oh
it's 90 images so that's 6 seconds 16
frames a second 15 frames a second and
we dump that into a 3D convolutional
Network again that means it's 90
channels of it's not 90 frames gray
scale and then the prediction is one of
three classes of cognitive load low
cognitive load uh medium cognitive load
and high cognitive load and there's
ground Truth for that because we have
people over 500 different people do
different tasks of various cognitive
load
and after some frontalization again
where you see the eyes are trans no
matter where the person is
looking the image of the face is
transposed in such a way that the eyes
the corners of the eyes remain always in
the same
position after the frontalization
we find the I active appearance models
find 39
points uh of the eye of the
eyelids the iris and four point point on
the
pupil putting all of that into a 3D CNN
model they are positioned IM eye
sequence on the left 3dcnn model in the
middle cognitive load prediction on the
right this code by the way is it's
freely available
online all you have to do dump a
webcam from the video stream CNN runs in
faster than real time predicts cognitive
load same process as detecting the
identity of the face same process as
detecting where the driver is looking
same process detecting emotion and all
of those require very little
hyperparameter tuning on the convolution
on your
networks they only require huge amounts
of
data and why do we care about detecting
what the driver is doing and I think
Eric has
mentioned this
is on the oh man this is the comeback of
the
slide I let's criticize this for for
this being a very cheesy
slide
in in the uh path towards uh full
automation we're likely to take gradual
steps towards
that can't it's enough of that this is
better and uh especially given that the
this is given today our new
president this is pickup truck
country this is manually controlled
vehicle country for quite a little while
we like
control and
control being given to somebody body
else to the machine will be a gradual
process it's a gradual process of that
machine earning trust and through that
process the machine like the Tesla like
the BMW like the Mercedes the Volvo
that's now playing with these
ideas is going to need to see what the
human is
doing and for that to see what the human
is doing we have billions of miles of
forward-facing data what we need is
billions of miles of driverf facing data
as well we're in the process of
collecting
that and this is a
pitch for
automakers and everybody to uh buy cars
that have a driver facing
camera and let me sort of close so I
said we need a lot of
data but I think this class has
been and and through your own research
you'll find that we're in the very early
stages
of of discovering the power of deep
learning for example you know as as
recently uh like Yan Lon
said that it seems
that the deeper the network the better
the results in a lot of really important
cases even though the data is not
increasing so why does a deeper Network
give better
results this is a mysterious thing we
don't understand there's there's there's
these hundreds of millions of parameters
and from them is emerging some kind of
structure some kind of representation of
the knowledge that we're giving it
one of my favorite examples of this
emergent concept is uh Conway's Game of
Life for those of you who know who what
this is will probably criticize me for
it being as cheesy as the stairway slide
but I think it it's actually such a
simple and Brilliant example of of how
like a neuron in a neural network is a
really simple computational unit and
then incredible power emerges when you
just combine a lot of them in a network
and in the same way the this is called a
cellular
automata that's a weird
pronunciation and and the r it's it it's
every single cell is operating under a
simple rule you can think of it as a
cell living and
dying it's it's it's filled in black
when it's alive and white when it's dead
and when it has two or it's if it's
alive and it has two or three neighbors
it survives to the next time
slot otherwise it
dies and if it has exactly three
neighbors it's
dead it comes back to life if has
exactly three neighbors that's a simple
rule whatever you can just imagine it's
just simple all is doing is operating
under this very local process same as a
neuron it's a it's it's uh or in the way
we're currently training neuron networks
and this local gradient we're optimizing
over a local gradient same local rules
and what happens if you run this
system operating under really local
rules what you get on the right it's not
again you have to go
home hopefully no drugs involved but you
have to open up your
mind and and see how amazing that is
because what happens is it's a it's a
local computational unit that knows very
little about the world but somehow
really complex patterns emerge and we
don't understand
why in fact under different rules
incredible patterns emerge and it feels
like it's living creatures like
communicating like when you just watch
it not not these examples this is uh
this is the original they they they get
like complex and interesting but even
these examples this complex geometric
patterns that emerge is incredible we
don't understand why same in your
networks we don't understand why and we
need to in order to see how these
networks will be able to
reason okay so what's
next uh I encourage you to uh read the
Deep learning
book it's available online deeplearning
book.org as I mentioned to a few people
you should well first there's a ton of
amazing papers every day coming out on
archive
I I'll put these links up but there's a
lot of good uh collections of strong
paper list of papers there is the
literally awesome list the awesome deep
learning papers on GitHub it's calling
itself awesome but it happens to be
awesome and there is a lot of blogs that
are just
amazing that's that's how I recommend
you learn machine learning is on
blogs and if you're interested in the
application of deep learning in the
automotive space you can come do
research in our group just email
me anyway we have three
winners Jeffrey Hugh Michael Gump
and how do you are you
here yes hey how do you say your name uh
no that's not my name uh
all right this is so my name
is oh I
see well you I
here so he achiev
the stunning
speed of so I I this was kind of
incredible so I didn't know what kind of
speed we were going to be able to
achieve I thought 73 was unbeatable cuz
we played with it for a while we
couldn't achieve 73 we designed a
deterministic algorithm that was able to
achieve 74 I
believe meaning like it's cheating with
the cheating algorithm that got 74 and
so folks have come up with algorithms
that have done have beaten 73 and in 74
so this is really incredible and uh the
other two guys so all three of you get a
free term at the Udacity cell driving
car engineering that degree thanks to
all those guys for giving it uh giving
that award and bringing their army of
brilliant uh so they have they have
people who are obsessed about sell
driving cars and we've
received over 2,000 submissions for this
competition a lot of them from those
guys and they're just brilliant so it's
it's uh it's really uh exciting to have
such a big community of deep learning
folks uh working in this field so this
is for the rest of Eternity well we're
going to change this up a little bit but
this is actually the three neural
networks uh the three neural three
winning neural networks running side by
side and you can see the number of cars
pass there the first place is on the
left second place and third place and in
fact the third place is almost win wait
no second place is winning
currently but that just tells you that
the uh the random nature of competition
sometimes you win sometimes you
lose so the there's a the actual
evaluation process runs through a lot of
uh a lot of iterations and takes the
medium
evaluation with that let me thank thank
you guys so much for well wait wait wait
that
question the winning networks at
all yeah so I uh all all three guys
wrote me a note about how their networks
work I did not read that note so I'll
I'll post this tells you how crazy this
has been uh I'll post their their uh the
winning
networks to online and and I encourage
you to continue competing and continue
submitting networks uh this will run for
a while and we're working on a journal
paper for this for this game we're
trying to find the Optimal
Solutions okay so this is the first time
I've ever taught a class and the first
time obviously teaching this class and
so thank you so much for for being a
part of
it thank you
thank you to
Eric uh you didn't get a shirt please
come back please come down and get a
shirt just write your email on the note
on the on the index
note thank
[Music]
you okay