MIT 6.S094: Deep Learning for Human Sensing

Z2GfE8pLyxc • 2018-01-30

Transcript preview

Open

Kind: captions
Language: en
today we will talk about how to apply
the methods of deep learning to
understanding the sense of the human
being the focus will be on computer
vision the visual aspects of a human
being
of course we humans express ourselves
visually but also through audio voice
and through text beautiful poetry and
novels and so on we're not going to
touch those today we're just going to
focus on computer vision how we can use
computer vision to extract useful
actionable information from video images
video of human beings in particular in
the context of the car so what are the
requirements for successfully applying
deep learning methods in the real world
so when we're talking about human
sensing we're not talking about a basic
face recognition of celebrity images
we're talking about using computer
vision deep learning methods to create
systems that operate in the real world
and in order for them to operate in the
real world there are several things they
sound simple some are much harder than
they sound first and the most important
here for most to less more to less
critical ordered is data data is
everything real world data we need a lot
of real world data to form the data set
on which these supervised learning
methods can be trained I'll say this
over and over throughout the day today
data is everything that means data
collection is the hardest part and the
most important part we'll talk about how
that data collection is carried out here
in our group at MIT all the different
ways to capture human beings in the
driving context in the road user context
pedestrians cyclists but the data it
starts and ends at data the fun stuff is
the algorithms but the data is what
makes it all work real world data okay
then once you have the data okay data
isn't everything I lied because you have
to actually annotate it so what do we
mean by data there's raw data video
audio lidar all the types of sensors
we'll talk about to capture real world
you wrote user interaction you have to
reduce that into meaningful
representative cases of what happens in
that real world in driving 99% of the
time driving looks the same it's the
it's the 1% the interesting cases that
we're interested in and what we want is
algorithm to train learning algorithms
on those 1% so we have to collect 100
percent we have to collect all the data
and then figure out and automated
semi-automated ways to find the pieces
of that data that could be used to train
your own networks and that a
representative of the general thing
kinds of things that happen in this
world efficient annotation annotation
isn't just about drawing bounding boxes
on images of cats annotation tooling is
key to unlocking real world performance
systems that successfully solve some
problem accomplish some goal in real
world data that means designing
annotation tools for a particular task
annotation tools that are used for
glance classification for determining
where drivers are looking it's very
different than annotation tools used for
body pose estimation is very different
than the tooling use that we use for
psyche views investing thousands of
dollars for the competition for this
class to annotate fully scene
segmentation where every pixel is
colored there's needs to be tooling for
each one of those elements and they're
key that's HCI question that's a design
question there's no deep learning
there's no robotics in that question
it's how do we leverage human
computation human the human brain to mow
effectively label images such that we
can train y'all networks on them
hardware in order to train these
networks in order to parse the data we
collect and we'll talk about we have now
over five billion images of data of
driving data in order to parse that you
can't do it on a single machine you have
to do large-scale distributed compute
and large-scale distributed storage and
finally the the stuff that's the most
exciting that people that there's this
class and many classes and much of the
literature is focused on is the
algorithms the deep learning algorithms
the machine learning algorithms the
algorithms that learn from data of
course that's really exciting and
important but what we find time and time
again in real world systems is that as
long as these algorithms learn from data
so as long as this deep learning the
data is what's much more important of
course it's nice for the algorithms to
be calibration free meaning they learn
to calibrate self calibrate we don't
need to have the sensors in an exact
same position every time that's a very
nice feature the robustness of the
system is then generalizable across
multiple multiple vehicles and multiple
scenarios and one of the key things that
comes up time again time and time again
and we'll mention today is a lot of the
algorithms developed in deep learning
are really focused for computer vision
are focused on single images now the
real world is happens in both space and
time and we have to have algorithms that
both capture the visual characteristics
but also look at the sequence of images
sequence of those digital
characteristics that form the temporal
dynamics the physics of this world so
it's nice when those algorithms are able
to capture the physics of the scene
the big takeaway I would like if you
leave with anything today
unfortunately it's that the painful
boring stuff of collecting data of
cleaning that data of annotating that
data in order to create successful
systems is much more important than good
algorithms or great algorithms it's
important to have good algorithms as
long as you have neural networks that
learn from that data okay so today I'll
talk I like to talk about human
imperfections and the various detection
problems the pedestrian body pose glance
and motion cognitive load estimation
that we can use to help those humans as
they operate in the driving context and
finally try to continue with the idea of
the vision that fully autonomous
vehicles as some of our guest speakers
have spoke about and sterling anis will
speak about tomorrow is really far away
that the humans will be an integral part
of the operating cooperating with the AI
systems and I will continue on on that
line of thought to try to motivate why
we need to continuously approach the
autonomous vehicle the self-driving car
paradigm in the human centered way okay
first before we talk about human
imperfections let's just pause and
acknowledge that humans are amazing
we're actually really good at a lot of
things that's sometimes sort of fun to
talk about how much called terrible of
drivers who are how distracted we are
how irrational we are but we're actually
really damn good at driving here's a
video of stadia our soccer player messi
the best soccer player in the world
obviously and the state-of-the-art robot
on the right same thing
well there's it's not playing but I
assure you the American Ninja Warrior
Casey is is uh is far superior to the
DARPA humanoid robotics systems shown on
the right okay so continuing and the
line of thought to challenge to
challenge us here that humans are
amazing is you know there's record high
in 2016 in the United States there was
over forty thousand since uh many years
it's across the forty thousand
fatalities mark more than forty thousand
people died in car crashes in the United
States but that's in three point two
trillion miles traveled so that's one
fatality per eighty million miles that's
one in 625 chance of dying in a car
crash in your lifetime interesting side
fact for anyone in the United States
folks who live in Massachusetts are the
least likely to die in a car crash
Montana is the most likely so for every
one that thinks of Boston drives is
terrible maybe that adds some
perspective here's a visualization of
ways data across a period of a day
showing you the rich blood of the city
that the the traffic flow of the city
the people getting from A to B and a
mass scale and doing it surviving doing
it okay humans are amazing but they're
also flawed texting sources of
distraction with a smartphone the eating
the secondary tasks of talking to other
passengers grooming reading using
navigation system yes sometimes watching
video and manually adjusting or
adjusting the radio and 3,000 people
were killed and 400,000 were injured in
motor vehicle crashes vaulted involving
distraction
in 2014 distraction is a it's a very
serious issue for safety texting every
day more and more people text
smartphones are proliferating our
society 170 billion text messages are
sent in the United States every month
that's in 2014 you can only imagine what
it is today
eyes off road for five seconds is the
average time your eyes off the road
while texting five seconds if you're
traveling 55 miles an hour in that five
seconds that's enough time to cover the
length of a football field
so you're blindfolded you're not looking
at the road in five seconds the average
time of texting you're covering the
entire football field eight so many
things can happen in that moment of time
that's distraction drunk driving 31% of
traffic fatalities involve a drunk
driver drunk driving 23% of nighttime
drivers tested positive for a legal
prescription or over-the-counter
medication distracted driving as I said
is a huge safety risk drowsy driving
people driving tired nearly three
percent of all traffic fatalities
involve a drowsy driver if you are
uncomfortable with videos that involve
risk I urge you to look away these are
videos collected by Triple A of
teenagers a very large-scale
naturalistic driving data set and it's
capturing clips of teenagers being
distracted on their smartphone
[Music]
once you take it in the problem we're
against
so in the cutting
context of human imperfections we have
to ask ourselves is the human centered
approach to autonomy in systems
autonomous vehicles that are using
artificial intelligence to aid the
driving task do we want to go as I
mentioned a couple of lectures ago the
human centered way or the full autonomy
way the tempting path is towards full
autonomy where we removed this imperfect
flawed human from the picture altogether
and focus on the robotics problem of
perception and control and planning and
driving policy or do we work together
human and machine to improve the safety
to alleviate distraction to bring drive
our attention back to the road and use
artificial intelligence to increase
safety through collaboration human robot
interaction versus removing the human
completely from the picture as I've
mentioned as as sterling will certainly
talk about tomorrow and and rightfully
so and yesterday or on Tuesday Emilio
has talked about the elf four-way is
grounded in literature it's grounded in
common sense since in some sense it's
you can count on the fact that humans
the the natural flaws of human beings to
over trust to misbehave to be irrational
about their risk estimates will result
in improper use of the technology and
that leads to what I've showed before
the public perception of what drivers do
and semi autonomous vehicles they begin
to over trust the moment the system
works well they begin to over trust they
begin to do stuff they're not supposed
to be doing in the car taking it for
granted a recent video that somebody
posted this is a common sort of more
practical concern that people have is
while the traditional ways to ensure the
physical engagement of the driver is by
saying they should touch the wheel the
the steering wheel every once in a while
and of course there's ways to buy
the need to touch the steering wheel
some people hang objects like I can off
of the steering wheel in this case
brilliantly I have to say they shove an
orange into the into the wheel to make
the touch sensor fire and therefore be
able to take their hands off the
autopilot and that that kind of idea
makes us believe that there's no way
that you know humans will find a way to
misuse this technology however I believe
that that's not giving the technology
enough credit artificial intelligence
systems if are they're able to perceive
the human being are also able to work
with the human being and that's what I'd
like to talk about today teaching cars
to perceive the human being and it all
starts with data it's all about data as
I mentioned data is everything in these
real world systems with the MIT
naturalistic driving data set of 25
vehicles of which 25 and 21 and equipped
with Tesla autopilot we instrument them
this is what we do the data collection
two cameras on the driver will see the
cameras on the face capturing
high-definition video of the face that's
where we get the glance classification
the emotion recognition cognitive load
everything coming from the face that we
have another camera or a fisheye that's
looking at the body of the driver and
that from that comes the body pose
estimation hands on wheel activity
recognition and then one video looking
out for the full scene segmentation for
all the scene perception tasks and
everything is being recorded
synchronized together with GPS with
audio with all the can covered from the
car on a single device synchronization
of this data is critical so that's one
road trip in the data where thousands
like it traveling hundreds of miles
sometimes hundreds of miles under
automated control and autopilot that's
the data
again as I said data is everything and
from this data we can both gain
understanding what people do which is
really important to understand how
autonomy successful autonomy can be
deployed in the real world and to design
algorithms as for training for training
the deep learning the deep neural
networks in order to perform the
perception tasks better twenty five
beagles 21 Tesla's Model S Model X and
now model three over a thousand miles
collected a day every single day we have
thousands of miles in the Boston
Massachusetts area driving around all of
that video being recorded now over five
billion video frames there are several
ways to look at autonomy one of the big
ones is safety that's what everybody
talks about how do we make these things
safe but the other one is enjoyment do
people actually want to use it it we can
create a perfectly safe system we can
create it right now we've had it for
ever before even cars a car that never
moves is a perfectly safe system well
not perfectly but almost and but it
doesn't provide a service that's
valuable it doesn't provide an enjoyable
driving experience so okay what about
slow moving vehicles that's an open
question the reality is with these Tesla
vehicles and l2 systems doing automated
driving people are driving 33% of miles
using Tesla autopilot what does that
mean that means that people are getting
value from it they a large fraction of
their driving is done an automated way
that's value that's enjoyment the glance
suffocation algorithm we'll talk about
today is used as one example that we use
to understand what's in this data shown
with the bar graphs there and the red
and the blue red is during manual
driving blues during autopilot driving
and we look at glance classification
regions of where drivers are looking on
road and off-road and if that
distribution changes with automated
driving or manual driving and would
these glass classification methods we
can determine that there's not much
difference at least until you dig into
the details which we haven't done and
the aggregate there's not a significant
difference that means people are getting
value enjoying using these technologies
but yet they're staying attentive or at
least not attentive but physically
engaged when your eyes are on the road
you might not be attentive but you're at
the very least physically your body's
position in such a way your head is
looking at the forward roadway that
you're physically in position to be
alert and to take in the forward roadway
so they're using it and they don't over
trust it and that's I think the sweet
spot that human-robot interaction needs
to achieve is the human gaining through
experience through exploration through
trial and error exploring and
understanding the limitation of the
system to a degree that over trust can
occur that seems to be happening in this
system and using the computer vision
methods I'll talk about we can continue
to explore how that can be achieved in
other systems when the when the when the
fraction of automated driving increases
from 30% to 40% to 50% and so on it's
all
about the data and I'll I'll harp on
this again the algorithms are
interesting you know I will mention of
course it's the same convolution neural
networks it's the same networks that
take in raw pixels and extract features
of interest it's 3d convolutional neural
networks that take into sequences of
images and extract the temporal dynamics
along with the visual characteristic for
the individual images it's RN and
zoella's TMS that use the convolutional
neural networks to extract features and
over time look at the dynamics and the
images these are pretty basic
architecture is the same kind of deep
neural network architectures but they
rely fundamentally and deeply on the
data on real-world data so let's start
where perhaps on the human sensing side
it all began which is pedestrian
detection decades ago to put it in con
texe pedestrian detection here shown
from left to right on the left is green
showing the easier human sensing tasks
tasks of sensing some aspect to a human
being but as for your detection which is
detecting the full body of a human being
in an image or video is one of the
easier computer vision tasks and on the
right under in the red microcircuits
these are the tremors of the eye or
measuring the pupil diameter or
measuring the cognitive load or the fine
blink dynamics of the eye the velocity
of the blink micro glances and I pose
are much harder problems
so you think body pose estimation
pedestrian detection phase
classification detection recognition
head pose estimation all those are
easier tasks anything that starts
getting smaller looking at the eye and
everything that start getting
fine-grained there's much more difficult
so we start at the easiest pedestrian
detection and as the usual challenges of
all of computer vision we've talked
about as the various styles of
appearance so the inter class variation
the different possible articulations
of put it of our bodies superseded only
perhaps by cats but as humans are pretty
flexible as well the presence of
occlusion from the accessories that we
wear to occluding self occlusion and
including each other but that crowded
scenes have a lot of humans in them and
they include each other and therefore to
be able to disambiguate to figure out
each individual pedestrians is a very
challenging problem so how do people
approach this problem well there is I
need to extract features from raw pixels
whether that was hot cascades hog or CNN
the through the decades the sliding
window approach was used because the
pedestrians can be small in an image or
big so there's the problem of scale so
you use a sliding window to detect where
that pedestrian is you have a classifier
that's given a single image such as this
that's you're not you take that classify
you slide across the image to find where
all the pedestrians of scene are so you
can use non neural network methods or
you can use convolution neural networks
for that classifier it's extremely
inefficient then came along our CNN fast
our CNN fast our CNN these are networks
that as opposed to doing a complete
sliding window approach are much more
intelligent clever about generating the
candidates to consider so as opposed to
considering every possible position of a
window different scales of the window
they generate more a small subset of
candidates that are more likely and
finally using a CNN classify for those
candidates whether there's a pedestrian
or not whether the there's an object of
interest or not a face or not and using
that maximum suppression because there's
overlapping bounding boxes to figure out
what is the most likely bounding box
around this pedestrian around this
object that's our CNN and there's a lot
of variants now with masks our CNN
really the state-of-the-art localization
Network mask also adds to this on top of
the body box also performed segmentation
there's voxel net which does
three-dimensional and light our data
uses localization and point clouds so
it's not just using it to the images but
in 3d but it's it's it's all kind of
grounded in the our CNN framework ok
data so we have large-scale data
collection going on here in Cambridge if
you've seen cameras a lidar various
intersections throughout MIT we're part
of that so for example here's one of the
intersections to collecting about 10
hours a day instrumenting it with
various sensors I'll mention but we see
about 12,000 pedestrians a day across
that particular intersection using 4k
cameras using stereo vision cameras 360
now the insta 360 which is an 8k 360
camera gopro lidar various sizes the 64
channel of the 6
and recording this is where this is the
this is where the data comes from this
is from the 360 video this is from the
lidar data of the same intersection this
is for the 4k camcorders pointing at a
different intersection and the different
than capturing the entire 360 view with
the vehicles approaching in the
pedestrians making crossing decisions
this is understanding the negotiation
that pedestrian is the nonverbal
negotiation that pedestrians perform and
choosing to cross or not especially when
they're jaywalking and everybody
jaywalks especially if you're familiar
with this particular intersection
there's more Jay walkers than non
jaywalkers it's a fascinating one and so
we record everything about the driver
and everything about the pedestrians
again our CNN this is where it comes in
is you do Bonney box detection of the
pedestrians here are the vehicles as
well and allows you to convert this raw
data into hours of pedestrian crossing
decisions and begin to interpret it
that's pedestrian detection bounding box
for body pose estimation is the more
difficult task body pose estimation is
also finding the joints the hands the
elbows the shoulders the hips knees feet
the landmark points in the image XY
position that marked that those joints
that's body pose estimation so why is
that important in driving for example
it's it's important to determine the
vertical position or the alignment of
the driver the seatbelts and the sort of
the the airbag testing is always
performing the seatbelt testing is
performed with the dummy considering the
frontal position in a standard dummy
position the the greater greater degrees
of automation comes more capability and
flexibility for the driver to get
misaligned from the standard corner
dummy position and so body pose or at
least upper body pose estimation allows
you to determine how often these drivers
get out of line from the standard
position the general movement and then
you can look at hands on wheel
smartphone smartphone detection activity
and help add context to glance
estimation that which we'll talk about
so some of the more traditional methods
were sequential is detecting first the
head and then stepping detecting the
shoulders the elbows the hands the
depot's holistic view which has been the
very powerful successful way for multi
person pose estimation is performing a
regression of detecting body parts from
the entire image it's not sequentially
stitching bodies together it's detecting
the left elbow the right elbow the hands
individually it's performing that
detection and then stitching everything
together afterwards allowing you to deal
with the crazy deformations of the body
that happened the occlusions and so on
because you don't need all the joints to
be visible and with this cascade of pose
regressors meaning these are
convolutional neural networks had taken
a raw image and produce an XY position
of their estimate of each individual
joint input as an image output is an
estimate of a joint of elbow shoulder
whatever one of several landmarks and
then you can build on top of that every
estimation zooms in on that particular
area and performs a finer and finer
grain estimation of the exact position
of the Joye repeating it over and over
and over so through this process we can
do part detection and multi-person and
multi-person scene that contain multiple
people so we can detect the the head the
neck here the hands the elbows shown in
the various images on the right that
don't have an understanding who the head
the elbows the the hands belong to
it's just performing a detection without
trying to do individual person detection
first and then finally connecting or not
finally but next step is connecting with
part affinity fields is connecting those
parts together so first you detect
individual parts then you connect them
together and then through bipartite
matching you determine which is who is
that each individual body part most
likely belonging to so you kind of
stitch the different people together in
the scene after
the detection is performed with the CNN
we use this approach for detecting the
upper body specifically the shoulders
the neck and the head eyes nose ears
that is used to determine the the
position of the driver relative to the
standard dummy position for example
looking during autopilot driving
30-minute periods we can look at on the
x-axis is time and the y-axis is the
position of the neck point that I
pointed out in the previous slide that
the the the midpoint between the two
shoulders the neck is the position over
time relative to where it began this is
the slouching the sinking into the seat
allowing the car to know that
information and allowing us or the
designers of safety systems and all that
information is really important we can
use the same body pose algorithm to from
the perspective of the vehicle outside
the vehicle perspective so the vehicle
looking out is doing the as opposed to
just plain pedestrian detection using
body pose estimation again here in
Kendall Square vehicles crossing
observing pedestrians making crossing
decisions and performing body pose
estimation which allows you to then
generate visualizations like this and
gain understanding like this on the
x-axis is time on the y-axis is on the
top plot in blue is the speed of the
vehicle the speed of the vehicle the ego
vehicle from which the camera is
observing the scene and on the bottom in
green up and down as a binary value
whether the Podesta when the pedestrian
is not looking at the car one when the
pedestrian is looking at the car so we
can look at thousands of episodes like
this crossing decisions nonverbal
communication decisions and determine
using body pose estimation the dynamics
of this nonverbal
here just nearby by media lab crossing
there's a pedestrian approaches we can
look in green there when the pedestrian
glasses looks away glasses the car looks
away fascinating glance behavior that
happens interesting most people look
away before they cross same thing here
this is just an example we have
thousands of these body pose estimation
allows you to get this fine-grained
information about the pedestrian glance
behavior pedestrian body behavior
hesitation glass classification one of
the most important things in driving is
determining where drivers are looking it
if there's any sensing that I advocate
and is has the most impact in the
driving context is for the car to know
where the driver is looking and at the
very crude region level information of
is the driver looking on road or off
road that's what we mean by glance
classification it's not the standard
gaze estimation problem of X Y Z
determining where the eye pose and the
head pose combined to determine where
the driver is looking no this is
classifying two regions on road off-road
or six regions on road off road left
right center stack rearview mirror and
instrument cluster so it's region based
glance allocation not the geometric gaze
estimation problem why is that important
it allows you to address it as a machine
learning problem it's a subtle but
critical point every problem we try to
solve in human sensing in driver sensing
has to be learn about from data
otherwise it's not it's not amenable to
application in the real world we can't
design systems in the lab that are
deployed without learning if they
involve a human it's possible to do slam
localization by having really good
sensors and doing localization using
those sensors without much learning it's
not possible to design systems that deal
with lighting variability and the full
variability of human behavior without
being able to learn so gaze estimation
the geometric approach of finding the
landmarks in the face and from those
landmarks determining the the Jeremie
the orientation of the head and the
orientation of the eyes there's no
learning there outside of actually
training the systems to detect the
different landmarks if we convert this
into a gaze classification problem shown
here glass classification is when taking
the raw video stream determining in post
so humans are annotating this video is
the driver which region the driver is
looking at that's we're able to do by
converting the problem into a simple
variant of classification on-road
off-road left-right the same can be done
for pedestrians left forward right it
can annotate regions of where they are
looking and using that kind of
classification approach determine are
they looking at the cars or not are they
looking away are they looking at their
smartphone without doing the 3d gaze
estimation again it's a subtle point but
think about it if you wanted to estimate
exactly where they're looking
you need that ground truth you don't
have that ground truth unless you there
there's no in the real world data
there's no way to get the information
about where exactly people were looking
you're only inferring so you have to
convert it into a region based
classification problem in order to be
able to train your networks on this and
the pipeline is the same the source
video here the face the the 30 frames a
second video coming in of the drivers
face of the human face there is some
degree of calibration that's required
you have to determine approximately
where the sensor is that's taking in the
image especially for the glance
classification task because its region
based
needs to be able to estimate where the
forward roadway is where the the camera
frame is relative the world frame the
video stabilization and the face front
elevation all the basic processing
they've removed the vibration of the
noise that remove the physical movement
of the head that removed the shaking of
the car in order to be able to determine
stuff about eye movement and blink
dynamics and finally with the neural
networks there is nothing left except
taking in the raw video of the face for
the glass classification tasks and the
eye for the cognitive load tasks raw
pixels that's the input to these
networks and the output is whatever the
training data is and we'll mention each
one so whether that's cognitive load
glance emotion drowsiness the input is
the raw pixels and the output is
whatever you have data for data is
everything here the face an alignment
problem which is a traditional geometric
approach to this problem is designing
algorithms that are able to detect
accurately the individual landmarks in
the face and from that estimate the
geometry of the head pose for the class
of
in version we perform the same kind of
alignment or with the same kind of face
detection in alignment to determine
where the head is but once we have that
we pass in just the raw pixels and
perform the classification on that as
opposed to doing the estimation its
classification allowing you to perform
what's shown there on the bottom is the
real-time classification of where the
driver is looking Road left right center
stack instrument cluster and rearview
mirror and as I mentioned annotation
tooling is key so we have a total 5
billion video frames one and a half
billion of the face that would take tens
of millions of dollars to annotate just
for the glass classification fully so we
have to figure out what to annotate in
order to trade and you'll networks to
perform this task and what we annotate
is the things that the network is not
confident about the moments of
highlighting variation the partial
occlusions from the light or self
occlusion and the moving out of frame
the outer frame occlusions all the
difficult cases going from frame to
frame to frame here and the different
pipeline starting at the table going at
the bottom whenever the classification
has a low confidence we pass it to the
human it's simple we rely on the human
only when the classifier is not
confident and the fundamental trade-off
in all of these systems is what is the
accuracy we're willing to put up with
here in red and blue and red is human
choice decision and blue as a machine
tasks in red we select the video we want
to classify in blue the the the neural
network performs the face detection task
localizing the camera choosing what is
the angle of the camera
and provides a trade opportunity and
percent frames it can annotate so
certainly and you'll networking at a
glance for the entire data set they
would achieve accuracy in the case of
glass classification of nine low 90%
classification on the sixth glass task
now if you want a higher accuracy that
it will only be able to achieve that for
us for a smaller fraction of frames
that's the choice
and then a human has to go in and
perform the annotation of the frames
that the algorithm was not confident
about and it repeats over and over the
algorithm is then trained on the frames
that were annotated by the human and
repeats this process over and over on
the frames until everything is annotated
yes yes absolutely
the question was do you ever observe
that the classifier is highly confident
about the incorrect class yep right
question was hot well then how do you
how do you deal with that how do you
account for that how do you account for
the fact that highly confident
predictions can be highly wrong yeah
false positives false positives that
you're really confident in there there's
no at least in our experience there's no
good answer for that except more more
and more training data on the things
you're not confident about that usually
seems to deal generalize over cases we
don't encounter obvious large categories
of data where you're really confident
about the wrong thing usually some
degree of human annotation fixes most
problems
annotating the low the low confidence
part of the data
solves all incorrect issues but of
course that's not always true in the
general case that you can imagine a lot
of scenarios whether that's not true for
example one one one thing they always
perform is for each individual person we
usually entertain a large amount of the
data manually no matter what so we have
to make sure that the neural network has
seen that person in the various and the
various ways their face looks like with
glasses with different hair with
different a lighting variation so we
want to manually annotate that it's
overtime we're allowing the machine to
do more and more of the work
so what's resulting in this in the
glance classification cases you can do
real-time classification you can give
the car information about whether the
driver is looking on road or off road
this is critical information for the car
to understand and you want to pause for
a second to realize that when you're
driving a car for those our driver for
those that driven any kind of car with
any kind of automation it has no idea
about what you're up to at all there's
no it doesn't have any information about
the driver except if they're touching
the steering wheel or not more and more
now with the GM supercruise vehicle and
Tesla now has added a dryer facing
camera that slowly started to think
about moving towards perceiving the
driver but most vehicles on the road
today have no knowledge of the driver
this knowledge is almost common sense
and trivial for the car to have the it's
common sense how important this
information is where the driver is
looking that's the glance classification
problem and again emphasizing that we've
converted it's been three decades of
work on gaze estimation yet gaze
estimation is doing head pose estimation
so the geometric orientation of the head
combining the orientation of the eyes
and using that combined information to
determine where the person is looking
will convert that into a classification
problem so the standard gaze estimation
definition is not a machine learning
problem
classification is a machine learning
problem this transformation is key
emotion human emotion is a fascinating
thing so the same kind of pipeline
stabilization cleaning of the data raw
pixels in and then the classification is
emotion the problem with emotion if I
may speak as an expert human not am NOT
an expert in emotion is just an expert
of being human is that there is a lot of
ways that's a sodomize emotion to
categorize emotion to define emotion
whether that's for the the primary
emotion of the para scale would love joy
surprise anger sadness fear there's a
lot of ways to mix those together to
break those apart into hierarchical
taxonomies and the way we think about it
in the driving context at least there is
a general emotion recognition task sort
of I mentioned I'll mention it but it's
kind of how we think about primary
emotions is detecting the the broad
categories of emotion of joy and anger
of disgust and surprise and then there
is application specific emotion
recognition where you're using the
facial expressions that all the various
ways that we can deform our face to
communicate information to determine the
specific question about the interaction
of the driver so I'll first for the
general case these are the building
blocks I mean there's there's countless
ways of deforming the face that we use
to communicate with each other there's
42 individual facial muscles that can be
used to form those expressions one of
our favorite
work with is the effective SDK this is
their their their task with the general
emotion recognition task is taking in
raw pixels and determining categories of
emotion very subtleties of that emotion
in the general case producing a
classification of anger disgust fear
surprise so on and then mapping I mean
essentially what these algorithms are
doing whether whether they using deep
neural networks or not whether using
face alignment to do the landmark
detection and then tracking those
landmarks over time to do the facial
actions they're determined they're
mapping the expressions the component
their various expressions who can make
with their eyebrows or their nose and
mouth and eyes to map them to the
emotion so I'd like to highlight one
because I think it's an illustrative one
for joy an expression of joy is smiling
so there's an increased likelihood that
you observe a smiling expression on the
face when joy is experienced or vice
versa if there's an increased
probability of a smile there's an
increased probability of emotion of joy
being experienced and then joy an
experience has a decreased probability
likelihood of brow raising and brow
following so if you see a smile that's a
that's a plus for joy if you see brow
raised bright for Oh
brow furrow is a minus for joy that's
for the general emotional recognition
task that's been well studied that's
sort of the core of affective computing
movement from from the visual
perspective again from the computer
vision perspective from the application
of specific perspective which were
really focused on again data is
everything what what are you annotating
we can take here we have a large-scale
data set of drivers interacting with a
voice based navigation system so they're
tasked with in various vehicles to enter
a navigation so with they're talking to
their GPS using their voice this is for
depending on the vehicle depending on
the system in most cases an incredibly
frustrating experience so we have them
perform this task and then
the annotation is self-report after the
task they say on a scale of 1 to 10 how
frustrating was this experience and when
you see on top is is the expressions
detected and associated with a satisfied
a person who said a a 10 on the
satisfaction so a 1 in the frustration
scale was perfectly satisfied with a
voice based interaction on the bottom is
frustrated as a believin 9 on the
frustration scale so the feature the
strongest there the expression remember
joy smile was the strongest indicator of
frustration for all our subjects that
was the strongest expression smile was
the thing that was always there for
frustration there's other various
frowning that followed and shaking the
head and so on but smiles were there so
that shows you the kind of clean
difference between general emotion
recognition tasks and the
application-specific
here perhaps they enjoyed an absurd
moment of joy at the frustration that
were experiencing you can sort of get
philosophical about it but the practical
nature is they were frustrated with the
experience and we're using the 42 most
of the face to make expressions to do
classification of frustrated or not and
their data does the work not the
algorithms it's the annotation a quick
mention for the AGI class next week for
the artificial general intelligence
class one of the competition's we're
doing is we have a JavaScript face
that's trained with a neural network to
form various expressions to communicate
with the observer so we're interested in
creating emotion which is a nice mirror
coupling of the emotional recognition
problem it's gonna be super cool
cognitive load we're starting to get to
the eyes
cognitive load is the degree to which a
human being is accessing their memory or
as Lawson thought how hard they're
working in their mind to recollect
something to think about something as
cognitive load and to do a quick pause
of eyes as the window to cognitive load
eyes the window to the mind there's a
different ways the eyes move so there's
pupils the black part of the eye they
can expand and and contract based on
various factors including the lighting
variations in the scene but they also
expand and contract based on cognitive
load that's a that's a strong signal
they can also move around
there's ballistic movement saccades when
we look around eyes jump around the
scene they can also do something called
smooth pursuit when you and connecting
to our animal past you can see a
delicious meal
flying by or running by that your eyes
can follow it perfectly they're not
jumping around so when we read a book
our eyes are using saccadic movements
where they jump around and when the
purse muth pursuit the eye is moving
perfectly smoothly those are the kinds
of movements who have to work with and
cognitive load can be detected by
looking at various factors of the eye
the blink dynamics the eye movement and
the eye the pupil diameter the problem
is in the real world and real world data
with lighting variations everything goes
out the window in terms of using pupil
diameter which is the standard way to
measure non-contact way to measure
cognitive load in the lab when you can
control lighting conditions and use
infrared cameras when you can't all that
goes out the window and all you have is
the blink dynamics and the eye movement
so neural networks to the rescue
3d convolutional neural networks in this
case we take a sequences of images that
I through time and use 3d convolutions
as opposed to 2d convolutions on the
left is everything we've talked about
previous to this as 2d convolutions when
the convolution filter is operating on
the
XY 2d image every channel is operated on
by the filter individual separately 3d
convolutions combine those convolve
across the across multiple images across
multiple channels therefore being able
to learn the dynamics of the scene
through time as well not just spatially
temporal and data data is everything for
a cognitive load we have in this case 92
drivers so how do we sort of perform the
cognitive load classification task we
have these drivers driving on the
highway and performing the what's called
the n-back task zero back one back to
back and that task involves hearing
numbers being read to you and then
recalling those numbers one at a time so
one zero back the system gives you a
number seven and then you have to just
say that number back seven and it keeps
repeating that's easy it's supposed to
be the easy task one back is when you
hear number you have to remember it and
then that for the next number you have
to say the number previous to that so
you kind of have to keep one number in
your memory always and not get
distracted by the new information coming
up but to back you have to do that two
numbers back so you have to use memory
more and more went to back so cognitive
load is higher and higher okay so what
do we do we use face alignment face
front elevation and detecting the eye
closest to the camera and extract the
eye region and now we have this nice raw
pixels of the eye region across six
seconds of video and we take that and
put that in as a 3d convolutional neural
network and classify simply one of three
classes zero back one back and two back
so we have a ton of data of people on
the highway performing these tasks and
back tasks and that forms the
classification supervised learning
training data that's it the input is 90
images it's at 15 frames a second
and the output is one of three classes
face fronto ization i should mention is
the technique developed under for face
recognition because most face
recognition tasks require frontal face
orientation is also what we use here to
normalize everything that we can focus
in on the exact blink it's taking the
it's taking whatever the orientation of
the face and projecting into the frontal
position taking the raw pixels of the
face is detecting the eye region zooming
in and grabbing the eye where you find
and this is where the intuition builds
it it's a fascinating one what's being
plotted here is the relative movement of
the pupil the relative movement of the
eye based on a different cognitive loads
for cognitive load on the left of zero
so when your mind is not that lost in
thought and cognitive load of two on the
right when it is lost in thought the eye
moves a lot less eye is more focused on
the forward roadway that's an
interesting finding but it's only in
aggregate and that's what the neural
neural network is task would do it with
extracting an a frame-by-frame basis
this is a standard 3d convolutional
architecture again taking in the image
sequence is the input cognitive load
classification is the output and
classifying on the right is the accuracy
that's able to achieve of 86% that's
pretty cool from real-world data the
idea is that you can just plop in a
webcam get the video going in going into
the neural network and this predicting
it continued
a stream from zero to two of cognitive
load because every single zero want back
one back to back classes are have a
confidence that's associated with them
so you can turn that into a real value
between zero and two and when you see
here's a plot of three of the people on
the team here driving a car performing a
task of conversation and in white
showing the cognitive load frame by
frame a thirty frames a second
estimating the cognitive load of each of
the drivers on from zero to two on the
y-axis so these are high cognitive load
and showing in on the bottom red and
yellow of high medium cognitive load and
when everybody's silent the cognitive
load goes down so we can perform now
with this simple neural network with the
training data that we formed we can
extend that to any arbitrary new data
set and generalize okay those are some
examples of Chania neural networks can
be applied and why is this important
again is while we focus on the sort of
the perception tasks of using neural
networks of using sensors and signal
processing to determine where we are in
the world where the different obstacles
are informed trajectories around those
obstacles we are still far away from
completely solving that problem I would
argue 20 plus years away the human will
have to be involved and so when it's the
system is not able to control when the
system is not able to perceive when
there's some flawed aspect about the
perception or the driving policy the
human has to be involved and that's
where we have to know let the car know
what the human is doing that's the
essential element of human robot
interaction the most popular car in the
United States today is the Ford f-150 no
automation the thing that sort of
inspires us and makes us think that
transportation can be fundamentally
transformed is the Google self-driving
mo
our and although our guest speakers and
all the folks work in the autonomous
vehicles but if you look at it the only
people who are at a mass scale or
beginning to are actually injecting
automation into our daily lives is the
ones in between
it's the Tesla's the l2 systems it's the
tesla system the supercruise the audio
as 90s the the vehicles that are slowly
adding to some degree of automation and
teaching human beings how to interact
with that automation and here's again
the the the path towards mass scale
automation we're steering wheels removed
the consideration that humans removed I
believe is more than two decades away on
the path to that we have to understand
and create successful human robot
interaction approach autonomous vehicles
autonomous systems in a human centered
way the mass scale integration of these
systems of the human center systems like
to test the vehicles a Tesla is just a
small company right now the the kind of
l2 technologies have not truly
penetrated the the market have not
penetrated that our vehicles even the
Brittain the new vehicles being released
today I believe that happens in the
early 2020s and that's going to form the
core of our algorithms that will
eventually lead to the full autonomy all
of that

Resume

Berikut adalah rangkuman komprehensif dan terstruktur dari konten video berdasarkan transkrip yang telah Anda berikan.

***

# Revolusi Deep Learning dan Computer Vision: Masa Depan Sensing Manusia dalam Keselamatan Berkendara

### Inti Sari (Executive Summary)
Video ini membahas penerapan *deep learning* dan *computer vision* untuk memahami perilaku manusia—baik pengemudi, pejalan kaki, maupun pesepeda—dalam konteks lalu lintas guna meningkatkan keselamatan. Pembicara menekankan bahwa keberhasilan implementasi AI di dunia nyata sangat bergantung pada kualitas dan skala data (pengumpulan dan anotasi) daripada sekadar kecanggihan algoritma, serta menyoroti pentingnya pendekatan *human-centered* yang kolaboratif antara manusia dan mesin.

### Poin-Poin Kunci (Key Takeaways)
*   **Data adalah Raja**: Dalam *deep learning* untuk kendaraan otonom, pengumpulan, pembersihan, dan anotasi data jauh lebih krusial daripada pengembangan algoritma itu sendiri.
*   **Pendekatan Human-Centered**: Alih-alih mengejar otonomi penuh yang membutuhkan waktu puluhan tahun, fokus saat ini adalah kolaborasi manusia-mesin untuk mengurangi kecelakaan akibat kesalahan manusia (gangguan, mengantuk, mabuk).
*   **Klasifikasi Pandangan (*Glance Classification*)**: Menentukan area pandangan pengemudi (misalnya: jalan vs. konsol tengah) lebih efektif dan aplikatif daripada estimasi pandangan geometris yang kompleks.
*   **Deteksi Emosi Kontekstual**: Pengenalan emosi dalam berkendara berbeda dengan pengenalan emosi umum; contohnya, senyum bisa menjadi indikator frustrasi saat menggunakan navigasi suara.
*   **Beban Kognitif**: Teknik *3D Convolutional Neural Networks* (CNN) memungkinkan deteksi beban mental pengemudi secara *real-time* melalui analisis gerakan mata dan pupil, meskipun dalam kondisi pencahayaan yang bervariasi.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Tantangan & Persyaratan Deep Learning di Dunia Nyata
Penerapan *deep learning* pada *computer vision* untuk sensing manusia (pengemudi, pejalan kaki, pesepeda) menghadapi tantangan besar. Fokus utama bukan pada audio atau teks, melainkan pada aspek visual.
*   **Pentingnya Data**: Faktor terpenting adalah ketersediaan data dunia nyata dalam jumlah besar. Pengumpulan data adalah bagian tersulit dan terpenting (misalnya menangani 5 miliar gambar).
*   **Proses Anotasi**: Data mentah harus direduksi menjadi kasus yang berarti (memisahkan 1% menarik dari 99% yang membosankan). Alat anotasi yang efisien adalah kunci HCI (Human-Computer Interaction).
*   **Peran Algoritma**: Meskipun menarik, algoritma berada di urutan kedua setelah data. Algoritma yang ideal harus bebas kalibrasi, *robust*, dan menangkap dinamika temporal (fisika), bukan hanya citra statis.
*   **Statistik Bahaya Manusia**: Manusia sebenarnya pengemudi yang hebat, namun memiliki cacat fatal. Pada tahun 2016, terdapat >40.000 kematian lalu lintas di AS. *Texting* adalah ancaman besar (mata terlepas dari jalan selama 5 detik setara dengan satu lapangan sepak bola pada kecepatan 55 mph).

#### 2. Human-Centered vs. Otonomi Penuh
Terdapat dua jalur pengembangan: menghapus manusia sepenuhnya (robotika murni) atau pendekatan *human-centered* (manusia + mesin).
*   **Statistik Keselamatan**: 31% kematian akibat mabuk, hampir 3% akibat mengantuk, dan sisanya akibat gangguan (*distraction*).
*   **Kelemahan Manusia**: Manusia cenderung terlalu percaya diri (*over-trust*) pada teknologi. Contoh penyalahgunaan: menggantung benda (seperti jeruk) pada setir untuk menipu sensor sentuh agar bisa melepaskan tangan saat *autopilot*.
*   **Studi Berkendara MIT**: Melibatkan 25 kendaraan (21 di antaranya Tesla) yang dilengkapi berbagai sensor (kamera wajah, tubuh, pemandangan luar, GPS, audio). Data ini digunakan untuk melatih jaringan saraf dan memahami perilaku manusia.
*   **Metrik Kenikmatan**: Penggunaan Tesla Autopilot sekitar 33% dari total jarak tempuh menunjukkan bahwa sistem ini memberikan nilai dan kenikmatan bagi pengguna.

#### 3. Deteksi Pejalan Kaki & Estimasi Pose Tubuh
*   **Deteksi Pejalan Kaki**: Tantangan meliputi variasi kelas (gaya, artikulasi) dan oklusi (tertutup benda atau kerumunan). Metode modern menggunakan R-CNN (Region-based CNN) yang menghasilkan *region proposal* kandidat, jauh lebih efisien daripada *sliding window* tradisional.
*   **Pengumpulan Data di Persimpangan**: Menggunakan kamera 4k, stereo vision, dan LiDAR di persimpangan Cambridge untuk memahami negosiasi, komunikasi non-verbal, dan perilaku *jaywalking*.
*   **Estimasi Pose Tubuh**: Menemukan sendi (siku, bahu, pinggul, dll.) sebagai landmark XY. Ini penting untuk menentukan posisi pengemudi (misalnya untuk penggunaan sabuk pengaman atau pengujian airbag). Metode *holistic regression* modern lebih kuat dalam menangani deformasi dan oklusi dibandingkan metode berurutan tradisional.

#### 4. Klasifikasi Pandangan (*Glance Classification*) Pengemudi
Mengetahui ke mana pengemudi melihat adalah aspek sensing yang paling berdampak.
*   **Klasifikasi vs. Estimasi Geometris**: Pendekatan ini bukan estimasi pandangan geometris (koordinat XYZ), melainkan klasifikasi wilayah (*region-based*). Contohnya: membagi pandangan menjadi 2 wilayah (on-road/off-road) atau 6 wilayah (kiri, kanan, tengah, spion, dll).
*   **Pembelajaran Mesin**: Masalah ini dipecahkan dengan pembelajaran dari data, bukan desain laboratorium semata. Manusia menganotasi video berdasarkan wilayah pandangan, dan mesin mempelajari polanya.

#### 5. Pengenalan Emosi & Beban Kognitif
*   **Pengenalan Emosi**: Menggunakan algoritma untuk memetakan ekspresi wajah (alis, hidung, mulut) ke dalam kategori emosi (senang, marah, sedih). Namun, konteks sangat penting.
*   **Paradoks Frustrasi**: Dalam konteks navigasi suara yang membuat frustrasi, indikator terkuat frustrasi bukanlah kemarahan, melainkan senyum (mungkin karena ketidakpercayaan atau absurditas situasi). Ini membuktikan bahwa data anotasi spesifik konteks lebih penting daripada teori umum.
*   **Beban Kognitif (*Cognitive Load*)**: Tingkat upaya mental akses memori.
    *   *Indikator*: Diameter pupil (sulit diukur akibat perubahan cahaya), gerakan mata (*saccades* vs *smooth pursuit*), dan dinamika berkedip.
    *   *Solusi*: Menggunakan 3D CNN pada urutan gambar (*image sequences*) untuk mempelajari dinamika temporal, mengabaikan masalah pencahayaan yang mempengaruhi pengukuran pupil.

#### 6. Implementasi Real-time & Masa Depan Mobil Otonom
*   **Metodologi 3D CNN**: Input berupa 90 gambar wilayah mata selama 6 detik (15 fps). Teknik *face frontalization* digunakan untuk menormalkan orientasi wajah.
*   **Temuan**: Pada beban kognitif rendah, mata bergerak lebih banyak. Pada beban tinggi, mata bergerak lebih sedikit dan fokus ke depan. Akurasi mencapai 86% pada data dunia nyata.
*   **Keterlibatan Manusia**: Tugas persepsi dan kontrol otonom penuh diperkirakan masih butuh waktu >20 tahun. Saat ini, kita berada di fase antara (Level 2). Mobil otonom adalah "robot pribadi" yang membutuhkan pemahaman, komunikasi, dan kepercayaan, bukan sekadar alat seperti *Roomba*.

#### 7. Penutup & Ajakan Belajar
Pembicara menutup dengan promosi sumber daya pembelajaran dan peluang kolaborasi.
*   **Kursus**: Ada kelas "Introduction to Deep Learning" yang lebih terapan dan *hands-on*. Juga ada kelas "Global Business of AI and Robotics" yang bersifat lintas disiplin.
*   **Peluang Karir**: Undangan untuk bekerja sama atau bergabung dalam penelitian penerapan *deep learning* di bidang otomotif.
*   **Ucapan Terima Kasih**: Mengapresiasi kontributor komunitas (ribuan submission untuk "Deep Traffic") dan sponsor (Nvidia, Google, Amazon Alexa, Toyota, dll).

---

### Kesimpulan & Pesan Penutup
Masa depan keselamatan berkendara terletak pada integrasi *deep learning* yang memahami nuansa perilaku manusia. Meskipun otonomi penuh masih jauh, penerapan teknologi

Read

file updated 2026-02-13 13:22:58 UTC