Transcript
Z2GfE8pLyxc • MIT 6.S094: Deep Learning for Human Sensing
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0031_Z2GfE8pLyxc.txt
Kind: captions Language: en today we will talk about how to apply the methods of deep learning to understanding the sense of the human being the focus will be on computer vision the visual aspects of a human being of course we humans express ourselves visually but also through audio voice and through text beautiful poetry and novels and so on we're not going to touch those today we're just going to focus on computer vision how we can use computer vision to extract useful actionable information from video images video of human beings in particular in the context of the car so what are the requirements for successfully applying deep learning methods in the real world so when we're talking about human sensing we're not talking about a basic face recognition of celebrity images we're talking about using computer vision deep learning methods to create systems that operate in the real world and in order for them to operate in the real world there are several things they sound simple some are much harder than they sound first and the most important here for most to less more to less critical ordered is data data is everything real world data we need a lot of real world data to form the data set on which these supervised learning methods can be trained I'll say this over and over throughout the day today data is everything that means data collection is the hardest part and the most important part we'll talk about how that data collection is carried out here in our group at MIT all the different ways to capture human beings in the driving context in the road user context pedestrians cyclists but the data it starts and ends at data the fun stuff is the algorithms but the data is what makes it all work real world data okay then once you have the data okay data isn't everything I lied because you have to actually annotate it so what do we mean by data there's raw data video audio lidar all the types of sensors we'll talk about to capture real world you wrote user interaction you have to reduce that into meaningful representative cases of what happens in that real world in driving 99% of the time driving looks the same it's the it's the 1% the interesting cases that we're interested in and what we want is algorithm to train learning algorithms on those 1% so we have to collect 100 percent we have to collect all the data and then figure out and automated semi-automated ways to find the pieces of that data that could be used to train your own networks and that a representative of the general thing kinds of things that happen in this world efficient annotation annotation isn't just about drawing bounding boxes on images of cats annotation tooling is key to unlocking real world performance systems that successfully solve some problem accomplish some goal in real world data that means designing annotation tools for a particular task annotation tools that are used for glance classification for determining where drivers are looking it's very different than annotation tools used for body pose estimation is very different than the tooling use that we use for psyche views investing thousands of dollars for the competition for this class to annotate fully scene segmentation where every pixel is colored there's needs to be tooling for each one of those elements and they're key that's HCI question that's a design question there's no deep learning there's no robotics in that question it's how do we leverage human computation human the human brain to mow effectively label images such that we can train y'all networks on them hardware in order to train these networks in order to parse the data we collect and we'll talk about we have now over five billion images of data of driving data in order to parse that you can't do it on a single machine you have to do large-scale distributed compute and large-scale distributed storage and finally the the stuff that's the most exciting that people that there's this class and many classes and much of the literature is focused on is the algorithms the deep learning algorithms the machine learning algorithms the algorithms that learn from data of course that's really exciting and important but what we find time and time again in real world systems is that as long as these algorithms learn from data so as long as this deep learning the data is what's much more important of course it's nice for the algorithms to be calibration free meaning they learn to calibrate self calibrate we don't need to have the sensors in an exact same position every time that's a very nice feature the robustness of the system is then generalizable across multiple multiple vehicles and multiple scenarios and one of the key things that comes up time again time and time again and we'll mention today is a lot of the algorithms developed in deep learning are really focused for computer vision are focused on single images now the real world is happens in both space and time and we have to have algorithms that both capture the visual characteristics but also look at the sequence of images sequence of those digital characteristics that form the temporal dynamics the physics of this world so it's nice when those algorithms are able to capture the physics of the scene the big takeaway I would like if you leave with anything today unfortunately it's that the painful boring stuff of collecting data of cleaning that data of annotating that data in order to create successful systems is much more important than good algorithms or great algorithms it's important to have good algorithms as long as you have neural networks that learn from that data okay so today I'll talk I like to talk about human imperfections and the various detection problems the pedestrian body pose glance and motion cognitive load estimation that we can use to help those humans as they operate in the driving context and finally try to continue with the idea of the vision that fully autonomous vehicles as some of our guest speakers have spoke about and sterling anis will speak about tomorrow is really far away that the humans will be an integral part of the operating cooperating with the AI systems and I will continue on on that line of thought to try to motivate why we need to continuously approach the autonomous vehicle the self-driving car paradigm in the human centered way okay first before we talk about human imperfections let's just pause and acknowledge that humans are amazing we're actually really good at a lot of things that's sometimes sort of fun to talk about how much called terrible of drivers who are how distracted we are how irrational we are but we're actually really damn good at driving here's a video of stadia our soccer player messi the best soccer player in the world obviously and the state-of-the-art robot on the right same thing well there's it's not playing but I assure you the American Ninja Warrior Casey is is uh is far superior to the DARPA humanoid robotics systems shown on the right okay so continuing and the line of thought to challenge to challenge us here that humans are amazing is you know there's record high in 2016 in the United States there was over forty thousand since uh many years it's across the forty thousand fatalities mark more than forty thousand people died in car crashes in the United States but that's in three point two trillion miles traveled so that's one fatality per eighty million miles that's one in 625 chance of dying in a car crash in your lifetime interesting side fact for anyone in the United States folks who live in Massachusetts are the least likely to die in a car crash Montana is the most likely so for every one that thinks of Boston drives is terrible maybe that adds some perspective here's a visualization of ways data across a period of a day showing you the rich blood of the city that the the traffic flow of the city the people getting from A to B and a mass scale and doing it surviving doing it okay humans are amazing but they're also flawed texting sources of distraction with a smartphone the eating the secondary tasks of talking to other passengers grooming reading using navigation system yes sometimes watching video and manually adjusting or adjusting the radio and 3,000 people were killed and 400,000 were injured in motor vehicle crashes vaulted involving distraction in 2014 distraction is a it's a very serious issue for safety texting every day more and more people text smartphones are proliferating our society 170 billion text messages are sent in the United States every month that's in 2014 you can only imagine what it is today eyes off road for five seconds is the average time your eyes off the road while texting five seconds if you're traveling 55 miles an hour in that five seconds that's enough time to cover the length of a football field so you're blindfolded you're not looking at the road in five seconds the average time of texting you're covering the entire football field eight so many things can happen in that moment of time that's distraction drunk driving 31% of traffic fatalities involve a drunk driver drunk driving 23% of nighttime drivers tested positive for a legal prescription or over-the-counter medication distracted driving as I said is a huge safety risk drowsy driving people driving tired nearly three percent of all traffic fatalities involve a drowsy driver if you are uncomfortable with videos that involve risk I urge you to look away these are videos collected by Triple A of teenagers a very large-scale naturalistic driving data set and it's capturing clips of teenagers being distracted on their smartphone [Music] once you take it in the problem we're against so in the cutting context of human imperfections we have to ask ourselves is the human centered approach to autonomy in systems autonomous vehicles that are using artificial intelligence to aid the driving task do we want to go as I mentioned a couple of lectures ago the human centered way or the full autonomy way the tempting path is towards full autonomy where we removed this imperfect flawed human from the picture altogether and focus on the robotics problem of perception and control and planning and driving policy or do we work together human and machine to improve the safety to alleviate distraction to bring drive our attention back to the road and use artificial intelligence to increase safety through collaboration human robot interaction versus removing the human completely from the picture as I've mentioned as as sterling will certainly talk about tomorrow and and rightfully so and yesterday or on Tuesday Emilio has talked about the elf four-way is grounded in literature it's grounded in common sense since in some sense it's you can count on the fact that humans the the natural flaws of human beings to over trust to misbehave to be irrational about their risk estimates will result in improper use of the technology and that leads to what I've showed before the public perception of what drivers do and semi autonomous vehicles they begin to over trust the moment the system works well they begin to over trust they begin to do stuff they're not supposed to be doing in the car taking it for granted a recent video that somebody posted this is a common sort of more practical concern that people have is while the traditional ways to ensure the physical engagement of the driver is by saying they should touch the wheel the the steering wheel every once in a while and of course there's ways to buy the need to touch the steering wheel some people hang objects like I can off of the steering wheel in this case brilliantly I have to say they shove an orange into the into the wheel to make the touch sensor fire and therefore be able to take their hands off the autopilot and that that kind of idea makes us believe that there's no way that you know humans will find a way to misuse this technology however I believe that that's not giving the technology enough credit artificial intelligence systems if are they're able to perceive the human being are also able to work with the human being and that's what I'd like to talk about today teaching cars to perceive the human being and it all starts with data it's all about data as I mentioned data is everything in these real world systems with the MIT naturalistic driving data set of 25 vehicles of which 25 and 21 and equipped with Tesla autopilot we instrument them this is what we do the data collection two cameras on the driver will see the cameras on the face capturing high-definition video of the face that's where we get the glance classification the emotion recognition cognitive load everything coming from the face that we have another camera or a fisheye that's looking at the body of the driver and that from that comes the body pose estimation hands on wheel activity recognition and then one video looking out for the full scene segmentation for all the scene perception tasks and everything is being recorded synchronized together with GPS with audio with all the can covered from the car on a single device synchronization of this data is critical so that's one road trip in the data where thousands like it traveling hundreds of miles sometimes hundreds of miles under automated control and autopilot that's the data again as I said data is everything and from this data we can both gain understanding what people do which is really important to understand how autonomy successful autonomy can be deployed in the real world and to design algorithms as for training for training the deep learning the deep neural networks in order to perform the perception tasks better twenty five beagles 21 Tesla's Model S Model X and now model three over a thousand miles collected a day every single day we have thousands of miles in the Boston Massachusetts area driving around all of that video being recorded now over five billion video frames there are several ways to look at autonomy one of the big ones is safety that's what everybody talks about how do we make these things safe but the other one is enjoyment do people actually want to use it it we can create a perfectly safe system we can create it right now we've had it for ever before even cars a car that never moves is a perfectly safe system well not perfectly but almost and but it doesn't provide a service that's valuable it doesn't provide an enjoyable driving experience so okay what about slow moving vehicles that's an open question the reality is with these Tesla vehicles and l2 systems doing automated driving people are driving 33% of miles using Tesla autopilot what does that mean that means that people are getting value from it they a large fraction of their driving is done an automated way that's value that's enjoyment the glance suffocation algorithm we'll talk about today is used as one example that we use to understand what's in this data shown with the bar graphs there and the red and the blue red is during manual driving blues during autopilot driving and we look at glance classification regions of where drivers are looking on road and off-road and if that distribution changes with automated driving or manual driving and would these glass classification methods we can determine that there's not much difference at least until you dig into the details which we haven't done and the aggregate there's not a significant difference that means people are getting value enjoying using these technologies but yet they're staying attentive or at least not attentive but physically engaged when your eyes are on the road you might not be attentive but you're at the very least physically your body's position in such a way your head is looking at the forward roadway that you're physically in position to be alert and to take in the forward roadway so they're using it and they don't over trust it and that's I think the sweet spot that human-robot interaction needs to achieve is the human gaining through experience through exploration through trial and error exploring and understanding the limitation of the system to a degree that over trust can occur that seems to be happening in this system and using the computer vision methods I'll talk about we can continue to explore how that can be achieved in other systems when the when the when the fraction of automated driving increases from 30% to 40% to 50% and so on it's all about the data and I'll I'll harp on this again the algorithms are interesting you know I will mention of course it's the same convolution neural networks it's the same networks that take in raw pixels and extract features of interest it's 3d convolutional neural networks that take into sequences of images and extract the temporal dynamics along with the visual characteristic for the individual images it's RN and zoella's TMS that use the convolutional neural networks to extract features and over time look at the dynamics and the images these are pretty basic architecture is the same kind of deep neural network architectures but they rely fundamentally and deeply on the data on real-world data so let's start where perhaps on the human sensing side it all began which is pedestrian detection decades ago to put it in con texe pedestrian detection here shown from left to right on the left is green showing the easier human sensing tasks tasks of sensing some aspect to a human being but as for your detection which is detecting the full body of a human being in an image or video is one of the easier computer vision tasks and on the right under in the red microcircuits these are the tremors of the eye or measuring the pupil diameter or measuring the cognitive load or the fine blink dynamics of the eye the velocity of the blink micro glances and I pose are much harder problems so you think body pose estimation pedestrian detection phase classification detection recognition head pose estimation all those are easier tasks anything that starts getting smaller looking at the eye and everything that start getting fine-grained there's much more difficult so we start at the easiest pedestrian detection and as the usual challenges of all of computer vision we've talked about as the various styles of appearance so the inter class variation the different possible articulations of put it of our bodies superseded only perhaps by cats but as humans are pretty flexible as well the presence of occlusion from the accessories that we wear to occluding self occlusion and including each other but that crowded scenes have a lot of humans in them and they include each other and therefore to be able to disambiguate to figure out each individual pedestrians is a very challenging problem so how do people approach this problem well there is I need to extract features from raw pixels whether that was hot cascades hog or CNN the through the decades the sliding window approach was used because the pedestrians can be small in an image or big so there's the problem of scale so you use a sliding window to detect where that pedestrian is you have a classifier that's given a single image such as this that's you're not you take that classify you slide across the image to find where all the pedestrians of scene are so you can use non neural network methods or you can use convolution neural networks for that classifier it's extremely inefficient then came along our CNN fast our CNN fast our CNN these are networks that as opposed to doing a complete sliding window approach are much more intelligent clever about generating the candidates to consider so as opposed to considering every possible position of a window different scales of the window they generate more a small subset of candidates that are more likely and finally using a CNN classify for those candidates whether there's a pedestrian or not whether the there's an object of interest or not a face or not and using that maximum suppression because there's overlapping bounding boxes to figure out what is the most likely bounding box around this pedestrian around this object that's our CNN and there's a lot of variants now with masks our CNN really the state-of-the-art localization Network mask also adds to this on top of the body box also performed segmentation there's voxel net which does three-dimensional and light our data uses localization and point clouds so it's not just using it to the images but in 3d but it's it's it's all kind of grounded in the our CNN framework ok data so we have large-scale data collection going on here in Cambridge if you've seen cameras a lidar various intersections throughout MIT we're part of that so for example here's one of the intersections to collecting about 10 hours a day instrumenting it with various sensors I'll mention but we see about 12,000 pedestrians a day across that particular intersection using 4k cameras using stereo vision cameras 360 now the insta 360 which is an 8k 360 camera gopro lidar various sizes the 64 channel of the 6 and recording this is where this is the this is where the data comes from this is from the 360 video this is from the lidar data of the same intersection this is for the 4k camcorders pointing at a different intersection and the different than capturing the entire 360 view with the vehicles approaching in the pedestrians making crossing decisions this is understanding the negotiation that pedestrian is the nonverbal negotiation that pedestrians perform and choosing to cross or not especially when they're jaywalking and everybody jaywalks especially if you're familiar with this particular intersection there's more Jay walkers than non jaywalkers it's a fascinating one and so we record everything about the driver and everything about the pedestrians again our CNN this is where it comes in is you do Bonney box detection of the pedestrians here are the vehicles as well and allows you to convert this raw data into hours of pedestrian crossing decisions and begin to interpret it that's pedestrian detection bounding box for body pose estimation is the more difficult task body pose estimation is also finding the joints the hands the elbows the shoulders the hips knees feet the landmark points in the image XY position that marked that those joints that's body pose estimation so why is that important in driving for example it's it's important to determine the vertical position or the alignment of the driver the seatbelts and the sort of the the airbag testing is always performing the seatbelt testing is performed with the dummy considering the frontal position in a standard dummy position the the greater greater degrees of automation comes more capability and flexibility for the driver to get misaligned from the standard corner dummy position and so body pose or at least upper body pose estimation allows you to determine how often these drivers get out of line from the standard position the general movement and then you can look at hands on wheel smartphone smartphone detection activity and help add context to glance estimation that which we'll talk about so some of the more traditional methods were sequential is detecting first the head and then stepping detecting the shoulders the elbows the hands the depot's holistic view which has been the very powerful successful way for multi person pose estimation is performing a regression of detecting body parts from the entire image it's not sequentially stitching bodies together it's detecting the left elbow the right elbow the hands individually it's performing that detection and then stitching everything together afterwards allowing you to deal with the crazy deformations of the body that happened the occlusions and so on because you don't need all the joints to be visible and with this cascade of pose regressors meaning these are convolutional neural networks had taken a raw image and produce an XY position of their estimate of each individual joint input as an image output is an estimate of a joint of elbow shoulder whatever one of several landmarks and then you can build on top of that every estimation zooms in on that particular area and performs a finer and finer grain estimation of the exact position of the Joye repeating it over and over and over so through this process we can do part detection and multi-person and multi-person scene that contain multiple people so we can detect the the head the neck here the hands the elbows shown in the various images on the right that don't have an understanding who the head the elbows the the hands belong to it's just performing a detection without trying to do individual person detection first and then finally connecting or not finally but next step is connecting with part affinity fields is connecting those parts together so first you detect individual parts then you connect them together and then through bipartite matching you determine which is who is that each individual body part most likely belonging to so you kind of stitch the different people together in the scene after the detection is performed with the CNN we use this approach for detecting the upper body specifically the shoulders the neck and the head eyes nose ears that is used to determine the the position of the driver relative to the standard dummy position for example looking during autopilot driving 30-minute periods we can look at on the x-axis is time and the y-axis is the position of the neck point that I pointed out in the previous slide that the the the midpoint between the two shoulders the neck is the position over time relative to where it began this is the slouching the sinking into the seat allowing the car to know that information and allowing us or the designers of safety systems and all that information is really important we can use the same body pose algorithm to from the perspective of the vehicle outside the vehicle perspective so the vehicle looking out is doing the as opposed to just plain pedestrian detection using body pose estimation again here in Kendall Square vehicles crossing observing pedestrians making crossing decisions and performing body pose estimation which allows you to then generate visualizations like this and gain understanding like this on the x-axis is time on the y-axis is on the top plot in blue is the speed of the vehicle the speed of the vehicle the ego vehicle from which the camera is observing the scene and on the bottom in green up and down as a binary value whether the Podesta when the pedestrian is not looking at the car one when the pedestrian is looking at the car so we can look at thousands of episodes like this crossing decisions nonverbal communication decisions and determine using body pose estimation the dynamics of this nonverbal here just nearby by media lab crossing there's a pedestrian approaches we can look in green there when the pedestrian glasses looks away glasses the car looks away fascinating glance behavior that happens interesting most people look away before they cross same thing here this is just an example we have thousands of these body pose estimation allows you to get this fine-grained information about the pedestrian glance behavior pedestrian body behavior hesitation glass classification one of the most important things in driving is determining where drivers are looking it if there's any sensing that I advocate and is has the most impact in the driving context is for the car to know where the driver is looking and at the very crude region level information of is the driver looking on road or off road that's what we mean by glance classification it's not the standard gaze estimation problem of X Y Z determining where the eye pose and the head pose combined to determine where the driver is looking no this is classifying two regions on road off-road or six regions on road off road left right center stack rearview mirror and instrument cluster so it's region based glance allocation not the geometric gaze estimation problem why is that important it allows you to address it as a machine learning problem it's a subtle but critical point every problem we try to solve in human sensing in driver sensing has to be learn about from data otherwise it's not it's not amenable to application in the real world we can't design systems in the lab that are deployed without learning if they involve a human it's possible to do slam localization by having really good sensors and doing localization using those sensors without much learning it's not possible to design systems that deal with lighting variability and the full variability of human behavior without being able to learn so gaze estimation the geometric approach of finding the landmarks in the face and from those landmarks determining the the Jeremie the orientation of the head and the orientation of the eyes there's no learning there outside of actually training the systems to detect the different landmarks if we convert this into a gaze classification problem shown here glass classification is when taking the raw video stream determining in post so humans are annotating this video is the driver which region the driver is looking at that's we're able to do by converting the problem into a simple variant of classification on-road off-road left-right the same can be done for pedestrians left forward right it can annotate regions of where they are looking and using that kind of classification approach determine are they looking at the cars or not are they looking away are they looking at their smartphone without doing the 3d gaze estimation again it's a subtle point but think about it if you wanted to estimate exactly where they're looking you need that ground truth you don't have that ground truth unless you there there's no in the real world data there's no way to get the information about where exactly people were looking you're only inferring so you have to convert it into a region based classification problem in order to be able to train your networks on this and the pipeline is the same the source video here the face the the 30 frames a second video coming in of the drivers face of the human face there is some degree of calibration that's required you have to determine approximately where the sensor is that's taking in the image especially for the glance classification task because its region based needs to be able to estimate where the forward roadway is where the the camera frame is relative the world frame the video stabilization and the face front elevation all the basic processing they've removed the vibration of the noise that remove the physical movement of the head that removed the shaking of the car in order to be able to determine stuff about eye movement and blink dynamics and finally with the neural networks there is nothing left except taking in the raw video of the face for the glass classification tasks and the eye for the cognitive load tasks raw pixels that's the input to these networks and the output is whatever the training data is and we'll mention each one so whether that's cognitive load glance emotion drowsiness the input is the raw pixels and the output is whatever you have data for data is everything here the face an alignment problem which is a traditional geometric approach to this problem is designing algorithms that are able to detect accurately the individual landmarks in the face and from that estimate the geometry of the head pose for the class of in version we perform the same kind of alignment or with the same kind of face detection in alignment to determine where the head is but once we have that we pass in just the raw pixels and perform the classification on that as opposed to doing the estimation its classification allowing you to perform what's shown there on the bottom is the real-time classification of where the driver is looking Road left right center stack instrument cluster and rearview mirror and as I mentioned annotation tooling is key so we have a total 5 billion video frames one and a half billion of the face that would take tens of millions of dollars to annotate just for the glass classification fully so we have to figure out what to annotate in order to trade and you'll networks to perform this task and what we annotate is the things that the network is not confident about the moments of highlighting variation the partial occlusions from the light or self occlusion and the moving out of frame the outer frame occlusions all the difficult cases going from frame to frame to frame here and the different pipeline starting at the table going at the bottom whenever the classification has a low confidence we pass it to the human it's simple we rely on the human only when the classifier is not confident and the fundamental trade-off in all of these systems is what is the accuracy we're willing to put up with here in red and blue and red is human choice decision and blue as a machine tasks in red we select the video we want to classify in blue the the the neural network performs the face detection task localizing the camera choosing what is the angle of the camera and provides a trade opportunity and percent frames it can annotate so certainly and you'll networking at a glance for the entire data set they would achieve accuracy in the case of glass classification of nine low 90% classification on the sixth glass task now if you want a higher accuracy that it will only be able to achieve that for us for a smaller fraction of frames that's the choice and then a human has to go in and perform the annotation of the frames that the algorithm was not confident about and it repeats over and over the algorithm is then trained on the frames that were annotated by the human and repeats this process over and over on the frames until everything is annotated yes yes absolutely the question was do you ever observe that the classifier is highly confident about the incorrect class yep right question was hot well then how do you how do you deal with that how do you account for that how do you account for the fact that highly confident predictions can be highly wrong yeah false positives false positives that you're really confident in there there's no at least in our experience there's no good answer for that except more more and more training data on the things you're not confident about that usually seems to deal generalize over cases we don't encounter obvious large categories of data where you're really confident about the wrong thing usually some degree of human annotation fixes most problems annotating the low the low confidence part of the data solves all incorrect issues but of course that's not always true in the general case that you can imagine a lot of scenarios whether that's not true for example one one one thing they always perform is for each individual person we usually entertain a large amount of the data manually no matter what so we have to make sure that the neural network has seen that person in the various and the various ways their face looks like with glasses with different hair with different a lighting variation so we want to manually annotate that it's overtime we're allowing the machine to do more and more of the work so what's resulting in this in the glance classification cases you can do real-time classification you can give the car information about whether the driver is looking on road or off road this is critical information for the car to understand and you want to pause for a second to realize that when you're driving a car for those our driver for those that driven any kind of car with any kind of automation it has no idea about what you're up to at all there's no it doesn't have any information about the driver except if they're touching the steering wheel or not more and more now with the GM supercruise vehicle and Tesla now has added a dryer facing camera that slowly started to think about moving towards perceiving the driver but most vehicles on the road today have no knowledge of the driver this knowledge is almost common sense and trivial for the car to have the it's common sense how important this information is where the driver is looking that's the glance classification problem and again emphasizing that we've converted it's been three decades of work on gaze estimation yet gaze estimation is doing head pose estimation so the geometric orientation of the head combining the orientation of the eyes and using that combined information to determine where the person is looking will convert that into a classification problem so the standard gaze estimation definition is not a machine learning problem classification is a machine learning problem this transformation is key emotion human emotion is a fascinating thing so the same kind of pipeline stabilization cleaning of the data raw pixels in and then the classification is emotion the problem with emotion if I may speak as an expert human not am NOT an expert in emotion is just an expert of being human is that there is a lot of ways that's a sodomize emotion to categorize emotion to define emotion whether that's for the the primary emotion of the para scale would love joy surprise anger sadness fear there's a lot of ways to mix those together to break those apart into hierarchical taxonomies and the way we think about it in the driving context at least there is a general emotion recognition task sort of I mentioned I'll mention it but it's kind of how we think about primary emotions is detecting the the broad categories of emotion of joy and anger of disgust and surprise and then there is application specific emotion recognition where you're using the facial expressions that all the various ways that we can deform our face to communicate information to determine the specific question about the interaction of the driver so I'll first for the general case these are the building blocks I mean there's there's countless ways of deforming the face that we use to communicate with each other there's 42 individual facial muscles that can be used to form those expressions one of our favorite work with is the effective SDK this is their their their task with the general emotion recognition task is taking in raw pixels and determining categories of emotion very subtleties of that emotion in the general case producing a classification of anger disgust fear surprise so on and then mapping I mean essentially what these algorithms are doing whether whether they using deep neural networks or not whether using face alignment to do the landmark detection and then tracking those landmarks over time to do the facial actions they're determined they're mapping the expressions the component their various expressions who can make with their eyebrows or their nose and mouth and eyes to map them to the emotion so I'd like to highlight one because I think it's an illustrative one for joy an expression of joy is smiling so there's an increased likelihood that you observe a smiling expression on the face when joy is experienced or vice versa if there's an increased probability of a smile there's an increased probability of emotion of joy being experienced and then joy an experience has a decreased probability likelihood of brow raising and brow following so if you see a smile that's a that's a plus for joy if you see brow raised bright for Oh brow furrow is a minus for joy that's for the general emotional recognition task that's been well studied that's sort of the core of affective computing movement from from the visual perspective again from the computer vision perspective from the application of specific perspective which were really focused on again data is everything what what are you annotating we can take here we have a large-scale data set of drivers interacting with a voice based navigation system so they're tasked with in various vehicles to enter a navigation so with they're talking to their GPS using their voice this is for depending on the vehicle depending on the system in most cases an incredibly frustrating experience so we have them perform this task and then the annotation is self-report after the task they say on a scale of 1 to 10 how frustrating was this experience and when you see on top is is the expressions detected and associated with a satisfied a person who said a a 10 on the satisfaction so a 1 in the frustration scale was perfectly satisfied with a voice based interaction on the bottom is frustrated as a believin 9 on the frustration scale so the feature the strongest there the expression remember joy smile was the strongest indicator of frustration for all our subjects that was the strongest expression smile was the thing that was always there for frustration there's other various frowning that followed and shaking the head and so on but smiles were there so that shows you the kind of clean difference between general emotion recognition tasks and the application-specific here perhaps they enjoyed an absurd moment of joy at the frustration that were experiencing you can sort of get philosophical about it but the practical nature is they were frustrated with the experience and we're using the 42 most of the face to make expressions to do classification of frustrated or not and their data does the work not the algorithms it's the annotation a quick mention for the AGI class next week for the artificial general intelligence class one of the competition's we're doing is we have a JavaScript face that's trained with a neural network to form various expressions to communicate with the observer so we're interested in creating emotion which is a nice mirror coupling of the emotional recognition problem it's gonna be super cool cognitive load we're starting to get to the eyes cognitive load is the degree to which a human being is accessing their memory or as Lawson thought how hard they're working in their mind to recollect something to think about something as cognitive load and to do a quick pause of eyes as the window to cognitive load eyes the window to the mind there's a different ways the eyes move so there's pupils the black part of the eye they can expand and and contract based on various factors including the lighting variations in the scene but they also expand and contract based on cognitive load that's a that's a strong signal they can also move around there's ballistic movement saccades when we look around eyes jump around the scene they can also do something called smooth pursuit when you and connecting to our animal past you can see a delicious meal flying by or running by that your eyes can follow it perfectly they're not jumping around so when we read a book our eyes are using saccadic movements where they jump around and when the purse muth pursuit the eye is moving perfectly smoothly those are the kinds of movements who have to work with and cognitive load can be detected by looking at various factors of the eye the blink dynamics the eye movement and the eye the pupil diameter the problem is in the real world and real world data with lighting variations everything goes out the window in terms of using pupil diameter which is the standard way to measure non-contact way to measure cognitive load in the lab when you can control lighting conditions and use infrared cameras when you can't all that goes out the window and all you have is the blink dynamics and the eye movement so neural networks to the rescue 3d convolutional neural networks in this case we take a sequences of images that I through time and use 3d convolutions as opposed to 2d convolutions on the left is everything we've talked about previous to this as 2d convolutions when the convolution filter is operating on the XY 2d image every channel is operated on by the filter individual separately 3d convolutions combine those convolve across the across multiple images across multiple channels therefore being able to learn the dynamics of the scene through time as well not just spatially temporal and data data is everything for a cognitive load we have in this case 92 drivers so how do we sort of perform the cognitive load classification task we have these drivers driving on the highway and performing the what's called the n-back task zero back one back to back and that task involves hearing numbers being read to you and then recalling those numbers one at a time so one zero back the system gives you a number seven and then you have to just say that number back seven and it keeps repeating that's easy it's supposed to be the easy task one back is when you hear number you have to remember it and then that for the next number you have to say the number previous to that so you kind of have to keep one number in your memory always and not get distracted by the new information coming up but to back you have to do that two numbers back so you have to use memory more and more went to back so cognitive load is higher and higher okay so what do we do we use face alignment face front elevation and detecting the eye closest to the camera and extract the eye region and now we have this nice raw pixels of the eye region across six seconds of video and we take that and put that in as a 3d convolutional neural network and classify simply one of three classes zero back one back and two back so we have a ton of data of people on the highway performing these tasks and back tasks and that forms the classification supervised learning training data that's it the input is 90 images it's at 15 frames a second and the output is one of three classes face fronto ization i should mention is the technique developed under for face recognition because most face recognition tasks require frontal face orientation is also what we use here to normalize everything that we can focus in on the exact blink it's taking the it's taking whatever the orientation of the face and projecting into the frontal position taking the raw pixels of the face is detecting the eye region zooming in and grabbing the eye where you find and this is where the intuition builds it it's a fascinating one what's being plotted here is the relative movement of the pupil the relative movement of the eye based on a different cognitive loads for cognitive load on the left of zero so when your mind is not that lost in thought and cognitive load of two on the right when it is lost in thought the eye moves a lot less eye is more focused on the forward roadway that's an interesting finding but it's only in aggregate and that's what the neural neural network is task would do it with extracting an a frame-by-frame basis this is a standard 3d convolutional architecture again taking in the image sequence is the input cognitive load classification is the output and classifying on the right is the accuracy that's able to achieve of 86% that's pretty cool from real-world data the idea is that you can just plop in a webcam get the video going in going into the neural network and this predicting it continued a stream from zero to two of cognitive load because every single zero want back one back to back classes are have a confidence that's associated with them so you can turn that into a real value between zero and two and when you see here's a plot of three of the people on the team here driving a car performing a task of conversation and in white showing the cognitive load frame by frame a thirty frames a second estimating the cognitive load of each of the drivers on from zero to two on the y-axis so these are high cognitive load and showing in on the bottom red and yellow of high medium cognitive load and when everybody's silent the cognitive load goes down so we can perform now with this simple neural network with the training data that we formed we can extend that to any arbitrary new data set and generalize okay those are some examples of Chania neural networks can be applied and why is this important again is while we focus on the sort of the perception tasks of using neural networks of using sensors and signal processing to determine where we are in the world where the different obstacles are informed trajectories around those obstacles we are still far away from completely solving that problem I would argue 20 plus years away the human will have to be involved and so when it's the system is not able to control when the system is not able to perceive when there's some flawed aspect about the perception or the driving policy the human has to be involved and that's where we have to know let the car know what the human is doing that's the essential element of human robot interaction the most popular car in the United States today is the Ford f-150 no automation the thing that sort of inspires us and makes us think that transportation can be fundamentally transformed is the Google self-driving mo our and although our guest speakers and all the folks work in the autonomous vehicles but if you look at it the only people who are at a mass scale or beginning to are actually injecting automation into our daily lives is the ones in between it's the Tesla's the l2 systems it's the tesla system the supercruise the audio as 90s the the vehicles that are slowly adding to some degree of automation and teaching human beings how to interact with that automation and here's again the the the path towards mass scale automation we're steering wheels removed the consideration that humans removed I believe is more than two decades away on the path to that we have to understand and create successful human robot interaction approach autonomous vehicles autonomous systems in a human centered way the mass scale integration of these systems of the human center systems like to test the vehicles a Tesla is just a small company right now the the kind of l2 technologies have not truly penetrated the the market have not penetrated that our vehicles even the Brittain the new vehicles being released today I believe that happens in the early 2020s and that's going to form the core of our algorithms that will eventually lead to the full autonomy all of that data what I mentioned with Tesla with a 32% miles being driven all of that is training data for the algorithms the edge cases arise there that's where we get all this data in our data set at MIT is 400,000 miles Tesla has a billion miles so that that's all training data on the way on the stairway to mass scale automation why is this important beautiful and fundamental to the role of AI in society I believe that self-driving cars when they're in this way are focused on a human robot interaction our personal robots they're not perception control systems tools like a Roomba performing a particular task when human life is a steak when there's a fundamental transfer between of life of a human being giving their life over to an AI system directly one on one is a transfer that is kind of a relationship that is one indicative of a personal robot this is it requires all the things of understanding communication of trust these are fascinating to understand how a human and robot can form trust enough to create a really an almost one-to-one understanding of each other's mental state learn from each other oh boy so one of my favorite movies Good Will Hunting we're in Boston Cambridge have two have two gonna regret this one this is Robin Williams speaking about human imperfections so I'd like you to take this quote and replace every time you mentioned girl with car people call those things imperfections Robin Williams is talking about his wife who passed away in the movie talking about her imperfections they call these things imperfections but they're not that's the good stuff and then we'll get to choose who we let into our weird little worlds you're not perfect sport and let me save you the suspense this girl you met she isn't perfect to you there you know what let me just the video sequences that only I know about that's what made her my wife when she had a puts on me - she all my pet dogs people call these things into fashions suffice no need to choose we learn to obviate the words in my breath explore things in suspense he has an air attack but the question is what am i perfect for each other [Music] so the approach we're taking in building the autonomous vehicle we are here at MIT in our group it's the human centered approach the autonomous vehicles they were going to release in March of 2018 in the streets of Boston those who would to help please do I will talk run a course on deep learning for understanding the humans of Chi 2018 will be going through tutorials that go far beyond the visual the convolutional neural network based detection of various aspects of the face and body would look at natural language processing voice recognition and Gans if you're going to Chi please join next week we have an incredible course that's aims to understand to begin to explore the nature of intelligence natural and artificial we have Josh Tenenbaum Ray Kurzweil Lisa Barret Nate Dubinsky looking at cognitive modeling architectures Andre karpati Stephen Wolfram Richard Moyes talking about autonomous weapon systems and AI safety mark Robert from Boston Dynamics and the amazing incredible robots I have and Ilya sutskever from open AI and myself so what next for folks register for this course you have to submit by tonight a deep traffic entry that achieves a speed of 65 miles an hour and I hope you continue to submit more that win the competition the high performer award will be given to folks the very few folks who achieved 70 miles an hour faster we will continue rolling out seg fuse having hit a few snags and invested a few thousands of dollars in the sanitation process of annotating a large-scale data set for you guys we'll continue this competition that will take us into into a submission to his nips where we'd hope to submit the results for this competition and deep crash the deeper enforcement learning these competitions will continue through May 2018 I hope you stay tuned and participate there's upcoming classes the a GI class I encourage you to come to is going to be fascinating and there's so many cool interesting ideas that we're going to explore it's gonna be awesome there's an introduction to deep learning course that I'm also part of will get a little bit more applied and get folks who are interested in the the very basic algorithms of deep learning how to get started with those hands-on and there's an awesome class that ran last year for those who took this class last year we also talked about it on the the global business of AI and robotics the slides are online I encourage you to click a link on there and register it's in the spring it's once a week and it's truly brings together a lot of cross-disciplinary folks to talk about ideas of artificial intelligence and the role of AI and robotics and society it's an awesome class and if you're interested in applying deep learning methods in the automotive space come work with us we have a lot of fascinating problems to to solve or collaborate so with that I'd like to thank everybody here everybody across the community that's been contributing we have thousands of submissions coming in for deep traffic and I'm just truly humbled by the support we've been getting and the team behind this class is incredible thank you to Nvidia Google Amazon Alexa auto live in Toyota and today we have shirts extra large extra extra large medium over there small and large over there the big and small people over here and then the medium-sized people over here so just grab it grab one and enjoy thank you very much [Applause]