Kind: captions Language: en all right so the human side of AI so how do we turn this camera back in on the human so we've been talking about perception how to detect cats and dogs pedestrians Lanes how to steer a vehicle B based on the external environment the thing that's really fascinating and severely understudied is The Human Side so you know you talk about the Tesla we we have cameras in 17 Teslas driving around Cambridge because Tesla is one of the only Vehicles allowing you to uh to experience in the real way on the road the interaction between the human and the Machine and the thing that we don't have that deep learning needs on The Human Side of of semi-autonomous vehicles and fully autonomous vehicles is video of drivers that's what we're collecting that's what uh my work is in is looking at billions of video frames of human beings driving 60 M an hour plus on the highway in their semi-autonomous Tesla what are the things that we want to know about the human the uh uh if if we were a deep learning therapist and we try to uh tear apart uh break break apart the different things we can detect from this raw set of pixels we can look here from the green to Red is the different detection problems the different computer vision detection problems green means it's less challenging it's feasible even under poor lighting conditions variable pose noisy environment for resolution red means it's really hard no matter what you do that's starting on the left with face detection body pose one of the best studied and one of the easier computer vision problems we have huge data sets for these and then there is micro scods the slight Tremors of the eye that happen one uh at a rate of uh a th times a second all right let's look at uh well first like why do we even care about the human in the car so one is trust this trust part is uh so you think about it to build trust the car needs to have some awareness of the the the biological thing it's carrying inside the human inside you kind of assume the car knows about you cuz you're like sitting there controlling it but if you think about it almost every single car on the road today has no sensors with which it's perceiving you it knows some cars have a pressure sensor on the steering wheel and a a pressure sensor or some kind of sensor uh detecting that you're sitting in the seat that's the only thing it knows about you that's that's it so how is the car supposed to this same car that's driving 70 M hour on the highway autonomously how is it supposed to build trust with you if it doesn't perceive you that's that's one of the critical things here so if I'm some if I'm constantly advocating something is that we should have a driverf facing camera in every car and that despite the privacy concerns now you have a camera on your phone and you don't have as much of a privacy concern there is but despite the privacy concerns the safety benefits are huge the trust benefits are huge so let's start with the easy one body pose detecting body pose why do we care so there's uh seat Bel design there's these dummies crash as dummies with which uh which are used to design uh the safety system uh the passive Safety Systems in our cars and theyve make certain assumptions about body shapes male female child body shapes but they also make assumptions about the position of your body in the seat they have the optimal position the position they assume you take the reality is in a Tesla when the car is driving itself the variability if you remember the cat the deformable cat you start doing a little bit more of that you start to reach back in the back seat in your purse your bag for your cell phone these kinds of things and that's when the crashes happen and we need to know how often that happens the car needs to know that you're in that position and that's critical for that very serious moment when the actual crash happens how do you do this is deep learning class right so this deep learning to the rescue whenever you have these kind of tasks of detecting for example body poses you're detecting points of the shoulders points of the head five 10 points along the arms the skeleton how do you do that you have a CNN convolution your network that takes this input image and takes as an output it's a regressor it gives an XY position of the whatever you're looking for the left shoulder the right shoulder and then you have a Cascade of regressors that give you all these points that give you the shoulders the arms and so on and then you have Through Time on every single frame you make that prediction and then you optimize you know you you know you know you can make certain assumption about physics that you can't your arm can't be in this place in the in one frame and then the next frame be over here it moves smoothly through space so under those constraints you can then minimize the error the temporal error from frame to frame or you can just dump all the frames as if there are different channels like RGB is three channels you could think of those channels as in time you can dump all those frames together in a what are call 3D convolution on your networks you dump them all together and then you estimate the body pose in all the frames at once and there are some data sets for sports and we're building our own I don't know who that guy is let's let's fly through this a little bit so what's called gaze classification gaze is another word for glance right it's a classification problem here's uh one of the Tas for this class again not here cuz he's married had to be home I know where his priorities are at this is on camera should be here there's five cameras this is what we're recording in the Tesla this is a Tesla vehicle there's in the b in the bottom right there's a blue icon that lights up automatically detected if it's operating under autopilot that means the car is currently driving itself there's five cameras one of the forward roadway one in the instrument cluster one of the center stack steering wheel his face and then it's it's a classification problem you dump the raw pixels into a convolution network have six classes forward roadway you're predicting where the person is looking forward roadway left right uh Center stack instrument cluster rearview mirror and you give it millions of frames for every class simple and it does incredibly well at predicting where the driver is looking and the process is the same for majority of the driver State problems that have to do with the face the face has so much information uh where you're looking emotion drowsiness so different degrees of frustration I'll fly through those as well but the process is the same there's some pre-processing so this is in the wild data there's a lot of crazy light going on there's noise there's vibration from the vehicle so first you have to V video stabilization you have to remove all that vibration all that noise as best as you can there's a lot of algorithms non- neural network algorithms boring but they work for removing the St the removing the noise removing the effects of sudden light variations and the vibrations of the vehicle there's the automated calibration so you have to estimate the the frame of the camera the position of the camera and estimate the identity of the person you're looking at the more you can specialize the network to the identity of the person and the identity of the car the person is riding in the better the performance for the different driver State classification so you personalize the network you have a background model that works on everyone and you specialize each individual this is transfer learning you specialize each individual Network to that one individual all right there is face frontal ation fancy name for the fact that no matter where they're looking you want to transfer that face so the eyes the nose are the exact same position in the image that way if you want to look at at the eyes and you want to study the subtle movement of the eyes the subtle blinking the Dynamics of the eyelid the velocity of the eyelid it's always in the same place so you can really focus in remove all effects of any other motion of the head and then you just this the beauty of deep learning right you don't there is some pre-processing uh because this is you know real world data but you just dump the raw pixels in you dump the raw pixels in and and predict whatever you need what do you need one is emotion you can have so we had uh we had a we had a study where people used uh a crappy and a good voice-based navigation system so the crappy one got them really frustrated and they self-reported it as a frustrating experience or not on scale 1 to 10 so that gives us ground truth but had a bunch of people use this system and uh you know they they put themselves as frustrated or not and so then we can predict we can train aish neural network to predict is this person frustrated or not I think we've seen a video of that turns out smiling is a strong indication of of frustration you can also predict drowsiness in this way gaze estimation in this way cognitive load I'll briefly look at that and the process is all the same you detect the face you find the landmark points in the face for the face alignment face frontalization and then you dump the raw pixels in for classification step five you can use svms there or you can use what everyone uses now convolution your own networks this is the one part where CNN have still struggle to compete is the alignment problem is this is where I talked about the Cascade regressors is finding the landmarks on the the the eyebrows the nose the jawline the mouth there's certain constraints there and so algorithms that can utilize those constraints effectively can often perform better than endtoend regressors that just don't have any concept of what a face is shaped like and there's huge data sets and we're part uh of of the awesome Community that's building those data sets for face alignment okay so this is again the TA in this younger form this is uh live in the car realtime system predicting where they're looking this is taking slow steps towards the exciting direction that machine learning is headed which is unsupervised learning the less you have to have humans look through the data and annotate that data the more power these machine learning algorithms get right so currently supervised learning is what's needed you need human beings to label a cat and label a dog but if if you can only have a human being label 1% one tenth of a perc of a data set only the hard cases so the machine can come to the human and be like I don't know I don't know what I'm looking at at these pictures because of the partial light occlusions we're not good at uh dealing with occlusions whether it's your own arm or because of light conditions we're not good with Crazy Light uh drowning out the image this is what the Google self-driving car actually struggle with when they're trying to use their Vision sensors moving out of frame so just all kinds of occlusions they really hard for computer vision algorithms and in those case we want a machine to step in and say and pass that image on to the human be like help me out with this and the other Corner cases is so in driving for example 90 plus% of the time all you're doing is staring forward at the roadway in the same way that's where the machine shines that's where machine annotation automated uh annotation shines because it's seen that face for hundreds of millions of frames already in that exact position so it can do all the hard work of annotation for you it's in the trans position away from those positions that he needs a little bit of help just to make sure that this person just start looking away from the road to the rear view and you bring those points up so you're uh there's uh using optical flow putting the optical flow in the convolution when your on network you use that to predict when some when something has changed and when something has changed you bring that to the machine for annotation all of this is to build a giant billions a frames uh annotated data set of ground truth on which to train your driver State algorithms and in this way you can control on the x-axis is the fraction of frames that a human has to annotate 0% on the left 10% on the right and then the accuracy tradeoff the more the human annotates the higher the accuracy you approach 100% accuracy but you can still do pretty good this is for the Gaye classification task when uh uh with an 84 uh uh 84 fold almost two words is magnitude reduction in in human annotation this is the future of machine learning and hopefully one day no human annotation and the result is millions of images like this video frames same thing Drive frustration this this is what I was talking about the frustrated driver is the one that's on the bottom so a lot of movement of the eyebrows and a lot of smiling and that's true subject after subject and the happy The satisfied we don't say happy the satisfied driver is cold and stoic and that's true for subject after subject cuz driving is a boring experience and you want it to stay that way yes question or uh great great great question they're Nota will be absolutely that's a great question there is a so this is cars owned by MIT there is somebody in the back but then my emotions or I'm happy might have nothing to do my Driving Experience so uh the comment was my emotions might then have nothing to do with the Driving Experience uh yes let me continue that comment is your emotions are often you're an actor on the stage for others with your emotion so when you're alone you might not express emotion you're really expressing emotion often times for others like your frustration is like oh what the heck that's for for the passenger and that that's absolutely right so one of the cool things we're doing well as I said we now have over a billion video frames in the Tesla we're stud collecting huge amounts of data in the Tesla and it's emotion is complex thing right um in this case we can we know the ground truth of how frustrated they were in naturalistic data when it's just people driving around we don't know how they're really feeling at the moment we're not asking them to like enter in an app how are you feeling right now but we do know certain things like we we know that people sing a lot that has to be a paper at some point it's awesome people love singing so that doesn't happen in this kind of data because there's somebody sitting in in the car and I think the expression of frustration is also the same yes so yeah the qu uh the question is yeah so or the comment is that the data set the solo data set is probably going to be very different from a data set that's nons solo with a passenger and it's very true the tricky thing about driving and this is why it's a huge huge challenge for cell driving cars for the external facing sensors and for the internal facing sensors analyzing human behavior is is like 99.9% of driving is the same thing it's really boring so finding the interesting bits is actually pretty complicated so that has to do with em motion that has to do with so singing is easy to find so we can track the mouth pretty well so whenever you're talking or singing we can find that but how do you find subtle expressions of emotion it's hard when you're solo and cognitive load that's that's a that's a fascinating thing I mean similar to emotion it's uh it's a little more Concrete in a sense that there's a lot of good science and and ways to measure cognitive load cognitive workload how occupied your mind is mental workload is another term used and so the window to the soul the the cognitive workload soul is is the eyes so pupil so first of all the eyes move in two different ways well they move a lot of ways but two major ways is sads these are these ballistic movements they jump around whenever you look around the room they're they're actually just jumping around when you read They're eyes are jumping around and when like if follow you just follow this bottle with your eyes that your eyes are actually going to move smoothly uh smooth Pursuit somebody actually just told me today that probably has to do with our hunting background or as animals I I don't know how that helps like frogs track flies really well so that you have to like I don't know anyway the point is there are smooth Pursuit movements where the eyes move smoothly and those are all indications of certain aspects of cognitive load and and then there is very subtle movements which are almost imperceptible for computer vision and these are um micro scods these are Tremors of the eye here work from here from Bill Freeman uh magnifying those subtle movements these are taken at uh 500 frames a second and so cognitive load when the pupil that black dot in the middle just in case we don't know what a pupil is in the middle of the eye when it gets larger that's in an indicator of Co of high cognitive load but it also gets larger when the light is dim so there's like this complex interplay so we can't rely in the wild outside in the car or just in general Outdoors on using the pupil size even though pupil size have been used effectively in the lab to measure cognitive load it can't be reliably used in the car and the same with blinks the uh when when there's a higher cognitive load your blink rate decreases and your blink duration shortens okay I think I'm just repeating the same thing over and over but you can imagine how we can predict cognitive load right we extract video of the ey here is the uh the primary eye of the the person the system is observing happens to be that the same ta once again we take the sequence of 100 um oh it's 90 images so that's 6 seconds 16 frames a second 15 frames a second and we dump that into a 3D convolutional Network again that means it's 90 channels of it's not 90 frames gray scale and then the prediction is one of three classes of cognitive load low cognitive load uh medium cognitive load and high cognitive load and there's ground Truth for that because we have people over 500 different people do different tasks of various cognitive load and after some frontalization again where you see the eyes are trans no matter where the person is looking the image of the face is transposed in such a way that the eyes the corners of the eyes remain always in the same position after the frontalization we find the I active appearance models find 39 points uh of the eye of the eyelids the iris and four point point on the pupil putting all of that into a 3D CNN model they are positioned IM eye sequence on the left 3dcnn model in the middle cognitive load prediction on the right this code by the way is it's freely available online all you have to do dump a webcam from the video stream CNN runs in faster than real time predicts cognitive load same process as detecting the identity of the face same process as detecting where the driver is looking same process detecting emotion and all of those require very little hyperparameter tuning on the convolution on your networks they only require huge amounts of data and why do we care about detecting what the driver is doing and I think Eric has mentioned this is on the oh man this is the comeback of the slide I let's criticize this for for this being a very cheesy slide in in the uh path towards uh full automation we're likely to take gradual steps towards that can't it's enough of that this is better and uh especially given that the this is given today our new president this is pickup truck country this is manually controlled vehicle country for quite a little while we like control and control being given to somebody body else to the machine will be a gradual process it's a gradual process of that machine earning trust and through that process the machine like the Tesla like the BMW like the Mercedes the Volvo that's now playing with these ideas is going to need to see what the human is doing and for that to see what the human is doing we have billions of miles of forward-facing data what we need is billions of miles of driverf facing data as well we're in the process of collecting that and this is a pitch for automakers and everybody to uh buy cars that have a driver facing camera and let me sort of close so I said we need a lot of data but I think this class has been and and through your own research you'll find that we're in the very early stages of of discovering the power of deep learning for example you know as as recently uh like Yan Lon said that it seems that the deeper the network the better the results in a lot of really important cases even though the data is not increasing so why does a deeper Network give better results this is a mysterious thing we don't understand there's there's there's these hundreds of millions of parameters and from them is emerging some kind of structure some kind of representation of the knowledge that we're giving it one of my favorite examples of this emergent concept is uh Conway's Game of Life for those of you who know who what this is will probably criticize me for it being as cheesy as the stairway slide but I think it it's actually such a simple and Brilliant example of of how like a neuron in a neural network is a really simple computational unit and then incredible power emerges when you just combine a lot of them in a network and in the same way the this is called a cellular automata that's a weird pronunciation and and the r it's it it's every single cell is operating under a simple rule you can think of it as a cell living and dying it's it's it's filled in black when it's alive and white when it's dead and when it has two or it's if it's alive and it has two or three neighbors it survives to the next time slot otherwise it dies and if it has exactly three neighbors it's dead it comes back to life if has exactly three neighbors that's a simple rule whatever you can just imagine it's just simple all is doing is operating under this very local process same as a neuron it's a it's it's uh or in the way we're currently training neuron networks and this local gradient we're optimizing over a local gradient same local rules and what happens if you run this system operating under really local rules what you get on the right it's not again you have to go home hopefully no drugs involved but you have to open up your mind and and see how amazing that is because what happens is it's a it's a local computational unit that knows very little about the world but somehow really complex patterns emerge and we don't understand why in fact under different rules incredible patterns emerge and it feels like it's living creatures like communicating like when you just watch it not not these examples this is uh this is the original they they they get like complex and interesting but even these examples this complex geometric patterns that emerge is incredible we don't understand why same in your networks we don't understand why and we need to in order to see how these networks will be able to reason okay so what's next uh I encourage you to uh read the Deep learning book it's available online deeplearning book.org as I mentioned to a few people you should well first there's a ton of amazing papers every day coming out on archive I I'll put these links up but there's a lot of good uh collections of strong paper list of papers there is the literally awesome list the awesome deep learning papers on GitHub it's calling itself awesome but it happens to be awesome and there is a lot of blogs that are just amazing that's that's how I recommend you learn machine learning is on blogs and if you're interested in the application of deep learning in the automotive space you can come do research in our group just email me anyway we have three winners Jeffrey Hugh Michael Gump and how do you are you here yes hey how do you say your name uh no that's not my name uh all right this is so my name is oh I see well you I here so he achiev the stunning speed of so I I this was kind of incredible so I didn't know what kind of speed we were going to be able to achieve I thought 73 was unbeatable cuz we played with it for a while we couldn't achieve 73 we designed a deterministic algorithm that was able to achieve 74 I believe meaning like it's cheating with the cheating algorithm that got 74 and so folks have come up with algorithms that have done have beaten 73 and in 74 so this is really incredible and uh the other two guys so all three of you get a free term at the Udacity cell driving car engineering that degree thanks to all those guys for giving it uh giving that award and bringing their army of brilliant uh so they have they have people who are obsessed about sell driving cars and we've received over 2,000 submissions for this competition a lot of them from those guys and they're just brilliant so it's it's uh it's really uh exciting to have such a big community of deep learning folks uh working in this field so this is for the rest of Eternity well we're going to change this up a little bit but this is actually the three neural networks uh the three neural three winning neural networks running side by side and you can see the number of cars pass there the first place is on the left second place and third place and in fact the third place is almost win wait no second place is winning currently but that just tells you that the uh the random nature of competition sometimes you win sometimes you lose so the there's a the actual evaluation process runs through a lot of uh a lot of iterations and takes the medium evaluation with that let me thank thank you guys so much for well wait wait wait that question the winning networks at all yeah so I uh all all three guys wrote me a note about how their networks work I did not read that note so I'll I'll post this tells you how crazy this has been uh I'll post their their uh the winning networks to online and and I encourage you to continue competing and continue submitting networks uh this will run for a while and we're working on a journal paper for this for this game we're trying to find the Optimal Solutions okay so this is the first time I've ever taught a class and the first time obviously teaching this class and so thank you so much for for being a part of it thank you thank you to Eric uh you didn't get a shirt please come back please come down and get a shirt just write your email on the note on the on the index note thank [Music] you okay