Kind: captions Language: en thank you everyone for braving the cold and the snow to be here this is six at zero nine for deep learning for self-driving cars and it's a course where we cover the topic of deep learning which is a set of techniques that have taken a leap in the last decade for our understanding of what artificial intelligence systems are capable of doing and self-driving cars which is systems that can take these techniques and integrate them in a meaningful profound way internal daily lives in a way that transform society so that's why both of these topics are extremely important and extremely exciting my name is Lex Friedman and I'm joined by an amazing team of engineers in Jack terwilliger Jude Julia Kendall's burger Dan Brown Michael Glaser Lee ding Spencer Dodd and Benedict genic among many others we build autonomous vehicles here at MIT not just ones that perceive and move about the environment but ones that interact communicate and earn the trust and understanding of human beings inside the car the drivers and the passengers and the human beings outside the car the pedestrians and other drivers and cyclists the website for this course self-driving cars that mit.edu if you have questions email deep cars at MIT tidy you slack deep - MIT for registered MIT students you have to register on the website and by midnight Friday January 19th build a neural network and submit it to the competition that achieves the speed of 65 miles per hour on the new deep traffic 2.0 it's much harder and much more interesting and last year's for those of you who participated there's three competitions in this class deep traffic seg fuse deep crash there's guest speakers that come from way more Google Tesla and those are starting new autonomous vehicle startups in voyage autonomy and Aurora and then use a lot today from CES and we have shirts for those of you who braved the snow and continued to do so towards the end of the class there will be free shirts yes I said free and shirts in the same sentence you should be here okay first the deep traffic competition there's a lot of updates and we'll cover those on Wednesday it's a deep reinforcement learning competition last year we received over 18,000 submissions this year we're going to go bigger not only can you control one car well then you'll network you can control up to ten this is multi agent deeper enforcement learning this is super cool second psych fuse dynamic driving scene segmentation competition where you're given the raw video the the kinematics of the vehicles in the movement of the vehicle the state-of-the-art segmentation for the training set you're given ground truth labels pixel level labels scene segmentation and optical flow and with those pieces of data your task to try to perform better than the state of the art in image based segmentation why is this critical and fascinating in an open research problem because robots that act in in this world in the physical space not only must interpret use these deep learning methods to interpret the spatial visual characteristics of a scene they must also interpret understand and track the temporal dynamics of the scene this competition is about temporal propagation of information not just seeing segmentation you must understand in space and time and finally deep crash where we use deep reinforcement learning to slam cars thousands of times at here at MIT at the gym you're given data on a thousand runs or a car or a car knowing nothing is using a monocular camera as a single input driving over 30 miles an hour through a scene it has very little control through very little capability to localize itself it must act very quickly in that scene you're given a thousand runs to learn anything we'll discuss this in the coming weeks this competition will result in four submissions that we evaluate everyone's in simulation but the top four submissions we put head-to-head at the gym and until there is a winner declared we keep slamming cars at 30 miles an hour deep crash and also on the website is from last year and on github there's deep Tesla which is using the large-scale naturalistic driving data set we have to train a neural network to do enter and steering that takes in monocular video from the forward roadway and produces steering commands that steering commands for the car lectures today we'll talk about deep learning tomorrow we'll talk about autonomous vehicles deep RLS on Wednesday driving scene understanding so segmentation that's Thursday on Friday we have sasha or knew the director of engineering at way mo way mo is one of the companies that's truly taking huge strides and fully autonomous vehicles they're taking the fully l4 l5 autonomous vehicle approach and it's fascinating to learn he's also the head of perception for them to learn from him what kind of problems they're facing and what kind of approach they're taking on we have ameliafe Rizzoli who one of last year's speakers sir - Carmen said Amelia as the smartest person he knows so Amelia for zoli's the CTO of metonymy an autonomous vehicle company that was just acquired by Delphi for a large sum of money and they're doing a lot of incredible work in Singapore and here in Boston next Wednesday we are going to talk about the topic of our research or my personal fascination is deep learning for driver state sensing understanding the human perceiving everything about the human being inside the car and outside the car one talk I'm really excited about is Oliver Cameron on Thursday he is now the CEO of autonomous vehicle startup voyage there's previously the director of the self-driving car program for audacity he will talk about how to start a self-driving car company for those he said that MIT folks and entrepreneurs if you want to start one yourself he'll tell you exactly how its super cool and then Sterling Anderson who was the director previously a Tesla auto pilot team and now is a co-founder of Aurora the the self-driving car startup that I mentioned that has now partnered with NVIDIA and many others so why self-driving cars this class is about applying data-driven learning methods to the problem of autonomous vehicles why self-driving cars are fascinating and an interesting problem space quite possibly in my opinion this is the first wide reaching and profound integration of personal robots in society wide-reaching because there's 1 billion cars on the road even a fraction of that will change the face of Transportation and how we move about this world profound and this is an important point that not always understood is there's an intimate connection between a human and a vehicle when there's a direct transfer of control it's a direct transfer of control that takes that his or her life into the hands of an artificial intelligence system I showed a few click quit quick Clips here you can Google first time with Tesla autopilot on YouTube and watch people perform that transfer of control there's something magical about a human and a robot working together that will transform what artificial intelligence is in the 21st century and this particular autonomous system AI system self-driving cars is on the scale and the profound the life-critical nature of it is profound in a way that it will truly test the capabilities of AI there is a personal connection that will argue throughout these lectures that we cannot escape considering the human being that autonomous vehicle must not only perceive and control its movement through the environment you must also perceive everything about the human driver and the passenger and interact communicate and build trust with that driver because in my view as I will argue throughout this course an autonomous vehicle is more of a personal robot than it is a perfect perception control system because perfect perception and control so this world full of humans is extremely difficult and could be two three four decades away full autonomy autonomous vehicles are going to be flawed they're going to have flaws and we have to design systems that are effectively caught that effectively transfer control to human beings when they can't handle the situation and that transfer of control isn't as a fascinating for AI because the obstacle avoidance perception of obstacles and obstacle avoidance is the easy problem it's the safe problem going 30 miles an hour navigating through streets of Boston is easy it's when you have to get to work in you're late or you're sick of the person in front of you that you want to go into in the opposing lane and speed up that's human nature and we can't escape it our artificial intelligence systems can't escape human nature they must work with it what's shown here is one of the algorithms we'll talk about next week for cognitive load or we take the raw 3d convolutional neural networks take in the eye region the blinking and the pupil movement to determine the cognitive load of the driver we'll see how we can detect everything about the driver where they're looking emotion cognitive load body pose estimation drowsiness the the movement towards full autonomy is so difficult I would argue that it almost requires human level intelligence that the as I said two three four decade out journey for artificial intelligence researchers to achieve full autonomy will require achieving solving some of the problems fundamental problems of creating intelligence and that's something we'll discuss in much more depth in a broader view in two weeks for the artificial general intelligence course where we have Andrey Carpathia from Tesla Ray Kurzweil Mark Ryberg from Boston Dynamics who asked for the dimensions of this room because he's bringing robots nothing else was told to me it'll be a surprise so that is why I argue the human centered artificial intelligence approach in every algorithm of a design considers the human for autonomous vehicle on the left the perception scene understanding and the control problem as we'll explore through the competitions in the assignments of this course can handle 90 and increasing percent of the cases but it's the 10 1.1 percent of the cases as we get better and better that we have to we're not able to handle through these methods and that's where the human perceiving the human is really important this is the video from last year of Arc de Triomphe thank you I didn't know it last year I know now that is one of millions of cases where human to human interaction is them is the dominant driver not the basic perception control problem so why deep learning in this space because deep learning is a set of methods that do well from a lot of data and to solve these problems where human life is at stake we have to be able to have techniques that learn from data learn from real-world data this is the fundamental reality of artificial intelligent systems that operate in the real world they must learn from real world data whether that's on the left for the perception the control side or on the right for the human the perception and the communication interaction and collaboration with the human and the human robot interaction ok so what is deep learning it's a set of techniques if you allow me the definition of intelligence being the ability to accomplish complex goals then I would argue that definition of understanding maybe reasoning is the ability to turn complex information into simple useful actionable information and that is what deep learning does deep learning is representation learning or feature learning if you will its able to take raw information raw complicated information that's hard to do anything with and construct here are hierarchical representations of that information to be able to do something interesting with it it is the branch of artificial intelligence which is most capable and focused on this task forming representations from data whether it's supervised or unsupervised whether it's with the help of humans or not it's able to construct structure find structure in the data such that you can extract simple useful actionable information on the left from Ian Goodfellows book is the basic example of a Mis classification the input of the image on the bottom with the raw pixels and as we go up the stack as we go up the layers hiring higher-order representations are formed from edges to contours the corners to object parts and then finally the full object semantic classification of what's in the image this is representation learning a favorite example for me is one from four centuries ago our place in the universe and representing that place in the universe whether it's relative to earth or relative to the Sun on the left as our current belief on the right is the one that has held widely for centuries ago representation matters because what's on the right is much more complicated than what's on the left you can think of in a simple case here when the task is to draw a line that separates green triangles and blue circles in the Cartesian coordinate space on the left the task is much more difficult impossible to do well on the right is trivial in polar coordinates this transformation is exactly what which we need to learn this is representation learning so you can take the same task of having to draw a line that separates the blue curve and the red curve on the left if we draw a straight line it's going to be a high there's no way to do it with zero error with 100% accuracy shown on the right is our best attempt but what we can do with deep learning with a single hidden layer Network done here is form the the topology the mapping of the space in such a way in the middle that allows for a straight line to be drawn that separates the blue curve and the red curve the learning of the function in the middle is what we're able to achieve with deep learning it's taking raw complicated information and making it simple actionable useful and the point is that this kind of ability to learn from raw sensory information means that we can do a lot more with a lot more data so deep learning gets better with more data and that's important for real world applications where edge cases are everything this is us driving with two perception control systems one is in Tesla vehicle with the auto pilot version one system that's using a monocular camera to perceive the external environment and produce control decisions and our own neural network running on adjacent ex2 that's taking in the same with a monocular camera and producing control decisions and the two systems argue and when they disagree they raise up a flag to say that this is an edge case the East that needs human intervention there is covering such edge cases using machine learning is the main problem art of artificial intelligence and when applied to the real world it is the main problem to solve okay so what our neural networks inspired very loosely and I'll discuss about the key difference being our own brains and artificial brains because there's a lot of insights in that difference but inspired loosely by biological neural networks here as a simulation of a thalamocortical brain network which is only 3 million neurons 476 million synapses the full human brain is a lot more than that a hundred billion neurons 1,000 trillion synapses there's an inspirational music with this one that I didn't realize was here it should make you think artificial neural networks yeah let's just let it play the the human neural network is a hundred billion neurons right 1,000 trillion synapses one of the stated in that state-of-the-art neural networks as resin that 152 which has 16 million synapses that's a difference of about a seven order of magnitude difference the human brains have ten million times more synapses than artificial neural networks plus or minus one order of magnitude depending on the network so what's the difference between a biological neuron and an artificial neuron the topology of the human brain have no layers neural networks are stacked in layers they're fixed for the most part there is chaos a very little structure in our human brain in terms of how neurons are connected they're connected often to 10,000 plus other neurons the number of synapses from individual neurons that are that are input into the neuron is huge they're asynchronous the human brain brain works asynchronously artificial neural networks work synchronously the learning algorithm for artificial neuron networks the only one the best one is back propagation and we don't know how human brains learn processing speed this is one of the the only benefits we have with artificial neural networks is artificial neurons are faster but they're also extremely power and efficient and there is a division in two stages of training and testing with neural networks with by logical neural networks as you're sitting here today they're always learning the only profound similarity the inspiring one the captivating one is that both are distributed computation at scale there is an emergent aspect to neural networks where the basic element of computation a neuron is simple is extremely simple but when connected together beautiful amazing powerful approximator z' can be formed a neural network is built up with these computational units where the inputs there's a set of edges with weights on them the edges are the weights are multiplied by this input signal a bias is added with a nonlinear function that determines whether the network gets activated or not well the neuron gets activated or not visualized here and these neurons can be combined in a mall in number of ways they can form a feed-forward neural network or they can feed back into itself to form to have state memory in recurrent neural networks the ones on the left are the ones that are most successful for most applications in computer vision the ones on the right are very popular and specific one temporal dynamics or dynamics time series of any kind are used in fact the ones on the right are much closer to the way our human brains are and the ones on the left but that's why they're really hard to train one beautiful aspect of this emergent power for multiple neurons being connected together is the universal property that with a single hidden layer these networks can learn any function learn to approximate any function which is an important property to be aware of because the limits here are not in the power of the networks the limits in is in the methods by which we construct them and train them what kinds of machine learning deep learning are there we can separate into two categories memorizers now the approaches that essentially memorize patterns in the data and approaches that we can loosely say are beginning to reason to generalize over the data with minimal human input on top on the left are the quote unquote teachers is how much human input and blue is needed to make the method successful for supervised learning which is what most of deep learning successes come from or most of the data is annotated by human beings the human is at the core of the success most of the data that's part of the training needs to be annotated by human beings with some additional successes coming from augmentation methods that extend that extend the data based on which these networks are trained and the semi-supervised reinforcement learning and unsupervised methods that we'll talk about later in the course that's where the near-term successes we hope are and with the unsupervised learning approaches that's where the true excitement about the possibilities of artificial intelligence lie being able to make sense of our world with minimal input from humans so we can think of two kinds of deep learning impact spaces one is a special purpose intelligence is taking a problem formalizing it collecting enough data on it and being able to solve a particular case that's that provides value of particular interest here is a network that estimates apartment costs in the Boston area so you could take the number of bedrooms the square feet and the neighborhood and provide as output the estimated cost on the right is the actual data of apartment cost we're actually standing in a in a area that has over three thousand dollars for a studio apartment some of you may be feeling that pain and then there's general-purpose intelligence or something that feels like approaching general-purpose intelligence which is reinforcement and unsupervised learning here with Audra for magical parties pong the pixels a system that takes in 80 by 80 pixels image and with no other information is able to beat is able to win at this game no information except a sequence of images raw sensory information the same way the same kind of information that human beings take in from the visual audio touch sensory data the very low-level data and be able to learn to win and it's very simplistic and it's very artificially constructed world but nevertheless a world where no feature learning is performed only raw sensory information is used to win with very sparse minimal human input we'll talk about that on Wednesday with deep reinforcement learning so but for now we'll focus on supervised learning where there is input data there is a network we're trying to train a learning system and there's a correct output that's labeled by human beings that's the general training process for a neural network input data labels and the training of that network that model so that in a testing stage a new input data that has never seen before its task with producing guesses and is evaluated based on that for autonomous vehicles that means being released either in simulation or in the real world to operate and how they learn how neural networks learn is given the forward pass of taking the input data from the training stage in the training stage the taking the input data producing a prediction and then given that there's ground truth in the training stage we can we can have a measure of error based on a loss function that then punishes the the synapses the connections the parameters that were involved with making that that wrong prediction and it back propagates the error through those weights we'll discuss that in a little bit more detail in a bit here so what can we do with deep learning you can do one-to-one mapping really you can think of input as being anything it can be a number of vector numbers a sequence of numbers a sequence of vector of numbers anything you can think of from images to video to audio to text can be represented in this way and the output can the same be a single number or it can be images video text audio one-to-one mapping on the bottom one-to-many many-to-one many to many and many to many with different starting points for the data asynchronous some quick terms that will come up deep learning is the same as new networks it's really deep neural networks large neural networks it's a subset of machine learning that has been extremely successful in the past decade multi-layer perceptron deep neural network recurrent neural network long short-term memory Network lsdm convolution neural network and deep belief networks all of these will come up to the slides and there is specific operations layers within these networks of convolution pooling activation and back propagation this concept that we'll discuss in this class activation functions there's a lot of variants on the left is the activation function the left column and the x-axis is the input on the y-axis is the output the sigmoid function the output if the font is too small the output is not centered at zero for the 10h function it's centered at zero but it still suffers from vanish ingredients vanish ingredients is 1 the value the input is low or high the the output of the network as you see in the right column there the derivative of the function is very low so the learning rate is very low for real you not it's also not zero centered but it does not suffer from vanish ingredients back propagation is the process of learning it's the way we take go from error compute as the loss function and the bottom right of the slide taking the actual output of the network with a forward pass subtracting it from the ground truth squaring dividing by two and using that loss function that back propagate through to construct a gradient to back propagate the error to the weights that were responsible for making either a correct or an incorrect decision so the sub desks that there's a forward pass there's a backward pass and a fraction of the weights gradients subtracted from the weight that's it that process is modular so it's local to each individual neuron which is why it's extremely just it's we're able to distribute it across multiple across the GPU parallelize across the GPU so learning for a neural network these competition units are extremely simple they're extremely simple to then correct when they make an error when they're part of a larger network that makes an error and all that boils down to is essentially an optimization problem where the objective utility function is the loss function and the goal is to minimize it and we have to update the parameters the weights and the synapses and the biases to decrease that loss function and that loss function is highly nonlinear depending on the activation functions different properties different issues arise theirs vanish ingredients for sigmoid where the learning can be slow there's dying Raley's where the derivative is exactly zero for inputs less than zero there are solutions to this like leaky Raley's and a bunch of details you may discover when you try to win the deep traffic competition but for the most part these are the main activation functions and it's the choice of the neural network designer which one works best there's saddle points all the problems from your miracle non-linear optimization that arise come up here it's hard to break symmetry and stochastic gradient descent without any kind of tricks - it can take a very long time to arrive at the minima one of the biggest problems in all of machine learning and certainly deep learning is overfitting you can think of the blue dots and a plot here as the data to which we want to fit a curve we want to design a learning system that approximates the regression of that of this data so in green is a sine curve simple fits well and then there's a ninth degree polynomial which fits even better in terms of the error but it clearly over fits this data if there's other data that it has not seen yet that it has to fit it's likely to produce a high error so it's over fitting the training set this is a big problem for small data sets and so we have to fix that with regularization regularization is a set of methodologies that prevent overfitting learning the training too well in order and then to not be able to generalize to the testing stage and overfitting the main symptom is the air and training set but increases in the test set so there's a lot of techniques and traditional machine learning that deal with this and cross validation and so on but because of the cost of training for neural networks its traditional to use of what's called a validation set so you create a subset of the training that you keep away for which you have the ground truth and use that as a representative of the testing set so you perform early stoppage or more realistically just save a checkpoint often to see how as the training evolves the performance changes on the validation set and so you can stop when the performance in the validation set is getting a lot worse it means you're overtraining on the training set in practice of course we run training much longer and see when what is the best performing what what is the best performing snapshot checkpoint of the network dropout is another very powerful regularization technique where we randomly remove part of the network randomly remove some of the nodes in the network along with its incoming and outgoing edges so what that really looks like is a probability of keeping a node and in many deep learning frameworks today it comes with a dropout layer so it's essentially a probability that's usually greater than 0.5 then that a node will be kept for the input layer the probability should be much higher or more effectively what works well as just adding noise what's the point here you want to create enough diversity in the training data such that it is generalizable to the testing and as you'll see with deep traffic competition there's l2 and a1 penalty weight decay weight penalty where there's a penalisation on the weights that get too large the l2 penalty keeps the way it's small unless the air derivative is huge and produces model and prefers to distribute when there is two similar inputs it prefers to put half the weights on each distribute the weights as opposed to putting the weight on one of the edges makes the network more robust our one penalty has the one benefit that for really large weights they're allowed to be to stay so it allows for a few weights to remain very large these are the regularization techniques and I wanted to mention them because they're useful to some of the competitions here in the course and I recommend to go to playground tents the temple floor playground to play around with some of these parameters where you get to online in the browser play around with different inputs different features different number of layers and regularization techniques and to build your intuition about classification regression problems given different input datasets so what changed why over the past many decades neural networks that have gone through two winters are now again dominating the artificial intelligence community CPUs GPUs Asics so computational power has skyrocketed from Moore's law to GPUs there is huge data set including image net and others there is research back propagation in the 80s the convolutional neural networks lsdm there's been a lot of interesting breakthroughs about how to design these architectures how to build them such that they're trainable efficiently using GPUs there is the software infrastructure from being able to share the data or get to being able to Train networks and share code and effectively impune ural networks as a stack of layers as opposed to having to implement stuff from scratch with tensorflow pi torch and other than that and other deep learning frameworks and there's huge financial backing from Google Facebook and so on deep learning is in order to understand why it works so well and where its limitations are we need to understand where our own intuition comes from about what is hard and what is easy the important thing about computer vision which is a lot of what this course is about even it's in deep reinforcement learning formulation is that visual perception for us human beings was formed 540 million years ago that's 540 million million years worth of data an abstract thought is only formed about a hundred thousand years ago that's several orders of magnitude less data so we can make with the neural networks predictions that seemed trivial the that trivial to us human beings but completely challenging and wrong to neural networks here on the Left showing a prediction of a dog with a little bit of a distortion and noise added to the image producing the image on the right and your network is confidently 99 percent plus accuracy predicting that it's an ostrich and there's all these problems is to deal with whether it's in computer vision data whether it's in text data audio all of this variation arises in vision its illumination variability the set of pixels and the numbers look completely different depending on the lighting conditions it's the biggest problem in driving is lighting conditions lighting variability pose variation objects need to be learned from every different perspective I'll discuss that for when sensing the driver most of most of most of the deep learning work that's done in the face on the human is done on the frontal face or semi frontal face there's very little work done on the full 360 pose variability that a human being could take on intraclass variability for the classification problem for the detection problem there is a lot of different kinds of objects for cats dogs cars bicyclists pedestrians so that brings us to object classification and I'd like to take you through where deep learning has taken big strides for the past several years leading up to this year to 2018 so let's start at object classification is when you take a single image and you have to say one class that's most likely to belong in that image the most famous variant of that is the image net competition image net challenge image now data set is a data set of 14 million images with 21,000 categories and for say the category of fruit there's a total of 180 8000 images of fruit and there is 1200 images of granny smith apples it gives you a sense of what we're talking about here so this has been the source of a lot of interesting breakthroughs in deep learning and a lot of the excitement in deep learning is first the big successful network at least one that became famous and deep learning is Alex net in 2012 that took a leap of a significant leap in performance on the image net challenge so it was one of the first neural networks though successfully trained on the GPU and achieved an incredible performance boost over the previous year on the image net challenge the challenge is and I'll talk about some of these networks it's to given a single image give five guesses and you have five guesses to guess for one of them to be correct the human annotation is a question often comes up so how do you know the ground truth human level performance is 5.1 percent accuracy on this task but the way the annotation for image net is performed is there's a Google search where you pull the images already labeled for you and then the annotation that Mechanical Turk other humans perform is just binary is this a cat or not a cat so they're not tasked with performing the very high-resolution semantic labeling of the image okay so through from 2012 with alex net to today and the big transition in 2018 of the image net challenge leaving Stanford and going to Kegel it's sort of a monumental step because in 2015 with the resonant network was the first time that the human level performance was exceeded and I think this is a very important map of where deep learning is for particularly what I would argue is a toy example despite the fact that it's 14 million images so we're developing state-of-the-art techniques here and the next stage as we are now exceeding human level performance on this task is how to take these methods into the real world to perform scene perception to perform driver state perception in 2016 and 2017 see you image and se net has a very unique new addition to the previous formulations that has achieved an accuracy of 2.2 percent error 2.25 percent error on the image net classification challenge it's an incredible result ok so you have this image classification architecture that takes in a single image and produces convolution and takes it through pooling convolution and at the end fully connected layers and performs a classification task or regression task and you can swap out that layer to perform any kind of other tasks including with recurrent neural networks of image captioning and so on or localization of bounding boxes or you can do fully convolutional networks which we'll talk about on Thursday which is when you take a image as an input and produce an image as an output but where the output image in this case is a segmentation is where a color indicates what what the object is the category of the of the object so it's pixel level segmentation every single pixel in the image is assigned a class a category where that pixel belongs to this is the kind of task that's overlaid on top of other sensory information coming for the car in order to perceive the external environment you can continue to extract information from images in this way to produce image to image mapping for example to colorize images and take from grayscale images to color images or you can use that kind of heat map information to localize objects in the image so as opposed to just classifying that this is a image of a cow our CNN fast and faster our CNN and a lot of other localization networks allow you to propose different candidates for where exactly the cow is located in this image and thereby being able to perform object detection not just object classification in 2017 has been a lot of cool applications of these architectures one of which is background removal again mapping from image to image ability to remove and background from selfies of humans or human-like pictures of faces the reference is with some incredible animations are in the bottom of the slide and the slides are now available online pix depicts HD there's been a lot of work and Ganz in generative artifice aerial networks in particular in driving Ganz have been used to generate examples that generate examples from source data whether that's from raw data or in this case with pics to pix HD is taking course semantic labeling of the images pixel level and producing photorealistic high-definition images of the forward roadway this is an exciting possibility for being able to generate a variety of cases for self-driving cars for autonomous vehicles to be able to learn to generate to augment the data and be able to change the way different roads look Road conditions to change the way vehicles look cyclists pedestrians then we can move on to recurrent neural networks everything I've talked about was one-to-one mapping from image to image or image to number but kernel networks at work with sequences we can use sequences to generate handwriting to generate text captions from an image based on the localization as the various detections in that image we can provide video description generation so taking a video and combining convolutional neural networks with recurrent neural networks using convolutional neural networks to extract features frame to frame and using those extracted features to input into our TRN ends to then generate a labeling a description of what's going on in the video a lot of exciting approaches for autonomous systems especially in drones where the time to make a decision is short same with the RC car traveling 30 miles an hour attentional mechanisms for steering the attention of the network have been very popular for the localization tasks and for just saving how much interpretation of the image how many pixels need to be considered in the classification task so we can steer we can model the way a human being looks around an image to interpret it and use the network to do the same and we can use that kind of steering to draw images as well finally the big breakthroughs in 2017 came from this pong to pixels the reinforcement learn using sensory data raw sensory data and use reinforcement learning methods deep are all methods of which you'll talk about and Wednesday I'm really excited about the underlying methodology of deep traffic and deep crash is using neural networks as the approximate errs inside reinforcement learning approaches so alphago in 2016 have achieved a monumental task that when I first started in artificial intelligence was told to me is impossible for a system to accomplish which is to win at the game of go against the top human player in the world however that method was trained on human expert positions the alphago system was trained on previous games played by human experts and in an incredible accomplishment alphago zero in 2017 was able to beat alphago and many of its variants by playing itself from zero information so no knowledge of human experts no games no training data very little human input and what more it was able to generate moves that were surprising to human experts I think it's Einstein that said that intelligence that the key mark of intelligence is imagination I think it's beautiful to see an artificial intelligence system come up with something that surprises human experts truly surprises for the gambling junkies deep stack and a few other variants have been used in 2017 to win a heads-up poker again another incredible results I was always told in artificial intelligence would be impossible for deep for any machine learning method to achieve and was able to beat a professional player and several competitors have come along since we're yet to be able to beat to win in a setting so multiple players for those of you familiar heads-up poker is one-on-one it's a much much smaller easier space to solve there's a lot more human human dynamics going on from when there's multiple players but that's the task for 2018 and the drawbacks it's one of my favorite videos they show it often of Coast runners for these deep reinforcement learning approaches the learning of the reward function the definition of the word function chain controls how the actual system behaves and this will come this would be extremely important for us with autonomous vehicles here the boat is tasked with gaining than the highest number of points and it figures out that it does not need to race which is the whole point of the game in order to gain points but instead pick up green circles that regenerate themselves over and over this is the the counterintuitive behavior of a system that would not be expected when you first designed the reward function and this is a very formal simple system nevertheless is extremely difficult to come up with a reward function that makes it operate in the way you expect it to operate very applicable for autonomous vehicles and of course in the perception side as I mentioned with the ostrich and the dog a little bit of noise with ninety nine point six percent confidence we can predict that the noise up top is a robbing a cheetah armadillo lesser Panda these are outputs from actual state-of-the-art neural networks taking in the noise and producing a confident prediction it should build our intuition to understand that we don't that the visual characteristics the vision the spatial characteristics of an image did not necessarily convey the level of hierarchy necessary to function in this world in a similar way with a dog in the ostrich and everything in an ostrich and work confidently with a little bit of noise can make the wrong prediction thinking the school bus as an ostrich and a speaker as an ostrich they're easily fooled but not really because they perform the task that they were trained to do well so we have to make sure we keep our intuition optimized to the way machines learn not the way humans have learned over the 540 million years of data that we've gained through developing the AI through evolution the current challenges were taking on first transfer learning there's a lot of success in transfer learning between domains that are very close to each other so image classification from one domain to the next there's a lot of value in forming representations of the way scenes look in order to show scenes look in order to do scene segmentation the driving case for example but we're not able to do any any bigger leaps in the way it would perform transfer learning the biggest challenge for deep learning is to generalize generalize across domains it lacks the ability to reason in the way that we've defined understanding previously which is the ability to turn complex information into simple useful information convert domain specific complicated sensory information that doesn't relate to the initial training set that's the open challenge for deep learning trained on very little data and then go and reason and operate in the real world right now you'll know what's a very inefficient they require big data they require supervised data which means they need human cost of human input they're not fully automated despite the fact that the feature learning incredibly the big breakthrough feature learning is performed automatically you still have to do a lot of design of the actual architecture of the network and all the different hyper parameter tuning needs to be performed human input perhaps a little bit more edgy kthuman input and former PhD students postdocs faculty is required to hyper-girl doing these hyper parameters but nevertheless human input is still necessary they cannot be left alone for the most part the reward defining the reward as we saw with coast run is extremely difficult for systems that operate in the real world transparency quite possibly it's not an important one but neural networks currently are black box for the most part they're not able to accept through a few successful visualization methods that visualize different aspects of the activations they're not able to reveal to us humans why they work or where they fail and this is a philosophical question for autonomous vehicles that we may not care as human beings if a system works well enough but I would argue that it will be a long time before systems work well enough or we don't care we'll care and we'll have to work together with these systems and that's where transparency communication collaboration is critical edge cases it's all about educators in robotics in Thomas vehicles the 99.9% of driving is really boring it's the same especially highway driving traffic driving it's the same the obstacle avoidance the car following the lane centering all these problems a trivial it's the edge cases the trillions of edge cases that need to be generalised over on a very small amount of training data so again I return to why deep learning I mentioned a bunch of challenges and this is an opportunity it's an opportunity to come up with techniques that operate successfully in this world so I hope the competitions would present in this class and the autonomous vehicle domain will give you some insight and an opportunity to apply in some of these cases are open research problems with semantic segmentation of external perception with control of the vehicle and deep traffic and with deep crash of control of the vehicle and under actuated high speed conditions and the driver state perception so with that I wanted to introduce deep learning to you today before we get to the fun tomorrow of autonomous vehicles so we'd like to thank Nvidia Google Auto live Toyota and at the risk of setting off people's phones Amazon Alexa Auto but truly I would like to say that I've been humbled over the past year by the thousands of messages were received by the attention by the 18,000 competition entries by the many people across the world not just here at MIT that are brilliant that I got a chance to interact with and I hope we go bigger and do some impressive stuff in 2018 thank you very much and tomorrow is self-driving [Applause]