Sacha Arnoud, Director of Engineering, Waymo - MIT Self-Driving Cars
LSX3qdy0dFg • 2018-02-16
Transcript preview
Open
Kind: captions Language: en today we have the director of engineering head of perception at way mo a company that's recently driven over four million miles autonomously and in so doing inspired the world in what artificial intelligence and good engineering can do so please give a warm welcome to Sasha our new [Applause] thanks a lot Lex for the introduction well it's it's a pretty packed house thanks a lot I'm really excited thanks a lot for giving me the opportunity to to be able to come and share my passion with the Seb driving cars and be able to share with you all the great work we've been doing at Weimer over the last 10 years and give you more details on the recent milestones we've reached so as you see we'll cover a lot of different topics some more technical some more about context but when either the content I have three main objectives that that I'd like to convey today so keep that in mind as we go through the through the presentation my first one is is to give you some background around the self-driving space and what's happening there and what it takes to build self-driving cars but also give you some some behind the scene views and tidbits on on the history of machine learning deep learning and how it how it all came together within the big alphabet family from Google to way moe another piece obviously another objective I have is to give you some technical meat around the techniques that are working today on our self-driving cars so I think during the the class you hear a lot you've heard a lot about different different deep learning techniques models architectures algorithms and I try to put that in a current hole so that you can you can see how those pieces fit together to build a system we have today and has been at least I think as Lex mentioned it takes a lot more actually than algorithms to build a sophisticated system such as our self-driving cars and fundamentally it takes a a food industrial project to make that happen and I'll try to give you some color with which hopefully is it are different from from what you've heard during the week I'll try to give you some color on what it takes to actually pan out such an industrial project in real life and make an essentially productionize machine learning so we hear a lot of talk we hear a lot about self-driving cars it's a very hot topic and for very good reasons I can tell you for sure that 2017 has been a great year for whammo actually only a year ago in January 2017 when Moe became its own company so that was a major milestone and a testimony to the to the robustness of distribution so that we could move to a product product is Asian phase so what you see on the picture here is our latest generation self-driving vehicle so it is based on on the chrysler pacifica you can already see a bunch of sensors I'll come back to that and give you more more insights on what they do and how they operate but that's that that's the latest and greatest so self-driving indeed is draws a lot of attention and for very good reason I personally believe and I think you will agree with me that self-driving really has as the potential to deeply change the way we look about mobility and the way we move people and things around so only to cover a few aspects here obviously that and I want to go into too many details but safety is one of is one of the the main motivations 94% of us crashes today involve human errors a lot of those errors are around distraction and things that could be avoided so safety is a big piece of it disability and access to mobility is also a big motivation of ours so obviously the the self-driving technology has the potential to make it very available and cheaper for more people to to be able to move around and last but not least is efficiency a collective efficiency so not only we spend a lot of time in our cars in in long commute hours I personally spend a lot of time in on commit hours and that time we spend in traffic probably could be better spent doing something else than having to drive to grab the coin in complicated situations beyond beyond traffic obviously the self-driving technology has the potential to deeply change the way we think about traffic parking spots urban environments city design so that that's why it's a very exciting topic so that's why we made it our our mission at Waco is fundamentally to to make it safe and easy to move people and things around so that's a nice mission and we've been on it for a very long time so actually the whole adventure started close to 10 years ago in 2009 and at the time that was that starting under the umbrella of a Google project that you may have heard of called chauffeur and back back back in those days so remember we were before the deep learning days at least in the industry and so really back in those days the the first the first objective of the project was to try and assemble first product a vehicle take off-the-shelf sensors assemble them together and try to go and decide if self-driving is even a possibility it is like it's one thing to to have some prototype somewhere but is that even a thing that that that is worth pursuing which is a very common way for Google to to tackle problems so the genesis for that work was to come up with a pretty aggressive objective so the team the first milestone for the team was to essentially assembled 10100 my loops in Northern California around around Mountain View and try and figure out so for a total of 1,000 miles and try and and see if they could build first system that that would be able to go and drive those loops autonomously so they were not afraid so the team was not afraid so those loops went through some very aggressive patterns so you see that some of those loops go through the Santa Cruz Mountains which is an area in California that as you'll see I'll show you a video that has very small roads and two-way traffic and cliffs with negative obstacles and complicated patterns like that some of those some of those paths were going on highways so that and one of the the busiest highways some of those routes were going around Lake Tahoe which is which is in the Sierras in California where you can encounter different kinds of weather and again different kinds of roads conditions those routes were going around bridges and the Bay Area has quite a few bridges to go through though some of them were even going through a dense urban area so you can see San Francisco being driven you can see Monterey some of the Monterey centers being driven and as you see on the video those bring those truly bring dense urban area challenges so since I promised it so here you're gonna see some pictures of the driving and it's kind of working so here with better quality so here you see the the roads I was talking about on the mountain on the Santa Cruz Mountains driving in the night animals crossing the street freeway driving going through patos just another area that is Charlie dance there's a aquarium there pretty popular one that's the famous lombard street in san francisco that you may have heard of which in San Francisco always brings unique set of challenges between fog and slopes and in that case even shop turns so that was all the way back in 2010 so those ten loops were successfully completed 100% autonomously back in 2010 so that's more than eight eight years ago so on that on the heels of that success the team decided and Google decided that self-driving was worth worth pursuing and moved and moved forward with the development of the technology and and testing so we've been at it for all those years and have been working very hard on it historically way more and and and I think what the other companies out there have been relying on what we call safety drivers to still sit behind the wheels even if the car is is driving autonomously you still have a safety driver was able to take over at any time and make sure that we have very separations and and we've been a committed my eyes and knowledge and developing the system many iterations of the system across all those enoguh lose years we reached a major milestone as Lex mentioned back in back in November where for the first time we reached a level of confidence and maturity in a system that we felt confident and proved to ourselves that it was safe to remove the safety driver as you can imagine that's that's a major milestone because it takes a very high level of confidence to not have that backup solution of a safety driver to take over or something to arise so here I'm gonna show you a small video a quick quick capture of that event so that the video is from one of the first times we did that since then we've been continuously operating drug arrest cars self-driving cars in the Phoenix area in Arizona to expand our testing so here you can see the video swing so you can see our chrysler pacifica so here we have members of the team who are acting as the passengers getting on a backseat there is you can notice that there is no driver on the driver's seat so here we are running car having kind of service so the passenger simply press the button the application knows where they want to go and the car goes nope no one on the driver seat so we started with a fairly constrained geographical area in Chandler close to Phoenix Arizona and we we are hard working to expand testing and the scope of our operating area since then so that goes well beyond a single car a single day not only we do that continuously but we also have a growing fleet of self-driving cars that we are deploying there all the way and looking for a product launch pretty quickly so I've talked about 2010 and we are in 2018 and were getting there but what it took it took quite a bit of time so I think one of the one of the key ideas that I'd like to convey here today and that I will I will go back to during representation is how much work and how much work it takes to really take a demo or something that's working in a lab into something that you feel safe to put on the roads and get all the way to that to that depth of understanding that depth of perfection in your technology that that you operate safely so one way to say that is that when you are 90% done you still have 90 percent to go right so 90% of the technology takes only 10% of the time right in other words you need to 10x right you need to 10x the the capabilities of your technology you need to 10x your team size and find ways for more engineers and more researchers to collaborate together you need to 10x the capabilities of your sensors you need to 10x fundamentally the overall quality of the system right and your testing practices as we'll see and a lot of the aspects of the program and that's what we've been that's what we've been working on so beyond the context of self-driving cars I want to spend a little bit of time to give you kind of a kind of an inside of view of the rise of deep learning Sumer I mentioned that back in 2009 2010 deep learning was not ready available yet in full capacity in the industry and so over those years actually it took a lot of breakthroughs to to be able to reach that stage and one of them was the Agora algorithm breakthrough that deep learning gave us and I'll give you a little bit of of backstage view on what happened at Google during those years so as you know Google has been as committed itself to machine learning and deep learning very early on you may have heard of the Google brain what we call internally the Google brain team which is which is a team fundamentally hard at work to lead the bleeding edge of research which is known but also leading the development of the tools an infrastructure of the whole machine learning ecosystem at at Google and level to essentially low many teams to develop machine learning at scale all the way to successful products so they've been working and pushing that the deep learning technology has been pushing the field in many in many directions from computer vision to speech understanding to NLP and all those directions are things that you can see in Google products today so whether you're talking real assistant or Google photos speech recognition or even Google Maps you can see the impact of deep learning in all those areas and actually many years ago I was part of I myself was part of the street view team and I was leading the what an internal program an internal project that we call the street smart and the good we had at sweet smart was to use deep learning and machine learning techniques to go and analyze Street imagery and as you know that that's a very big and varied corpus so that we could extract elements that are core to our mapping strategy and build and that way build a better Google Maps so for instance in that picture so that's that's a panorama or piece of a panorama from Street View imagery and you can see that there are a lot of pieces in there that if you could find and and properly localized would drastically help you build better maps so street numbers obviously that are really useful to map addresses street names that when combined event on similar techniques from our views will help you properly draw all the routes and give a name to them and those two combines actually allow you to do very high quality address book apps which is a common query on Google Maps general text and more specifically text on business facades that allow you to not only may be localized business listings that you may have gotten by other means to actual physical locations weather so build some of those local listings directly from scratch and and more traffic oriented patterns traffic whether it's traffic lights traffic signs that can be used then for for ETA navigation ETA predictions and stuff like that so that was our mission one of the as I mentioned one of the hot piece is to do is to map addresses at Cal and so you can imagine that we had a breakthrough when we first were able to properly find those street numbers out of the Street View imagery and out of the facade solving that problem actually requires a lot of pcs not only you need to find what where the the street number is on the facade which is if you think about it a fairly hard semantic problem right what what's the difference between a street number versus another kind of number versus other auto text but then obviously read it because there's no point having pixels if you cannot understand the number that that's on on the facade all the way to properly draw geo localizing it so that you can put it on on Google Maps and so the first deepening application that that succeeded in production and that's all the way back to 2012 that we had the first system in production was really the first breakthrough that we had across across alphabet on our ability to properly understand real scene situations so here I'm gonna show you a video that kind of sums it up so look every one of those segments is actually a view from starting from the car going to the physical number of all those house numbers that we've been able to detect and transcribe so here that's in Sao Paulo and well you can see that when all that data is put together gives you a very consistent view of the addressing scheme so in in so that's another example say similar things obviously we have more that in Paris where we are doing more imagery so more views of those of those physical numbers that when you if you are going to triangulate you're able to do localize them very accurately and have very accurate maps so the last example I'm going to show is in Cape Town in South Africa where again the impact of that deep learning work has been huge in terms of quality so many countries today actually have up tuned more than 95% of addresses maps map to that way so doing similar things service you can see a lot of parallelism between that work on 3d imagery and doing doing the same on the real scene on the car but obviously doing that on the car is even harder is even harder because you need to do that rigor time and and very quickly with low latency and you also need to do that in in an embedded system right so the cars have to be entirely autonomous you cannot rely on a connection to a Google Data Centers and first you don't have the time in terms of latency to bring data back and forth but also you cannot rely on a connection to for the safe operation of your system right so you need to do the processing within the car but very so that's a that's a paper that you can read that dates all the way back to 2014 where for the first time by using slightly different techniques we were able to put deep learning at work inside inside that that constrained real-time environment and start to have impact and in that case around a pedestrian detection so as I said there are a lot of analogies you can see that to properly drive that scene like Street View you need to find you need to see the traffic light you need to understand if the light is red or green and that's what that's what essentially will allow you to to be at processing obviously driving is even more challenging beyond the real-time and if you saw the cyclist going through so you have air stuff happening on the scene that you need to detect and properly understand interpret and predict and at the same time he expressed explicitly took a night driving example to show you that while you can choose when you take pictures of street view and do it in in data I mean perfect conditions driving requires you to take the conditions that they are and you have to deal with it so there has been for from the very early beginning there's been a lot of cross pollenization between the real scene work so here I took a few papers that we did in Street View that obviously if you read them you see directly apply to some of the stuff we do on the cars well obviously that collaboration between Google research and wham-o historically went well beyond studio only and across all the resort groups and that still is a very strong collaboration going on that enables us to be to stay on the bleeding edge right off of what we can do so now that we we looked a little bit at how things happened I want to spend more time and and go into more of the details of what's going on in the cars today and how deep learning is actually impacting our current system so I think during the if I looked at the cursors properly I think during the week you went through the major pieces that that you need to master to make a self-driving car so I'm sure you heard about mapping localization so putting the car within those maps and understanding where you are with it's pretty good accuracy perception scene understanding which is a higher-level semantic understanding of what's going on in the scene starting to predict what the agents are going to do around you so that you can do better motion planning obviously it is a whole robotics aspect at the end of the day the car in many ways acts like a robot whether it's around the sensor data or even the control interfaces to the car and for every one was was dead with Holloway on robotics you will agree with me that that it's not a perfect world and you need to deal with with with those errors other pieces that we may have talked about is around simulation and essentially validation of whatever system you put together so obviously machine learning and the planning have been having a deep impact in a in a growing set of those areas but for the next for the next minutes here I'm going to focus more on the on the perception piece which is which is a core element of what the self-driving car needs to do so what is what is perception so fundamentally set perception is assist in a system in the car that needs to build an understanding of the world around around it and it does that using two major inputs the first one is prior on the scene so for instance to give you an example it would be a little silly to to have to recompute the actual location of the road the actual interconnectivity of the intersections of every intersection when once you get on the scene because those things you can pre-compute you can pre-compute in advance and save your onboard computing for all the tasks that are more critical so really so that's often referred to as the mapping exercise but really it's about reducing the computation you're going to have to do on on the car watch once it drives the other big input obviously is what sensors are going to give you once you get on the spot so since your data is the is the the signal that's going to tell you what is not like what you mapped and the things is the traffic light right or green where where are the pedestrians where are the cars what are you doing so as we saw on the initial picture we have quite a set of sensors on our self-driving cars so they go from vision systems radar and later how the other three big families of sensors we have one point to note here is is that they are designed to be complimentary right so they are designed to be complimentary first in there in the localization on the car so we don't put them in the same spot because obviously blind spots is is a major issues and and and you want to have good coverage of the field of view the other piece is that there are complementary India capabilities it's so for instance to give you an example cameras are going to be very good to give you a dance representation it's like it is very dense set of information it contains a lot of semantic information right you can you can see you can really see a big number of a large number of details but Francis they are not really good to give you depth or it's much harder computer and computer additionally expensive to get depth information out of camera systems so systems like a lidar for instance will give you very good very good when you hit when you hit objects will give you a very good depth estimation but obviously they're going to lack a lot of the cementing information that you will find on camera systems so all those sensors are designed to be complimentary in terms of their capabilities it goes without saying that the better your sensors are the better your perception system is gonna be right so that's why at way more we we took the path of designing our own sensors in-house and and and and enhancing what's available of the shell today because it's important for us to go all the way to be able to build a self-driving system that we could believe in and so that's what perception does take those two inputs and build a representation of the scene right so at the end of the day you have to realize that that in nature that work of perception is really what differentiates deeply differentiates what you need to do in a safe driving system as opposed to a lower lower level driving assistance system in many cases France we do speed control speed cruise or if you do a lot of lower lower level drug resistance a lot of the strategies can be around not bumping into things if you see things moving around you you group them you segment them appropriately in blocks of moving things and you don't hit them you're good enough in most cases when you don't have a driver on the dragon seat obviously the challenge totally changes scale so to give you an example for instance if you're if you're on the lane and and you see a bicyclist going small slowly on the right on the under on the lane right of you and there's a car and next to you you need to understand that there's a chance that that car is going to want to avoid that bicyclist is going to swerve and you need to anticipate that behavior so that you can you can properly decide whether you want to slow down give space for the car or speed up and have the car go behind you those are the kinds of behaviors that go well beyond not bumping into things and that require much deeper understanding of the world are going that's going on around you so let me put it in picture and and we come back to that example in a court case so here is a typical scene that we encountered at least so so he obviously you have a police car pulled over probably pulled over someone there you have a cyclist on the road moving forward and we need to drive through that situation so the first thing you can do you have to do obviously is the basics right so out of your sensor data understand that a set of point clouds and pixels belong to the cyclist find that you have two cars on the scene the police car and the car park in front of it understand the policeman as a pedestrian so basic level of understanding obviously you need a little more than that you need to go deeper in your semantics obviously you need if you understand that the the flashing lights are on you understand that the police car is becoming an Eevee and and it's performing something on the scene if you understand that this car is parked and we see this a variable piece of information that's going to tell you whether you can pass it or not something you may have not noticed is that there are so cones so there are cones here on the scene that would prevent you for instance to go and drag that pathway if you wanted to next level of getting closer to behavior prediction obviously if you if you also understand that actually the police car has an open door then all of a sudden you can start to expect it behavior where someone is gonna get over that car right and and the way you swerve even if you were to decide to swerve or the way someone getting up out of that car would impact the trajectory of the cyclist is something you need to understand in order to properly and safely Drive and only then only when you have that that depth of understanding you can start to come up with realistic behavior predictions and trajectory predictions for all those agents in the in on the scene so that you can come up with a proper strategy for your planning control so how is a deep learning playing into that whole space and how he is a deep learning impacting used to solve many of those problems so remember when I said when you're 90% down you still have 90% to go so I think that's not that starts to beat us I also talked about how robotics and having sensors in real life is not a perfect world so actually it is a big piece of the puzzle so I wish sensors would give us perfect data all the time and we would give us a perfect picture that we can do reality use to do a deep learning but unfortunately that's not how it works so here for instance you see an example where you have a pickup truck so the imagery doesn't show it but you have a smoke coming off the out of the exhaust and you have exhaust that's triggering a light our laser points right not very relevant for your for any behavior prediction or for your driving behavior so those points obviously and it's safe to go and drive through them all right so those are very safe to ignore in terms of sin understanding right so filtering the whole whole bunch of data coming off your sensors is is a very important task because that reduces the computation you're gonna have to do whether Sookie to do to operator safely a most more subtle one but important one are around reflections so we are driving a scene there's a there's a car here on the camera picture the car is reflected in a bus and if you just do naive detection especially that if the bus goes moves along with you and everything move which is very typical and everything moves then you can have all of a sudden thing and have two cars on the scene and and if you take that car too seriously all the way to impacting your behavior obviously you're gonna make mistakes right so here I showed you an example of reflections on the on the visual range but obviously that affects all sensors in slightly different matters but you could have the same effect for instance with a light our data where for instance when you drive you drive a freeway and you have a road sign on top of the freeway that will reflect in the back window of the car in front of you right and then showing a reflected sign on the road you better understand that the thing you see on the road is actually a reflection and not try to swerve around and trying to avoid that thing on the only sixty five miles per hour trajectory so that's a big that's a big complicated challenge but assume we are able to get to a proper sensor data that we can start the process with our machine running so by the way a lot of the a lot of the the signal processing PC is actually already used machine learning and deep learning too because as you can see Francis in the reflection space you need to at the end of the day you can do some tricks to understand the difference in the signal but at the end of the day at some point for some of them you're gonna have to understand to have a higher level of understanding of the scene and realize it's not possible that the car is hiding behind the bus and given my field of view for instance but assuming you have do the sensor data filter I would sensor data the very next thing I want to do is typically is apply some kind of convolution layers on top of that of that imagery so follow if you're not familiar with convolution layers so that's that's a very popular way to do computer vision because it relies on on connecting neurons with kernels that are gonna run that are gonna learn layer after layer features of the imagery right so those kernels typically work locally on this on this on region of the image and they can understand how they can understand lines they can understand contours and as you build up layers are going to understand higher and higher levels of future representations that ultimately will tell you what's happening on the on the image that's a very common technique and much more efficient we slid and fully connected layers for instance that wouldn't work but unfortunately a lot of a lot of the state of the art is actually in 2d convolutions right so again they've been developed on on imagery and typically they require a fairly dense input rights so for an imagery a crate is great because pixels are very dense you always have a pixel next to the next one there is not a lot of void if you were for instance to think if you were to to plain convolutions on on a very sparse laser point Swensen then you would have a lot of holes and those don't work nearly as well so typically what we do is to first project sensor data into 2d planes and do processing on those so two very typical views that we use the first one is a top-down so broad view is going to give you a Google Maps kind of view of the scene so it's great for instance to to map up to map cars and objects moving along along the scene but they don't it's harder to put imagery pixels imagery you saw from the car into those top-down views so there's another famous one common one that that is the driver view it's a projection onto the the plane from the driver's perspective that are much better at utilizing imagery because this essentially that's how imagery imagery got captured my name media news drone so here for instance you're gonna see how you can if if your sensors are properly registered you can use both lidar and imagery signals together to better understand the scene so the first the first kind of processing you can do is is is what is called their segmentation so once you have pixels or laser points you need to group them together into together into objects that you can that you can then use for better understanding and processing so unfortunately a lot of the objects you encounter while driving don't have a predefined shape so here are two example of snow but if you think about vegetation or if you think about trash bags for instance you can't you can't come up with prior understanding on how they're gonna look like and so you have to be ready to have any shape of those objects so the one of the techniques that works pretty well is to to build a smaller convolution network that you're gonna slide across across Europe the protection of your sensor data so that's the sliding window approach so here for instance if you have if you have a pixel accurate snow detector that you slide across the image then you'll be able to build a representation of those patches of snow and drag appropriately around them so that works pretty well but as you can imagine is a little expensive computation computation because it's like the if follow if you remember I know if you if you've seen them actually it's like the old the whole the matrix printing it's like you had a printer and it had to go and print the page point-by-point all right so it was pretty well but it's pretty slow obviously but it's very analogous to that but it works pretty good so so that was pretty well but you need obviously you need to be very conscious on which area of the of the of the scene you want to apply it to to to stay efficient fortunately many of the objects you you need to care about have predefined priors so Francis if you take a car from the bird from the top down view from the birds view it's gonna be a rectangle you can you can take that that shape prior into consideration in most cases even on the on the lanes on the driving lanes they're gonna go in in similar directions whether whether they go forward or they come the other way they're gonna go in the direction of the lanes same for address and streets so you can use those priors to actually do some more efficient deep learning that in the literature is its convener the ideas of single-shot multi box constants so so here again you would start with the convolution towers but what you do only one pass of convolution it's like it's the same difference between a dot matrix printer and and press right that would print a page at once it's not an allergy but I think that conveys the idea pretty well so here you would train a deep deep net that would directly take the whole projection of just sensor data and output boxes that that encode the pores you have so here for instance I can show you how such a thing would work for cone detection so you can see that we don't have all the fidelity of the per pixel cone detection but we not really care about that we just need to know there is a cone somewhere and we take a box prior and obviously what what that image is also meant to show is that since it's a it's a lot cheaper computed computationally you can obviously run that on a pretty wide range of space and and even if you have a lot of them that's still easy the city is going to be a very efficient efficient way to get to get that data so we talked about the member the flashing lights on top of the police car so even if you if you properly detect and segment cars let's say on the road many cars are very special semantics so here in that on that slide I'm showing you many examples of evie emergency vehicles that you need to visually to understand you need to understand first that it is an Eevee and to whether the Eevee is active or not so school births are not actually emergency vehicles but obviously whether the bus has lights on or the bus has a stop sign open on the side carry heavy semantics that you need to understand so how do you deal with that back to the deep learning techniques one thing you could do is is take that patch build a new convolution tower and be the classifier on top of that and essentially build a school bus classifier a school bus with light sound classifier a school bus with stop sign open classifier I'm pretty sure that would work pretty well but obviously it would be a lot of work and and pretty pretty expensive to run on the car because we need to and convolution convolution layers typically are the most expensive pieces of a neural net so one better thing to do is to use to use embeddings so if you're not familiar with it embeddings essentially are vector representations of objects that you can learn with deep nets that will that really carry some semantic meaning of those objects so for instance you've given given a vehicle you can build a vector that's gonna carry the information that that vehicle is a school bus whether the lights are on whether the stop sign is open and then you you're back into a vector space which is much smaller much more efficient that you can operate in to do further further processing so those embeddings have been actually historically they've been more closely associated with word embeddings so in a typical text if you were able to build those vectors with word alt of words right so out of every word in a piece of text you'll be the vector that represents the meaning of that world and then if you look at the sequence of those words and operate in the vector space you start to understand the semantics of those sentences right so one of the early projects that you can look at is called work to Veck which was which was done in a DNP group at Google where they were able to beat such things and and and they discovered that that embedding space actually carried some interesting vector space properties such as if you took the vector for king- the vector for man plus the vector for women actually you ended up with a vector whether the closest word to that vector would be Queen essentially right so so that's to show you how those those vector representations can be very powerful in the amount of information you can they can contain let's talk about pedestrians so we talked about semantics image segmentation remember so the ability to go pixel by pixel for for things that that don't really have a shape we talked about using shape priors but pedestrians actually combine the complexity of those of those two approaches for many reasons one is that they obviously they are deformable and pedestrians come with many shapes and poses as you can see here I think here you have a guy on someone on the on the skateboard crouching more more unusual poses that you need to understand and the recall you need to have on pedestrian is very high and pedestrians show up in many different situations so for instance here you know clearly pedestrians that you need to see because that's a good chance when you when you do your behavior prediction that that person here is gonna jump out of a car I need to be ready for that so last but not least predicting the behavior of pedestrian is really hard because they move in any direction that car moving that direction you can safely bet connect it's gonna it's not a drastic keychain angle in in a moment's notice right but if you take children for instance it's a little more complicated right so they may not pay attention they may jump in any direction and you need to be ready for that so it's harder in terms of shape prior it's harder in terms of recall and it's also harder in terms of prediction right then you need to have a fine understanding of the semantics to understand that another example here is that we encountered is you get to an intersection and you have a visually impaired person that's jaywalking on the intersection and you obviously need to understand all of that to know that you need to yield to that person pretty clearly so person on the road maybe you should yield to it to him not easy so for instance here so there is actually I don't I don't know if it's a real person or a mannequin or something right so but here we go something that frankly really looks like a pedestrian that you should probably classify the pedestrian but lying on the on the bed of a pickup truck so and obviously you shouldn't yield to that person right because if you if you were to and yielding to a pedestrian at 35 miles per hour for instance ease is hitting the brakes pretty hard right and with with the risk of where we are we New York Region so obviously you need to understand that that that person is travelling with a truck and he's not actually on the road and it's okay to not hear - to him so those are examples of the rich region of the semantics you need to understand obviously one way to do that is to start and understand the behavior of things over time everything we talked about up until now in the how we use deep learning to solve some of these problems was on a pure friend basis but understanding that that person is moving with the truck versus the jaywalker in the middle of the intersection viously do that kind of information you can get to if you observe the behavior of a time back to the embeddings so if you had vector if you have vector representations of those objects you can start and track them over time so a common technique that you can use to get there is to use a recurrent neural networks that essentially are networks that will build a state that gets better and better as it gets more observation sequential observations of for your pattern right so for instance coming back to the to the world's example I gave her earlier you can you see you have one word you see its vector representation another one the sentence saying you understand more but what did some what the author is trying to say third word fourth word at the end of the sentence you had a good understanding and you can start to translate Winston's right so he has a similar idea if you if you understand if you have a semantic representation and coding in an embedding for the pedestrian and the car under it and track that over time and build a state you that that gets more and more meaning as time goes by you're going to get closer and closer to the to a good understanding of what's going on in the scene right so my my point here is those vector representation combined with recurrent neural networks is a common technique that that can help you figure that out back to the point when you're 90% done you still have 90% to go and so to get to the last leg of my talk here today I want to give you some appreciation for what it takes to truly build a machine learning system at scale and in sterilize it so up till now we talked a lot about algorithms as I said earlier algorithms have been a breakthrough and and the efficiency of those algorithms has been a breakthrough for us to succeed the self-driving task but it takes a lot more than algorithms to actually get there the first piece that you need to 10x is ease around labeling efforts so a lot of the algorithms we talked about are supervised meaning that even if you have a strong Network attack sure and you come up with the right one there are supervised in the sense that you need you need to give in order to to train that network you need to come up with a representative set high-quality set of label data that's gonna map some input to predict the output you want it to predict right so that's a pedestrian that's a car that's a pedestrian that's a car and and the network will learn in a supervised way how to build the right representations so there's a lot obviously the unsupervised space is a very active domain of research our own team of research at wham-o and collaboration with Google is around either on that domain but today a lot of it still is revised so to give you orders of magnitude so here represented in a logarithmic scale the size of a couple data sets so you may be familiar with image net which i think is the 15 million of such labels range that guy jumping represents number of seconds from birth to collect correlation pre-cutting suing and so that's that's kind of that's more of an historical tidbit but the first member the find I the hustle the street number on the facade problem so in the back in those days it took us a multi billion label data set to actually teach the network right so those were very early days today we do a lot more a lot better obviously but that's to give you an idea of scale so being able to put to have labeling operations that produce large and high quality label data sets is key for your success and that's a big piece of the puzzle you need to solve so obviously today we do a lot more better not only we require less data but we also can generate those data set much much more efficiently you can use machine running itself to come up with labels and use operators and more importantly use ibrain models where you use labels to to more and more fix the discrepancies or the mistakes and I'll have to label the whole thing from scratch so combining so that's a whole space of active learning and stuff like that combining those those techniques together obviously you can get you can get to completion faster it's very common to still need so that in the minions minions range kind of same pose to train a robust solution another piece is around computation compute computing power so again that's that's that's kind of a historical tidbit around the street number models so here it's a detection model and here is the transcriber model so obviously comparison is not is only worth what it's worth here but if you look at number of neurons or number of connections per neuron which are two important parameters of a Fenny neural net that gives you an idea of scale it's obviously it's many orders of magnitude away from what the human brain can do but you start to be competitive in invent in some cases in the in the Mon space right so again historical historical data but the main point here is that you need a lot of computation and you need a you need to have access to a lot of computing to either train or an infer those train models on real time on the sea and that requires a lot of very robust engineering an infrastructure development to get to those to those scales but Google is pretty good at that and and obviously we at Wayne who have access to the Google infrastructure and tools to essentially get there so I know if you heard so the way the way it's happening at Google is around a tensorflow so maybe you've heard about about it as a moral programming language to program machine learning and and encode network architectures but actually tensorflow is also becoming or is actually the whole ecosystem that can combine combine all those pcs together and do machine learning at scale at Google my mo so it's as I said it's a language that allow teams that allows teams to collaborate and work together that's a data representation in which you can represent your your label data sets for instance or your training batches that's a runtime that that that you can deploy on to Google Data Centers and you need you need it's good that we have access to that computing power another piece is his accelerators so back in the early days when we had CPUs to 1d planning models at scale which is less efficient and over time GPUs came into the mix and and and Google is proactive into developing very advanced set of hardware accelerators so you have heard about GPUs tensorflow processing units which has which are proprietary chipsets that rule deploys in its data centers that are you to train and infer more efficiently this deep learning models and tensorflow is the glue that allows you to deploy at scale across those those pcs very important piece to get there so it's nice you're smart you build we build a smart algorithm we were able to collect enough data to to train it great ship it well self-driving system is pretty sophisticated and that's a complex system to understand and that's a complex system that that requires extensive testing and I think the last leg that you need to cover to do machine learning at scale and and with a high safety bar is around your testing program so we have three legs that that we that we use to make sure that we our machine running is ready for production one is around we are what driving another one is around simulation and the last one is around a structured testing so I'll come back to that in terms of we are about driving obviously there is no way around it if if you want to encounter situations and see and understand how you behave you need to drive so as you can see the driving at way mo has been accelerating over time still accelerating so we crossed three million minds driven back in May 2017 and only six months later back in November we reached four million so that's an accelerating pace obviously not every mind is equal and what you care about are the mice that carry new situations and important situations so what we do obviously is driving in many different situations so those mice got acquired across 20 cities many weather conditions and many environments it's forming a lot so to give you another of magnitude so that's when about 60 times around the globe okay even more importantly it's not to point it's hard to estimate that's probably around 300 years of human driving equivalent all right so so in that data set potentially you have 300 years of experience that your machine learning can tap into to learn to learn what to do even more importantly is your ability to simulate obviously the software changes regularly so if for each new revision of the software you need to go and we drive four million miles it's not very practical it's going to take a lot of time so the ability to to good enough simulation that you can replay all those miles that you've driven in any new iteration of the software is key for you to decide if the new version is ready or not even more important is your ability to to make those mozzie more even more efficient and tweak them so here is a screenshot of an internal tool that we call a car craft that essentially gives us the ability to fast or change the parameters of the actual scene we've driven so what if the cars were doing in a slightly different speed what if there was an extra car that that was on the scene what if a pedestrian crossed in front of the car so so you can use the actual live on Mars as a base and then augment them into new situations that you can test your drive again your sub running system against so that's a very powerful way to actually drastically multiply the impact of any animal you drive and simulation is another of those massive scales project that you need to cover so a couple orders of magnitude h
Resume
Categories