Sacha Arnoud, Director of Engineering, Waymo

Sacha Arnoud, Director of Engineering, Waymo - MIT Self-Driving Cars

LSX3qdy0dFg • 2018-02-16

Transcript preview

Open

Kind: captions
Language: en
today we have the director of
engineering head of perception at way mo
a company that's recently driven over
four million miles autonomously and in
so doing inspired the world in what
artificial intelligence and good
engineering can do so please give a warm
welcome to Sasha our new
[Applause]
thanks a lot Lex for the introduction
well it's it's a pretty packed house
thanks a lot I'm really excited thanks a
lot for giving me the opportunity to to
be able to come and share my passion
with the Seb driving cars and be able to
share with you all the great work we've
been doing at Weimer over the last 10
years and give you more details on the
recent milestones we've reached
so as you see we'll cover a lot of
different topics some more technical
some more about context but when either
the content I have three main objectives
that that I'd like to convey today so
keep that in mind as we go through the
through the presentation my first one is
is to give you some background around
the self-driving space and what's
happening there and what it takes to
build self-driving cars but also give
you some some behind the scene views and
tidbits on on the history of machine
learning deep learning and how it how it
all came together within the big
alphabet family from Google to way moe
another piece obviously another
objective I have is to give you some
technical meat around the techniques
that are working today on our
self-driving cars so I think during the
the class you hear a lot you've heard a
lot about different different deep
learning techniques models architectures
algorithms and I try to put that in a
current hole so that you can you can see
how those pieces fit together to build a
system we have today and has been at
least I think as Lex mentioned
it takes a lot more actually than
algorithms to build a sophisticated
system such as our self-driving cars and
fundamentally it takes a a food
industrial project to make that happen
and I'll try to give you some color with
which hopefully is it are different from
from what you've heard during the week
I'll try to give you some color on what
it takes to actually pan out such an
industrial project in real life and make
an essentially productionize machine
learning so we hear a lot of talk we
hear a lot about self-driving cars it's
a very hot topic and for very good
reasons I can tell you for sure that
2017 has been a great year for whammo
actually only a year ago in January 2017
when Moe became its own company so that
was a major milestone and a testimony to
the to the robustness of distribution so
that we could move to a product product
is Asian phase so what you see on the
picture here is our latest generation
self-driving vehicle so it is based on
on the chrysler pacifica you can already
see a bunch of sensors I'll come back to
that and give you more more insights on
what they do and how they operate but
that's that that's the latest and
greatest so self-driving indeed is draws
a lot of attention and for very good
reason I personally believe and I think
you will agree with me that self-driving
really has as the potential to deeply
change the way we look about mobility
and the way we move people and things
around so only to cover a few aspects
here obviously that and I want to go
into too many details but safety is one
of is one of the the main motivations
94% of us crashes today involve human
errors a lot of those errors are around
distraction and things that could be
avoided so safety is a big piece of it
disability and access to mobility is
also a big motivation of ours
so obviously the the self-driving
technology has the potential to make it
very available and cheaper for more
people to to be able to move around and
last but not least is efficiency a
collective efficiency so not only we
spend a lot of time in our cars in in
long commute hours I personally spend a
lot of time in on commit hours and that
time we spend in traffic probably could
be better spent doing something else
than having to drive to grab the coin in
complicated situations beyond beyond
traffic obviously the self-driving
technology has the potential to deeply
change the way we think about traffic
parking spots urban environments city
design so that that's why it's a very
exciting topic so that's why we made it
our our mission at Waco is fundamentally
to to make it safe and easy to move
people and things around so that's a
nice mission and we've been on it for a
very long time so actually the whole
adventure started close to 10 years ago
in 2009 and at the time that was that
starting under the umbrella of a Google
project that you may have heard of
called chauffeur and back back back in
those days so remember we were before
the deep learning days at least in the
industry and so really back in those
days the the first the first objective
of the project was to try and assemble
first product a vehicle take
off-the-shelf sensors assemble them
together and try to go and decide if
self-driving is even a possibility it is
like it's one thing to to have some
prototype somewhere but is that even a
thing that that that is worth pursuing
which is a very common way for Google to
to tackle problems so the genesis for
that work was to come up with a pretty
aggressive objective
so the team the first milestone for the
team was to essentially assembled 10100
my loops in Northern California around
around Mountain View and try and figure
out so for a total of 1,000 miles and
try and and see if they could build
first system that that would be able to
go and drive those loops autonomously so
they were not afraid so the team was not
afraid so those loops went through some
very aggressive patterns so you see that
some of those loops go through the Santa
Cruz Mountains which is an area in
California that as you'll see I'll show
you a video that has very small roads
and two-way traffic and cliffs with
negative obstacles and complicated
patterns like that some of those some of
those paths were going on highways so
that and one of the the busiest highways
some of those routes were going around
Lake Tahoe which is which is in the
Sierras in California where you can
encounter different kinds of weather and
again different kinds of roads
conditions those routes were going
around bridges and the Bay Area has
quite a few bridges to go through though
some of them were even going through a
dense urban area so you can see San
Francisco being driven you can see
Monterey some of the Monterey centers
being driven and as you see on the video
those bring those truly bring dense
urban area challenges
so since I promised it so here you're
gonna see some pictures of the driving
and it's kind of working
so here with better quality so here you
see the the roads I was talking about on
the mountain on the Santa Cruz Mountains
driving in the night animals crossing
the street freeway driving going through
patos just another area that is Charlie
dance there's a aquarium there pretty
popular one that's the famous lombard
street in san francisco that you may
have heard of which in San Francisco
always brings unique set of challenges
between fog and slopes and in that case
even shop turns so that was all the way
back in 2010 so those ten loops were
successfully completed 100% autonomously
back in 2010 so that's more than eight
eight years ago so on that on the heels
of that success the team decided and
Google decided that self-driving was
worth worth pursuing and moved and moved
forward with the development of the
technology and and testing so we've been
at it for all those years and have been
working very hard on it historically
way more and and and I think what the
other companies out there have been
relying on what we call safety drivers
to still sit behind the wheels even if
the car is is driving autonomously you
still have a safety driver was able to
take over at any time and make sure that
we have very separations and and we've
been a committed my eyes and knowledge
and developing the system many
iterations of the system across all
those enoguh lose years we reached a
major milestone as Lex mentioned back in
back in November where for the first
time we reached a level of confidence
and maturity in a system that we felt
confident and proved to ourselves that
it was safe to remove the safety driver
as you can imagine that's that's a major
milestone because it takes a very high
level of confidence to not have that
backup solution of a safety driver to
take over or something to arise so here
I'm gonna show you a small video a quick
quick capture of that event so that the
video is from one of the first times we
did that since then we've been
continuously operating drug arrest cars
self-driving cars in the Phoenix area in
Arizona to expand our testing so here
you can see the video swing so you can
see our chrysler pacifica so here we
have members of the team who are acting
as the passengers getting on a backseat
there is you can notice that there is no
driver on the driver's seat so here we
are running car having kind of service
so the passenger simply press the button
the application knows where they want to
go and the car goes nope no one on the
driver seat so we started with a fairly
constrained geographical area in
Chandler close to Phoenix Arizona and we
we are hard working to expand testing
and the scope of our operating area
since then so that goes well beyond a
single car a single day not only we do
that continuously but we also have a
growing fleet of self-driving cars that
we are deploying there all the way and
looking for a product launch pretty
quickly so I've talked about 2010 and we
are in 2018 and were getting there but
what it took it took quite a bit of time
so I think one of the one of the key
ideas that I'd like to convey here today
and that I will I will go back to during
representation is how much work and how
much work it takes to really take a demo
or something that's working in a lab
into something that you feel safe
to put on the roads and get all the way
to that to that depth of understanding
that depth of perfection in your
technology that that you operate safely
so one way to say that is that when you
are 90% done you still have 90 percent
to go right so 90% of the technology
takes only 10% of the time right in
other words you need to 10x right you
need to 10x the the capabilities of your
technology you need to 10x your team
size and find ways for more engineers
and more researchers to collaborate
together you need to 10x the
capabilities of your sensors you need to
10x fundamentally the overall quality of
the system right and your testing
practices as we'll see and a lot of the
aspects of the program and that's what
we've been that's what we've been
working on so beyond the context of
self-driving cars I want to spend a
little bit of time to give you kind of a
kind of an inside of view of the rise of
deep learning Sumer I mentioned that
back in 2009 2010 deep learning was not
ready available yet in full capacity in
the industry and so over those years
actually it took a lot of breakthroughs
to to be able to reach that stage and
one of them was the Agora algorithm
breakthrough that deep learning gave us
and I'll give you a little bit of of
backstage view on what happened at
Google during those years so as you know
Google has been as committed itself to
machine learning and deep learning very
early on you may have heard of the
Google brain what we call internally the
Google brain team which is which is a
team fundamentally hard at work to lead
the bleeding edge of research which is
known but also leading the development
of the tools an infrastructure of the
whole machine learning ecosystem at at
Google and level to essentially low many
teams to develop machine learning at
scale all the way to successful products
so they've been working and pushing
that the deep learning technology has
been pushing the field in many in many
directions from computer vision to
speech understanding to NLP and all
those directions are things that you can
see in Google products today so whether
you're talking real assistant or Google
photos speech recognition or even Google
Maps you can see the impact of deep
learning in all those areas and actually
many years ago I was part of I myself
was part of the street view team and I
was leading the what an internal program
an internal project that we call the
street smart and the good we had at
sweet smart was to use deep learning and
machine learning techniques to go and
analyze Street imagery and as you know
that that's a very big and varied corpus
so that we could extract elements that
are core to our mapping strategy and
build and that way build a better Google
Maps so for instance in that picture so
that's that's a panorama or piece of a
panorama from Street View imagery and
you can see that there are a lot of
pieces in there that if you could find
and and properly localized would
drastically help you build better maps
so street numbers obviously that are
really useful to map addresses street
names that when combined event on
similar techniques from our views will
help you properly draw all the routes
and give a name to them and those two
combines actually allow you to do very
high quality address book apps which is
a common query on Google Maps general
text
and more specifically text on business
facades that allow you to not only may
be localized business listings that you
may have gotten by other means to actual
physical locations weather so build some
of those local listings directly from
scratch and and more traffic oriented
patterns traffic whether it's traffic
lights traffic signs that can be used
then for for ETA navigation ETA
predictions and stuff like that so that
was our mission one of the as I
mentioned one of the hot piece is to do
is to map addresses at Cal and so you
can imagine that we had a breakthrough
when we first were able to properly find
those street numbers out of the Street
View imagery and out of the facade
solving that problem actually requires a
lot of pcs not only you need to find
what where the the street number is on
the facade which is if you think about
it a fairly hard semantic problem right
what what's the difference between a
street number versus another kind of
number versus other auto text but then
obviously read it because there's no
point having pixels if you cannot
understand the number that that's on on
the facade all the way to properly draw
geo localizing it so that you can put it
on on Google Maps and so the first
deepening application that that
succeeded in production and that's all
the way back to 2012 that we had the
first system in production was really
the first breakthrough that we had
across across alphabet on our ability to
properly understand real scene
situations so here I'm gonna show you a
video that kind of sums it up so look
every one of those segments is actually
a view from starting from the car going
to the physical number of all those
house numbers that we've been able to
detect and transcribe so here that's in
Sao Paulo and well you can see that when
all that data is put together gives you
a very consistent
view of the addressing scheme so in in
so that's another example say similar
things obviously we have more that in
Paris where we are doing more imagery so
more views of those of those physical
numbers that when you if you are going
to triangulate you're able to do
localize them very accurately and have
very accurate maps so the last example
I'm going to show is in Cape Town in
South Africa where again the impact of
that deep learning work has been huge in
terms of quality so many countries today
actually have up tuned more than 95% of
addresses maps map to that way so doing
similar things service you can see a lot
of parallelism between that work on 3d
imagery and doing doing the same on the
real scene on the car but obviously
doing that on the car is even harder is
even harder because you need to do that
rigor time and and very quickly with low
latency and you also need to do that in
in an embedded system right so the cars
have to be entirely autonomous you
cannot rely on a connection to a Google
Data Centers and first you don't have
the time in terms of latency to bring
data back and forth but also you cannot
rely on a connection to for the safe
operation of your system right so you
need to do the processing within the car
but very so that's a that's a paper that
you can read that dates all the way back
to 2014 where for the first time by
using slightly different techniques we
were able to put deep learning at work
inside inside that that constrained
real-time environment and start to have
impact and in that case around a
pedestrian detection so as I said there
are a lot of analogies you can see that
to properly drive that scene like Street
View you need to find you need to see
the traffic light you need to understand
if the light is red or green and that's
what that's what essentially will allow
you to
to be at processing obviously driving is
even more challenging beyond the
real-time and if you saw the cyclist
going through so you have air stuff
happening on the scene that you need to
detect and properly understand interpret
and predict and at the same time he
expressed explicitly took a night
driving example to show you that while
you can choose when you take pictures of
street view and do it in in data I mean
perfect conditions driving requires you
to take the conditions that they are and
you have to deal with it so there has
been for from the very early beginning
there's been a lot of cross
pollenization
between the real scene work so here I
took a few papers that we did in Street
View that obviously if you read them you
see directly apply to some of the stuff
we do on the cars well obviously that
collaboration between Google research
and wham-o historically went well beyond
studio only and across all the resort
groups and that still is a very strong
collaboration going on that enables us
to be to stay on the bleeding edge right
off of what we can do so now that we we
looked a little bit at how things
happened I want to spend more time and
and go into more of the details of
what's going on in the cars today and
how deep learning is actually impacting
our current system so I think during the
if I looked at the cursors properly I
think during the week you went through
the major pieces that that you need to
master to make a self-driving car so I'm
sure you heard about mapping
localization so putting the car within
those maps and understanding where you
are with it's pretty good accuracy
perception scene understanding which is
a higher-level semantic understanding of
what's going on in the scene starting to
predict what the agents are going to do
around you so that you can do better
motion planning obviously it is a whole
robotics aspect at the end of the day
the car in many ways acts like a robot
whether it's around the sensor data or
even the control interfaces to the car
and for every one was was dead with
Holloway on robotics you will agree with
me that that it's not a perfect world
and you need to deal with with with
those errors other pieces that we may
have talked about is around simulation
and essentially validation of whatever
system you put together so obviously
machine learning and the planning have
been having a deep impact in a in a
growing set of those areas but for the
next for the next minutes here I'm going
to focus more on the on the perception
piece which is which is a core element
of what the self-driving car needs to do
so what is what is perception so
fundamentally set perception is assist
in a system in the car that needs to
build an understanding of the world
around around it and it does that using
two major inputs
the first one is prior on the scene so
for instance to give you an example it
would be a little silly to to have to
recompute the actual location of the
road the actual interconnectivity of the
intersections of every intersection when
once you get on the scene because those
things you can pre-compute you can
pre-compute in advance and save your
onboard computing for all the tasks that
are more critical so really so that's
often referred to as the mapping
exercise but really it's about reducing
the computation you're going to have to
do on on the car watch once it drives
the other big input obviously is what
sensors are going to give you once you
get on the spot so since your data is
the is the the signal that's going to
tell you what is not like what you
mapped and the things is the traffic
light right or green where where are the
pedestrians where are the cars what are
you doing
so as we saw on the initial picture we
have quite a set of sensors on our
self-driving cars so they go from vision
systems radar and later how the other
three big families of sensors we have
one point to note here is is that they
are designed to be complimentary right
so they are designed to be complimentary
first in there in the localization on
the car so we don't put them in the same
spot because obviously blind spots is is
a major issues and and and you want to
have good coverage of the field of view
the other piece is that there are
complementary India capabilities it's so
for instance to give you an example
cameras are going to be very good to
give you a dance representation it's
like it is very dense set of information
it contains a lot of semantic
information right you can you can see
you can really see
a big number of a large number of
details but Francis they are not really
good to give you depth or it's much
harder computer and computer
additionally expensive to get depth
information out of camera systems so
systems like a lidar for instance will
give you very good very good
when you hit when you hit objects will
give you a very good depth estimation
but obviously they're going to lack a
lot of the cementing information that
you will find on camera systems
so all those sensors are designed to be
complimentary in terms of their
capabilities it goes without saying that
the better your sensors are the better
your perception system is gonna be right
so that's why at way more we we took the
path of designing our own sensors
in-house and and and and enhancing
what's available of the shell today
because it's important for us to go all
the way to be able to build a
self-driving system that we could
believe in and so that's what perception
does take those two inputs and build a
representation of the scene right so at
the end of the day you have to realize
that that in nature that work of
perception is really what differentiates
deeply differentiates what you need to
do in a safe driving system as opposed
to a lower lower level driving
assistance system in many cases France
we do speed control speed cruise or if
you do a lot of lower lower level drug
resistance a lot of the strategies can
be around not bumping into things if you
see things moving around you you group
them you segment them appropriately in
blocks of moving things and you don't
hit them you're good enough in most
cases
when you don't have a driver on the
dragon seat obviously the challenge
totally changes scale so to give you an
example for instance if you're if you're
on the lane and and you see a bicyclist
going small slowly on the right on the
under on the lane right of you and
there's a car and next to you you need
to understand that there's a chance that
that car is going to want to avoid that
bicyclist is going to swerve and you
need to anticipate that behavior so that
you can you can properly decide whether
you want to slow down give space for the
car or speed up and have the car go
behind you those are the kinds of
behaviors that go well beyond not
bumping into things and that require
much deeper understanding of the world
are going that's going on around you so
let me put it in picture and and we come
back to that example in a court case so
here is a typical scene that we
encountered at least so so he obviously
you have a police car pulled over
probably pulled over someone there you
have a cyclist on the road moving
forward and we need to drive through
that situation so the first thing you
can do you have to do obviously is the
basics right so out of your sensor data
understand that a set of point clouds
and pixels belong to the cyclist find
that you have two cars on the scene the
police car and the car park in front of
it understand the policeman as a
pedestrian so basic level of
understanding obviously you need a
little more than that you need to go
deeper in your semantics obviously you
need if you understand that the the
flashing lights are on you understand
that the police car is becoming an Eevee
and and it's performing something on the
scene if you understand that this car is
parked and we see this a variable piece
of information that's going to tell you
whether you can pass it or not something
you may have not noticed is that there
are so cones so there are cones here on
the scene that would prevent you for
instance to go and drag that pathway if
you wanted to next level of
getting closer to behavior prediction
obviously if you if you also understand
that actually the police car has an open
door then all of a sudden you can start
to expect it behavior where someone is
gonna get over that car right and and
the way you swerve even if you were to
decide to swerve or the way someone
getting up out of that car would impact
the trajectory of the cyclist is
something you need to understand in
order to properly and safely Drive and
only then only when you have that that
depth of understanding you can start to
come up with realistic behavior
predictions and trajectory predictions
for all those agents in the in on the
scene so that you can come up with a
proper strategy for your planning
control so how is a deep learning
playing into that whole space and how he
is a deep learning impacting used to
solve many of those problems
so remember when I said when you're 90%
down you still have 90% to go so I think
that's not that starts to beat us I also
talked about how robotics and having
sensors in real life
is not a perfect world so actually it is
a big piece of the puzzle so I wish
sensors would give us perfect data all
the time and we would give us a perfect
picture that we can do reality use to do
a deep learning but unfortunately that's
not how it works so here for instance
you see an example where you have a
pickup truck
so the imagery doesn't show it but you
have a smoke coming off the out of the
exhaust and you have exhaust that's
triggering a light our laser points
right not very relevant for your for any
behavior prediction or for your driving
behavior so those points obviously and
it's safe to go and drive through them
all right so those are very safe to
ignore in terms of sin understanding
right so filtering the whole whole bunch
of data coming off your sensors is is a
very important task because that reduces
the computation you're gonna have to do
whether Sookie to do to operator safely
a most more subtle one but important one
are around reflections so we are driving
a scene there's a there's a car here on
the camera picture the car is reflected
in a bus and if you just do naive
detection especially that if the bus
goes moves along with you and everything
move which is very typical and
everything moves then you can have all
of a sudden thing and have two cars on
the scene and and if you take that car
too seriously all the way to impacting
your behavior obviously you're gonna
make mistakes right so here I showed you
an example of reflections on the on the
visual range but obviously that affects
all sensors in slightly different
matters but you could have the same
effect for instance with a light our
data where for instance when you drive
you drive a freeway and you have a road
sign on top of the freeway that will
reflect in the back window of the car in
front of you right and then showing a
reflected sign on the road you better
understand that the thing you see on the
road is actually a reflection and not
try to swerve around and trying to avoid
that thing on the only sixty five miles
per hour
trajectory
so that's a big that's a big complicated
challenge but assume we are able to get
to a proper sensor data that we can
start the process with our machine
running so by the way a lot of the a lot
of the the signal processing PC is
actually already used machine learning
and deep learning too because as you can
see Francis in the reflection space you
need to at the end of the day you can do
some tricks to understand the difference
in the signal but at the end of the day
at some point for some of them you're
gonna have to understand to have a
higher level of understanding of the
scene and realize it's not possible that
the car is hiding behind the bus and
given my field of view for instance but
assuming you have do the sensor data
filter I would sensor data the very next
thing I want to do is typically is apply
some kind of convolution layers on top
of that of that imagery so follow if
you're not familiar with convolution
layers so that's that's a very popular
way to do computer vision because it
relies on on connecting neurons with
kernels that are gonna run that are
gonna learn layer after layer features
of the imagery right so those kernels
typically work locally on this on this
on region of the image and they can
understand how they can understand lines
they can understand contours and as you
build up layers are going to understand
higher and higher levels of future
representations that ultimately will
tell you what's happening on the on the
image that's a very common technique and
much more efficient we slid and fully
connected layers for instance that
wouldn't work but unfortunately a lot of
a lot of the state of the art is
actually in 2d convolutions right so
again they've been developed on on
imagery and typically they require a
fairly dense input rights so for an
imagery a crate is great because pixels
are very dense you always have a pixel
next to the next one there is not a lot
of void if you were for instance to
think if you were to to
plain convolutions on on a very sparse
laser point Swensen then you would have
a lot of holes and those don't work
nearly as well so typically what we do
is to first project sensor data into 2d
planes and do processing on those so two
very typical views that we use the first
one is a top-down so broad view is going
to give you a Google Maps kind of view
of the scene so it's great for instance
to to map up to map cars and objects
moving along along the scene but they
don't it's harder to put imagery pixels
imagery you saw from the car into those
top-down views so there's another famous
one common one that that is the driver
view it's a projection onto the the
plane from the driver's perspective that
are much better at utilizing imagery
because this essentially that's how
imagery imagery got captured my name
media news drone
so here for instance you're gonna see
how you can if if your sensors are
properly registered you can use both
lidar and imagery signals together to
better understand the scene so the first
the first kind of processing you can do
is is is what is called their
segmentation so once you have pixels or
laser points you need to group them
together into together into objects that
you can that you can then use for better
understanding and processing so
unfortunately a lot of the objects you
encounter while driving don't have a
predefined shape so here are two example
of snow but if you think about
vegetation or if you think about trash
bags for instance you can't you can't
come up with
prior understanding on how they're gonna
look like and so you have to be ready to
have any shape of those objects so the
one of the techniques that works pretty
well is to to build a smaller
convolution network that you're gonna
slide across across Europe the
protection of your sensor data so that's
the sliding window approach so here for
instance if you have if you have a pixel
accurate snow detector that you slide
across the image then you'll be able to
build a representation of those patches
of snow and drag appropriately around
them so that works pretty well but as
you can imagine is a little expensive
computation computation because it's
like the if follow if you remember I
know if you if you've seen them actually
it's like the old the whole the matrix
printing it's like you had a printer and
it had to go and print the page
point-by-point all right so it was
pretty well but it's pretty slow
obviously but it's very analogous to
that but it works pretty good so so that
was pretty well but you need obviously
you need to be very conscious on which
area of the of the of the scene you want
to apply it to to to stay efficient
fortunately many of the objects you you
need to care about have predefined
priors so Francis if you take a car from
the bird from the top down view from the
birds view it's gonna be a rectangle you
can you can take that that shape prior
into consideration in most cases even on
the on the lanes on the driving lanes
they're gonna go in in similar
directions whether whether they go
forward or they come the other way
they're gonna go in the direction of the
lanes same for address and streets so
you can use those priors to actually do
some more efficient deep learning that
in the literature is its convener the
ideas of single-shot multi box constants
so so here again you would start with
the convolution towers but what you do
only one pass of convolution it's like
it's the same difference between a dot
matrix printer and and press right that
would print a page at once
it's not an allergy but I think that
conveys the idea pretty well so here you
would train a deep deep net that would
directly take the whole projection of
just sensor data and output boxes that
that encode the pores you have so here
for instance I can show you how such a
thing would work for cone detection so
you can see that we don't have all the
fidelity of the per pixel cone detection
but we not really care about that we
just need to know there is a cone
somewhere and we take a box prior and
obviously what what that image is also
meant to show is that since it's a it's
a lot cheaper computed computationally
you can obviously run that on a pretty
wide range of space and and even if you
have a lot of them that's still easy the
city is going to be a very efficient
efficient way to get to get that data so
we talked about the member the flashing
lights on top of the police car so even
if you if you properly detect and
segment cars let's say on the road many
cars are very special semantics so here
in that on that slide I'm showing you
many examples of evie emergency vehicles
that you need to visually to understand
you need to understand first that it is
an Eevee and to whether the Eevee is
active or not so school births are not
actually emergency vehicles but
obviously whether the bus has lights on
or the bus has a stop sign open on the
side carry heavy semantics that you need
to understand so how do you deal with
that back to the deep learning
techniques one thing you could do is is
take that patch build a new convolution
tower and be the classifier on top of
that and essentially build a school bus
classifier a school bus with light sound
classifier a school bus with stop sign
open classifier I'm pretty sure that
would work pretty well but obviously it
would be a lot of work and and pretty
pretty expensive to run on the car
because we need to and convolution
convolution layers typically are the
most expensive pieces of a neural net so
one better thing to do is to use to use
embeddings so
if you're not familiar with it
embeddings essentially are vector
representations of objects that you can
learn with deep nets that will that
really carry some semantic meaning of
those objects so for instance you've
given given a vehicle you can build a
vector that's gonna carry the
information that that vehicle is a
school bus whether the lights are on
whether the stop sign is open and then
you you're back into a vector space
which is much smaller much more
efficient that you can operate in to do
further further processing
so those embeddings have been actually
historically they've been more closely
associated with word embeddings so in a
typical text if you were able to build
those vectors with word alt of words
right so out of every word in a piece of
text you'll be the vector that
represents the meaning of that world and
then if you look at the sequence of
those words and operate in the vector
space you start to understand the
semantics of those sentences right so
one of the early projects that you can
look at is called work to Veck which was
which was done in a DNP group at Google
where they were able to beat such things
and and and they discovered that that
embedding space actually carried some
interesting vector space properties such
as if you took the vector for king- the
vector for man plus the vector for women
actually you ended up with a vector
whether the closest word to that vector
would be Queen essentially right so so
that's to show you how those those
vector representations can be very
powerful in the amount of information
you can they can contain let's talk
about pedestrians so we talked about
semantics image segmentation remember so
the ability to go pixel by pixel for for
things that that don't really have a
shape we talked about using shape priors
but pedestrians actually combine the
complexity of those of those two
approaches
for many reasons one is that they
obviously they are deformable and
pedestrians come with many shapes and
poses as you can see here I think here
you have a guy on someone on the on the
skateboard crouching more more unusual
poses that you need to understand and
the recall you need to have on
pedestrian is very high and pedestrians
show up in many different situations so
for instance here you know clearly
pedestrians that you need to see because
that's a good chance when you when you
do your behavior prediction that that
person here is gonna jump out of a car I
need to be ready for that so last but
not least predicting the behavior of
pedestrian is really hard because they
move in any direction that car moving
that direction you can safely bet
connect it's gonna it's not a drastic
keychain angle in in a moment's notice
right but if you take children for
instance it's a little more complicated
right so they may not pay attention they
may jump in any direction and you need
to be ready for that so it's harder in
terms of shape prior it's harder in
terms of recall and it's also harder in
terms of prediction right then you need
to have a fine understanding of the
semantics to understand that another
example here is that we encountered is
you get to an intersection and you have
a visually impaired person that's
jaywalking on the intersection and you
obviously need to understand all of that
to know that you need to yield to that
person pretty clearly so person on the
road maybe you should yield to it to him
not easy so for instance here so there
is actually I don't I don't know if it's
a real person or a mannequin or
something right so but here we go
something that frankly really looks like
a pedestrian that you should probably
classify the pedestrian but lying on the
on the bed of a pickup truck so and
obviously you shouldn't yield to that
person right because if you if you were
to and yielding to a pedestrian at 35
miles per hour for instance ease is
hitting the brakes pretty hard right and
with with the risk of where we are we
New York Region so obviously you need to
understand that that that person is
travelling with a truck and he's not
actually on the road and it's okay to
not hear - to him so those are examples
of the rich region of the semantics you
need to understand obviously one way to
do that is to start and understand the
behavior of things over time everything
we talked about up until now in the how
we use deep learning to solve some of
these problems was on a pure friend
basis but understanding that that person
is moving with the truck versus the
jaywalker in the middle of the
intersection viously do that kind of
information you can get to if you
observe the behavior of a time back to
the embeddings so if you had vector if
you have vector representations of those
objects you can start and track them
over time
so a common technique that you can use
to get there is to use a recurrent
neural networks that essentially are
networks that will build a state that
gets better and better as it gets more
observation sequential observations of
for your pattern right so for instance
coming back to the to the world's
example I gave her earlier you can you
see you have one word you see its vector
representation another one the sentence
saying you understand more but what did
some what the author is trying to say
third word fourth word at the end of the
sentence you had a good understanding
and you can start to translate Winston's
right so he has a similar idea if you if
you understand if you have a semantic
representation and coding in an
embedding for the pedestrian and the car
under it and track that over time and
build a state you that that gets more
and more meaning as time goes by you're
going to get closer and closer to the to
a good understanding of what's going on
in the scene right so my my point here
is those vector representation combined
with recurrent neural networks is a
common technique that that can help you
figure that out
back to the point when you're 90% done
you still have 90% to go and so to get
to the last leg of my talk here today I
want to give you some appreciation for
what it takes to truly build a machine
learning system at scale and in
sterilize it so up till now we talked a
lot about algorithms as I said earlier
algorithms have been a breakthrough and
and the efficiency of those algorithms
has been a breakthrough for us to
succeed the self-driving task but it
takes a lot more than algorithms to
actually get there the first piece that
you need to 10x is ease around labeling
efforts so a lot of the algorithms we
talked about are supervised meaning that
even if you have a strong Network attack
sure and you come up with the right one
there are supervised in the sense that
you need you need to give in order to to
train that network you need to come up
with a representative set high-quality
set of label data that's gonna map some
input to predict the output you want it
to predict right so that's a pedestrian
that's a car that's a pedestrian that's
a car and and the network will learn in
a supervised way how to build the right
representations so there's a lot
obviously the unsupervised space is a
very active domain of research our own
team of research at wham-o and
collaboration with Google is around
either on that domain but today a lot of
it still is revised so to give you
orders of magnitude so here represented
in a logarithmic scale
the size of a couple data sets so you
may be familiar with image net which i
think is the 15 million of such labels
range that guy jumping represents number
of seconds from birth to collect
correlation pre-cutting suing and so
that's that's kind of that's more of an
historical tidbit but the first member
the find I
the hustle the street number on the
facade problem so in the back in those
days it took us a multi billion label
data set to actually teach the network
right so those were very early days
today we do a lot more a lot better
obviously but that's to give you an idea
of scale so being able to put to have
labeling operations that produce large
and high quality label data sets is key
for your success and that's a big piece
of the puzzle you need to solve
so obviously today we do a lot more
better not only we require less data but
we also can generate those data set much
much more efficiently you can use
machine running itself to come up with
labels and use operators and more
importantly use ibrain models where you
use labels to to more and more fix the
discrepancies or the mistakes and I'll
have to label the whole thing from
scratch so combining so that's a whole
space of active learning and stuff like
that
combining those those techniques
together obviously you can get you can
get to completion faster it's very
common to still need so that in the
minions minions range kind of same pose
to train a robust solution another piece
is around computation compute computing
power so again that's that's that's kind
of a historical tidbit around the street
number models so here it's a detection
model and here is the transcriber model
so obviously comparison is not is only
worth what it's worth here but if you
look at number of neurons or number of
connections per neuron which are two
important parameters of a Fenny neural
net that gives you an idea of scale it's
obviously it's many orders of magnitude
away from what the human brain can do
but you start to be competitive in
invent in some cases in the in the Mon
space right
so again historical historical data but
the main point here is that you need a
lot of computation and you need a you
need to have access to a lot of
computing to either train or an infer
those train models on real time on the
sea and that requires a lot of
very robust engineering an
infrastructure development to get to
those to those scales but Google is
pretty good at that and and obviously we
at Wayne who have access to the Google
infrastructure and tools to essentially
get there
so I know if you heard so the way the
way it's happening at Google is around a
tensorflow so maybe you've heard about
about it as a moral programming language
to program machine learning and and
encode network architectures but
actually tensorflow is also becoming or
is actually the whole ecosystem that can
combine combine all those pcs together
and do machine learning at scale at
Google my mo so it's as I said it's a
language that allow teams that allows
teams to collaborate and work together
that's a data representation in which
you can represent your your label data
sets for instance or your training
batches
that's a runtime that that that you can
deploy on to Google Data Centers and you
need you need it's good that we have
access to that computing power another
piece is his accelerators so back in the
early days when we had CPUs to 1d
planning models at scale which is less
efficient and over time GPUs came into
the mix and and and Google is proactive
into developing very advanced set of
hardware accelerators so you have heard
about GPUs tensorflow processing units
which has which are proprietary chipsets
that rule deploys in its data centers
that are you to train and infer more
efficiently this deep learning models
and tensorflow is the glue that allows
you to deploy at scale across those
those pcs very important piece to get
there so it's nice you're smart you
build we build a smart algorithm we were
able to collect enough data to to train
it great ship it well
self-driving system is pretty
sophisticated
and that's a complex system to
understand and that's a complex system
that that requires extensive testing and
I think the last leg that you need to
cover to do machine learning at scale
and and with a high safety bar is around
your testing program so we have three
legs that that we that we use to make
sure that we our machine running is
ready for production one is around we
are what driving another one is around
simulation and the last one is around a
structured testing so I'll come back to
that in terms of we are about driving
obviously there is no way around it if
if you want to encounter situations and
see and understand how you behave you
need to drive so as you can see the
driving at way mo has been accelerating
over time still accelerating so we
crossed three million minds driven back
in May 2017 and only six months later
back in November we reached four million
so that's an accelerating pace obviously
not every mind is equal and what you
care about are the mice that carry new
situations and important situations so
what we do obviously is driving in many
different situations
so those mice got acquired across 20
cities many weather conditions and many
environments it's forming a lot so to
give you another of magnitude so that's
when about 60 times around the globe
okay even more importantly it's not to
point it's hard to estimate that's
probably around 300 years of human
driving equivalent all right so so in
that data set potentially you have 300
years of experience that your machine
learning can tap into to learn to learn
what to do even more importantly is your
ability to simulate obviously the
software changes regularly so if for
each new revision of the software you
need to go and we drive four million
miles it's not very practical it's going
to take a lot of time so the ability to
to
good enough simulation that you can
replay all those miles that you've
driven in any new iteration of the
software is key for you to decide if the
new version is ready or not even more
important is your ability to to make
those mozzie more even more efficient
and tweak them so here is a screenshot
of an internal tool that we call a car
craft that essentially gives us the
ability to fast or change the parameters
of the actual scene we've driven so what
if the cars were doing in a slightly
different speed
what if there was an extra car that that
was on the scene what if a pedestrian
crossed in front of the car so so you
can use the actual live on Mars as a
base and then augment them into new
situations that you can test your drive
again your sub running system against so
that's a very powerful way to actually
drastically multiply the impact of any
animal you drive and simulation is
another of those massive scales project
that you need to cover so a couple
orders of magnitude h

Resume

Berikut adalah rangkuman komprehensif dan terstruktur dari transkrip video yang Anda berikan.

***

# Mengungkap Teknologi di Balik Mobil Otonom Waymo: Dari Deep Learning hingga Simulasi Skala Besar

### Inti Sari (Executive Summary)
Video ini membawakan pembicaraan oleh Sasha, Direktur Teknik dan Kepala Persepsi di Waymo, mengenai perjalanan evolusi teknologi mobil otonom. Diskusi mencakup sejarah panjang proyek ini sejak era Google, penerapan *Deep Learning* untuk pemahaman lingkungan (*perception*), serta tantangan dalam menghadirkan teknologi laboratorium ke penggunaan nyata yang aman. Video ini juga menyoroti pentingnya infrastruktur skala besar, metode pengujian yang rigor melalui simulasi, dan visi masa depan kendaraan otonom dalam meningkatkan keselamatan dan efisiensi transportasi.

### Poin-Poin Kunci (Key Takeaways)
*   **Sejarah & Skala:** Waymo (sebelumnya *Project Chauffeur* di Google) telah berkembang sejak 2009 dan kini telah menempuh lebih dari 4 juta mil secara otonom.
*   **Peran Deep Learning:** Teknologi *Deep Learning* yang awalnya dikembangkan untuk Google Street View (membaca nomor jalan) menjadi fondasi kunci untuk sistem persepsi mobil otonom secara *real-time*.
*   **Sistem Persepsi yang Kompleks:** Mobil otonom membutuhkan pemahaman semantik yang mendalam, bukan hanya deteksi objek, untuk membedakan refleksi, asap knalpot, dan memprediksi perilaku pengguna jalan lain.
*   **Metode Pengujian 3 Tahap:** Waymo menggunakan kombinasi pengujian di dunia nyata, simulasi skala besar ("Carcraft"), dan pengujian terstruktur di fasilitas khusus untuk memvalidasi keamanan.
*   **Tantangan "90% Selesai":** Mengembangkan AI otonom sangat sulit karena ketika sistem merasa sudah 90% selesai, sisa 10% tersebut membutuhkan usaha setara dengan 90% awal untuk disempurnakan.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Latar Belakang dan Motivasi Waymo
*   **Misi Waymo:** Membuatnya aman dan mudah bagi orang dan barang untuk berpindah tempat.
*   **Motivasi Utama:**
    *   **Keselamatan:** 94% kecelakaan di AS disebabkan oleh kesalahan manusia (gangguan, kelalaian).
    *   **Aksesibilitas:** Membantu penyandang disabilitas dan lansia.
    *   **Efisiensi:** Mengurangi waktu terbuang dalam kemacetan dan mendesain ulang kota yang lebih efisien.
*   **Sejarah Singkat:** Dimulai pada tahun 2009 di bawah Google dengan nama "Project Chauffeur". Pada tahun 2010, berhasil menyelesaikan 10 loop (100 mil) secara otonom sepenuhnya di berbagai medan di California. Pada tahun 2017, Waymo resmi menjadi perusahaan independen dan mencapai tonggak sejarah dengan mengemudi tanpa pengemudi safety (*fully driverless*) di Phoenix, Arizona.

#### 2. Evolusi Deep Learning dan Penerapannya
*   **Era Google Brain:** Sebelum diterapkan pada mobil, tim Google Brain mengembangkan infrastruktur ML skala besar yang digunakan untuk *Computer Vision*, *Speech*, dan NLP (Google Assistant, Foto, Peta).
*   **Proyek Street View ("Street Smart"):** Pada tahun 2012, Waymo menggunakan *Deep Learning* untuk menganalisis citra jalan guna mengekstrak informasi seperti nomor rumah, nama jalan, dan rambu lalu lintas. Teknologi ini berhasil memetakan >95% alamat di banyak negara.
*   **Transisi ke Mobil Otonom:** Tantangan utama mobil otonom dibandingkan Street View adalah pemrosesan secara *real-time* dengan latensi rendah pada sistem tertanam (*embedded system*), tanpa koneksi ke pusat data demi keamanan.

#### 3. Teknologi Persepsi (*Perception*) dan Sensor
*   **Dua Input Utama:**
    1.  **Peta (*Prior*):** Data pra-komputasi tentang lokasi jalan dan konektivitasnya.
    2.  **Data Sensor:** Sinyal *real-time* dari kamera, radar, dan LiDAR untuk mendeteksi elemen dinamis.
*   **Desain Sensor:** Waymo merancang sensor *in-house* untuk memastikan saling melengkapi. Kamera memberikan informasi semantik yang kaya, sementara LiDAR memberikan estimasi kedalaman yang akurat.
*   **Lebih dari Sekadar Deteksi:** Sistem persepsi harus membedakan objek berdasarkan konteks. Contoh: Mengenali mobil polisi dengan lampu menyala, pejalan kaki yang mungkin menyeberang, atau membedakan asap knalpot (yang harus diabaikan) dengan objek nyata.

#### 4. Tantangan Teknis: Refleksi dan Segmentasi
*   **Masalah Refleksi:** Sistem harus cukup cerdas untuk membedakan objek asli dengan pantulan (misalnya mobil yang terpantul di kaca bus atau rambu yang terpantul di jendela mobil belakang) agar tidak melakukan pengereman mendadak yang berbahaya.
*   **Proyeksi 2D:** Data sensor yang jarang (*sparse*) diproyeksikan ke dalam bidang 2D untuk diproses menggunakan lapisan konvolusi (*convolutional layers*). Dua pandangan umum adalah pandangan *top-down* (seperti peta) dan pandangan pengemudi (*driver view*).
*   **Teknik Segmentasi:**
    *   *Sliding Window:* Akurat secara piksel tetapi mahal secara komputasi.
    *   *Single-shot Multi-box Detectors (SSD):* Lebih efisien, mendeteksi objek berdasarkan bentuk prior (seperti kotak untuk mobil) dalam satu kali proses konvolusi.

#### 5. Embeddings, Pejalan Kaki, dan Skala Data
*   **Embeddings (Vektor):** Mengubah data objek menjadi vektor untuk pemrosesan yang efisien, mirip dengan konsep *Word2Vec* dalam pemrosesan bahasa.
*   **Tantangan Pejalan Kaki:** Manusia adalah objek yang sangat kompleks dan tidak terprediksi (bisa berubah bentuk, berjalan mundur, melompat). Sistem menggunakan *Recurrent Neural Networks* (RNN) untuk melacak perilaku mereka sepanjang waktu.
*   **Pelabelan Data (*Labeling*):** Pembelajaran terawasi (*supervised learning*) membutuhkan jutaan label data. Waymo menggunakan model ML untuk membantu proses pelabelan (*active learning*) agar lebih efisien, menggantikan pendekatan manual yang skala besar seperti pada era ImageNet.

#### 6. Infrastruktur dan Metodologi Pengujian
*   **TensorFlow & TPU:** Waymo memanfaatkan ekosistem TensorFlow dan *Tensor Processing Units* (TPU) buatan Google untuk komputasi pelatihan dan inferensi yang masif dan efisien.
*   **3 Kaki Program Pengujian:**
    1.  **Pengujian Dunia Nyata:** Menempuh 4 juta mil di 20 kota dengan berbagai kondisi cuaca (setara ~300 tahun pengalaman mengemudi manusia).
    2.  **Simulasi ("Carcraft"):** Alat yang memutar ulang perjalanan nyata dalam versi perangkat lunak baru. Skalanya sangat besar: 25.000 mobil virtual berjalan 24/7, menempuh 2,5 miliar mil virtual hanya dalam setahun.
    3.  **Pengujian Terstruktur:** Fasilitas seluas 90 hektar (bekas pangkalan AU) untuk mereplikasi skenario langka dan berbahaya secara berulang-ulang.

#### 7. Masa Depan dan Sesi Tanya Jawab (Q&A)
*   **Ekspansi ODD (*Operating Design Domain*):** Waymo berencana menguji di area yang lebih menantang seperti pusat kota (San Francisco) dan kondisi cuaca ekstrem (kabut, kemiringan jalan).
*   **Interaksi Sosial:** Tantangan masa depan adalah memahami interaksi sosial yang kompleks di jalan raya, seperti di persimpangan padat (misalnya *Arc de Triomphe* di Paris).
*   **Keamanan dari Serangan Adversarial:** Waymo mengatasi risiko ini dengan redundansi sensor (kamera, LiDAR, radar tidak membuat kesalahan yang sama) dan pemahaman semantik untuk memfilter noise atau serangan.

### Kesimpulan & Pesan Penutup
Mengembangkan mobil otonom adalah perjalanan panjang dari konsep "Project Chauffeur" hingga menjadi produk komersial Waymo saat ini. Kunci keberhasilannya terletak pada penerapan *Deep Learning* yang canggih, infrastruktur komputasi yang masif, dan metodologi pengujian yang sangat ketat melalui simulasi. Tujuan akhirnya bukan sekadar membuat mobil yang bisa berjalan sendiri, tetapi menciptakan ekosistem transportasi yang jauh lebih aman dan dapat diakses oleh semua orang, dengan mengubah cara kita berinteraksi dengan dunia di sekitar kita.

Read

file updated 2026-02-13 13:25:50 UTC