Drago Anguelov (Waymo) - MIT Self-Driving Cars

Q0nGo2-y0xY • 2019-02-12

Transcript preview

Open

Kind: captions
Language: en
all right welcome back to 6sz ro9 for
deep learning for self-driving cars
today we have Drago and glial of
principal scientists at way mo aside
from having the coolest name in
autonomous driving Drago has done a lot
of excellent work in developing applying
machine learning methods to autonomous
vehicle perception and more generally in
computer vision and robotics he's not
helping way mo lead the world in
autonomous driving 10 plus million miles
achieved autonomously to date which is
an incredible accomplishment so it's
exciting to have Drago here with us to
speak please give him a big hand
[Applause]
hi thanks for having me I will tell you
a bit about our work and the the
exciting nature of self-driving and the
problem and our solutions so my talk is
called taming the long tail of
autonomous driving challenges my
background is in perception in robotics
so I did PhD at Stanford with Daphne
Koller and worked closely with one of
the pioneers in the space professor
Sebastian Thrun I spent eight years at
Google doing research on perception also
work on Street View developing deep
models for detection neural net
architectures I was briefly zooks I was
heading the 3d perception gaming jokes
were built another perception system for
autonomous driving and I've been leading
the research team at way more in most
recently
so I want to tell you a little bit about
Weimar when we start way more actually
this month has its 10-year anniversary
it started with Sebastian throng
convinced the Google leadership to try
an exciting new moonshot
and the goal that they set for
themselves was to drive 10 different
segments that were 100 miles long and
later that year they succeeded and drove
an order of magnitude more than anyone
has ever driven
in 2015 we brought this car to the road
it was built ground up as a study in
what fully driverless mobility would be
like in 2015 we put this vehicle in
Austin and it completed the world's
first fully autonomous ride on public
roads and the person inside this car is
a fan of the project that he is blind so
we did not want this to be just a demo
fully driverless experience we worked
hard and in 2017 we launched a fleet of
fully self-driving vehicles on the
streets of in Phoenix metro area
and we have been doing driverless fully
driverless operations ever since
so I wanted to give you a feel for what
fully driverless experience is like
[Music]
and so we continued last year we
launched our first commercial service in
the metro area of Phoenix there people
can call a web on their phone it can
come pick them up and help them with
errands or go to school and we've been
already learning a lot from these
customers and we're looking to grow and
expand the service and bring it to more
people so in the process of drawing the
service we have driven 10 million miles
on public road is like said and
driverless lis in Enmore also with with
human drivers to collect data and we've
driven all kinds of scenarios cities
capturing a diverse set of conditions
and a diverse set of situations in which
we develop our systems
I want to tell you I mean about the long
tail of events this is all the things we
need to handle to enable truly sub
driver this future and I guess all the
problems that come with this and offer
some solutions and show you how has been
thinking about these issues so as we
drove 10 million miles of course we
still find scenarios new ones that we
have not seen before we still keep
collecting them right and so when you
think about self-driving vehicles they
need to have the following properties
first a vehicle needs to be capable it
needs to be able to handle the entire
task of driving so you cannot just a
subset and remove the human operator
from the vehicle and also all of these
tasks obviously need to do well and
safely and that is the requirement to
achieving so driving at scale and when
you think about this now the question is
well how many of these capabilities and
how many scenarios do you really need to
handle well it turns out well the world
is quite diverse and complicated and
there is a lot of rare situations and
all of them need to be handled well
right and they call this the long tail
the long tail of situations you it's
it's it's one type of effort to get
yourself driving for the common cases
and then it's another effort to tame
this the rest and they really really
matter and so I'll show you some for
example
this is us driving in the street and
let's see if you can tell what is
unusual in this video you see so this I
can play it one more time so there's a
bicyclist and he is carrying a stop sign
and I don't know where he picked it up
but it's certainly not a stop sign we
need to stop for unlike others right and
so you need to understand that let me
show you another scenario this is
another case where we are happily
staying there and then the vehicle stops
and a big pile of poles comes our way
right and you need to potentially
understand that and learn to avoid it
generally well different types of
objects can fall on the road it's not
just pose here's another interesting
scenario this is happens a lot it's
called construction and there's various
aspects of it one of them is someone
changed clothes Delaine put a bunch of
cones and we learn and this is our
vehicle correctly identifying where it's
supposed to be driving between all of
these cones and and successfully
executing it so yeah we drive for a
while and this is this is something that
is happens fairly often
if you drive a lot another case is this
one I think you can you can understand
what happened here and you can notice
actually so we hear the siren so we we
have the ability to understand sirens to
special vehicles and you can see we hear
it and stop and some guys are much later
than us breaking at the last moment
letting the emergency vehicle pass and
here's another scenario potentially I
want to show you let's see if you can
understand what happened
so let me play one more time did you
guys see
so we stopped at there's a green light
we're about to go and someone goes at
high speed running a red light without
any remorse right and we successfully
stop and prevent issues right and so
sometimes you have the rules of the way
and you have your road and people don't
always abide by them and that's
something that you know I don't want to
just directly go in front of that person
even if they're breaking the law so
hopefully with this I convince you that
the situations that can occur a diverse
and challenging and there's quite a few
of them and I want to take a little bit
on a tour of what makes this challenging
and then tell you some ways in which we
think about it and how we're handling it
and so to do this we're going to delve a
little bit more into the main tasks for
sub driving which is perception
prediction and planning so I'll tell you
a little bit about those right and
perception these are the core AI aspects
of the car usually this task there's
others we can talk about others as well
in a little bit but that let's focus on
this person so perception is mapping
from sensory inputs in potentially prior
knowledge of the environment to seen
representation and that same
presentation can contain objects it
contains in semantics potentially you
can construct the map you can learn
about objects or relationships and so on
and perception the space of things you
need to handle in perception is fairly
hard it's a complex mapping right so you
have sensors the pixels come later
points come or radar scans come and you
have multiple axis of variability in the
environment so obviously there's a lot
of objects they have different types
appearance pose is I don't know if you
see this well they're a bunch of people
dressed as dinosaurs in this case people
generally are fairly creative in how
they dress vehicles can also be
different types people come in different
poses and we have seen it all right so
that's one of prospects
there's different environments that
these objects appear in so there are
times of day seasons day night different
for example highway environment suburban
street and so on and then there's a
different variability axis and this is a
little more slightly more abstract that
different objects can come in these
environments in different configurations
and can have different relationships and
so things like occlusion there's a guy
carrying a big board there is
reflections there is smell people riding
on horses and so on and so what am i
showing this because I just want to show
you the space right so in most cases you
care about most objects in most
environments in most reasonable
configurations and that's a space that
you need to map from from the sensor
inputs to a representation that makes
sense and you need if you need to learn
this mapping function or represent it
somehow right and so let's go to the
next step which is prediction so apart
from just understanding what's happening
in the world you need to be able to
anticipate and predict what some of the
actors in the world are going to do the
actors being mostly people and people is
honestly what makes driving quite
challenging this is one of the aspects
that do so it's you know vehicle needs
to be out there and be a full-fledged
traffic scene participant and this
anticipation of agent behavior sometime
needs to be fairly long-term so
sometimes when you want to make a
decision you want to validate or
convince yourself it does not interfere
what what anyone else is going to do and
it can go from one second to maybe ten
seconds or more you need to anticipate
the future so what goes into
anticipating the future well you can
watch it past behavior some ones I'm
going this way maybe I will continue I'm
going there maybe I'm very aggressively
walking and maybe I'm more likely to do
aggressive motions in the future high
levels in semantics well I'm in a
presentation room I'm sitting here at
the front giving a talk I'll probably
stay here and continue even though
stranger things have happened
and of course there's subtle appearance
skills so for example if a person's
watching our vehicle and moving towards
them we can be fairly confident they're
paying attention and not going to do
anything particularly dangerous if
someone's not paying attention or being
distracted or you know there is a person
in the car waving at us various gesture
skills the blinkers from the vehicles
these are all signals and and subtle
signals that we need to understand in it
in order to be able to behave well and
last but not least even when you predict
how other agents behave agents also
affected by the other agents in the
environment as well so everyone can
affect everyone else and you need to be
mindful of this so I'll give you an
example of this I think this is one of
the issues that really needs to be
thought about we are all interacting
with each other so here's the case our
way move vehicle is driving and there is
two bicyclists in red going around a
parked car and what happens is we
correctly anticipate that as day bike
they will go around the car and we slow
down and let them pass right so we
reasoning that they will interact with
the parked car this is the this is the
prediction our most likely prediction
for the rear bicyclists we anticipate
that they will do this and we correctly
handle this okay so this illustrates
prediction and here planning this is our
decision-making machine it produces
vehicle behavior typically ends up in
control commands to the vehicle
accelerate slow down steer the wheel any
to generate behavior that ultimately has
several properties to and it's important
to think of them which is safe safety
comes first comfortable for the
passengers and also sends the right
signals to the other traffic
participants you because they can
interact with you and they will react to
your actions you need to be mindful and
you need to of course make progress you
need to deliver your passengers so you
need to trade all of these in a
reasonable way right and it and it can
be
fairly sophisticated reasoning and
complex environments I'll show you just
one scene this is this is the complex I
think school gathering there's bicyclist
trailing us vehicles really close the
hand within as a bunch of pedestrians
and we need to make progress and here is
us we're driving and reasonably well in
crowded scenes and that is part of the
prerequisite of bringing this technology
to in all the deaths urban environments
being able to do so how are we going to
do it well I gave it up I'm a machine
learning person I think when you have
this complicated models and systems
machine learning is a really great tool
to model complex actions complex mapping
functions features right and so we're
going to learn our system and we've been
doing this I mean we're not the only one
so obviously this this is now a machine
learning revolution and machine learning
is permeating all parts of the way imma
stack all of these systems that I'm
talking about it helps us perceive the
world it helps us making decisions about
what others are going to do it helps us
make our own decisions and machine
learning is a tool to handle the long
tail right and now tell you a little
more on this how so I have this allegory
about machine learning that I like to
think about so there is a classical
system and there is a machine learning
system and to me a classical system and
I've been there I've done well early
machine learning also systems also can
be a bit classical you're the artisan
you're the expert you have your tools
and you need to build this product and
you have your craft and you go and take
your tools and build it right and it can
fairly quickly get something reasonable
but then it's harder to change it's
harder to evolve if if you learn new
things now I need to go back and maybe
the tools don't quite fit and you need
to essentially keep keep tweaking it and
starts becoming the more complicated the
product becomes the harder it is to do
and machine learning modern machine
learning is like a factory right so
machine learning you build the factory
which is the machine learning
infrastructure and then you feed data in
this Factory and get nice models to
solve your problems right and so kind of
infrastructure is at the heart of this
new paradigm you need to build the
factory all right once you do it now you
can iterate it's scalable right just
keep the right data keep feeding the
machine keeps giving you good models so
what is the ml factory for self-driving
models well roughly it goes
this we have a software release we put
it on the vehicle we're able to drive we
drive we collect data we collect it and
we store it and then we select some some
parts of this data and we send it to
labelers and the label is labeled parts
of the data that we find interesting and
that's the knowledge that we want to
extract from the data these are the
labels they are notations the results we
want for our models right there is and
then what we're going to do is we're
gonna train machine learning models on
this data after we have the models we
will do testing and validation validate
that they're good to put on our vehicles
once they're good to put on our vehicles
we go and collect more data and then the
process starts going again and again
right so you collect more data now you
select new data that you have not
selected before right you add it to your
data set you keep training the model and
iterate iterate iterate it's a nice
scalable set up of course
this needs to be automated it needs to
be scalable itself it's a game of
infrastructure right and at Weimer we
have the beautiful advantage to be
really well set up with regards to the
machine learning infrastructure and I'll
tell you a bit about its ingredients and
how we how we go about it so ingredient
one is computing software infrastructure
and we're part of alphabet Google and we
are able to first of all leverage
tensorflow the deep learning framework
we have access to the experts the throat
pans the flow and know it in-depth we
have data centers to run large-scale
parallel compute and also train models
we have specialized hardware for
training models which you know make it
cheaper and more affordable and faster
so you can iterate better ingredient to
high quality label data we have the
scale to collect and store hundreds and
thousands and more miles to millions of
miles and just collecting a store and
convenience miles is not necessarily
the best thing you can do right because
there is a decreasing utility to the
data so most of the data comes from
common scenarios you may be already good
at them and that's where the long tail
comes right so so it's really important
how you select the data and so this is
important part of this pipeline so while
you're running release on the vehicle we
have a bunch of models we have a bunch
of understanding about the world and you
can we annotate the data as we go and
you can use this knowledge to decide
what data is interesting how to store it
which data we can potentially even
ignore so then once we do that again we
need to be very careful how to select
data we want to select data for examples
that are interesting in some way and
complement capture these long tail cases
that we potentially may not be doing so
well on and so you know for this there
is we have active learning and data
mining pipelines given exemplars find
the rare examples look for parts of your
system which are uncertain or you know
inconsistent over time and and go and
label those cases last but not least we
also produce auto labels so how can you
do that well when you collect data you
also see the future for many of the
objects what they did and so because of
that now knowing the past and the future
you can annotate your data better and
then go back to your model that does not
know the future and try to replicate
that with that model right and so you
need to do all of this is part of the
system ingredient number three high
quality models we're part of larger
alphabet and Google and deepmind and
generally alphabet is the leader in AI
when I was at Google we were very early
on the deep learning revolution I happen
to have the chance to be there at the
time it was 20 2013 when I got on to do
deep learning and a lot of things were
not understood and we were there working
on it earlier than most people and so
through that we had the opportunity and
the chance to develop some of the in my
time the team I managed to invented
neural net architecture like Inception
which became popular later we invented
at the time the state of the art object
detection fast object detector called
SSD
and we want imagenet 2014 and now if you
go to the conference is Google and deep
mine the leaders in perception and
reinforcement learning and smart agents
and you know there is like state of the
art say semantic segmentation networks
pose estimation and so on the object
detection of course goes without saying
and so we collaborate with Google in
deep mountain projects improving our
models and so this is my factory for
self-driving models and I want to tell
you something that kind of captures all
of these ideas infrastructure data and
models in one this is a project we did
recently and today we put online in our
blog about automatic machine learning
for tuning and adjusting architectures
of neural networks so so what what did
we do so there is a team at Google
working on auto ml automatic machine
learning and usually networks themselves
a complex architecture they're crafted
by practitioners - artisans of networks
in some way and sometimes you know we
have very high latency constraints in
the models we have some compute
constraints the network's is specialized
it takes
often people months to find the right
architecture that's most performant low
latency and so on and so there's a way
to offload this work to the machines you
can have machines themselves once you
suppose the problem go and find your
good network architecture that's both
low latency and high performance right
and so that's what we do and we drive in
a lot of scenarios and we as we keep
collecting data and finding your cities
or new examples the architectures may
change and we want to recently find that
and keep evolving that without too much
effort right so so we worked with the
Google researchers and they had a strong
work where they invented well they
developed a system that searched the
space of architectures and found a set
of components of neural networks it's a
small sub Network called mast cell and
this is a diagram of a nerve cell it's a
such set of layers put together that you
can then replicate in the network to
build a larger Network and they
discovered in a small vision dataset it
was called C 410 it has its it's from
the early days of deep learning
it was a very popular date set and you
can quickly trade models and and explore
the large search space so the first
thing we did is it took some problems in
that we have for our stack one of them
being lighter segmentation so you have a
map representation and some lighter
points and you essentially Sigma and the
lighter points you say this is this
point is part of a vehicle that point is
part of vegetation and so on this is a
standard problem so what we first did it
way mo is we explored several hundred
mast cell combinations to see what
performs better on this task and we
thought one of two things happened for
the various versions that we found one
of them is we can find models with
similar quality but much lower latency
and less compute and then there is
models of a bit higher quality at the
same latency it's essentially we found
better models than the human engineers
did and similar results were obtained
for added problems Lane detection as
well with this transfer learning
approach of course you can also do
entrant architecture search so there's
no reason why what was found on C 410 is
best suited for our more specialized
problems and so we went about this more
from the ground up so let's find exactly
deeper search much much larger space
not limited to the nest cells themselves
and so the way to do this is because our
networks are trained on quite a lot of
data and take quite a while to converge
and it takes some compute we went to
define the proxy task this is a smaller
task simplified but correlates with the
larger task and we do this by some
experimentation of what would be a proxy
task and once we establish a proxy task
now we execute the search algorithms
developed by the Google researchers and
so we train up to 10,000 architectures
with different topology and capacity and
once we find the top hundred models now
we train the large networks on those
models all the way and pick the best
ones right and so this way we can
explore much larger space of network
architectures so what happened so on the
Left this is 4,000 different models
spanning the scale and latency and
quality and in red was the transfer
model so act after the first round of
search we actually did not produce the
better model than the transfer which
already leveraged their insight so then
we took the learnings and the best
models from this search and did the
second round the search which was in
yellow which allowed us to beat it in
third is we also executed reinforcement
learning algorithm developed by their
researchers on 6,000 different
architectures and that one was able to
significantly improve on the red dot
which also significantly improves on the
in-house algorithm
so that's one example where
infrastructure data and models combine
and shows how you can keep automating
the factory that is all good but we keep
finding new examples in the world and
for some situations we have fairly few
examples as well right and so there are
cases where the models are uncertain or
potentially can make mistakes and you
need to be robust to those I mean you
cannot put the product and say well our
network just don't handle some case and
it's so so we have designed a system to
be robust even when ml is not
particularly confident and how do you do
this so one part is of course you want
redundant in complementary sensors so we
have given 360-degree field of view on
our vehicles both in camera lighter and
radar and they're complementary
modalities first of all you know an
object is seen in all of them second of
all they all have different strengths
and different modes of failure and so
whenever one of them tends to fail the
others usually work fine and so that
that helps a lot make sure we do not
miss anything
also we design our system to be a hybrid
system and this is the point I want to
make right so I mean some of these
mapping problems or you know problems
with nutria player models are very
complicated they're high dimensional the
image has a lot of pixels lighter has a
lot of lighter points right the networks
can end up pretty big and it may not be
so easy to train with very few examples
with the current state of the art and so
the state of the art keeps improving of
course so this is their zero short and
one-shot learning but we can also well
the state of the art is improving in the
models we can also leverage expert
domain knowledge and so what does that
do so humans can help develop the right
input representations they can put an
expert bias that constrains the
representation to fewer parameters that
already describe the task and then with
that bias it is easier to learn models
with fewer examples and there is also of
course experts can put in their
knowledge in terms of designing the
algorithm which incorporates it as well
right and so our system is this hybrid
it's an example of what that looks for
perception is well with no matter if the
there's cases where the machine learning
system may be not confident we still
have tracks and obstacles from leather
and radar scans and we make sure that we
we drive relative to those safely and in
prediction and planning if we're not
confident in our predictions we can
drive more conservatively and over time
as the factory is running and our models
become more powerful of course improve
and we get more data of all the cases
the scope of ml grows right and the
sister the the set of cases that you can
handle with it increases and so there's
two ways to attract attack the tail you
both protect against it but you also
keep growing ml and making a system more
performant I'm going to tell you now how
we deal with large-scale testing which
is another key problem it's very
important in in the pipeline and also in
getting the vehicles on the road so how
do you normally develop a self-driving
algorithm well the ideal thing you're
gonna do is you make your algorithm
change and you would put it on the
vehicle and drive a bunch and say now it
looks in great alright let's make the
next one the problem is I mean we have a
big fleet we have a lot of data but some
of the conditions and situations occur
very very rarely and so if you do this
you're gonna wait a long time
furthermore you don't just want to take
your code and put it on a vehicle you
need to test it even before that you
don't want to like you want very
strongly tested code in public streets
so you can do structured testing we have
a 90 acres air force base place where we
can test very important situations and
situations that occur rarely it's an
example of such a situation and so you
can do this as well so you can select
and deliberately staged safely
conditions occur but now again you
cannot do spore all situations so what
do you do a simulator right and so how
much we need to simulate well we
simulate a lot so we simulate the
equivalent of 25,000 cars virtual cars
driving ten million miles a day and
seven over seven billion miles simulated
it's a key part of our release process
so why do you need to simulate this much
right well I hopefully I convinced you
there is a variety of cases to worry
about and that you need to test right
through so far and furthermore it goes
all the way bottom-up so as a change
perception for example slightly
different segmentation or detection the
changes can go through the system and
you know the results can change
significantly and you need to be robust
to this you need to test all the way so
what to simulate one thing you can do is
Teaneck scenarios from scratch working
with safety experts Nitsa and analyzing
water conditions in which typically lead
to accidents so you can do that of
course you can do it manually you can
create them what else could you do well
you want to leverage your driving data
you have all your logs you have a bunch
of situations there right so you can
pick interesting situations from your
logs and furthermore what you can do is
to take all these situations and you any
create variations of this situation so
you get even more scenarios so here's an
example of a log simulation I'll play
Twice first time look at the image this
is what happened in the real world the
first time so in the real world we
mostly stayed in the middle lane and
stopped if you see what's happened in
simulation simulation our algorithm
decided this time to merge to the left
lane and stopped and everything was fine
things were safe things were happy what
can go wrong in simulation from logs
well let's say this is another scenario
slightly different visualization our
vehicle when it drove the real world was
where the green vehicle is now in
simulation we drop differently and we
have the blue vehicle right and so we're
driving BAM what happened well there is
a purple
they're pasty purple agent who in the
real world saw that we passed them
safely and so it was safe for them to go
but it's no longer safe because we
changed what we did so the insight is in
simulation our actions affect the
environment and it need to be accounted
for so what does that mean if you want
to have effective simulations on a large
scale you need to simulate realistic
driver and pedestrian behavior so you
know you could think of a simple model
well how do you do oxy or what's a good
approximation of a realistic behavior
well you can do a break and swerve model
so you just say well there is some
normal way reactions happen you know I
have a reaction time and braking profile
it may be swerving profile so if an
agency someone in front of them maybe
they just apply it is an algorithm all
right hopefully I convinced you that
behavior can be fairly complicated in
this will not always produce a
believable reaction especially is
complex interactive cases such as merges
lane changes intersections and so on
right so what could you do
you could learn an agent from real
demonstrations well you went and
collected all this data in the world you
have a bunch of it information of how
vehicles pedestrians behave you can
learn the model and use that okay so
what is an agent let's look a little bit
an agent receives sends the information
maybe context about the environment and
it develops a policy it develops a
reaction that's the driver agent in
applies acceleration is steering then
gets new sensor information new map
information place in the map and it
continues and if it's our own vehicle
then you also have a router that's in
explicit intent generator which says
well the passenger wants you to go over
there why don't we try to make a right
turn now so you also get an intent and
this is an agent you know it could be in
simulation it could be in the real world
roughly this is the picture and this is
an end-to-end agent end to end learning
is popular right to its best
approximation if you learn
a good policy this way you can apply it
and have very believable agent reactions
right and so I'm going to tell you a
little bit about work we did in this
direction so we put a paper on archive
about a month ago I believe on we took
60 hours of footage of driving and we
try to see how well we can imitate it
using a deep neural network all right
and so one option is to do exactly the
same to antigen policy but we wanted to
make a task easier how well we have a
good perception system at Weymouth so
why don't we use its products for that
agent also can simplify the input
representation a bit that is good if
bigdhaas becomes easier controllers are
well understood we can use an existing
controller so no need to worry about
acceleration and arcs we can generate
trajectories now if you want to see in a
little more detail to understand the
representation is so we have this is our
our agent vehicle which is sub driving
vehicle in this case but could be a
simulation agent and we render an image
with it at the center and potentially we
augment it with some we can we can
generate a little bit of rotation to the
image just so we don't over bias
the orientation a specific way all right
and it's an 80 by 80 box so we roughly
see about 60 meters in front of us and
40 meters to the side in the center and
now we render a road map in this box
which is the map like which lanes you're
allowed to drive on these traffic lights
and generally at intersections we render
what lanes are allowed to go and what
lanes and how the traffic lights
permitted or do not permit it then you
can render speed limits the objects
result of your perception system you
render your current vehicle where it
believes it is and you render the post
history so you you give an image of
where the agents been in the last for a
few steps and so you want and last but
not least you render the intent so the
intent is where you want to go so the
conditions on this intent and this input
you want to predict the future waypoints
for this vehicle right so that's the
task
and you can praise it as a supervised
learning problem man just learn to learn
a policy with this network that
approximates what you've seen in the
world with 60 hours of date course
learning agents there is a well-known
problem it's identified it's called
paper dagger by Stephane Ross who is
actually way more now and Andrew Pannell
so it's easy to make small errors over
time so even though in each step if you
do if you could do a relatively good
estimate if it strings 10 steps together
you can end up very different from where
agents have been before right and there
is techniques to handle this right one
thing we did is synthesize perturbations
so you have a trajectory and we
synthesize the form the trajectory and
force the vehicle to learn to come back
to the middle of the way so that's
something you can do that's reasonable
now you know if you just have direct
imitation based in supervision we are
trying to pass the vehicle in the street
and it's stopping and never continuing
so now we did perturbations and well it
kind of ran through the vehicle right so
that's not enough so we need more right
it's not actually an easy problem so in
addition to having this
agent RNN which essentially takes the
past and keeps creates memory of its
past decisions and keeps iterating
predicting multiple points in the future
so it predicts the trajectory piecemeal
in the future
how about we also learn about collisions
and staying on the road and so on so
we've meant the network and now the
network starts also produce predicting a
mask for the road and now we have a loss
here I don't know if I can point so here
you have a road mask loss you say hey if
you driver generate motions that take
outside the road that's probably not
good hey if you ever cause collisions
where your perception network which
takes takes the other objects and
predicts their motions to predict here
our motion where the road is in the
other agents motion in the future and
they're trying to make sure there's no
collisions in that we stay on the road
so you add this structural
that adds a lot more constraints to the
to the system as it trains so it's not
just limited but what's it with what
it's explicitly seeing it allows it to
reason about things it has not
explicitly seen as well and so now
here's an example of us driving with
this network and it can now it can you
can see that we're predicting the future
it with the yellow boxes and we're
driving safely to intersections and
complex scenarios actually handles a lot
of scenarios very well I if you
interested I welcome you to go read the
paper it handles most of the simple
situations fine so now we have our past
two approaches the passing a parked car
one of them stops in every starts the
other one hits the car now it actually
handles it fine
and beyond that afterwards we can stop
at a stop sign happily which is the red
line over there and it does all of these
operations and what we did beyond this
is we took the system has learned to an
imitation data and we actually draw our
real bueno car with it so we took it to
castle their force base staging grounds
and this is it driving a road it's never
seen before and stopping at stop signs
and so so that's all great we could use
it as an agent simulation world and we
could drive a car with it but it has
some issues so let's look on the left so
here it is driving and then it was
driving too fast so because our range is
limited it didn't know it had to make a
turn in it over and the third so it just
drove off the road that's one thing that
can happen so you know when one area of
improvement more range hears it is
another time so yellow is by the way
what we did in the real world and green
is what we do in the simulation in that
example and here we're trying to execute
a complex maneuver a u-turn we're
sitting there and we don't try to do it
and we almost do it but not quite and at
least we end up in the driveway and
there is that the interactive situations
when they get really complex this
network also does not do too well right
and so what does that tell us well long
tale came again
in testing right there's again you can
learn the policy for a lot of the common
situations but actually in testing some
of the things you really care about is
the long tail you want to test to the
corner cases you want to test in the
scenarios where someone is obnoxious and
adversarial and there's something not
too kosher right so one way to think of
it is this right this is the
distribution of human behavior and of
course it goes in multiple axis it could
be you know aggressive and conservative
right and then somewhere in between you
could be super expert driver is super
inexperienced and somewhere in between
and so on so like our end-to-end model
it's fairly it's an ambassador's Entei
ssin meaning it could in theory learn
any policy right I mean if you see
everything you want to know about the
environment by and large but it's
complex and this is similar a bit to the
models as well some of the models we
talked about before you can end up with
complex model if you have complex input
this is images that are 80 by 80 with
multiple channels it's a large input
space the model can have tens of
millions of parameters now if you have
an example if you have a case where you
have two or three examples in your whole
60 hours of driving there's no guarantee
that your 10 million parameter model
will learn it well right and so it's
really good when you have a lot of
examples it's really trying to do well
in those and then you have the long tail
so what do you do well we can improve
the representation you know we can
improve our model this is you know there
is a lot of room to to keep evolving
this and then this area will keep
expanding right and that's one good
direction there is a lot of interesting
questions how to do that and we're
working on a lot of them is actually
some exciting work hopefully I get to
share with you another time something
else you can do if you remember from my
slide about the hybrid system when you
go to the longtail you can you can do
essentially a similar thing which is
simpler biased expert design input
distribution that is much easier to
learn with few examples you can also of
course use expert design models
and so in this case you still will
produce something reasonable by
inputting this human knowledge and you
could have many models I mean there's
not one you could just tune to various
aspects of this distribution you can
have little models for all the aspects
you care about you can mix and match it
so that's another way to do it so let me
tell you about one such a model so the
trajectory optimization agent so we take
inspiration from a motion control theory
and we want to plan a good trajectory
for the vehicle the agent vehicle and
that satisfies a bunch of constraints
and preferences and so one insight to
this is that we already know what the
agent did in the environment last time
so you have fairly strong idea about the
intent and that helps you when you
specify the preferences because you can
say okay well I have give me a
trajectory that minimizes some set of
costs which are preferences on the
trajectory typically called potentials
what is the potential well at different
parts of the trajectory you can add this
attractor potential saying well try to
go where you used to be before for
example and that's the benefit of in
simulation you have observed what was
done so this is a bit simpler and of
course you can have repeller potential
don't hit things don't run into be a
cause right so to first approximation
that's what the roughly looks like and
so now where is the learning right well
it's still machine learning model there
is a presentation these potentials have
parameters it's the steepness of this of
this curve there is sometimes they are
multi-dimensional right there's there's
a few parameters typically we're talking
a few dozen parameters or less all right
and you can learn them too so there is a
technique called inverse reinforcement
learning
want to learn these parameters that
produce trajectories that come close to
the trajectories you've observed in the
real world so it see if you pick a bunch
of trajectories that represent certain
type of behavior you want to model the
tunia parameters to behave like it then
you want to generate reasonable
trajectories continuous in all feasible
that satisfy this right and this is part
of this optimization you can solve this
actually and so then you can tune this
agents so here's some agents I want to
show you so this is a complex
interactive scenario to be a course but
you can see on the left is on the right
is the aggressive guy blue is the agent
red is our vehicle we're testing in
simulation and so let me play one more
time once the sense essentially on the
on the left is the conservative driver
on the right is the aggressive driver
and they pass us and then use very
different reactions in our vehicle so
the aggressive guy went in pastas and
pushed us further into that Lane and we
much much later in the other case when
you have a conservative driver we are in
front of them and they're not bugging us
and we execute with much cheerier can
switch into the right lane where we want
to go all right so this is agents that
can test your system well now you have
different scenarios in this case
depending what agent you put in and I'll
show you a little more scenarios so it's
not just a - agent game I mean we can do
things like merging from one side of the
highway to the next and this type of
agent can generate fairly reasonable
behaviors it slow slowed down for
knowing slow vehicle in front let the
vehicles on this side pass you and still
completes the mission and you can
generate multiple futures with this
agent so here's an example again on the
right will be an aggressive guy right
and on the left was the more
conservative person the aggressive guy I
found a gap between the two vehicles and
just went for it right and you can test
your stock this way and one more I
wanted to show you is is an aggressive
motorcycle driving so you can have an
agent that tests
you can test the reaction to motorcycle
that they're weaving in the lane right
so I guess what's my takeaway from this
story about testing in the longtail you
need the Ministry of agents at the
moment right so if you think of it right
and learning from demonstration is key
you can encode some simple models by
hand but ultimately it's much better the
task of modeling agent behavior is
complex and it's much better learned and
so here's the space the models so you
can have not learned you can just replay
the log like a show then you can you can
have design trajectories for agents -
for this reaction do this for that
reaction do that then you can have the
break and swirl model that mostly
there's someone in front of an agent
just does it deterministic break
trajectory optimization which I just
showed now our mid to mid model and
potentially and to end top-down model
top-down meaning you have like a top
view of the environment there's many
other representations possible this is a
very interesting space ultimately I
wanted to show you there's many possible
agents and they have different utility
and they have different number of
examples you need to train them with and
so one other takeaway I wanted to tell
you is smart agents are critical photon
and it's scale this is something I truly
believe working in the space and this
line of direction is exciting and
ultimately one of the exciting problems
that there's still a lot of interesting
progress to be made and why well you
have accurate models of human behavior
of drivers and pedestrians and they help
achieve several things first you will do
better decisions when you drive yourself
you'll be able to anticipate what others
will do better and that will be helpful
second you can develop a robust
simulation environment with those
insights also very important
third well our vehicle is also one more
agent in the environment it's an agent
we have more control than the others but
a lot of this inside supply and so this
is very exciting and interesting so I
wanted to finish the talk just maybe as
a mental exercise right when you think
of a system that is tackling a complex
AI challenge like self-driving what is
the good properties of the system to
have and how do you think
a scalable system and to me there's this
mental test right we want to grow and
handle and you know bring our service to
more and more environments more and more
cities how do you scale to dozens or
hundreds of cities so as we talked about
the longtail each new environment can
bring new challenges and they can be
complex intersections and cities like
Paris
there's our Lombard Street in San
Francisco and from there there's narrow
streets in European towns there's all
kinds as the long tails keep keeps
coming as you keep driving your
environments in Pittsburgh people drive
the famous Pittsburgh left they take
different precedence than usual the
local customs of driving of behaving all
of this needs to be accounted for as you
expand and this makes the system
potentially more complex or easier
harder to turn to all environments right
but it's important because ultimately
that's the only way you can scale so how
do you what should the scalable process
do so in my mind you let's say have a
very good sobriety system I mean this
very much parallels the factory analogy
I'm just going to repeat it one more
time you take your vehicles we put a
bunch of women cars and we drive a long
time in that environment with drivers
maybe 30 days maybe more at least that
long
and you collect all the data right and
then your system should be able to
improve a lot on the data have collected
right so drive a bunch obviously don't
wanna don't want to chain the system too
much in the real world while it's
driving but you want train it active
you've collected in data about the
environment so it needs to be trainable
and collected data it's very important
for a system to be able to quantify or
have a notion to elicit from it whether
it's incorrect or not confident right
because then you can take action and
this is the important property that I
think people should think of when they
design systems how they listed this then
you can take an action you can ask
questions to raters that's fairly legit
typical active learning is a bit like
this right so and it's usually based in
some amount of low confidence or
surprise
that's the examples you want to to send
and even better
the system could potentially directly
update itself and this is an interesting
question how those systems update
themselves in light of new knowledge and
we have a system that clearly does this
right and typically do it with reasoning
and what is reasoning right so I have an
answer it is one answer there's possibly
others right but one way is you can
check and enforce consistency of view
beliefs and you can look for
explanations of the world that are
consistent and see if you have a
mechanism in the system that can do this
this allows the system to improve itself
without necessarily being fed purely
labeled data it can improve yourself
from just collected data and I think
it's interesting to think of systems
where you can do reasoning and
representations that these models need
to have right and last but not least we
need scalable training and testing
infrastructure right this is part of the
factory I was talking about I'm very
lucky to a mode to have wonderful
infrastructure
and you know it allows this virtuous
cycle to happen
thank you appearance trouble thank you
so much for the talk really appreciate
it so if you were to train off of image
and lidar data a synthetic imaging lidar
data is there would you wait the
synthetic data differently than real
word real-world data when training your
models so there's actually a lot of
interesting research in the field there
are people trained on simulator but also
trained adaptation models that make
simulator data look like real data right
so you're essentially you're trying to
build consistency or it leads to
training on simulator scenarios but if
you learn a mapping from simulator
scenes to real scenes right you could
potentially train on the transformed
simulator data already that's
transforming with other models there's
many ways to do this ultimately right so
achieving realism in simulator

Resume

Berikut adalah rangkuman komprehensif dan terstruktur dari transkrip presentasi oleh Drago Anguelov (Principal Scientist di Waymo) mengenai penerapan *deep learning* pada mobil otonom.

---

# Mengatasi Tantangan "Long Tail" Mobil Otonom dengan Deep Learning: Wawasan dari Waymo

### Inti Sari (Executive Summary)
Video ini membawakan presentasi oleh Drago Anguelov mengenai perjalanan dan teknologi di balik Waymo dalam mengembangkan mobil otonom. Topik utamanya berfokus pada bagaimana *machine learning* (ML) dan *deep learning* digunakan untuk menyelesaikan tantangan "ekor panjang" (*long tail*)—situasi langka dan tak terduga di jalan raya. Pembahasan mencakup tiga pilar utama (Persepsi, Prediksi, Perencanaan), pentingnya infrastruktur data dan simulasi skala besar, serta pendekatan sistem hibrida yang menggabungkan kecerdasan buatan dengan aturan keamanan ahli untuk memastikan keandalan.

### Poin-Poin Kunci (Key Takeaways)
*   **Tantangan Utama:** Kesulitan terbesar dalam mengemudi otonom bukan pada situasi umum, melainkan pada kasus-kasus langka (*long tail*) seperti tiang jatuh, pejalan kaki aneh, atau konstruksi jalan yang kompleks.
*   **Tiga Pilar Teknologi:** Sistem Waymo mengandalkan **Persepsi** (memahami lingkungan), **Prediksi** (mengantisipasi tindakan orang lain), dan **Perencanaan** (mengambil keputusan mengemudi yang aman dan nyaman).
*   **ML Factory:** Waymo membangun infrastruktur siklus yang efisien: mengumpulkan data jutaan mil, melabeli, melatih model, dan melakukan validasi sebelum deployment.
*   **Simulasi Masif:** Untuk menguji berbagai skenario tanpa membahayakan nyawa, Waymo menggunakan simulasi yang setara dengan 25.000 mobil virtual mengemudi 10 juta mil setiap hari.
*   **Pendekatan Hibrida:** Karena model ML belum sempurna, Waymo menggunakan sistem hibrida yang menggabungkan ML dengan algoritma rekayasa tradisional (*expert design*) untuk menangani ketidakpastian dan menjaga keamanan.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Pengantar & Sejarah Waymo
*   **Pembicara:** Drago Anguelov, Principal Scientist di Waymo dengan latar belakang PhD dari Stanford dan pengalaman di Google (Street View) serta Zoox.
*   **Pencapaian Waymo:**
    *   Merayakan ulang tahun ke-10 saat presentasi berlangsung.
    *   Memiliki armada dengan pengalaman lebih dari 10 juta mil di jalan raya.
    *   **Milestone Penting:**
        *   *2015:* Prototipe "Firefly" memberikan perjalanan otonom penuh pertama (penumpang tunanetra di Austin).
        *   *2017:* Peluncuran armada otonom penuh di Phoenix.
        *   *Layanan Komersial:* Peluncuran layanan *ride-hailing* otonom pertama di Phoenix.
*   **Masalah "Long Tail":** Mengemudi menuntut penanganan situasi yang sangat beragam dan jarang terjadi, mulai dari pengendara sepeda yang membawa rambu berhenti hingga kendaraan darurat dan zona konstruksi yang rumit.

#### 2. Tugas Teknis Utama: Persepsi, Prediksi, dan Perencanaan
*   **Persepsi (Perception):** Memetakan input sensor (kamera, LiDAR, radar) menjadi representasi semantik. Tantangannya adalah variabilitas objek (penampilan, pose, seperti orang berpakaian kostum dinosaurus) dan lingkungan (cuaca, silau).
*   **Prediksi (Prediction):**
    *   Kendaraan harus mengantisipasi perilaku "aktor" lain (pejalan kaki, mobil) dalam jangka panjang (1–10 detik ke depan).
    *   Sistem mempertimbangkan perilaku masa lalu, konteks semantik, dan sinyal visual halus (kontak mata, gerakan tubuh, lampu sein).
    *   Penting untuk memahami interaksi antar-agen (misalnya: pejalan kaki bereaksi terhadap mobil yang mendekat).
*   **Perencanaan (Planning):**
    *   Menghasilkan perilaku kendaraan (akselerasi, rem, setir).
    *   Tujuannya adalah menyeimbangkan **Keamanan** (prioritas utama), **Kenyamanan** penumpang, **Komunikasi** dengan pengguna jalan lain, dan **Efisiensi** perjalanan.

#### 3. Infrastruktur Machine Learning (The ML Factory)
*   **Analogi Pabrik:** Beralih dari sistem klasik yang seperti "pengrajin" (sulit berkembang) ke sistem ML modern seperti "pabrik" (infrastruktur + data = model yang dapat diskalakan).
*   **Siklus:** Rilis perangkat lunak -> Mengemudi & Kumpulkan Data -> Penyimpanan -> Seleksi Data -> Pelabelan -> Pelatihan Model -> Validasi -> Deployment -> Ulang.
*   **Bahan Baku Utama:**
    *   **Komputasi:** Memanfaatkan ekosistem Google/Alphabet, TensorFlow, pusat data, dan perangkat keras khusus untuk pelatihan yang cepat dan murah.
    *   **Data Berlabel:** Mengumpulkan jutaan data untuk melatih model yang tangguh.

#### 4. Inovasi Arsitektur & Kolaborasi (AutoML)
*   **Kolaborasi DeepMind:** Waymo bekerja sama dengan Google Brain dan DeepMind untuk meningkatkan persepsi dan *reinforcement learning*.
*   **Neural Architecture Search (AutoML):**
    *   Desain arsitektur jaringan saraf manual memakan waktu berbulan-bulan.
    *   Waymo menggunakan mesin untuk mencari arsitektur optimal yang memberikan kinerja tinggi dengan latensi rendah.
    *   Hasilnya adalah model yang melampaui desain insinyur manusia dalam hal efisiensi dan akurasi untuk segmentasi LiDAR dan deteksi jalur.

#### 5. Ketahanan, Ketidakpastian, dan Sistem Hibrida
*   **Redundansi:** Menggunakan sensor 360 derajat (kamera, LiDAR, radar) yang saling melengkapi untuk menutupi kelemahan masing-masing modality.
*   **Sistem Hibrida:**
    *   Menggabungkan ML dengan pengetahuan domain ahli.
    *   Jika model ML tidak yakin (misalnya pada objek asing), sistem akan kembali ke pelacakan objek fisik dari LiDAR/radar atau mengambil keputusan mengemudi yang sangat konservatif.
    *   Tujuannya adalah melindungi sistem dari kegagalan pada kasus ekor (*tail cases*) sambil terus meningkatkan kapabilitas ML.

#### 6. Simulasi & Pengujian Skala Besar
*   **Kebutuhan Simulasi:** Pengujian di dunia nyata tidak cukup untuk menangkap semua kasus langka. Waymo menggunakan fasilitas uji terstruktur (Castle Air Force Base) dan simulasi virtual.
*   **Skala:** Simulasi Waymo setara dengan 25.000 mobil virtual yang mengemudi 10 juta mil per hari (total lebih dari 7 miliar mil).
*   **Log Simulation:** Menciptakan skenario berdasarkan data nyata (*driving logs*) dengan variasi untuk menguji reaksi kendaraan terhadap perubahan keputusan.
*   **Pemodelan Agen (Agent Modeling):**
    *   Simulasi membutuhkan perilaku pengemudi dan pejalan kaki yang realistis.
    *   **End-to-End Learning:** Model dilatih untuk meniru mengemudi dari data (60 jam footage). Namun, model ini sering gagal pada situasi kompleks atau "ekor panjang" (seperti belok balik/U-turn yang sulit) karena kurangnya contoh data.
    *   **Trajectory Optimization:** Menggunakan *Inverse Reinforcement Learning* untuk memodelkan perilaku (agresif vs konservatif) dengan parameter yang lebih sedikit namun lebih akurat secara fisika.

#### 7. Skalabilitas ke Kota Baru & Q&A
*   **Tantangan Skalabilitas:** Untuk berekspansi ke puluhan kota, sistem harus menangani lingkungan baru (persimpangan kompleks, jalan sempit Eropa) dan kebiasaan mengemudi lokal.

---

## Kesimpulan & Pesan Penutup
Secara keseluruhan, presentasi ini menggambarkan bagaimana Waymo memanfaatkan *deep learning* dan infrastruktur skala besar untuk menaklukkan tantangan *long tail* dalam mengemudi otonom. Dengan menggabungkan kekuatan data, simulasi, dan pendekatan hibrida antara ML serta rekayasa tradisional, sistem mereka dirancang untuk tetap aman dan andal dalam berbagai situasi. Hal ini menunjukkan bahwa kolaborasi antara teknologi canggih dan prinsip keamanan yang ketat adalah fondasi utama masa depan transportasi otonom.

Read

file updated 2026-02-13 13:23:56 UTC