Transcript

kxi-_TT_-Nc • Sergey Levine: Robotics and Machine Learning | Lex Fridman Podcast #108
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0415_kxi-_TT_-Nc.txt
Back Raw
Kind: captions
Language: en
the following is a conversation with
Sergey Levine a professor at Berkeley
and a world-class researcher in deep
learning reinforcement learning robotics
and computer vision including the
development of algorithms for end-to-end
training of neural network policies that
combine perception and control scalable
algorithms for inverse reinforcement
learning and in general deep r.l
algorithms quick summary of the ads to
sponsors cash app and expressvpn please
consider supporting the podcast by
downloading cash app and using collects
pot cast and signing up at expressvpn
comm / flex pod click the links buy the
stuff it's the best way to support this
podcast and in general the journey I'm
on if you enjoy this thing subscribe on
YouTube review it with five stars an
apple podcast follow on Spotify
supported on patreon or connect with me
on Twitter at lex
friedman as usual i'll do a few minutes
of as now and never any ads in the
middle that can break the flow of the
conversation this show is presented by
cash app the number one finance app in
the App Store when you get it used colex
podcast cash app lets you send money to
friends buy bitcoin and invest in the
stock market with as little as one
dollar since cash app does fractional
share trading let me mention that the
order execution algorithm that works
behind the scenes to create the
abstraction of the fractional orders is
an algorithmic marvel so big props the
cash app engineers are taking a step up
to the next layer of abstraction over
the stock market making trading more
accessible for new investors and
diversification much easier so again if
you get cash out from the App Store
Google Play and use the code lex podcast
you get $10 and cash up will also donate
$10 the first an organization that is
helping to advanced robotics and stem
education for young people around the
world this show
is also sponsored by expressvpn get it
at expressvpn comm / Lex pod to support
this podcast and to get an extra three
months free on a one-year package I've
been using expressvpn for many years I
love it
I think expressvpn is the best VPN out
there they told me to say it but it
happens to be true my humble opinion it
doesn't lock your data it's crazy fast
and as easy to use literally just one
big power on button again it's probably
obvious to you but I should say it again
it's really important that they don't
log your data
it works on Linux and every other
operating system but Linux of course is
the best operating system shout out to
my favorite flavor
Ubuntu mottai 2004 once again get it at
expressvpn calm / relax pod to support
this podcast and to get an extra three
months free on a one-year package and
now here's my conversation sergey
Lavigne what's the difference between a
state-of-the-art human such as you and I
well I don't know if we qualify Stata
they're humans but a state-of-the-art
human and a state-of-the-art robot it's
a very interesting question
robot capability is it's kind of a I
think it's a very tricky thing to to
understand because there are some things
that are difficult that we wouldn't
think are difficult and some things that
are easy that we wouldn't think ever you
see and there's also a really big gap
between capabilities of robots in terms
of hardware and their physical
capability and capabilities of robots in
terms of what they can do autonomously
there is a little video that I think
robotics researchers really like to show
a special Robotics learning researchers
like myself from 2004 from Stanford
which demonstrates a prototype robot
called the PR one and the PR one was a
robot that was designed as a home
assistance robot and there's this
beautiful video showing the pr1 tidying
up a living room putting away toys and
at the end bringing a beer to the person
sitting on the couch which looks really
amazing and then the punch line is that
this
is entirely controlled by person yes so
you can so that in some ways the gap
between a state-of-the-art human
state-of-the-art robot if the robot has
a human brain is actually not that large
now obviously like human bodies are
sophisticated and very robust and
resilient in many ways but on the whole
if we're willing to like spend a bit of
money and do a bit of engineering we can
kind of close the hardware gap almost
but the intelligence gap that one is
very wide and when you say hardware you
you're referring to the physical sort of
the actuators the actual body the robot
is opposed to the hardware on which the
cognition the nervous the hardware of
the nervous system yes exactly I'm
referring to the body rather than the
mind so what so that means that the kind
of the work is cut out for us like while
we can still make the body better we
kind of know that the big bottleneck
right now is really the mind and how big
is that gap how big is the how big is
the difference in your in your sense of
ability to learn a bit ability to reason
ability to perceive the world between
humans and our best robots the gap is
very large and the gap becomes larger
the more unexpected events can happen in
the world so essentially the spectrum
along which you can measure the the size
of that gap is the spectrum of how open
the world is if you control everything
in the world very tightly if you put the
robot in like a factory and you tell it
where everything is and you rigidly
program its motion then it can do things
you know one might even say in a
superhuman way it can move faster it's
stronger it can lift up a car and things
like that but as soon as anything starts
to vary in the environment now it'll
trip up and if many many things vary
like they would like in your kitchen for
example then things are pretty much like
wide open now again we're gonna stick a
bit on the philosophical questions but
how much on the human side of the
cognitive abilities in your sense is
nature versus nurture so so how much of
it is product of evolution and how much
of it something we'll learn from sort of
scratch yeah well from the day were born
I'm going to read into your question as
asking about the implications of this
for AI really
by biologists I can't really like speak
authoritative also until in garnet if if
it's so if it's all about learning then
there's more hope for am so the way that
I look at this is that you know well
first of course biology is very messy
and it's if you ask the question how
does a person do something or has a
person's mind do something you come up
with a bunch of hypotheses and
oftentimes you can find support for many
different often conflicting hypotheses
one way that we can approach the
question of what the implication of this
for AI R is we can think about what's
sufficient so you know maybe a person is
from birth very very good at some things
like for example recognizing faces
there's a very strong evolutionary
pressure to do that if you can recognize
your mother's face then you're more
likely to survive and therefore people
are good at this but we can also ask
like what's what's the minimum
sufficient thing right and one of the
ways that we can study the minimal
sufficient thing is we could for example
see what people do in unusual situations
if you present them of things that
evolution couldn't have prepared them
for you know our daily lives actually do
this to us all the time we we didn't
evolve to deal with you know automobiles
and spaceflight and whatever so they're
all these situations that we can find
ourselves in and we do very well they're
like I can give you a joystick to
control a robotic arm which you've never
used before and you might be pretty bad
for the first couple of seconds but if I
tell you like your life depends on using
this robotic arm to like open this door
you'll probably manage it even though
you've never seen this device before you
even even ever used the joys to control
us and you'll kind of muddle through it
and that's not your evolved natural
ability that's your fear flexibility
your your adaptability and that's
exactly why our current robotic systems
really kind of fall flat but I wonder
how much general almost what we think of
as common sense
pre-trained models underneath all that
so that ability to adapt to a joystick
is requires you to have a kind of you
know I'm human so it's hard for me to
introspect all the knowledge I have
about the world but it seems like there
might be an iceberg underneath of the
amount of knowledge
you actually bring to the table now
that's kind of the open question there's
absolutely an iceberg of knowledge that
we bring to the table but I think it's
very likely that iceberg of knowledge is
actually built up over our lifetimes
because we have you know we have a lot
of prior experience to draw on and it
kind of makes sense that the right way
for us to you know to optimize our
efficiency our evolutionary fitness and
so on is to utilize all that experience
to build up the best iceberg we can get
and that's actually one you know well
that sounds an awful lot like what
machine learning actually does I think
that for modern machine learning it's
actually a really big challenge to take
this unstructured massive experience and
distill out something that looks like a
common sense understanding of the world
and perhaps part of that isn't it's not
because something about machine learning
itself is is broken or hard but because
we've been a little too rigid in
subscribing to a very supervised very
rigid notion of learning you know kind
of the input-output excess goes go to
why sort of model and maybe what we
really need to to do is to view the
world more as like a massive experience
that is not necessarily providing any
rigid supervision but sort of providing
many many instances of things that could
be and then you take that and you
distill it into some sort of common
sense understanding I see what you're
you're painting an optimistic beautiful
picture especially from the robotics
perspective because that means we just
need to invest in both better learning
algorithms figure out how we can get
access to more and more data for those
learning L goes to extract signal from
and then accumulate that iceberg of
knowledge it's a beautiful picture it's
a hopeful one I think it's potentially a
little bit more than just that and this
is this is where we perhaps reach the
limits of our current understanding but
one thing that I think that the research
community hasn't really resolved in a
satisfactory way is how much it matters
where that experience comes from like
you know do just like download
everything on the intranet and cram it
into essentially the 21st century analog
of the giant language model and then see
what happens or does it actually matter
whether your machine
experiences the world or in a sense that
actually attempts things observes the
outcome of its actions and kind of
augments the experience that way that it
chooses which parts of the world it gets
to interact with and observe and learn
from right it may be that the world is
so complex that simply obtaining a large
mass of sort of iid samples of the world
is is a very difficult way to go but if
you are actually interacting with the
world and essentially performing this
sort of hard- mining by attempting what
you think might work observing the
sometimes happy and sometimes sad
outcomes of that and augmenting your
understanding using that experience and
you're just doing this continually for
many years maybe that sort of data in
some sense is actually much more
favourable to obtaining a common sense
understanding well one reason we might
think that this is true is that you know
the what we associate with common sense
or lack of common sense is often
characterized by the ability to reason
about kind of counterfactual questions
like you know I if I were to you know
here I'm this bottle of water sitting on
the table everything is fine far knock
it over which I'm not going to do but if
I were to do that what would happen and
I know that nothing good would happen
from that but if I have a bad
understanding of the world I might think
that that's a good way for me to like
you know gain more utility if I actually
go about the daily life doing the things
that my current understanding of the
world suggests will give me high utility
in some ways I'll get exactly the the
right supervision to tell me not to do
those those bad things and to keep doing
the good things so there's a spectrum
between iid random walk through the
space of data and then there's and what
we humans do or I don't even know if we
do it through optimal but there might be
beyond what so this open question that
you raised where do you think systems
intelligent systems that would be able
to deal with this world fall can we do
pretty well by reading all of Wikipedia
sort of randomly sampling it like
language models do or do we have to be
exceptionally selective and intelligent
about which aspects of the wall we eat
chocolate so I think this is first an
open scientific problem and I don't have
like a clear answer but I can speculate
a little bit and what I would speculate
is that you don't need to be super super
careful I think it's less about like
being careful to avoid the useless stuff
and more about making sure that you hit
on the really important stuff so perhaps
it's okay if you spend part of your day
just you know guided by your curiosity
visiting interesting regions of the of
your state space but it's important for
you to you know every once in a while
make sure that you really try out the
solutions that your current model of the
world suggests might be effective and
observe whether those solutions are
working as you expect or not and perhaps
some of that is really essential to have
kind of a perpetual improvement loop
like this perpetual improvement loop is
really like but that's really the key
the key that's going to potentially
distinguish the best current methods
from the best methods of tomorrow in a
sense how important do you think is
exploration or total out-of-the-box
thinking exploration in this space is
you jump to totally different domain so
you kind of mentioned there's an
optimization problem you kind of kind of
explore the specifics of a particular
strategy whatever the thing you're
trying to solve how important is it to
explore totally outside of the
strategies they've been working for you
so far what's your intuition there yeah
I think it's a very problem dependent
kind of question and I think that that's
actually you know in some ways that
question gets at one of the big
differences between sort of the classic
formulation of a reinforcement learning
problem and some of the sort of more
open-ended reformulations of that
problem that have been explored in
recent years so classically
reinforcement learning is framed as a
problem of maximizing utility like any
kind of rational AI agent and then
anything you do is in service to
maximizing that utility but a very
interesting kind of way to look at
I'm not necessary saying that's the best
way to look at it but an interesting
alternative way to look at these
problems as as something where you first
get to explore the world
however you please and then afterwards
you will be tasked with doing something
and that might suggest to somewhat
different solutions so if you don't know
what you're going to be tasked with
doing and you just want to prepare
yourself optimally for whatever you're
uncertain future holds maybe then you
will choose to attain some sort of
coverage build up sort of an arsenal of
cognitive tools if you will such that
later on when someone tells you now your
job is to fetch the coffee for me you'll
be well prepared to undertake that task
and that you see that as the modern
formulation of the reinforcement
learning problem as the kind of the more
multi task the general intelligence kind
of formulation I think that's one
possible vision of where things might be
headed I don't think that's by any means
the mainstream or standard way of doing
things and it's not like if I had to but
I like it it's a beautiful vision so
maybe you actually take a step back what
is the goal of robotics what's the
general problem of robotics of trying to
solve you actually kind of painted two
pictures here one of the narrow one is
the general what in your view is the big
problem of robotics again ridiculously
philosophical questions I think that you
know maybe there are two ways I can
answer this question one is there's a
very pragmatic problem which was like
what would make robots what would sort
of maximize the usefulness of robots and
there the answer might be something like
a system where a system that can perform
whatever task a human user sets for it
you know within the physical constraints
of course if you tell it to teleport to
another planet but probably can't do
that but if you if you ask it to do
something that's within its physical
capability then potentially with a
little bit of additional training or a
little bit of additional trial and error
it ought to be able to figure it out in
much the same way as like a human tele
operator ought to figure out how to
drive the robot to do that that's kind
of a very pragmatic view of what it
would take to kind of solve the the
robotics problem if you will but I think
that there is a second answer and that
answer that the answer is a lot closer
to why I want to work on on robotics
which is that I think it's it's less
about what it would take to do a really
good job
in the world of robotics but more the
other way around what robotics can bring
to the table
to help us understand artificial
intelligence so your dream fundamentally
is to understand intelligence yes I
think that's the dream for many people
who actually work in this space I think
that there is there's something very
pragmatic and very useful about studying
robotics but I do think that a lot of
people that go into this field actually
you know the things that they draw
inspiration from are the potential for
robots to like help us learn about
intelligence and about ourselves that's
that's fascinating that robotics is
basically the space by which you can get
closer to understanding the fundamentals
of artificial intelligence so what is it
about robotics that's different from
some of the other approaches so if we
look at some of the early breakthroughs
in deep learning or in the computer
vision space and the natural language
processing there was really nice clean
benchmarks that a lot of people competed
on and thereby came out with a lot of
building ideas what's the fundamental
difference to you between computer
vision purely define an image net and
kind of the bigger robotics problem so
there are a couple of things one is that
with robotics you kind of have you kinda
have to take away many of the crutches
so you have to deal with with both the
the the particular problems of
perception control and so on but you
also have to deal with the integration
of those things and you know classically
we've always thought of the integration
as kind of a separate problem so a class
a kind of modular engineering approaches
that we solve individual subproblems
then wire them together and then the
whole thing works and one of the things
that we've been seeing over the last
couple of decades is that well maybe
studying the thing as a whole might lead
to just like very different solutions
now if we were to study the parts and
wire them together so the integrative
nature of robotics research helps us see
you know the different perspectives on
the problem another part of the answer
is that with robotics it it casts a
certain paradox into very clever relief
so this is sometimes referred to as more
of expert on the idea that in artificial
intelligence things that are very
hard for people can be very easy for
machines and vice versa things that are
very easy for people can be very hard
for machines so you know integral and
differential calculus is pretty
difficult to learn for people but if you
program a computer do it it can derive
derivatives and integrals for you all
day long without any trouble
whereas some things like you know
drinking from a cup of water very easy
for a person to do very hard for a robot
to deal with and sometimes when we see
such blatant discrepancies that give us
a really strong hint that we're missing
something important so if we really try
to zero in on those discrepancies we
might find that little bit that we're
missing and it's not that we need to
make machines better or worse at math
and better at drinking water but just
that by studying those discrepancies you
might find some new insight so that that
could be that could be in any space it
doesn't have to be robotics but you're
saying yeah I get it's kind of
interesting that robotics seems to have
a lot of those discrepancies so the the
the Hans more of a paradox is probably
referring to the space of the the
physical interaction I think you said
object manipulation walking all the kind
of stuff we do in the physical world
that well how do you make sense if you
were to try to disentangle the the
Marwick paradox like why is there such a
gap in our intuition about it why do you
think manipulating objects is so hard
from everything you've learned from
applying reinforcement learning in this
space yeah I think that one reason is
maybe that for many of the problems for
many of the other problems that we've
studied in AI and computer science and
so on the notion of input/output and
supervision is much much cleaner so
computer vision for example deals with
very complex inputs but it's
comparatively a bit easier at least up
to some level of abstraction to cast it
as a very tightly supervised problem
it's comparatively much much harder to
cast robotic manipulation as a very
tightly supervised problem you can do it
it just doesn't
work all that well so you could say that
well maybe we get a label data set where
we know exactly which motor commands to
send and then we train on that but for
various reasons that's not actually like
such a great solution and it also
doesn't seem to be even remotely similar
to how people and animals learn to do
things because we're not told by like
our parents here is how you fire your
muscles in order to walk we you know we
do get some guidance but the really
low-level detailed stuff we figure out
most of them our own and that's what you
mean by tightly coupled that every
single little sub action gets a
supervised signal of whether it's a good
one or not right so so while in computer
vision you could sort of imagine up to a
level of abstraction that maybe you know
somebody told you this is a car and this
is a cat and this is a dog in motor
control it's very clear that that was
not the case if we look I said of the
sub spaces of Robotics that again as you
said robotics integrates all of them
together and we'll get to see how this
beautiful mess into place but so there's
nevertheless still perception so it's
the the computer vision problem
broadly speaking understanding the
environment then there's also maybe you
can correct me on this kind of
categorization of the space then there's
prediction in trying to anticipate what
things are going to do into the future
in order for you to be able to act in
that world and then there's also this
game theoretic aspect of how your
actions will change the behavior of
others in this kind of space what and
this is bigger than reinforcement
learning this is just broadly looking at
the problem of Robotics what's the
hardest problem here or is there or is
what you said true that when you start
to look at all of them together that's
an int that's a whole nother thing like
you can't even say which one
individually is harder because all of
them together you should only be looking
at them all together I think when you
look at them all together some things
actually become easier and I think
that's actually pretty important so we
had you know back in 2014 we had some
work basically our first work on end to
end
enforced learning for robotic
manipulation skills from vision which
you know at the time was something that
seemed a little inflammatory and
controversial in the robotics world but
other than the the inflammatory and
controversial part of it
the point that we were actually trying
to make in that work is that for the
particular case of combining perception
and control you could actually do better
if you treat them together then if you
try to separate them and the way that we
try to demonstrate this as we picked a
fairly simple motor control task where a
robot had to insert a little red
trapezoid into a trapezoidal hole and we
had our separated solution which
involved first detecting the hole using
a pose detector and then actuated arm to
put it in and then our intent solution
which just mapped pixels to the torques
and one of the things we observed is
that if you use the intense solution
essentially the pressure on the
perception part of the model is actually
lower like it doesn't have to figure out
exactly where the thing is in 3d space
it just needs to figure out where it is
you know distributing the errors in such
a way that the horizontal difference
matters more than the vertical
difference because vertically just
pushes it down all the way until it
can't go any further and their
perceptual errors are a lot less harmful
whereas a perpendicular to the direction
of motion perceptual errors are much
more harmful so the point is that if you
combine these two things you can trade
off errors between the components
optimally to best accomplish the task
and the components can should be weaker
while still leading to better overall
performance as a profound idea I mean in
in the space of pegs and things like
that is quite simple it almost is
tempting to overlook but that's seems to
be at least intuitively an idea that
should generalize to basically all
aspects of perception control of course
when one strengthens the other yeah and
and we you know people who have studied
sort of perceptual heuristics in humans
and animals find things like that all
the time so one one very well-known
example this is something called the
gaze heuristic which is a little trick
that you can use to intercept a flying
object so if you want to catch a ball
for instance you could try to localize
it in 3d space estimate its velocity
estimate the effect of wind resistance
solve a complex system of differential
equations in your head or you can
maintain a running speed so the object
stays in the same position as in your
field of view so if it dips a little bit
you speed up if it rises a little bit
you slow down and if you follow the
simple rule you'll actually arrive at
exactly the place where the object lands
and you'll catch it and humans use it
when they play baseball human pilots use
it when they fly airplanes to figure out
if they're about to collide with
somebody frogs use this to catch insects
and so on and so on so this is something
that actually happens in nature and I'm
sure this is just one instance of it
that we were able to identify just
because it's you know that scientists
are able to identify that goes so
prevalent with our probably many others
do you ever just who can zoom in as we
talk about robotics they have a
canonical problem sort of a simple clean
beautiful representative problem in
robotics they you think about when
you're thinking about some of these
problems we talked about robotic
manipulation to me that seems
intuitively at least the robotics
community is converging towards that as
a space that's the canonical problem if
you agree that maybe you zoom in in some
particular aspect of that problem that
you just like like if we solve that
problem perfectly it'll unlock a major
step in towards human level intelligence
I don't think I have like a really great
answer to that and I think partly the
reason I don't have a great answer kind
of has to do with the it has to do with
the fact that the difficulty is really
in the flexibility and adaptability
rather than in doing a particular thing
really really well so it's hard to just
say like oh if you can I don't know like
shuffle a deck of cards as fast as like
a Vegas right a casino dealer then
you'll you'll be very proficient it's
really the ability to quickly figure out
how to do some arbitrary new thing well
enough so like you know to move on to
the next arbitrary thing but the the
source of newness and uncertainty have
you found problems in which it's easy to
generate new noonah sness messes yeah
new types of newness yeah so
a few years ago is so if you'd asked me
this question around like 2016 maybe I
would have probably said that robotic
grasping is a really great example of
that because it's a task with great
real-world utility like you will get a
lot of money if you can do it well when
is the robotic grasping picking up any
object with a robotic hand exactly so
you'll get a lot of money if you do it
well because lots of people want to run
warehouses with robots and it's highly
non-trivial because very different
objects will require very different
grasping strategies but actually since
then people have gotten really good at
building systems to solve this problem
as to the point where I'm not actually
sure how much more progress we can make
with that as like the main guiding thing
but it's kind of interesting to see the
kind of methods that have what actually
worked well in that space because a
robotic grasping classically used to be
regarded very much as kind of an almost
like a geometry problem so you people
who have studied the history of computer
vision will find this very familiar that
it's kind of in the same way that in the
early days of computer vision people
thought of it very much it's like an
inverse graphics thing in robotic
grasping people thought of it as an
inverse physics problem essentially you
look at what's in front of you figure
out the shapes then use your best
estimate of the laws of physics to
figure out where to put your fingers on
you pick up the thing and it turns out
that what works really well for robotic
grasping instantiated in many different
recent works including our own but also
ones from many other labs is to use
learning methods with some combination
of either exhaustive simulation or like
actual real-world trial-and-error and
turns out that those things actually
work really well and then you don't have
to worry about solving geometry problems
or physics problems so what are just by
the way and the grasping what are the
difficulties that have been worked on so
one is like the materials of things
maybe occlusions and the perception side
why is it such a difficult why is
picking stuff up such a difficult
problem yeah it's a difficult problem
because the number of things that you
might have to deal with or the variety
of things that you have to deal with is
extremely large
and oftentimes things that work for one
class of objects won't work for other
class of objects so if you if you get
really good at picking up boxes and now
you have to pick up plastic bags you
know you just need to employ a very
different strategy and there are many
properties of objects that are more than
just their geometry it has to do with
you know the bits that that are easier
to pick up the bits that are hard to
pick up the bits that are more flexible
the bits that will cause the thing to
pivot and Bend and drop out of your hand
versus the bits that resulted in I
secure grasp things that are flexible
things that if you pick them up the
wrong way they'll fall upside down and
the contents will spill out so there's
all these little details that come up
but the task is still kind of can be
characterized as one task like there's a
very clear notion of you did it or you
didn't do it so in terms of spilling
things there creeps in this notion that
starts the sound and feel like common
sense reasoning do you think solving the
general problem of Robotics requires
common sense reasoning requires general
intelligence this kind of human level
capability of you know like you said be
robust and deal with uncertainty but
also be able to sort of reason and
assimilate different pieces of knowledge
that you have yeah what do you what are
your thoughts on the needs of common
sense reasoning in the space of the
general robotics problem so I'm gonna
slightly dodge that question and say
that I think I think maybe actually it's
the other way around is that studying
robotics can help us understand how to
put common sense into our AI systems one
way to think about common sense is that
and and why our current systems might
lack common sense is that common sense
is a property is an emergent property of
actually having to interact with a
particular world a particular universe
and get things done in that universe so
you might think that for instance like a
an image captioning system maybe it
looks at pictures of the world and it
types out English sentences so it kind
of it kind of deals with our world
and then you can easily construct
situations where image captioning
systems do things that defy common sense
like give it a picture of a person
wearing fur coat and we'll say it's a
teddy bear but I think what's really
happening in those settings is that the
system doesn't actually live in our
world it lives in its own world that
consists of pixels and English sentences
and doesn't actually consist of like you
know having to put on a fur coat in the
winter so you don't get cold so perhaps
the the reason for the disconnect is
that the systems that we have now is
simply inhabit a different universe and
if we build AI systems that are forced
to deal with all of the messiness and
complexity of our universe maybe they
will have to acquire our common sense to
essentially maximize their utility
whereas the systems we're building now
don't have to do that they can take some
shortcut that's fascinating
you've a couple of times already sort of
reframed the role of robotics and this
whole thing and for some reason I don't
know if my way of thinking is common but
I thought like we need to understand and
solve intelligence in order to solve
robotics and you're kind of framing it
as no robotics is one of the best ways
to just study artificial intelligence
and build sort of like robotics is like
the right space in which you get to
explore some of the fundamental learning
mechanisms fundamental sort of
multimodal multitask aggregation of
knowledge mechanisms that are required
for general intelligence this really
interesting way to think about it but
let me ask about learning can the
general sort of robotics the epitome of
the robotics problem be solved purely
through learning perhaps and to end
learning sort of learning from scratch
as opposed to injecting human expertise
and rules and heuristics and so on I
think that in terms of the spirit of the
question I I would say yes I mean I
think that in though in some ways it may
be like an overly sharp dichotomy like
you know I think that in some ways when
we build algorithms we you know at some
point a person does something like yeah
there's always a person turned on the
computer first
you know implemented tensorflow but yeah
I think that in terms of the in terms of
the point that you're getting and I do
think the answer is yes I think that I
think that we can solve many problems
that have previously required meticulous
manual engineering through automated
optimization techniques and actually one
thing I will say on this topic is I
don't think this is actually a very
radical or very new idea I think people
have have been thinking about automated
optimization techniques as a way to do
control for a very very long time and in
some ways what's changed is really more
than aim so you know today we would say
that oh my robot does machine learning
it does reinforcement learning maybe in
the 1960s you'd say oh my robot is doing
optimal control and maybe the difference
between typing out a system of
differential equations and doing
feedback linearization versus training
and neural net it's not such a large
difference it's just you know pushing
the optimization deeper and deeper into
the thing well you think that were but
with the especially deep learning that
the accumulation of experiences in data
form to form deep representations starts
to feel like knowledge is supposed to
optimal control so this feels like
there's an accumulation of knowledge to
the learning process yes yeah so I think
that is a good point that one big
difference between learning based
systems and classic optimal control
systems is that learning based systems
and principle should get better and
better
the more they do something right and I
do think that that's actually a very
very powerful difference so if you look
back at the world of expert systems is
symbolic AI and so on of using logic to
accumulate expertise human expertise
human encoded expertise but do you think
that will have a role the some points
that the you know deep learning machine
learning reinforcement learning has been
in incredible results and breaks there
wasn't just inspired thousands maybe
millions of researchers but you know
there's this less popular now but it
used to be part of the idea of symbolic
AI do you think that will have a role
I think in some ways the kind of the the
descendants of symbolic I actually
already have a role so you know this is
the the highly biased history from my
perspective you say that well initially
we thought that rational decision-making
involves logical manipulation so you
have some model the world expressed in
term in terms of logic you have some
query like what action do I take in
order to for X to be true and then you
manipulate your logical symbolic
representation to get an answer what
that turned into somewhere in the 1990s
is well instead of building kind of
predicates and statements that have true
or false values will build probablistic
systems where things have probabilities
associated and probabilities of being
true and false not turning the Bayes
nets and that provided sort of a boost
to what we're really you know still
essentially logical inference systems
just probabilistic logical inference
systems and then people said well let's
actually learn the individual
probabilities inside these models and
then people said well let's not even
specify the nodes and the models let's
just put a big neural net in there but
in many ways I see these as actually can
descendants from the same idea it's
essentially instantiating rational
decision-making by means of some
inference process and learning by means
of an optimization process so so in a
sense I would say yes that it has a
place and in many ways that place is or
you know it already holds that place
it's already in there yeah it's just by
different it looks slightly different
than there was before yeah but but at
some there are some things that that we
can think about that make this a little
bit more obvious like if I train a big
neural net model to predict what will
happen in response to my robots actions
and then I run probablistic inference
meaning I invert that model to figure
out the actions that lead to some
plausible outcome like to me that seems
like a kind of logic you have a model of
the world it just happens to be
expressed by a neural net and you are
doing some inference procedure some sort
of manipulation on that model to figure
out you know the answer to a query that
you have it's the interpretability it's
the explained ability though that seems
to be lacking more so because the nice
thing about sort of experts
systems is you can follow the reasoning
of the system that to us mere humans is
somehow compelling it it would it's just
I don't know what to make of this fact
that there's a human desire for
intelligence systems to be able to
convey in a poetic way to us why made
the decisions it did like tell a
convincing story and perhaps that's like
a silly human thing like we shouldn't
expect that of intelligent systems like
we should be super happy that there is
intelligent systems out there but if I
were to sort of psychoanalyze the
researchers at the time I would say
expert systems connected to that part
that desire for AI researchers for
systems to be explainable
I mean maybe on that topic do you have a
hope that sort of inferences source of
learning based systems will be as
explainable as the dream was with expert
systems for example I think it's a very
complicated question because I think
that in some ways the question of
explain ability is kind of very closely
tied to the question of of like
performance like you know why do you
want your system to explain itself well
so that it's so that when it screws up
you can kind of figure out why it did it
right but it's nice but in some ways
that that's a much bigger problem extra
like your system might screw up and then
it might screw up at how it explains
itself or you might have some bugs
somewhere so that it's not actually
doing what was supposed to do so you
know maybe a good way to view that
problem is really as a problem as a
bigger problem of verification and
validation of which explained abilities
sort of what one component I see I just
see differently I see explained ability
you you put it beautifully I think you
actually summarized the field of
explained ability but to me there's
another aspect of explained ability
which is like storytelling that has
nothing to do with errors or with
like the the survey it doesn't it uses
errors as as elements of its story as
opposed to a fundamental need to be
explainable when errors occur it's just
that for other intelligence systems to
be in our world we seem to want to tell
each other stories and that that's true
in the political world is true in the
academic world and that I you know
neural networks are less capable of
doing that or perhaps they're equally
capable a storytelling storytelling may
be it doesn't matter what the
fundamentals of the system are you just
need to be a good storyteller maybe one
specific story I can tell you about in
that space is actually about some work
that was done by by my former
collaborator who's now a professor at
MIT named Jacob Andreas Jacob actually
works on natural language processing but
he had this idea to do a little bit of
work in reinforcement learning and how
on how natural language can basically
structure the internals of policies
trained with RL and one of the things he
did is he set up a model that attempts
to perform some tasks that's defined by
a reward function but the model reads in
a natural language instruction so this
is a pretty common thing to do in
instruction following so you tell it
like you know go to the Red House and
then supposed to go to the Red House but
then one of the things that Jacob did is
he treated that sentence not as a
command from a person but as a
representation of the internal kind of
state of the of the of the mind of this
policy essentially so that when it was
faced with a new task what it would do
is it would basically try to think of
possible language descriptions attempt
to do them and see if they led to the
right outcome so it would kind of think
out loud like you know I'm faced with
this new task what am I gonna do let me
go to the red house now that didn't work
let me go to the Blue Room or something
let me go to the green plant and once it
got some reward it would say oh go to
the green plant that's what's working
I'm gonna go to the green plant and then
you could look at the string that it
came up with and that was a description
of how it thought it should solve the
problem so you could do you could
basically incorporate language as
internal state and you can start getting
some handle on these kinds of things and
then what I was kind of trying to get to
is that also if you add to the reward
function
the convincing nough story hmm so I have
another reward signal of like people who
review that story how much they like it
I says that you you know and initially
that could be a hyper parameter or sort
of hard-coded heuristic type of thing
but it's an interesting notion of the
convincing 'no story becoming part of
the reward function the objective
function of the explained ability it's
in the world of sort of twitter and fake
news that might be a scary notion that
the the nature of truth may not be as
important as the convincing 'no some the
how convinced you are in telling the
story around the facts well let me ask
the the basic question you're one of the
world-class researchers in reinforcement
learning deeper and forceful learning
certainly in the robotic space
what is reinforcement learning i think
that reinforcement learning refers to
today is really just the kind of the
modern incarnation of learning based
control so classically reinforcement
learning has a much more narrow
definition which is that it's you know
literally learning from reinforcement
like the thing does something and then
it gets a reward or punishment but
really i think the way the term is used
today is it's used for for more broadly
to learning based control so some kind
of system that's supposed to be
controlling something and it uses data
to get better and what is control means
is action is the fundamental element
yeah it means making rational decisions
now and rational decisions are decisions
that maximize a measure of utility and
sequentially see many decisions time and
time and time again now like so it's
easier to see that kind of idea in the
space of maybe games in the space of
robotics
do you see is bigger than that is it
applicable like word were the limits of
the applicability of reinforcement
learning yeah so rational
decision-making is essentially the the
encapsulation of the AI problems you
didn't through a particular lens so any
problem that we would want a machine to
do intelligent machine can likely be
represented as a decision-making problem
you're classifying images is a
decision-making problem although not a
sequential one typically you know
controlling a chemical plant as a
decision-making problem deciding what
videos to recommend on YouTube is a
decision-making problem and one of the
really appealing things about
reinforcement learning is if it does
encapsulate the range of all these
decision-making problems perhaps working
on reinforcement learning is you know
one of the ways to reach a very broad
swath of AI problems but what what do
you use the fundament the difference
between reinforcement learning and maybe
supervised machine learning so the
reinforcement learning can be viewed as
a generalization of supervised machine
learning you can certainly cast
supervised learning as a reinforcement
learning problem you can just say your
loss function is the negative of your
reward but you have stronger assumptions
you have the assumption that someone
actually told you what the correct
answer was that your data was iid and so
on so you could view reinforcement
learning is essentially relaxing some of
those assumptions now that's not always
a very productive way to look at it
because if you actually have a
supervised learning problem you'll
probably solve it much more effectively
by using supervised learning methods
because it's easier but you can view
reinforcement as a journalist a tional
know for sure but they're fundamentally
that's a mathematical statement that's
absolutely correct but it seems that
reinforcement learning the kind of tools
we'll bring to the table today of today
so maybe down the line everything will
be a reinforcement learning problem just
like you said
image classification should be mapped to
a reinforcement learning problem but
today the tools and ideas the way we
think about them are different sort of
supervised learning has been used very
effectively to solve basic narrow AI
problems the reinforcement learning kind
of represents the dream of AI it's very
much so in the research space now in two
captivating the imagination of people
what we can do with intelligent systems
but it hasn't yet had as wide of an
impact as the supervised learning
approaches so that so that I my question
comes from more practical sense like
what do you see is the gap between the
more general reinforcement learning
and the very specific yes it's a
question decision-making with one
sequence one step in the sequence of the
supervised learning so for a practical
standpoint I think that one one thing
that is you know potentially a little
tough now and this is I think something
that we'll see this is a gap that we
might see closing over the next couple
of years is the ability of reinforcement
learning algorithms to effectively
utilize large amounts of prior data so
one of the reasons why it's a bit
difficult today to use reinforcement
learning for all the things that we
might want to use it for is that in most
of the settings where we want to do
rational decision-making it's a little
bit tough to just deploy some policy
that does crazy stuff and learns purely
through trial and error it's much easier
to collect a lot of data a lot of logs
of some other policy that you've got and
then maybe you you know if you can get a
good policy out of that then you deploy
it and let it kind of fine-tune a little
bit but algorithmically it's quite
difficult to do that so I think that
once we figure out how to get
reinforcement learning to bootstrap
effectively from large data sets then
we'll see very very rapid growth and
applications of these technologies so
this is what's referred to as off policy
reinforcement learning or offline RL or
batch RL and I think we're seeing a lot
of research right now that that's
bringing us closer and closer to that
can you maybe paint a picture of the
different methods she said
off policy what's value-based
reinforcement learning what's policy
based was modelled based with soft
policy on policy what are the different
categories of reinforcement yeah so one
way we can think about reinforcement
learning is that it's um it's in some
very fundamental way it's about learning
models that can answer kind of what-if
questions so what would happen if I take
this action that I haven't taken before
and you do that of course from
experience from data and oftentimes you
do it in a loop so you build a model
that answers these what-if questions use
it to figure out the best action you can
take and then go and try taking that and
see if the outcome agrees with what you
predicted
so the different kinds of techniques are
basically refer different ways of doing
it so model based methods answer a
question of
what state you would get basically what
would happen to the world if you were to
take a certain action value based
methods they answer the question of what
value you would get meaning what utility
you would get but in a sense they're not
really all that different because
they're both really just answering these
what-if questions now unfortunately for
us with current machine learning methods
answering what-if questions can be
really hard because they are really
questions about things that didn't
happen if you want to answer what-if
questions about things that did happen
you wouldn't need to learn model you
would just like repeat the thing that
worked before and that's really a big
part of why RL is a little bit tough so
if you have a purely on policy kind of
online process then you ask these
what-if questions you make some mistakes
then you're going to try doing those
mistake in things and then you observe
kind of the counter examples that'll
teach you not to do those things again
if you have a bunch of off policy data
and you just want to synthesize the best
pulse you can out of that data then you
really have to deal with the the
challenges of making these these
counterfactual what's the policy yeah a
policy is a model or some kind of
function that maps from observations of
the world to actions so in reinforcement
learning we often refer to the the
current configuration of the world as
the state so we say the state kind of
encompasses everything you need to fully
define where the world is at at the
moment and depending on how we formulate
the problem we might say you either get
to see the state or you get to see an
observation which is some snapshot some
piece of the state so policy is just
includes everything in it in order to be
able to act in this world yes and so
what is off policy mean if yeah so the
terms on policy and off policy refer to
how you get your data so if you get your
data from somebody else who was doing
some other stuff maybe you get your data
from some manually programmed a system
that was you know just running in the
world before that's referred to as off
policy data but if you got the data by
actually acting in the world based on
what your current policy thinks is good
we call that on policy data and
obviously on policy data is more useful
to you because if your current policy
makes some bad decisions you will I
you see that those decisions are bad off
policy data however might be much easier
to obtain because maybe that's all the
log data that you have from before so we
talked about new offline talked about
autonomous vehicles so you can envision
off policy kind of approaches in
robotics phases where there's really ton
of robots out there but they don't get
the luxury of being able to explore
based on reinforcement learning
framework so how do we make again open
question but how do we make our policy
methods work yeah so this is something
that has been kind of a big open problem
for a while and in the last few years
people have made a little bit of
progress on that you know I can tell you
about and it's not by any means solved
yet but I can tell you some of the
things that for example we've done to
try to address some of the challenges it
turns out that one really big challenge
with off policy reinforcement learning
is that you can't really trust your
models to give accurate predictions for
any possible action so if I've never
tried to if in my data said I never saw
somebody steering the car off the road
onto the sidewalk my value function or
my model is probably not going to
predict the right thing if I ask what
would happen if I were to steer the car
off the road onto the sidewalk so one of
the important things you have to do to
get off Paul crl to work is you have to
be able to figure out whether a given
action will result in a trustworthy
prediction or not and you can use kind
of distribution estimation methods kind
of density estimation methods to try to
figure that out so you could figure out
that well this action my model is
telling me that it's great but it looks
totally different from any action I've
taken before so I'm all it's probably
not correct and you can incorporate
regularization terms into your learning
objective that will essentially tell you
not to ask those questions that your
model is unable to answer what would
lead to breakthroughs in this space do
you think like well what's needed is
this a data set question do we need to
collect big benchmark data sets that
allow us to explore the space is it a
new kinds of methodologies like what's
your sense or maybe coming together in a
space of robotics and defining the
problem to do working on him I think
four off policy reinforced mooring in
particular it's very much an algorithms
question right now and you know this is
something that I think it's great
because now arounds question is you know
that that just takes some very smart
people to get together and think about
it really hard whereas if it was like a
data problem or hardware problem that
would take some serious engineering so
that's why I'm pretty excited about that
problem because I think that we're in a
position where we can make some real
progress on it just by coming up with
the right algorithms in terms of which
algorithms they could be you know that
the problems that their core are very
related to problems in you know things
like like causal inference right because
well you're really dealing with the
situations where you have a model a
statistical model that's trying to make
predictions about things that I hadn't
seen before
and if it's a if it's a model it's
generalizing properly that'll make good
predictions if it's a model that picks
up on spurious correlations that will
not generalize properly and then you can
you have an arsenal of tools you can use
you could for example figure out what
are the regions where it's trustworthy
or on the other hand you could try to
make it generalize better somehow or
some combination of the two is there
room for mixing sort of or most of it
like 90 95 percent is off policy you
already have the data set and then you
get to send the robot out to do a little
exploration like what what's that role
of mixing them together yeah absolutely
I think that this is something that you
actually might describe very well at the
beginning of the of our discussion when
you talk about the iceberg like this is
the iceberg that the 99% of your prior
experience that's your iceberg you'd use
that for all policy reinforcement
learning and then of course if you've
never you know opened that particular
kind of door with that particular lock
before then you have to go out and
fiddle with it a little bit and that's
that additional 1% to help you figure
out a new task and I think that's
actually like a pretty good recipe going
forward is this to you the most exciting
space of reinforcement learning now or
is there what's uh and maybe taking a
step back not just now but what's to use
the most beautiful idea apologize for
the romanticized question but the
beautiful idea or a concept in
reinforcement learning
in general I actually think that one of
the things that is a very beautiful idea
in reinforcement learning is just the
idea that you can obtain a near optimal
controller in your optimal policy
without actually having a complete model
of the world this is you know it's
something that feels perhaps kind of
obvious if you if you just hear the term
reinforcement learning or you think
about trial and error learning but from
a controls perspective it's a very weird
thing because classically you know we we
think about engineered systems and
controlling engineered systems as as the
problem of writing down some equations
and then figuring out given these
equations you know basically I solve for
X figure out the the thing that
maximizes its performance and the the
theory of reinforcement learning
actually gives us a mathematically
principled framework just think to
reason about you know optimizing some
quantity when you don't actually know
the equations that govern that system
and that I don't to me that actually
seems kind of kind of you know very
elegant not something that sort of
becomes immediately obvious at least in
the mathematical sense does it make
sense to you that it works at all well I
think it makes sense when you take some
time to think about it but it is a
little surprising well then then taking
a step into the more deeper
representations which is also very
surprising of sort of the richness of
the state space the space of
environments that this kind of approach
can operate in can you maybe say what is
deep reinforcement learning well deep
reinforcement learning simply refers to
taking reinforcement learning algorithms
and combining them with high capacity
neural net representations which is you
know kind of it might at first seem like
a pretty arbitrary thing just take these
two components and stick them together
but the reason that it's it's something
that has become so important in recent
years is that reinforcement learning it
kind of faces an exacerbated version of
a problem that has faced many other
machine learning too
so if you if we go back to like you know
the early 2000s or the late 90s we'll
see a lot of research on machine
learning methods that have some very
appealing mathematical properties like
they reduced a convex optimization
problems for instance but they require
very special inputs they require a
representation of the input that is
clean in some way like for example clean
in the sense that the classes in your
multi-class classification problems
separate linearly so they they have some
cases it's some kind of good
representation we call this a feature
representation and for a long time
people were very worried about features
in the world of supervised learning
because somebody had to actually build
those features so you couldn't just take
an image and plug it into your logistic
regression or your SVM or something
someone had to take that image and
process it using some handwritten code
and then neural nets came along and they
could actually learn the features and
suddenly we could apply learning
directly to the raw inputs which was
great for images but it was even more
great for all the other fields where
people hadn't come up with good features
yet and one of those fields actually
reinforced my learning because in
reinforcement learning the notion of
features if you don't use neural nets
and you have to design your own features
it's very very opaque like it's very
hard to imagine like let's say I'm
playing chess or go what is a feature
with which I can represent the value
function for go or even though the
optimal policy forego linearly I I don't
even know how to start thinking about it
and and people tried all sorts of things
that would write down you know an expert
chess player looks for whether the the
knight is in the middle of the board or
not so that's a feature is night in
middle of board and they would write
these like long lists of kind of
arbitrary made-up stuff and that was
really kind of getting us no way and
that's a little chess is a little more
accessible than the robotics problem
absolutely all right that's there's at
least experts in the different features
for chess but still like the neural
network there I did to me that's I mean
you put it eloquently and almost made it
seem like a natural step to add neural
networks but the fact that neural
networks are able to discover features
in the control problem it's very
interesting it's hopeful I'm not sure
what to think about it but it feels
hopeful that the control problem has
features to be learned
like I guess my question is is it
surprising to you how far the deep side
of deep reinforcement learning is able
to like what the space of problems has
been able to tackle from especially in
games with the Alpha star and and alpha
zero and just the the representation of
power there and in the robotic space and
what is your sense of the limits of this
representation power and the control
context I think that in regard to the
limits that here I think that one thing
that makes it a little hard to fully
answer this question is because in
settings where we would like to put push
these things to the limit we encounter
other bottlenecks so like the reason
that I can't get my robot to learn how
to like I don't know do the dishes in
the kitchen it's not because it's neural
net is not big enough it's because when
you try to actually do trial and error
learning you reinforce them a loner
directly in the real world where you
have the potential to gather these large
they're you know highly varied and
complex datasets you start running into
other problems like one problem you run
into very quickly it'll first sound like
a very pragmatic problem that actually
turns out to be a pretty deep scientific
problem take the robot put in your
kitchen have it try to learn to do the
dishes with trial and error it'll break
all your dishes and then we'll have no
more dishes to clean now you might think
this is a very practical issue but
there's something to this which is that
if you have a person trying to do this
you know a person will have some degree
of common sense they'll break one dish
it'll be a little more careful with the
next one and if they break all of them
they're gonna go and get more or
something like that so there's all sorts
of scaffolding that that comes very
naturally to us for our learning process
like you know if I have to learn
something through trial and error I have
a common sense to know that I have to
you know try multiple times if I screw
something up I ask for help or I recept
things or something like that and all
that it's kind of outside of the classic
reinforcement problem formulation there
are the things that are that can also be
categorizes
scaffolding but are very important like
for example where you get your award
function if I want to learn how to pour
a cup of water well how do I know if
I've done it correctly now that probably
requires an entire computer vision
system to be built just to determine
that and that seems a little bit
inelegant so there are all sorts of
things like this that start to come up
when we think through what we really
need to get reinforcement learning to
happen at scale in the real world and
any that many of these things actually
suggest a little bit of a shortcoming in
the problem formulation and a few deeper
questions that we have to resolve that's
really interesting I thought to like
David silver bought alpha zero and it
seems like there's no again the the we
haven't hit the limit at all in the
context when there is no broken dishes
so in the game in the case of go you can
it's really about just scaling compute
so again like the bottleneck is the
amount of money you're willing to invest
in compute and then maybe the different
the scaffolding around how difficult it
is to scale compute maybe but there
there's no limit and it's interesting
now we move to the real world and
there's the broken dishes they solved it
and the reward function like you
mentioned that's really nice of what how
do we push forward there do you think
there's there's this kind of sample
efficiency question that people bring up
or you know not having to break a
hundred thousand dishes is this an
algorithm question is this data
selection like question or what do you
think how do we how do we not break them
too many dishes yeah well one way we can
think about that is that maybe we need
to be better at reusing our data
building that that iceberg so perhaps
perhaps it's too much to hope that you
can have a machine that in isolation in
the vacuum without anything else can
just master complex tasks in like in
minutes the way that people do but
perhaps it also doesn't have to perhaps
what it really needs to do is have an
existence a lifetime where it does many
things and the previous things that it
has done prepare it to do new things
more
and you know the study of these kinds of
questions typically falls under
categories like multitask learning or
meta learning but they all fundamentally
deal with the same general theme which
is use experience for doing other things
to learn to do new things efficiently
and quickly so what do you think about
if you just look at one particular case
study of Tesla autopilot that has
quickly approaching towards a million
vehicles on the road where some
percentage of the time thirty forty
percent of the time is driven using the
computer vision multitask Hydra net
right and then the other percent that's
what they call it Hydra net the the
other percent is human controlled from
the human side how can we use that data
what's your sense like what's the signal
do you have ideas in this autonomous
vehicle space when people can lose their
lives you know it's a it's a safety
critical environment so how do we use
that data so I think that actually the
kind of problems that come up when we
want systems that are reliable and that
can kind of understand the limits of
their capabilities they're actually very
similar to the kind of problems that
come up when we have we're doing off
policy reinforcement learning so as I
mentioned before and off policy
reinforcement learning the big problem
is you need to know when you can trust
the predictions of your model because if
you if you're trying to evaluate some
pattern of behavior for which your model
doesn't give you an accurate prediction
then you shouldn't use that to to modify
your policy and it's actually very
similar to the problem that we're faced
when we actually then deploy that thing
and we want to decide whether we trust
it in the moment or not so perhaps we
just need to do a better job of figuring
out that part and that's a very deep
research question of course it's also a
question that a lot of people are
working on so I'm pretty optimistic that
we can make some progress on that over
the next few years what's the role of
simulation in reinforcement learning the
end deeper enforcement learning
reinforcement learning like how
essential is it it's been essential for
the breakthroughs so far for some
interesting breakthroughs do you think
it's a crutch that we rely on I mean
again it's can
throw off policy discussion but do you
think we can ever get rid of simulation
or do you think simulation will actually
take over will create more and more
realistic simulations that will allow us
to to solve actual real-world problems
like transfer the models will learn in
simulation from the walk-around yes I
think that simulation is a very
pragmatic tool that we can use to get a
lot of useful stuff to work right now
but I think that in the long run we will
need to build machines that can learn
from real data because that's the only
way that will get them to improve
perpetually because if we can't have our
machines learn from real data if they
have to rely on simulated data
eventually the simulator becomes the
bottleneck in fact this is a general
thing if your machine has any bottleneck
that is built by humans and that doesn't
improve from data it will eventually be
the thing that holds it back and if
you're entirely relying on your
simulator that'll be the bottleneck if
you're entirely really reliant on a
manually designed controller that's
going to be the bottleneck so simulation
is very useful it's very pragmatic but
it's not a substitute for being able to
utilize real experience and this is by
the way this is something that I think
is quite relevant now especially in the
context of some of the things we've
discussed because some of these kind of
scaffolding issues that I mentioned
things like the broken dishes and the
unknown reward function like these are
not problems that you would ever stumble
on when working in a purely simulated
kind of environment but they become very
apparent when we try to actually run
these things in the real world do you
throw a brief wrench into our discussion
let me ask do you think we're living in
a simulation oh I have no idea do you
think that's a useful thing to even
think about about the there the the
fundamental physics nature of reality or
another perspective the reason I think
the simulation hypothesis is interesting
is it's to think about how difficult is
it to create sort of a virtual reality
game type situation that will be
sufficiently convincing to us humans or
sufficiently enjoyable that would we
wouldn't want to leave that's actually a
practical engineering
and I I personally really enjoy virtual
reality but it's quite far away but I
kind of think about what would it take
for me to want to spend more time in
virtual reality versus the real world
and that's a that's a sort of a nice
clean question because at that point
we've reached if I want to live in a
virtual reality that means we're just a
few years away where majority of the
population lives in a virtual reality
and that's how we create the simulation
right you don't need to actually
simulate the you know the quantum
gravity and just every aspect of the of
the universe and that's a read that the
interesting question for reinforcement
learning too is if you want to make
sufficiently realistic simulations that
make it blend the difference between
sort of the real world and the
simulation there by just are the some of
the things we've been talking about kind
of the problems go away if we can create
actually interesting rich simulations
it's an interesting question and it
actually I think your question
casts your previous questions in a very
interesting light because in some ways
asking whether we can well the more
practical more kind of practical version
is like you know can we build simulators
that are good enough to train
essentially AI systems that will work in
the world and it's kind of interesting
to think about this about what this
implies if true it kind of implies that
it's easier to create the universe than
it is to create a brain and then it
seems like put this way it seems kind of
weird the aspect of the simulation most
interesting to me is the simulation of
other humans that seems to be a
complexity that makes the robotics
problem harder now I don't know if every
robotics person agrees with that notion
just as a quick aside what are your
thoughts about when the human enters the
picture of the robotics problem how does
that change the reinforcement learning
problem the the learning problem in
general yeah I think that's a it's a
kind of a complex question and I guess
my hope for a while had been that if we
build these
robotic learning systems that that are
multitask that utilize lots of prior
data and that learn from their own
experience the bit where they have to
interact with people will be perhaps
handled in much the same way as all the
other bits so if they have prior
experience in attracting with people and
they can learn from their own experience
of interacting with people for this new
task maybe that'll be enough now of
course there if it's not enough there
are many other things we can do and
there's quite a bit of research on that
in that area but I think it's worth a
shot to see whether the the the multi
agent interaction the the ability to
understand that other beings in the
world have their own goals and tensions
and thoughts and so on whether that kind
of understanding can emerge
automatically from simply learning to do
things with and maximize utility that
information arises from the data you've
said something about gravity sort of
that you don't need to explicitly inject
anything into the system they can be
learned from the data and gravity is an
example of something that could be
learned from data sort of like the
physics of the world like what what are
the limits of what we can learn from
data do you really do you think we can
so a very simple clean way to ask that
is do you really think we can learn
gravity from just data the idea the the
laws of gravity so it says something
that I think is a common kind of pitfall
when thinking about prior knowledge and
learning is to assume that just because
we know something then that it's better
to tell the Machine about that rather
than have it I regret out on its own in
many cases things that are important
that affect many of the events that the
Machine will experience are actually
pretty easy to learn like you know if
things if every time you drop something
it falls down like yeah you might not
get the you know you might get kind of
an in the Newton's version not Einsteins
version but it'll be pretty good and it
will probably be sufficient for you to
act rationally in the world because you
see the phenomena all the time so things
that are readily apparent from the data
we might not need to specify those by
hand it might actually be easier to let
the Machine figure
it just feels like that there might be a
space of many local local minima in
terms of theories of this world that we
would discover and get stuck on yeah of
course
Newtonian mechanics is not necessarily
easy to come by
yeah and well in fact in in some fields
of science for example human
civilizations itself full of these local
optima so for example if you think about
how people try to figure out biology and
medicine you know for the longest time
the kind of rules like the kind of
principles that serve us very well in
our day to day lives actually serve us
very poorly in understanding medicine
and biology we had kind of very
superstitious and weird ideas about how
the body worked until the advent of the
modern scientific method so that does
seem to be you know a failing of this
approach but it's also a failing of
human intelligence arguably maybe a
small aside but some you know the idea
of self play is fascinating
reinforcement learning sort of these
competitive and creating a competitive
context in which agents can play against
each other in a sort of at the same
skill level and thereby increasing each
other school it seems to be this kind of
self improving mechanism is
exceptionally powerful in the context
where it could be applied first of all
is that beautiful to you that this
mechanism work as well as it does and
also can be generalized to other context
like in the robotic space or anything
that's applicable to the real world I
think that it's a very interesting idea
and I suspect that the bottleneck to
actually generalizing it to the robotic
setting is actually gonna be the same as
as the bottleneck for everything else
that we need to be able to build
machines that can get better and better
through natural interaction with the
world and once we can do that then they
can go out and play with they can play
with each other they can play with
people they can play with the natural
environment but before we get there
we've got all these other problems we've
got we have to get out of the way
there's no shortcut around that you have
to interact with the national
environment
well because in in a self play setting
you still need a mediating mechanisms so
the the reason that you know self play
works for a board game is because the
rules of that board game mediate the
interaction between the agents so the
kind of intelligent behavior that will
emerge depends very heavily on the
nature of that mediating mechanism so on
the side of reward functions that's
coming up with good reward function
seems to be the thing that we associate
with general Intel like human beings
seem to value the idea of developing our
own reward functions of you know
arriving in meaning and so on and yet
for reinforcement learning we often kind
of specify that's the given what's your
sense of how we develop a reward for
good you know good reward functions yeah
I think that's a very complicated and
very deep question and you're completely
right that classically in reinforcement
learning this question has kind of been
treated as a non-issue that you sort of
treat the reward as this external thing
that comes from some other bit of your
biology and you can don't worry about it
and I do think that that's actually you
know a little bit of a mistake that we
shouldn't worry about it and we can
approach you in a few different ways we
can approach it for instance by thinking
of rewards as a communication medium we
can say well how does a person
communicate to a robot what its
objective is you can approach it also as
sort of more of an intrinsic motivation
medium you could say can we write down
kind of a general objective that leads
to good capability like for example can
you write down some objective such that
even in the absence of any other task if
you maximize that objective you'll sort
of learn useful things this is a
something that has sometimes been called
unsupervised reinforcement learning
which i think is a really fascinating
area of research especially today we've
done a bit of work on that recently one
of the things we've studied is whether
we can have some notion of of
unsupervised reinforcement learning by
means of you know information theoretic
quantities like for instance minimizing
a Bayesian measure of surprise this is
an idea that was you know pioneered
actually in the computational
neuroscience community by folks like
Carl Fritton
we've done some work recently that shows
that you can actually learn pretty
interesting skills by essentially
behaving in a way that allows you to
make accurate predictions about the
world it seems a little circular do the
things that will lead to you getting the
right answer for prediction but you can
you know by doing this you can sort of
discover stable niches in the world you
can discover that if you're playing
Tetris then correctly you know clearing
the rows will let you play Tetris for
longer and keep the board nice and clean
which sort of satisfies some desire for
order in the world and as a result to
get some degree of leverage over your
domain so we're exploring that pretty
actively is there a role for a human
notion of curiosity in itself being the
reward sort of discovering new things
about the war the world so one of the
things that I'm pretty interested in is
actually whether discovering new things
can actually be an emergent property of
some other objective that quantifies
capability so new things for the sake of
new things maybe it's not maybe might
not by itself be the right answer but
perhaps we can figure out an objective
for which discovering new things is
actually the natural consequence that's
something we're working on right now but
I don't have a clear answer for you
there yet that's still work-in-progress
you mean just as a security observation
to see sort of creative the patterns of
curiosity on the way to optimize for a
particular protector on the way to
optimize for a particular measure of
capability is is there ways to
understand or anticipate unexpected
unintended consequences of particular
reward functions
sort of anticipate the kind of
strategies that might be developed and
try to avoid highly detrimental strategy
yeah so classically this is something
that has been pretty hard in
reinforcement learning because it's
difficult for a designer to have good
intuition about you know what a learning
outcome will come up with when they give
it some objective there are ways to
mitigate that one way to mitigate it is
to actually define an objective that
says like don't do weird stuff
you can actually quantify you can say
just like don't enter situations that
have low probability under the
distribution of states you've seen
before it turns out that that's actually
one very good way to do off policy
reinforcement learning actually so we
can do some things like that if we
slowly venture in speaking about reward
functions into greater and greater
levels of intelligence there's a mr.
Russell thinks about this the alignment
of AI systems with us humans so how do
we ensure that AG AI systems align with
us humans it's a it's kind of a reward
function question of specifying the
behavior of AI systems such that their
success aligns with us with the broader
intended success interest of human
beings do you have thoughts on this they
have kind of concerns of where
reinforcement learning fits into this or
are you really focused on the current
moment of us being quite far away and
trying to solve the robotics problem I
don't have a great answer to this but
you know and I do think that this is a
problem that's that's important to
figure out for my part I'm actually a
bit more concerned about the other side
of the of this equation that you know
maybe rather than unintended
consequences for objectives that are
specified too well I'm actually more
worried right now about unintended
consequences for objectives that are not
optimized well enough which might become
a very pressing problem when we for
instance try to use these techniques for
safety critical systems like cars and
aircraft and so on I think at some point
we'll face the issue of objectives being
optimized too well but right now I think
we're more likely to face the issue of
them not being optimized well enough but
you don't think on intended consequence
can arise even when you're far from
optimality sort of like on the path to
it oh no I think I unattended
consequence can absolutely arise it's
just I think right now the bottleneck
for improving reliability safety and
things like that is more with systems
that like need to work better that the
optimize their objective better
you have thoughts concerns about
existential threats of human level
intelligence sort of if we put on our
hat of looking in ten twenty a hundred
five hundred years from now give
concerns about existential threats of AI
systems I think there are absolutely
existential threats for AI systems just
like there are for any powerful
technology but I think that the these
kinds of problems can take many forms
and and some of those forms will come
down to you know people with nefarious
intent some of them will come down to AI
systems that have some fatal flaws and
some of them will will of course come
down to AI systems that are too capable
in some way but among this set of
potential concerns I would actually be
much more concerned about the first two
right now and principally the one with
nefarious humans because you know just
through all of human history actress
that I Ferris humans that have been the
problem not the nefarious machines then
I am about the others and I think that
right now the best that I can do to make
sure things go well is to you know build
the best technology I can and also
hopefully promote responsible use of
that technology do you think RL systems
has something to teach us humans
you said nefarious humans getting us in
trouble I mean machine learning system
self in some ways have revealed to us
the ethical flaws in our data in that
same kind of wake and reinforce some
learning teach us about ourselves has it
taught something what have you learned
about yourself from trying to build
robots and reinforce the learning
systems I'm not sure what I've learned
about myself but maybe part of the
answer to your question might become a
little bit more apparent once we see
more widespread deployment of
reinforcement learning for decision
making support in you know in domains
like you know healthcare education
social media etc and I think we will see
some interesting stuff emerge there we
will see for instance what kind of
behaviors these systems come up with
in situations the where there is
interaction with humans and and where
they have you know possibility of
influencing human behavior I think we're
not quite there yet but maybe in the
next two years we'll see some
interesting stuff coming out in that
area
I hope outside the research because the
the exciting space where this could be
observed is sort of large companies that
deal with large data and I hope there's
some transparency and one of the things
it's unclear when I look at social
networks and just online is why an
algorithm did something or whether you
know even an algorithm was involved and
that'd be interesting as a formal
research perspective just to to observe
the results of algorithms to open up
that data or did these be sufficiently
transparent about the behavior of these
e-a systems in the real world what's
your sense I don't know if you looked at
the blog post bitter lesson by Irish
Sutton where it looks at serve the big
lesson of research in AI in
reinforcement learning is that simple
methods general methods that leverage
computation seem to work well so
basically don't try to do any kind of
fancy algorithms just wait for
computation and get fast do you share
this kind of intuition I think the high
level idea makes a lot of sense I'm not
sure that my takeaway would be that we
don't need to work on algorithms I think
that my takeaway would be that we should
work on general algorithms and actually
I think that this idea of needing to
better automate the acquisition of
experience in the real world actually
follows pretty naturally from Rich
Sutton's conclusion so if the claim is
that automated general methods plus data
leads to good results then it makes
sense that we should build general
methods and we should build the kind of
methods that we can deploy and get them
to go out there and like collect their
experience autonomously I think that you
know one place where I think that the
current state of things Falls a little
bit short of that is actually that the
going out there collecting the data
autonomously
which is easy to do in a simulator board
game but very hard to do in the real
world yeah it keeps coming back to this
one problem right it's uh so your mind
is focused there now in this real world
it just seems scary the step of
collecting the data and it seems unclear
to me how we can do it effectively well
you know it's seven billion people in
the world each of them had to do that at
some point in their lives
and we should leverage that experience
that they've all done the we should be
able to try to collect that kind of data
okay big questions maybe stepping back
through your life would book or books
technical or fiction or philosophical
had a big impact onion on the way you
saw the world I know he thought about in
the world your life in general hmm and
maybe what books if is different would
you recommend people consider reading on
their own intellectual journey it could
be within reinforcement learning but
could be very much bigger I don't know
if this is like a scientifically like
particularly meaningful answer but like
the honest answers that I I actually
found a lot of the work by Isaac Asimov
to be very inspiring when I was younger
I don't know if that has anything to do
with with AI necessarily you don't think
it had a ripple effect in your life
maybe it did but yeah I like I think
that a vision of a future where well
first of all artificial mice artificial
intelligence system artificial robotic
systems have you know kind of a big
place a big role in society and where we
try to imagine the sort of the the
limiting case of technological and
advancement and how that might play out
in in our future history but yeah I
think that the that was in some way
influential I don't really know how but
and I would recommend it I mean if
nothing else you'd be well entertained
did you first yourself like fall in love
with the idea of artificial intelligence
get captivated by this field so my
honest answer here is actually that I
only really started to think think about
it as a that's something that I might
want to do actually in graduate school
pretty light and a big part of that was
that until you know somewhere around
2009 2010 it just wasn't really high on
my priority list because I I didn't
think that it was something where we're
going to see very substantial advances
in my lifetime and you know maybe in
terms of my career the time when I
really decided I wanted to work on this
was when I actually took a seminar
course that was taught by Professor and
ring and you know at that point I of
course had some had like a decent
understanding of the technical things
involved but one of the things that
really resonated with me was when he
said in the opening lecture something to
the effect of like well he used to have
graduate students come to him and talk
about how they want to work on AI and he
would kind of chuckle and give them some
math problem to deal with but now he's
actually thinking that this is an area
where we might see like substantial
advances in our lifetime and that kind
of got me thinking because you know it's
an abstract sense yeah like you can kind
of imagine not but in a very real sense
when someone who had been working on
that kind of stuff their whole career
suddenly says that yeah like that had
that had some effect on me yeah this
might be a special moment in the history
of the field that this is where we might
see some some interesting breakthroughs
so in the space of advice somebody who's
interested in getting started and
machine learning or reinforcement
learning what advice would you give to
maybe an undergraduate student or maybe
even younger how what are the first
steps to take and further on what are
the stapes steps to take on that journey
so something that I think is important
to do is to is to not be afraid to like
spend time imagining the kind of outcome
that you might like to see so you know
one outcome might be a successful career
large paycheck or something or
state-of-the-art results in some
benchmark but hopefully that's not the
thing that's like the main driving force
for somebody but I I think that if
someone who's a student considering a
career in AI like takes a little while
sits down and thinks like what do I
really want to see what I want to see a
machine do what I want what do I want to
see a robot do what I want to do and
what I want to see a natural language
system just like imagine you know
imagine it almost like a commercial for
a future product or something or like
like something that you'd like to see in
the world and then actually sit down and
think about the steps that are necessary
to get there and hopefully that thing is
not a better number on imagenet
classification it's like it's probably
like an actual thing that we can't do
today that would be really awesome
whether it's a robot Butler or a you
know a really awesome healthcare
decision making support system whatever
it is that you find inspiring and I
think that thinking about that and then
backtracking from there and imagining
the steps needed to get there will
actually do much better research it'll
lead to rethinking the assumptions it'll
lead to working on the bottlenecks other
other people aren't working on and then
naturally to turn to you we've talked
about reward functions and you just give
an advice and looking forward I would
like to see what kind of change you
would like to make in the world what do
you think ridiculous big question what
do you think is the meaning of life what
is the meaning of your life what gives
you fulfillment purpose happiness and
meaning that's a very big question um
what's the reward function under which
you are operating yeah I think one thing
that does give you know if not meaning
at least satisfaction is some degree of
confidence that I'm working on a problem
that really matters I feel like it's
less important to me to like actually
solve a problem but it's it's quite nice
to take things to spend my time on that
I believe really matter and I I try
pretty hard to to look for that I don't
know if it's easy to answer this but if
you're successful what does that look
like what's the
they dream enough of course success is
built on top of success and you keep
going forever but what is the dream yeah
so one very concrete thing or maybe as
concrete as it's gonna get
here is is to see machines that actually
get better and better the you know the
longer they exists in the world and that
kind of seems like on the surface one
might even think that that's something
that we have today but I think we really
don't I think that there is unending
complexity in the universe and to date
all the machines that we've been able to
build don't sort of improve up to the
limit of that complexity they they hit a
wall somewhere maybe they hit a wall
because they're in a simulator that has
that is only a very limited very pale
imitation of the real world or they hit
a wall because they rely on a label
dataset but they never hit the wall of
like running out of stuff to see like
the did so you know I I'd like to build
a machine that that can go as far as
possible and that runs up against the
ceiling of the complexity of the
universe yes well I don't think there's
a better way to end it Sergey thank you
so much is a huge honor I can't wait to
see the amazing work they have to
publish and in education space in terms
of reinforcement learning thank you for
inspiring the world thank you for the
great research you do thank you thanks
for listening to this conversation with
Sergey levine and thank you to our
sponsors cash app and expressvpn please
consider supporting this podcast by
downloading cash app and using code lex
podcast and signing up at expressvpn
comm / lex pod click all the links buy
all the stuff it's the best way to
support this podcast and the journey I'm
on if you enjoy this thing subscribe on
YouTube review it five stars in a
podcast supported on patreon or connect
with me on Twitter at lex friedman
spelled somehow if you can figure out
how without using the letter e just FR
ID ma m and now let me leave you with
some words from Salvador Dali
intelligence without ambition is a bird
without wings thank you for listening
and hope to see you next time
you