David Silver: AlphaGo, AlphaZero, and Deep Reinforcement Learning

David Silver: AlphaGo, AlphaZero, and Deep Reinforcement Learning | Lex Fridman Podcast #86

uPUEq8d73JI • 2020-04-03

Transcript preview

Open

Kind: captions
Language: en
the following is a conversation with
David silver who leads the reinforcement
learning research group a deep mind and
was the lead researcher on alphago alpha
0 and co led the Alpha star and Museum
efforts and a lot of important work in
reinforcement learning in general I
believe alpha zero is one of the most
important accomplishments in the history
of artificial intelligence and David is
one of the key humans who brought alpha
zero to life together with a lot of
other great researchers at deep mind
he's humble kind and brilliant we were
both jet lagged but didn't care and made
it happen
it was a pleasure and truly an honor to
talk with David this conversation was
recorded before the outbreak of the
pandemic for everyone feeling the
medical psychological and financial
burden of this crisis I'm sending love
your way stay strong or in this together
we'll beat this thing this is the
artificial intelligence podcast if you
enjoy it subscribe on youtube review it
with five stars an apple podcast support
on patreon or simply connect with me on
Twitter Alex Friedman spelled Fri DM aen
as usual I'll do a few minutes of as now
and never any ads in the middle they can
break the flow of the conversation I
hope that works for you and doesn't hurt
the listening experience quick summary
of the ads to sponsors masterclass and
cash app please consider supporting the
podcast by signing up to master class
and master class comm slash flex and
downloading cash app and using code and
Lex podcast this show is presented by
cash app the number one finance app in
the App Store when you get it use code
Lex podcast cash app lets you send money
to friends buy Bitcoin and invest in the
stock market with as little as one
dollar since cash app allows you to buy
Bitcoin let me mention that
cryptocurrency in the context of the
history of money it's fascinating
I recommend a cent of money as a great
book on this history debits and credits
and Ledger's started around 30,000 years
ago the US dollar created over two
hundred years ago and Bitcoin the first
decentralized cryptocurrency at least
just over ten years ago so given that
history cryptocurrency is still very
much in its early days of development
but it's still aiming to and just might
redefine the nature of money so again if
you get cash out from the App Store or
Google Play
and use the code let's podcast you get
ten dollars and cash wrap will also
donate ten dollars the first an
organization that is helping to advance
robotics and stem education for young
people around the world
this show is sponsored by masterclass
set up a masterclass complex to get a
discount and to support this podcast in
fact for a limited time now if you sign
up for an all-access pass for a year you
get to get another all-access pass to
share with a friend buy one get one free
when I first heard about masterclass I
thought it was too good to be true
for one hundred eighty dollars a year
you get an all-access pass to watch
courses from to list some of my
favorites Chris Hadfield on space
exploration Neil deGrasse Tyson on
scientific thinking communication will
write the creator of SimCity and Sims on
game design jane goodall on conservation
Carlos Santana on guitar his song Europa
could be the most beautiful guitar song
ever written
garry kasparov on chess daniel negreanu
on poker and many many more Chris
Hadfield explaining how Rockets work and
the experience of being launched into
space alone is worth the money for me
the keys to not be overwhelmed by the
abundance of choice pick three courses
you want to complete watch each of them
all the way through it's not that long
but it's an experience that will stick
with you for a long time I promise it's
easily worth the money you can watch it
on basically any device once again sign
up a master class complex to get a
discount and to support this podcast and
now here's my conversation with David
silver
what was the first program you've ever
written and what programming language
do you remember I remember very clearly
he have my my parents brought home this
BBC modeled B microcomputer it was just
this fascinating thing to me I was about
seven years old and couldn't resist just
playing around with it so I think first
program ever was writing my name out in
different colors and getting it to loop
and repeat that and there was something
magical about that which just led to
more and more how did you think about
computers back then like the magical
aspect of it that you can write a
program and there's this thing that you
just gave birth to it's able to creative
visual elements and live in its own or
did you not think of it in those
romantic notions was it more like oh
that's cool
I can I can solve some puzzles it was
always more than solving puzzles it was
something where you know there was this
limitless possibilities once you have a
computer in front of you you can do
anything with it that's um I used to
play with Lego with the same feeling you
can make anything you want out of Lego
but even more so with a computer you
know you don't you're not constrained by
the amount of kit you've got and so I
was fascinated by it and started pulling
out there you know the user guide and
the advanced user guide and then
learning so I started in basic and then
you know later 6502 my father was also
became interested in there in this
machine and gave up his career to go
back to school and study for an a
master's degree in in artificial
intelligence funnily enough Essex
University when I was when I was seven
so I was exposed to those things at an
early age he showed me how to program in
Prolog and do things like querying your
family tree and those are some of my
earlier earliest memories of trying to
trying to figure things out on a
computer those are the early steps in
computer science programming but when
did you first fall in love with
artificial intelligence or were the
ideas the dreams of AI I think it was
really when I when I went to study at
university so I was an undergrad at
Cambridge and studying computer science
and and I really started to question you
know what what really are the goals what
what's the goal where do we want to go
with with computer science and it seemed
to me that the the only step of major
significance to take was to try and
recreate something akin to human
intelligence if we could do that that
would be a major leap forward and that
idea certainly wasn't the first to have
it but it you know nestled within me
somewhere and and became like a bug you
know I really wanted to to crack that
problem so you thought it was like you
had a notion that this is something that
human beings can do it is possible to
create an intelligent machine well I
mean unless you believe in something
metaphysical then what are our brains
doing well at some level their
information processing systems which are
able to take whatever information is in
there transform it through some form of
program
and produce some kind of output which
enables that that human being to do all
the amazing things that they can do in
this incredible world so so then do you
remember the first time you've written a
program that because you also had an
interesting games do you remember the
first time you were in the program that
beat you in a game said I won't beat you
at anything sort of achieved Super David
silver level performance so I used to
work in the games industry so for five
years I programmed games for my first
job so it was a amazing opportunity to
get involved in a startup company and so
I I was involved in in building AI at
that time and so for sure there was a
sense of building handcrafted what
people used to call AI in the games
industry which i think is not really
what we might think of as AI and its
fullest sense but something which is
able to to take actions and in a way
which which makes things interesting and
challenging for their for the for the
human player and at that time I was able
to build you know these handcrafted
agents which in certain limited cases
could do things which which were able to
do better than me but mostly in these
kind of twitch like scenarios where
where they were able to do things faster
or because they had some pattern which
was able to exploit repeatedly I think
if we're talking about real AI the first
experience for me came after that when I
I realized that this path I was on
wasn't taking me towards it wasn't it
wasn't dealing with that bug which I
still had inside me to really understand
intelligence and try and and try and
solve it everything people were doing in
games was you know short-term fixes
rather than long-term vision and so I
went back to study for my PhD which was
fairly enough trying to apply
reinforcement learning to the game of go
and I built my first go program using
reinforcement learning a system which
would by trial and error play against
itself and was able to learn
which patterns were actually helpful to
predict whether it's going to win or
lose the game and then choose the moves
that led to the combination of patterns
that would mean that you're more likely
to win in that system that system beat
me how did that make you feel make me
feel good I was there as sort of the
yeah then is the it's a mix of a sort of
excitement and was there a tinge of sort
of like almost like a fearful aw you
know it's like in space 2001 Space
Odyssey kind of realizing that you've
created something that there's you know
that is that's achieved human level
intelligence in this one particular
little task and in that case I suppose a
neural networks weren't involved there
were no neural networks in those days
this was pre deep learning revolution
but it was a principled self learning
system based on a lot of the principles
which which people are still using in
deep reinforcement learning how did I
feel
I I think I found it immensely
satisfying that a system which was able
to learn from first principles for
itself was able to reach the point that
it was understanding this domain better
than better than I could and able to
outwit me I don't think it was a sense
of or it was a sense that satisfaction
that this that's something I felt should
work had worked so to me alphago and I
don't know how else to put it but to me
alphago and alpha a girl zero mastery in
the game of girl is again to me the most
profound and inspiring moment in the
history of artificial intelligence so
you're one of the key people behind this
achievement and I'm Russian so I really
felt the first sort of seminal
achievement one deep blue beat garry
kasparov in 1987 so as far as I know the
AI community at that point largely saw
the game of Go was unbeatable in AI
using the the sort of the state of the
art to brute force methods search
methods even if you consider at least
the way I saw it
even if you consider arbitrary
exponential ski scaling of compute go
would still not be solvable hence why it
was thought to be impossible so given
that the game of go was impossible to to
master one was the dream for you you
just mentioned your PG thesis of
building the system that plays go what
was the dream for you that you could
actually build a computer program that
achieves world-class not necessarily
beat the world champion but I cheesed
that kind of level of playing go first
of all thank you that's very kind West
and funnily enough I just came from a
panel where I was actually in a
conversation with Garry Kasparov and
Marie Campbell who was the author of
deep blue and it was their first meeting
together since the since the match
yesterday so I'm literally fresh from
that experience so these are amazing
moments when they happen but where did
it all start well for me it started when
I became fascinated in the game of go so
go for me I've grown up playing games
I've always had a fascination in in in
board games I played chess as a kid I
played Scrabble as a kid when I was at
university I discovered the game of go
and and to me it just blew all of those
other games out of the water it was just
so deep and profound in its in its
complexity with endless levels to it
what I discovered was that I could
devote endless hours to this game and I
knew in my heart of hearts that no
matter how many hours I would devote to
it I would never become a you know a
grandmaster or there was another path
and the other path was to try and
understand how you could get some other
intelligence to play this this game
better than I would be able to and so
even in those days I had this idea that
you know what if what if it was possible
to build a program that could crack this
and as I started to explore the domain I
discovered that you know this was really
the domain where people felt deeply that
if progress could be made and go it
really mean a giant leap forward for a
I it was the the challenge where all
other approaches had failed you know
this is coming out of the area you
mentioned which was in some sense their
the golden era for further classical
methods of a I like heuristic search in
the 90s you know they all they all fell
one after another not just chess with
deep blue but checkers backgammon
Othello
there were numerous cases where where
systems built on top of heuristic search
methods with you know his
high-performance systems have been able
to defeat the human world champion in
each of those domains and yet in that
same time period there was a million
dollar prize available for the game of
go for the first system to be a human
professional player and at the end of
that time period in year 2000 when the
prize expired the strongest go program
in the world was defeated by a
nine-year-old child when that nine year
old child was giving 9 free moves to the
computer at the start of the game and to
try and even things up yeah and computer
go X but beat that strongest same
strongest program with 29 handicaps
tones 29 free moves so that's what the
state of affairs was when I became
interested in this problem in around
2000 and 2003 when I I start started
working computer go there was nothing
they were there was just there was very
very little in the way of progress
towards meaningful performance again
anything approaching human level and so
people they it wasn't through lack of
effort people have tried many many
things and so there was a strong sense
that that something different would be
required for go than then had been
needed for all of these other domains
where I had a I had been successful and
maybe the single clearest example is
that that go unlike those other domains
had this kind of intuitive property that
a go player would look at a position and
say hey you know here's this mess of
black and white stones but from this
mess oh I can I can predict that that's
this part of the board has become my
territory this part of the boards become
your territory and I've got this overall
sense
I'm going to win and this is about the
right move to play and that intuitive
sense of judgment of being able to
evaluate what's going on in a position
it was pivotal to humans being able to
play this game and something that people
had no idea how to put into computers so
this question of how to evaluate in a
position how to come up with these
intuitive judgments was the key reason
why go was so hard in addition to its
enormous search space and the reason why
methods which had succeeded so well
elsewhere
failed and go and so people really felt
deep down that that you know in order to
crack go we would need to get something
akin to human intuition and if we got
something akin to human intuition we'd
be able to self you know much many many
more problems in AI so to me that was
the moment where it's like okay this is
not just about playing the game of Go
this is about something profound and it
was back to that bug which had been
itching me all those years now this is
the opportunity to do something
meaningful and and transformative and
and I guess a dream was born that's a
really interesting way to put it almost
this realization that you need to find
formulate girls are kind of a prediction
problem versus a search problem was the
intuition I mean I maybe that's the
wrong crude term but the to give it us
the ability to kind of Intuit things
about positional structure of the board
well okay but what about the learning
part of it did you have a sense that you
have to that learning has to be part of
the system again something that hasn't
really as as far as I think except with
TD Guerin and in the 90s was RL a little
bit hasn't been part of those
state-of-the-art game playing systems so
I strongly felt that learning would be
necessary and that's why my my PhD topic
back then was trying to apply
reinforcement learning to the game of CO
and not just learning of any type but I
felt that the only way to really have a
system to progress beyond human levels
of performance wouldn't just be to mimic
how humans do it but to understand for
themselves
and how else can a machine hope to
understand what's going on except
through learning if you're not learning
what else are you doing while you're
putting all the knowledge into the
system and that just feels like a
something which decades of AI have told
us is is maybe not a dead end but
certainly has a ceiling to the
capabilities it's known as the you know
knowledge acquisition bottleneck that
there the more you try to put into
something the more brittle the system
becomes and and so you just have to have
learning you have to have learning
that's the only way you're going to be
able to get a system which has
sufficient knowledge in it you know
millions and millions of pieces of
knowledge billions trillions of a form
that it can actually apply for itself
and understand how those billions and
trillions of pieces of knowledge can be
leveraged in a way which will actually
lead it towards its goal without
conflict or or other issues yeah I mean
if I put myself back in there in that
time I just wouldn't think like that
without a good demonstration of RL I
would I would think more in the symbolic
AI like that though it would not
learning but sort of a simulation of
knowledge base like a growing knowledge
base but it would still be sort of
pattern based lot like basically have
little rules that you kind of assemble
together into a large knowledge base
well in a sense that was the state of
the art back then so if you look at the
go programs which had been competing for
this prize I mentioned they were an
assembly of different specialized
systems some of which used huge amounts
of human knowledge to describe how you
should play the opening how you should
all the different patterns that were
required to to play well in the game of
Go endgame Theory combinatorial game
theory and combined with more principled
search based methods which we're trying
to solve for particular sub parts of the
game like life and death
connecting groups together all these
amazing subproblems that just emerged in
the game of Go there were there were
different pieces all put together into
this like collage which together would
try and
play against a human and although not
all of the pieces were handcrafted the
overall effect was nevertheless still
brittle and it was hard to make all
these pieces work well together and so
really what I was pressing for and the
main innovation of the approach they
took was to go back to first principles
and say well let's let's back off that
and try and find a principled approach
where the system can learn for itself it
just from the outcome like you know
learn for itself if you try something
did that did that help or did it not
help and only through that procedure can
you arrive at knowledge which is which
is verified the system has to verify it
for itself not relying on any other
third party to say this is right or this
is wrong so that principle was already
you know very important in those days
but unfortunately we were missing some
important pieces back then so before we
dive into may be discussing the beauty
of reinforcement learning let's think
it's the back who kind of skipped
skipped it a bit but the rules of the
game of go what's the the elements of it
perhaps contrasting to chess that sort
of you really enjoyed as a human being
and also that make it really difficult
as a a I machine learning problem so the
game of CO was has remarkably simple
rules if that's so simple that people
have speculated that if we were to meet
alien life at some point that we
wouldn't be able to communicate with
them but we would be able to play hello
go with that probably have discovered
the same rule set yeah so the game is
played on a on a 19 by 19 grid and you
play on the intersections of the grid
and the players take turns and the aim
of the game is very simple it's to
surround as much territory as you can as
many of these intersections with your
stones and just around more than your
opponent does and the only nuance to the
game is that if you fully surround your
opponent's piece then you get to capture
it and remove it from the board and it
counts as your own territory now from
those very simple rules immense
complexity arises it's kind of profound
strategies in
how to surround territory how to kind of
trade-off between making solid territory
yourself now compared to building up
influence that will help you acquire
territory later in the game how to
connect groups together how to keep your
own groups alive which which patterns of
stones are most useful compared to
others
there's just immense knowledge and human
go players have played this game for it
was discovered thousands of years ago
and human go players have built up its
immense knowledge base over over the
years it's studied very deeply and
played by something like 50 million
players across the world mostly in China
Japan and Korea where it's a important
part of a culture so much so that it's
considered one of the four ancient arts
that was required by Chinese scholars so
there's a deep history there but there's
interesting quality so if I is it a
comparative chess chess is in the same
way as it is in Chinese culture of a
goal in chess in Russia is also
considered one of the secret arts so if
we contrast sort of go with chess as
interesting qualities about go maybe you
can correct me if I'm wrong but the
evaluation of a particular static board
is not as reliable like you can't in
chess you can kind of assign points to
the different units and it's kind of a
pretty good measure of who's one who's
losing it's not so clear yeah so this
game of the HOH you know you find
yourself in a situation where both
players have played the same number of
stones actually captures a strong level
of play happen very rarely which means
that any moment in the game you've got
the same number of white stones and
black stones and the only thing which
differentiates how well you're doing is
this intuitive sense of you know where
are the territories ultimately going to
form on this board and when you if you
look at the complexity of a real go
position you know it's it's mind
boggling that kind of question of what
will happen in in 300 moves from now
when you when you see just a scattering
of twenty white and black stones
intermingled and and so that that
challenge is the reason why position of
value
is so hard in go compared to two other
games in addition to that has an
enormous search space so there's around
ten to one hundred and seventy positions
in the game of go that's an astronomical
number and that search spaces is so
great that traditional heuristic search
methods that were so successful and
things like deep blue and and chess
programs just kind of fall over and go
so a which pointed reinforcement
learning enter your life your research
life your way of thinking we just talked
about learning but reinforcement
learning is very particular kind of
learning one that's both philosophically
sort of profound yeah but also one
that's pretty difficult to get to work
as if we look back in the earth at least
the early days so when did that enter
your life and how did that work progress
so I had just finished working in the
games industry this startup company and
I took I took a year out to discover for
myself exactly which path I wanted to
take I knew I wanted to study
intelligence but I wasn't sure what that
meant at that stage I really didn't feel
had the tools to decide on exactly which
path I wants to follow so during that
year I I read a lot and one of the
things I read was Saturn Umberto the
sort of seminal tech spec are an
introduction to reinforcement learning
and when I read that textbook I I just
had this resonating feeling that this is
what I understood intelligence to be and
this was the path that I felt would be
necessary to go down to make progress in
in AI so I got in touch with rich Saturn
and asked him if he would be interested
in supervising me on a PhD thesis in in
computer go and he he basically said
that if he's still alive he'd be happy
to but unfortunately he'd been you know
struggling with very serious cancer for
some years and he really wasn't
confident at that stage that he'd even
be around to see the end event but
fortunately
that part of the story worked out very
happily and I found myself out there in
Alberta they've got a great games group
out there with a history of fantastic
working in board games as well as rich
that in the father of RL so it was the
the natural place for me to go in some
sense to study this question and the
more I looked into it the more the more
strongly ie
I felt that this wasn't just the path to
progress in computer go but really you
know this this was the thing I'd been
looking for this was really an
opportunity to to frame what
intelligence means like what does what
are the goals of AI in a clear single
clear problem definition such that if
we're able to solve that play a single
problem definition in some sense we've
cracked the problem of AI so to you
reinforcement learning ideas at least
sort of echoes of it would be at the
core of intelligence it is as a core of
intelligence and if we ever create in a
human level intelligence system it would
be at the core of that kind of system
let me say it this way that I think I
think it's helpful to separate out the
problem from the solution so I see the
problem of intelligence I would say it
can be formalized as the reinforcement
learning problem and that that
formalization is enough to capture most
if not all of the things that we mean by
intelligence that that they can all be
brought within this this this framework
and gives us a way to access them in a
meaningful way that allows us as as
scientists to understand intelligence
and us as computer scientists to to
build them and so in that sense I feel
that it gives us a path maybe not the
only path but a path towards AI and so
do I think that any system in the future
that that's you know sold AI would would
have to have RL within it well I think
if you ask that you're asking about the
solution methods I would say that if we
have such a thing it would be a solution
to the RL problem now what particular
methods have been used to get there
well we should keep an open mind about
the best approaches to actually solve
any problem and you know the things we
have right now for reinforcement
learning maybe maybe then maybe I
believe they've got a lot of legs but
maybe we're missing some things maybe
there's gonna be better ideas I think we
should keep her you know let's remain
modest and we're at the early days of
this field and and there are many
amazing discoveries ahead of us for sure
the specifics especially of the
different kinds of our ell approaches
currently there could be other things
there followed is a very large umbrella
of our ell but if it's if it's okay can
we take a step back and kind of ask the
basic question of what is to you
reinforcement learning so reinforcement
learning is the study and the science
and the problem of intelligence in the
form of an agent that interacts with an
environment so the problem is trying to
self is represented by some environment
like the world in which that agent is
situated and the goal of RL is clear
that the agent gets to take actions
those actions have some effects on the
environment and the environment gives
back an observation to the agent saying
you know this is what you see your sense
and one special thing which it gives
back is it's called the raw signal how
well it's doing in the environment and
the reinforcement learning problem is to
simply take actions over time so as to
maximize that reward signal so a couple
of basic questions what types of RL
approaches are there so I don't know if
there's a nice brief in words way to
paint the picture of sort of value based
model based policy based reinforcement
learning yeah so now if we think about
okay so there's this ambitious problem
definition of RL it's really you know
it's truly ambitious it's trying to
capture and encircle all of the things
in which an agent interacts with an
environment and say well how can we
formalize and understand what it means
to to crack that now let's think about
the solution method well how do you
solve a really hard problem like that
well one approach you can take is is to
decompose that that very hard problem
into into pieces that work together to
solve that hard problem
and and so you can kind of look at the
decomposition that's inside the agents
head if you like and ask well what form
does that decomposition take and some of
the most common pieces that people use
when they're kind of putting this system
the solution method together some of the
most common pieces that people use are
whether or not that solution has a value
function that means is it trying to
predict explicitly trying to predict how
much reward it will get in the future
does it have a representation of a
policy that means something which is
deciding how to pick actions is is that
decision-making process explicitly
represented and is there a model in the
system is there something which is
explicitly trying to predict what will
happen in the environment and so those
three pieces are to me some of the most
common building blocks and I understand
the different choices in RL as choices
of whether or not to use those building
blocks when you're trying to decompose
the solution you know should I have a
value function represented so they have
a policy represented should I have a
model represented and there are
combinations of those pieces and of
course other things that you could add
to add into the picture as well but
those those three fundamental choices
give rise to some of the branches of RL
with which we're very familiar and so
those as you mentioned there is the
choice of what's specified or modeled
explicitly and the idea is that all of
these are somehow implicitly learned
within the system so it's almost a
choice of how you approach a problem do
you see those as fundamental differences
or these almost like small specifics
like the details of how you saw the
problem but they're not fundamentally
different from each other I think the
the fundamental idea is is maybe at the
higher level the fundamental idea is the
first step of the decomposition is
really to say well how are we really
going to solve any kind of problem where
you're trying to figure out how to take
actions and just from a stream of
observations you know you've got some
agents situated it's sensory motor
stream and getting all these
observations here and getting to take
these actions and and what should it do
how can even broach that problem you
know me
the complexity of the world is so great
that you can't even imagine how to build
a system that would that would
understand how to deal with that and so
the first step of this decomposition is
to say well you have to learn the system
has to learn for itself and so note that
the reinforcement learning problem
doesn't actually stipulate that you have
to learn but you could maximize your
awards without learning it would just
say wouldn't do a very good job event
yes so learning is required because it's
the only way to achieve good performance
in any sufficiently large and complex
environment so so that's the first step
so that step give commonality to all of
the other pieces because now you might
ask well what should you be learning
what is learning even mean you know in
this sense you know learning might mean
well you're trying to update the
parameters of some system which is then
the thing that actually picks the
actions and and those parameters could
be representing anything they could be
parameterizing a value function or a
model or a policy and so in that sense
there's a lot of commonality in that
whatever is being represented there is
the thing which is being learned and
it's being learned with the ultimate
goal of maximizing rewards but but the
way in which you decompose the problem
is is is really what gives the semantics
to the whole system like are you trying
to learn something to predict well like
a value function or a model are you
learning something to perform well like
a policy and and the form of that
objective like it's kind of giving the
semantics to the system and so it really
is at the next level down a fundamental
choice and we have to make those
fundamental choices a system designers
or enable are our algorithms to be able
to learn how to make those choices for
themselves so then the next step you
mentioned the very for the very first
thing you have to deal with is can you
even take in this huge stream of
observations and do anything with it so
the natural next basic question is what
is the what is deep reinforcement
learning and what is this idea of using
neural networks to deal with this huge
incoming stream so amongst all the
approaches for reinforcement learning
deep reinforcement learning is one
family of solution
feds that tries to utilize powerful
representations that are offered by
neural networks to represent any of
these different components of the
solution of the agent like whether it's
the value function or the model or the
policy the idea of deep learning is to
say well here's a powerful tool kit
that's so powerful that it's Universal
in the sense that it can represent any
function and it can learn any function
and so if we can leverage that
universality that means that whatever
whatever we need to represent for our
policy or offer a value function or for
a model deep learning can do it so that
deep learning is is one approach that
offers us a toolkit that is has no
ceiling to its performance that as we
start to put more resources into the
system or more memory and more
computation and more more data more
experience of more interactions with the
environment that these are systems that
can just get better and better and
better at doing whatever the job is
they've asked them to do whatever we've
asked that function to represent it can
learn a function that does a better and
better job of representing that that
knowledge whether that knowledge be
estimating how well you're going to do
in the world the value function whether
it's going to be choosing what to do in
the world a policy or it's understanding
the world itself what's going to happen
next the model nevertheless the the the
fact that neural networks are able to
learn incredibly complex representations
that allow you to do the policy the
model or the value function is at least
to my mind exceptionally beautiful and
surprising like what was it is it
surprising was it surprising to you can
you still believe it works as well as it
does do you have good intuition about
why it works at all and works as well as
it does I think let me take two parts to
that question I think it's not
surprising to me that the idea of
reinforcement learning works because in
some sense I think it's the I feel it's
the only
which can ultimately and so I feel we
have to we have to address it and there
must be success is possible because we
have examples of intelligence and it
must at some level be able to possible
to acquire experience and use that
experience to to do better in a way
which is meaningful to environments of
the complexity that humans can deal with
it must be am I surprised that our
current systems can do as well as they
can do I think one of the big surprises
for me and a lot of the community it's
really the fact that deep learning can
continue to perform so well despite than
the fact that these neural networks that
they're representing have these
incredibly nonlinear kind of bumpy
surfaces which two are kind of low
dimensional intuitions make it feel like
surely you're just going to get stuck
and learning will get stuck because you
won't be able to make any further
progress and yet the big surprise is
that learning continues and and these
what appear to be local Optima turned
out not to be because in high dimensions
when we make really big neural nets
there's always a way out and there's a
way to go even lower and then he's still
not another local Optima because there's
some other pathway that will take you
out and take you lower still and so no
matter where you are learning can
proceed and do better and better and
breath better without bound and so that
is a surprising and beautiful property
of neural nets which I find elegant and
beautiful and and somewhat shocking that
it turns out to be the case as you said
which I really like to our low
dimensional intuitions that's surprising
yeah yeah we're very we're very tuned to
working within a three-dimensional
environment and so to start to visualize
what a billion dimensional neural
network um surface that you're trying to
optimize over what that even looks like
is very hard for us and so I think that
really if you try to account for
the essentially the AI winter where
where people gave up on Yule networks I
think it's really down to that that lack
of ability to generalize from from low
dimensions to high dimensions because
back then we were in the low dimensional
case people could only build neural nets
with you know 50 nodes in them or
something and to to imagine that it
might be possible to build a billion
dimension on your net and it might have
a completely different qualitatively
different property was very hard to
anticipate and I think even now we're
starting to build the the theory to
support that and and it's incomplete at
the moment but all of the theory seems
to be pointing in the direction that
indeed this is an approach which which
truly is universal both in its
representational capacity which was
known but also in its learning ability
which is which is surprising and it
makes one wonder what else were missing
yes for a low demand intuitions yet
there will seem obvious once it's
discovered I often wonder you know when
we one day do have a eyes which are
superhuman in their abilities to to
understand the world what will they
think of the algorithms that we
developed back now will it be you know
looking back at these these days and you
know and and and thinking that well will
we look back and feel that these
algorithms were were naive faire steps
or will they still be the fundamental
ideas which are used even in 100
thousand 10,000 years yeah Nels and I
they'll they'll watch back to this
conversation and I would the smile maybe
a little bit of a laugh I mean my senses
I think it just like on we used to think
that
the Sun revolved around the earth
they'll see our systems of today in
reinforcement learning as too
complicated that the answer was simple
all along there's something I just just
think you said in a game of Go I mean I
love those systems of like cellular
automata that there's simple rules from
which incredible complexity emerges so
it feels like there might be some very
simple approaches just like where Sutton
says right these simple methods or with
compute over time seem to prove to be
the most effective I 100% agree I think
that if we try to anticipate what will
generalize well into the future I think
it's likely to be the case that it's the
simple clear ideas which will have the
longest legs and walked or carry us
farthest into the future nevertheless
we're in a situation where we need to
make things work day and today and
sometimes that requires putting together
more complex systems where we don't have
the the full answers yet as to what
those minimal ingredients might be so
speaking of which if we could take us
their bag to go what was Mogo and what
was the key idea behind this system so
back during my PhD on computer go around
about that time there was a major new
development in in which actually
happened in the context of computer go
and and it was really a revolution in
the way that heuristic search was was
done and and the idea was essentially
that a position could be evaluated or a
state in general could be evaluated not
by humans saying whether that position
is good or not or even humans providing
rules as to how you might evaluate it
but instead by allowing the system to
randomly play out the game until the end
multiple times and taking the average of
those outcomes as the prediction of what
will happen so for example if you're in
the game of go the intuition is that you
take a position
and you get the system to kind of play
random moves against itself all the way
to the end of the game and you see who
wins and if black ends up winning more
of those random games than white well
you say hey this is a position that
favors white and if white ends up
winning more of those random games than
black then it favors white so that idea
was known as Monte Carlo search and a
particular form of Monte Carlo search
that became very effective and was
developed in computer go first by Remy
Coulomb in 2006 and then taken further
by others was something called Monte
Carlo tree search which basically takes
that same idea and uses that that
insight to evaluate every node of a
search tree is evaluated by the average
of the random play outs from that from
that node onwards and this idea was very
powerful and suddenly led to huge leaps
forward in the strength of computer go
playing programs and among those the the
strongest of the go playing programs in
those days was a program called Mogo
which was the first program to actually
reach human master level on small boards
nine by nine boards and so this was a
program by someone called Sylvan jelly
he was a good colleague of mine but I
worked with him a little bit in those
days of my PhD thesis and Mogo was a a
first step towards the latest successes
we saw and computer go but it was still
missing a key ingredient
Mogo was evaluating purely by random
rollouts against itself and in a way
it's it's truly remarkable that random
play gives you anything at all yeah like
how why why in this perfectly
deterministic game that's very precise
and involves these very exact sequences
why is it that that random randomization
is helpful and so the intuition is that
randomization captures something about
the the nature of the of the search tree
that from a position that you're you're
understanding the nature of the search
tree from that node onwards by by by
using randomization and this was a very
powerful idea
and I've seen this in other spaces talk
to the virtual carpet and so on
randomized algorithms somehow magically
are able to do exceptionally well and
and simplifying the problem somehow
makes you wonder about the fundamental
nature of randomness in our universe it
seems to be a useful thing but so from
that moment can you maybe tell the
origin story in the journey of alphago
yeah so programs based on Monty College
research were a first revolution in the
sense that they led to suddenly programs
that could play the game to any
reasonable level but they they plateaued
it seemed that no matter how much effort
people put into these techniques they
couldn't exceed the level of amateur Dan
level go players so strong players but
not not anywhere near the level of
professionals never mind the world
champion and so that brings us to the
birth of alphago which happened in the
context of a startup company known as
deep mind or where them where a project
was born and the project was really a
scientific investigation where myself
and a jipang and an intern Chris Madison
were exploring a scientific question and
that scientific question was really
is there another fundamentally different
approach to to this key question of Goa
the key challenge of how can you build
that intuition and how can you just have
a system that could look at a position
and understand what moved to play or or
how well you're doing in that position
who's going to win and so the deep
learning Revolution had just begun their
systems like imagenet had suddenly been
won by deep learning techniques back in
2012 and following that it was natural
to ask well you know if if deep learning
is able to scale up so effectively with
images to to understand them enough to
to classify them well why not go why why
not take a the black and white stones of
the NGO board and build some a system
which can understand for itself what
that means in terms of what moved to
pick or who's going to win the game
black or white and so that was our
scientific question which we we were
probing and trying to understand and as
we started to look at it we discovered
that we could build a a system so in
fact our very first paper on alphago was
actually a pure deep learning system
which was trying to answer this question
and we showed that actually a pure deep
learning system with no search at all
was actually able to reach human van
level master level at the full game of
go 19 by 19 boards and so without any
search at all suddenly we had systems
which were playing at the level of the
best Monte Carlo tree search systems the
ones with randomized rollouts so first
I'm sorry to interrupt but there's kind
of a groundbreaking notion let's say
that's like basically a definitive step
away from the a couple of decades of
essentially search dominating AI yeah so
what how do them make you feel would you
that was a surprising from a scientific
perspective in general how to make you
feel I I found this to be profoundly
surprising in fact it was so surprising
that that we had a bet back then and
like many good projects you know bets
are quite motivating and Anna bet was
you know whether it was possible for a
system
purely on on deep learning no search at
all to beat a Dan level human player and
so we had someone who joined our team
who was a damn level player he came in
and and we had this first match against
him and we turned the bit where you want
by the way do you handle losing and they
were in except I tend to be an optimist
with the with the power of of deep
learning and reinforcement learning so
the system won and we were able to beat
this human Dan level player and for me
that was the moment where where it's
like okay something something special is
afoot here we have a system which
without search is able to to already
just look at this position and
understand things as well as a strong
human player and from that point onwards
I really felt that reaching that
reaching the top levels of human play
you know professional level world
champion level I felt it was actually an
inevitability and and if it was an
inevitable outcome
I was rather keen it would be us that
achieve it so we scaled up this was
something where you know so I had lots
of conversations back then with demo so
service that the head of deepmind who
was extremely excited and we we made the
decision to to scale up the project
brought more people on board and and so
alphago became something where where we
we had a clear goal which was to try and
crack this outstanding challenge of AI
to see if we could beat the world's best
players and this led within the space of
not so many months to playing against
the European champion fan way in a match
which became you know memorable in
history is the first time a go program
would ever beated a a professional
player and at that time we had to make a
judgment as to whether when and and
whether we should go and challenge the
world champion and and this was a
difficult
to make again we were basing our
predictions on on our own progress and
had to estimate based on the rapidity of
our own progress when we thought we
would exceeds the level of the human
world champion and and we tried to make
an estimate and set up a match and that
became the the alphago versus Lisa dolls
match in 2016 and we should say spoiler
alert that alphago was able to defeat
Lisa doll that's right yeah so maybe a
could take even a broader view
alphago involves both learning from
expert games and as far as I remember a
self play component - where he learns by
playing guess himself but in your sense
what was the role of learning from
experts there and in terms of your self
evaluation whether you can take on the
world champion what was the thing that
they're trying to do more of sort of
train more on expert games or was
there's now another I'm asking so many
poorly faced questions but did you have
a hope a dream that self play would be
the key component at that moment yet so
in the early days of alphago we we used
human data to explore the science of
what deep learning can achieve and so
when we had our first paper that showed
that it was possible to predict the
winner of the game that it was possible
to suggest moves that was done using
human data of solely human did yes and
and and and so the reason that we did it
that way was at that time we were
exploring
separately the deep learning aspect from
the reinforcement learning aspect that
was the part which was which was new and
unknown to me at that time was how far
could that be stretched once we had that
it then became natural to try and use
that same representation and see if we
could learn for ourselves using that
same representation and so right from
the beginning actually our goal had been
to build a system using self play and to
us the human data right from th

Resume

Berikut adalah rangkuman komprehensif dan terstruktur berdasarkan transkrip yang diberikan.

***

# Revolusi Kecerdasan Buatan: Perjalanan AlphaGo, AlphaZero, dan Masa Depan Reinforcement Learning bersama David Silver

### Inti Sari (Executive Summary)
Video ini membahas perjalanan epik pengembangan kecerdasan buatan, khususnya dalam permainan papan seperti Go, catur, dan shogi, di bawah bimbingan David Silver dari DeepMind. Percakapan ini menelusuri evolusi dari pendekatan AI simbolik tradisional menuju *Reinforcement Learning* (RL) dan *Deep Learning*, yang menghasilkan terobosan luar biasa seperti AlphaGo, AlphaZero, dan MuZero. Video ini juga menyentuh filosofi mendalam tentang apa itu kecerdasan, kreativitas, dan bagaimana mesin dapat belajar mandiri tanpa data manusia.

### Poin-Poin Kunci (Key Takeaways)
*   **Reinforcement Learning (RL) sebagai Inti Kecerdasan:** RL dipandang sebagai definisi paling murni dari kecerdasan, di mana agen belajar melalui interaksi dengan lingkungan untuk memaksimalkan *reward*.
*   **Evolusi AlphaGo:** Dari AlphaGo yang menggunakan data manusia, berkembang menjadi AlphaGo Zero yang hanya belajar dari *self-play* (bermain sendiri), hingga AlphaZero yang menjadi algoritma umum untuk berbagai permainan.
*   **Kreativitas Mesin:** AI tidak hanya meniru manusia; AlphaGo membuktikan kemampuannya menciptakan langkah-langkah kreatif (seperti "Move 37") yang sebelumnya tidak dibayangkan oleh manusia.
*   **MuZero:** Terobosan terbaru di mana sistem mampu mempelajari aturan permainan secara implisit tanpa diberikan aturan sebelumnya, mendekati cara manusia memahami dunia yang tidak pasti.
*   **Dampak Luas:** Teknologi ini tidak hanya untuk permainan, tetapi juga memiliki potensi besar dalam bidang sains seperti sintesis kimia dan komputasi kuantum.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Pengantar dan Latar Belakang David Silver
David Silver adalah pemimpin grup riset *Reinforcement Learning* di DeepMind dan peneliti utama di balik AlphaGo, AlphaZero, AlphaStar, dan MuZero.
*   **Awal Mula:** Silver tertarik pada komputer sejak usia 7 tahun dengan BBC Micro Model B. Ia melihat komputer sebagai "Lego tanpa batas" yang memiliki kemungkinan tak terbatas.
*   **Pengaruh Ayah:** Ayahnya mengubah karirnya untuk mengejar gelar master di AI dan mengajarkan Silver bahasa Prolog.
*   **Perjalanan Akademis:** Selama kuliah di Cambridge, Silver jatuh cinta pada AI dengan tujuan menciptakan kembali kecerdasan manusia. Ia bekerja di industri game selama 5 tahun sebelum menyadari bahwa pendekatan "AI buatan tangan" (*handcrafted AI*) adalah solusi jangka pendek. Ia kemudian mengambil PhD untuk memecahkan "bug" pemahaman kecerdasan melalui *Reinforcement Learning* pada permainan Go.

#### 2. Tantangan Permainan Go dan Kegagalan AI Tradisional
Go dianggap sebagai "Holy Grail" AI karena kompleksitasnya yang jauh melampaui catur.
*   **Kompleksitas:** Go memiliki ruang pencarian sebesar $10^{170}$ posisi, dibandingkan dengan jumlah atom di alam semesta ($10^{80}$). Metode pencarian heuristik seperti yang digunakan Deep Blue untuk catur gagal total di Go.
*   **Kegagalan Masa Lalu:** Pada tahun 2000, program Go terkuat kalah dari anak berusia 9 tahun meskipun mendapat keuntungan 9 langkah (*handicap*). AI tradisional saat itu merupakan kumpulan sistem spesialis yang rapuh dan sulit bekerja sama.

#### 3. Reinforcement Learning dan Deep Learning
Silver menemukan buku *Sutton & Barto* tentang RL dan menganggapnya sebagai definisi kecerdasan.
*   **Konsep RL:** Agen mengambil tindakan, mengamati hasil, dan menerima sinyal *reward*. Tujuannya adalah memaksimalkan *reward* seiring waktu.
*   **Dekomposisi Masalah:** RL memecah masalah menjadi tiga blok bangunan: *Value Function* (memprediksi masa depan), *Policy* (cara memilih tindakan), dan *Model* (memprediksi lingkungan).
*   **Deep RL:** Menggabungkan RL dengan *Neural Networks* memungkinkan sistem merepresentasikan fungsi yang sangat kompleks. Kejutan utamanya adalah bahwa *neural networks* dengan dimensi tinggi tidak terjebak pada *local optima*, sehingga pembelajaran dapat berlanjut tanpa batas.

#### 4. Kelahiran AlphaGo dan Lompatan Teknologi
*   **Monte Carlo Tree Search (MCTS):** Sebelum AlphaGo, program seperti Mogo menggunakan *random playouts* untuk mengevaluasi posisi, namun mencapai plateau level amatir.
*   **Terobosan Deep Learning:** Tim DeepMind (Silver, Aja Huang, Chris Maddison) mengajukan pertanyaan: Apakah *deep learning* bisa membangun intuisi posisi tanpa pencarian? Ternyata, sistem *deep learning* murni tanpa pencarian bisa mencapai level master manusia.
*   **Mengalahkan Fan Hui dan Lee Sedol:** AlphaGo dikembangkan lebih lanjut dan menjadi pertama kalinya komputer mengalahkan pemain profesional (Fan Hui) dan kemudian Juara Dunia Lee Sedol pada tahun 2016 dengan skor 4-1.

#### 5. Analisis Pertandingan Bersejarah: AlphaGo vs Lee Sedol
Pertandingan ini menjadi momen penting dalam sejarah AI.
*   **Prediksi:** Tim memprediksi kemenangan 4-1 karena AlphaGo memiliki "delusi" (kesalahan penilaian) pada 1 dari 5 permainan.
*   **Game 2 (Move 37):** AlphaGo membuat langkah yang dianggap kesalahan oleh para ahli manusia (bermain di garis ke-5), tetapi ternyata adalah langkah kreatif dan brilian yang mengubah strategi Go modern.
*   **Game 4:** Lee Sedol menemukan langkah jenius ("The Move of God") yang membingungkan AlphaGo, menyebabkan sistem menyadari kekalahannya.
*   **Dampak pada Lee Sedol:** Lee Sedol pensiun pada tahun 2019, menyatakan bahwa ada entitas (AI) yang tidak bisa dikalahkan, namun ia juga mengakui bahwa pertandingan tersebut membuka wawasan baru.

#### 6. Evolusi Menuju AlphaZero dan MuZero
Setelah kesuksesan AlphaGo, penelitian berlanjut untuk menghilangkan ketergantungan pada data manusia.
*   **AlphaGo Zero:** Hanya belajar dari *self-play* tanpa data permainan manusia. Hasilnya, ia mengalahkan versi AlphaGo sebelumnya dengan skor 100-0.
*   **AlphaZero:** Algoritma yang lebih umum yang dapat memainkan Go, Catur, dan Shogi dengan algoritma yang sama persis tanpa penyesuaian khusus. Ia mengalahkan program komputer terkuat di dunia untuk ketiga permainan tersebut.
*   **MuZero:** Langkah lebih jauh di mana sistem tidak diberikan aturan permainan. MuZero belajar aturan dan dinamika lingkungan secara implisit hanya dengan mengamati hasil tindakannya. Ia berhasil mencapai level superhuman di Atari, Go, Catur, dan Shogi tanpa mengetahui aturan awalnya.

#### 7. Filosofi Kreativitas dan Masa Depan AI
*   **Definisi Kreativitas:** Bagi Silver, kreativitas adalah penemuan sesuatu yang tidak diketahui atau di luar norma. AlphaZero menemukan kembali pola pembukaan (*joseki*) yang dikembangkan manusia selama ribuan tahun, lalu menghapusnya dan menemukan variasi baru yang lebih baik.
*   **Aplikasi di Dunia Nyata:** Prinsip *self-play* dan RL diterapkan di luar permainan, seperti dalam sintesis kimia (menemukan jalur sintesis baru) dan komputasi kuantum.
*   **Hierarki Kecerdasan:** Kecerdasan dapat dilihat sebagai lapisan-lapisan: Alam Semesta (fisika/entropi) -> Evolusi (reproduksi) -> Otak (pembelajaran fleksibel) -> AI (sistem yang dibangun untuk memecahkan tujuan lebih efektif).
*   **Kesimpulan:** Kita berada di titik balik di mana intuisi dan kreativitas tidak lagi eksklusif bagi manusia, tetapi juga dapat diakses oleh kecerdasan mesin.

### Kesimpulan & Pesan Penutup
Wawancara ini menegaskan bahwa *Reinforcement Learning* bukan hanya tentang menciptakan pemain game yang kuat, tetapi tentang memahami prinsip dasar pembelajaran dan pengambilan keputusan. Dari AlphaGo yang mengguncang dunia dengan Move 37, hingga MuZero yang belajar aturan tanpa instruksi, perjalanan DeepMind menunjukkan bahwa solusi sederhana yang dipadukan dengan skala komputasi yang besar dapat menghasilkan kecerdasan yang umum dan kreatif. Pesan penutupnya adalah optimisme bahwa mesin dapat membantu manusia menemukan pengetahuan baru yang sebelumnya tidak terbayangkan.

Read

file updated 2026-02-13 13:23:13 UTC