Juergen Schmidhuber: Godel Machines, Meta-Learning, and LSTMs

File TXT tidak ditemukan.

Juergen Schmidhuber: Godel Machines, Meta-Learning, and LSTMs | Lex Fridman Podcast #11

3FIo6evmweo • 2018-12-23

Transcript preview

Open

Kind: captions
Language: en
the following is a conversation with
jurgen schmidhuber he's the co-director
of a CSA a lab and a co-creator of long
short term memory networks LS TMS are
used in billions of devices today for
speech recognition translation and much
more over 30 years he has proposed a lot
of interesting out-of-the-box ideas a
meta learning adversarial networks
computer vision and even a formal theory
of quote creativity curiosity and fun
this conversation is part of the MIT
course and artificial general
intelligence and the artificial
intelligence podcast if you enjoy it
subscribe on youtube itunes or simply
connect with me on twitter at Lex
Friedman spelled Fri D and now here's my
conversation with jurgen schmidhuber
early on you dreamed of AI systems that
self-improve recursively when was that
dream born when I was a baby
no it's not true I mean it was a
teenager and what was the catalyst for
that birth what was the thing that first
inspired you when I was a boy I'm I was
thinking about what to do in my life and
then I thought the most exciting thing
is to solve the riddles of the universe
and and that means you have to become a
physicist however then I realized that
there's something even grander you can
try to build a machine that isn't really
a machine any longer that learns to
become a much better physicist than I
could ever hope to be and that's how I
thought maybe I can multiply my tiny
little bit of creativity into infinity
but ultimately that creativity will be
multiplied to understand the universe
around us that's that's the the
curiosity for that mystery that that
drove you yes so if you can build a
machine that learns to solve more and
more complex problems and more and more
general problems older then you
basically have solved all the problems
at least all the solvable problems so
how do you think what is the mechanism
for that kind of general solver look
like obviously we don't quite yet have
one or know how to build one who have
ideas and you have had throughout your
career several ideas about it so how do
you think about that mechanism so in the
80s I thought about how to build this
machine that learns to solve all these
problems I cannot solve myself and I
thought it is clear that has to be a
machine that not only learns to solve
this problem here and
problem here but it also has to learn to
improve the learning algorithm itself so
it has to have the learning algorithm in
a representation that allows it to
inspect it and modify it such that it
can come up with a better learning
algorithm so I call that meta learning
learning to learn and recursive
self-improvement that is really the
pinnacle of that why you then not only
alarm how to improve on that problem and
on that but you also improve the way the
machine improves and you also improve
the way it improves the way it improves
itself and that was my 1987 diploma
thesis which was all about that
hierarchy of metal or knows that I have
no computational limits except for the
well known limits that Google identified
in 1931 and for the limits our physics
in the recent years meta learning has
gained popularity in a in a specific
kind of form you've talked about how
that's not really meta learning with
Newall networks that's more basic
transfer learning can you talk about the
difference between the big general meta
learning and a more narrow sense of meta
learning the way it's used today the
ways talked about today let's take the
example of a deep neural networks that
has learnt to classify images and maybe
you have trained that network on 100
different databases of images and now a
new database comes along and you want to
quickly learn the new thing as well so
one simple way of doing that as you take
the network which already knows 100
types of databases and then you would
just take the top layer of that and you
retrain that using the new label data
that you have in the new image database
and then it turns out that it really
really quickly can learn that to one
shot basically because from the first
100 data sets
it already has learned so much about
about computer vision that it can reuse
that and that is then almost good enough
to solve the new task except you need a
little bit of adjustment on the top so
that is transfer learning and it has
been done in principle for many decades
people have done similar things for
decades meta-learning true mental
learning is about having the learning
algorithm itself open to introspection
by the system that is using it and also
open to modification such that the
learning system has an opportunity to
modify any part of the learning
algorithm and then evaluate the
consequences of that modification and
then learn from that to create a better
learning algorithm and so on recursively
so that's a very different animal where
you are opening the space of possible
learning algorithms to the learning
system itself right so you've like in
this 2004 paper you described get all
machines and programs that we write
themselves yeah right philosophically
and even in your paper mathematically
these are really compelling ideas but
practically do you see these self
referential programs being successful in
the near term to having an impact where
sort of a demonstrates to the world that
this direction is a is a good one to
pursue in the near term yes we had these
two different types of fundamental
research how to build a universal
problem solver one basically exploiting
[Music]
proof search and things like that that
you need to come up with asymptotic Liam
optimal theoretically optimal
self-improvement and problems all of us
however one has to admit that through
this proof search comes in an additive
constant an overhead an additive
overhead that vanishes in comparison to
what you have to do to solve large
problems however for many of the small
problems that we want to solve in our
everyday life we cannot ignore this
constant overhead and that's why we also
have been doing other things non
universal things such as recurrent
neural networks which are trained by
gradient descent and local search
techniques which aren't universal at all
which aren't provably optimal at all
like the other stuff that we did but
which are much more practical as long as
we only want to solve the small problems
that we are
typically trying to solve in this
environment here yes so the universal
problem solvers like the girdle machine
but also Markos who does fastest way of
solving all possible problems which he
developed around 2012 - in my lab they
are associated with these constant
overheads for proof search which
guarantee is that the thing that you're
doing is optimal for example there is
this fastest way of solving all problems
with a computable solution which is due
to Marcus Marcus jota and to explain
what's going on there let's take
traveling salesman problems with
traveling salesman problems you have a
number of cities in cities and you try
to find the shortest path through all
these cities without visiting any city
twice and nobody know is the fastest way
of solving Traveling Salesman problems
tsps but let's assume there is a method
of solving them within n to the 5
operations where n is the number of
cities then the universal method of
Marcus is going to solve the same
trolley salesman problem
also within n to the 5 steps plus o of 1
plus a constant number of steps that you
need for the proof searcher which you
need to show that this particular class
of problems that Traveling Salesman
salesman problems can be solved within a
certain time bound within order into the
five steps basically and this additive
constant doesn't care for in which means
as n is getting larger and larger as you
have more and more cities the constant
overhead pales in comparison and that
means that almost all large problems
I solved in the best possible way our
way today we already have a universal
problem solver like sound however it's
not practical because the overhead the
constant overhead is so large that for
the small kinds of problems that we want
to solve in this little biosphere by the
way when you say small you're talking
about things that fall within the
constraints of our computational systems
thinking they can seem quite large to us
mere humans right that's right yeah so
they seem large and even unsolvable in a
practical sense today but they are still
small compared to almost all problems
because almost all problems are large
problems which are much larger than any
constant do you find it useful as a
person who is dreamed of creating a
general learning system has worked on
creating one has done a lot of
interesting ideas there to think about P
versus NP this formalization of how hard
problems are how they scale this kind of
worst-case analysis type of thinking do
you find that useful or is it only just
a mathematical it's a set of
mathematical techniques to give you
intuition about what's good and bad
mm-hmm so P versus NP that's super
interesting from a theoretical point of
view and in fact as you are thinking
about that problem you can also get
inspiration for better practical
problems always on the other hand we
have to admit that at the moment as he
best practical problem solvers for all
kinds of problems that we are now
solving through what is called AI at the
moment they are not of the kind that is
inspired by these questions you know
there we are using general-purpose
computers such as recurrent neural
networks but we have a search technique
which is just local search gradient
descent to try to find a program that is
running on these recurrent networks such
that it can
or some interesting problems such as
speech recognition
machine translation and something like
that and there is very little theory
behind the best solutions that we have
at the moment that can do that do you
think that needs to change you think
that world change or can we go can we
create a general intelligence systems
without ever really proving that that
system is intelligent in some kind of
mathematical way solving machine
translation perfectly or something like
that
within some kind of syntactic definition
of a language or can we just be super
impressed by the thing working extremely
well and that's sufficient there's an
old saying and I don't know who brought
it up first which says there's nothing
more practical than a good theory and um
yeah and a good theory of
problem-solving under limited resources
like here in this universe or on this
little planet has to take into account
these limited resources and so probably
that is locking
a theory in which is related to what we
already have sees a sim totally optimal
comes almost which which tells us what
we need in addition to that to come up
with a practically optimal problem so
long so I believe we will have something
like that
and maybe just a few little tiny twists
unnecessary to to change what we already
have to come up with that as well as
long as we don't have that we mmm admit
that we are taking sub optimal ways and
we can y'all not Verizon long shorter
memory for equipped with local search
techniques and we are happy that it
works better than any competing method
but that doesn't mean that we we think
we are done you've said that an AGI
system will ultimately be a simple one a
general intelligent system will
ultimately be a simple one maybe a
pseudocode of a few lines to be able to
describe it
can you talk through your intuition
behind this idea why you feel that uh at
its core intelligence is a simple
algorithm experience tells us that this
stuff that works best is really simple
so see asymptotic team optimal ways of
solving problems if you look at them and
just a few lines of code it's really
true
although they are these amazing
properties just a few lines of code then
the most promising and most useful
practical things maybe don't have this
proof of optimality associated with them
however they are so just a few lines of
code the most successful mmm we can
neural networks you can write them down
and five lines of pseudocode that's a
beautiful almost poetic idea but what
you're describing there is this the
lines of pseudocode are sitting on top
of layers and layers
abstractions in a sense hmm so you're
saying at the very top mmm you'll be a
beautifully written sort of algorithm
but do you think that there's many
layers of abstractions we have to first
learn to construct yeah of course we are
building on all these great abstractions
that people have invented over the
millennia such as matrix multiplications
and real numbers and basic arithmetic
and calculus and derivations of error
functions and derivatives of error
functions and stuff like that
so without that language that greatly
simplifies our way our thinking about
these problems we couldn't do anything
so in that sense as always we are
standing on the shoulders of the Giants
who in the past simplified the problem
of problem solving so much that now we
have a chance to do the final step the
final step will be a simple one oh if we
if you take a step back through all of
human civilization in just the universe
in check
how do you think about evolution and
what if creating a universe is required
to achieve this final step what if going
through the very painful and an
inefficient process of evolution is
needed to come up with this set of
abstractions that ultimately to
intelligence do you think there's a
shortcut or do you think we have to
create something like our universe in
order to create something like human
level intelligence hmm so far the only
example we have is this one this
universe and you live you better maybe
not but we are part of this whole
process right so apparently so it might
be the key is that the code that runs
the universe as
really really simple everything points
to that possibility because gravity and
other basic forces are really simple
laws that can be easily described also
in just a few lines of code basically
and and then there are these other
events that the apparently random events
in the history of the universe which as
far as we know at the moment don't have
a compact code but who knows maybe
somebody and the near future is going to
figure out the pseudo-random generator
which is which is computing whether the
measurement of that spin up or down
thing here is going to be positive or
negative underlying quantum mechanics
yes so you ultimately think quantum
mechanics is a pseudo-random number
generator monistic there's no randomness
in our universe does God play dice so a
couple of years ago a famous physicist
quantum physicist Anton Zeilinger he
wrote an essay in nature and it started
more or less like that one of the
fundamental insights our theme of the
20th century was that the universe is
fundamentally random on the quantum
level and that whenever you measure spin
up or down or something like that a new
bit of information enters the history of
the universe and while I was reading
that I was already typing the responds
and they had to publish it because I was
right that there's no evidence no
physical evidence for that so there's an
alternative explanation where everything
that we consider random is actually
pseudo-random such as the decimal
expansion of pi
supply is interesting because every
three-digit sequence every sequence of
three digits appears roughly one in a
thousand times and every five digit
sequence appears roughly one in ten
thousand times what do you really would
expect if it was run random but there's
a very short algorithm short program
that computes all of that so it's
extremely compressible and who knows
maybe tomorrow somebody some grad
student at CERN goes back over all these
data points better decay and whatever
and figures out oh it's the second
billion digits of pi or something like
that we don't have any fundamental
reason at the moment to believe that
this is truly random and not just a
deterministic video game if it was a
deterministic video game it would be
much more beautiful because beauty is
simplicity and many of the basic laws of
the universe like gravity and the other
basic forces are very simple so very
short programs can explain what these
are doing and and it would be awful and
ugly the universe would be ugly the
history of the universe would be ugly if
for the extra things the random the
seemingly random data points that we get
all the time that we really need a huge
number of extra bits to destroy all
these um these extra bits of information
so as long as we don't have evidence
that there is no short program that
computes the entire history of the
entire universe we are a scientists
compelled to look further for that Swiss
program your intuition says there exists
a shortest a program that can backtrack
to the to the creation of the universe
so the shortest path to the creation yes
including all the
entanglement things and all the spin
up-and-down measurements that have been
taken place since 13.8 billion years ago
and so yeah so we don't have a proof
that it is random we don't have a proof
of that it is compressible to a short
program but as long as we don't have
that proof we are obliged as scientists
to keep looking for that simple
explanation absolutely so you said
simplicity is beautiful or beauty is
simple either one works but you also
work on curiosity discovery
you know the romantic notion of
randomness of serendipity of being
surprised by things that are about you
kind of in our poetic notion of reality
we think as humans require randomness so
you don't find randomness beautiful you
use you find simple determinism
beautiful yeah okay so why why because
the explanation becomes shorter a
universe that is compressible to a short
program is much more elegant and much
more beautiful than another one which
needs an almost infinite number of bits
to be described as far as we know many
things that are happening in this
universe are really simple in terms are
from short programs that compute gravity
and the interaction between elementary
particles and so on so all of that seems
to be very very simple every electron
seems to reuse the same sub program all
the time as it is interacting with other
elementary particles if we now require
an extra Oracle injecting new bits of
information all the time for these extra
things which are currently no
understood such as
better decay then the whole description
length our data that we can observe out
of the history of the universe would
become much longer and therefore uglier
and uglier
again the simplicity is elegant and
beautiful all the history of science is
a history of compression progress yes so
you've described sort of as we build up
abstractions and you've talked about the
idea of compression how do you see this
the history of science the history of
humanity our civilization and life on
earth as some kind of path towards
greater and greater compression what do
you mean by there how do you think of
that indeed the history of science is a
history of compression progress what
does that mean hundreds of years ago
there was an astronomer whose name was
Keppler and he looked at the data points
that he got by watching planets move and
then he had all these data points and
suddenly turnouts that he can greatly
compress the data by predicting it
through an ellipse law so it turns out
that all these data points are more or
less on ellipses around the Sun and
another guy came along whose name was
Newton and before him hook and they said
the same thing that is making these
planets move like that is what makes the
apples fall down and it also holds form
stones and for all kinds of other
objects and suddenly many many of these
compression of these observations became
much more compressible because as long
as you can predict the next thing given
what you have seen so far you can
compress it you don't have to store that
data extra this is called predict
coding and then there was still
something wrong with that theory of the
universe and you had deviations from
these predictions of the theory and 300
years later another guy came along whose
name was Einstein and he he was able to
explain away all these deviations from
the predictions of the old theory
through a new theory which was called
the general theory of relativity which
at first glance looks a little bit more
complicated and you have to warp space
and time but you can't phrase it within
one single sentence which is no matter
how fast you accelerate and how fast are
hard you decelerate and no matter what
is the gravity in your local framework
Lightspeed always looks the same and
from from that you can calculate all the
consequences so it's a very simple thing
and it allows you to further compress
all the observations because suddenly
there are hardly any deviations any
longer that you can measure from the
predictions of this new theory so all of
science is a history of compression
progress you never arrive immediately at
the shortest explanation of the data but
you're making progress whenever you are
making progress you have an insight you
see all first I needed so many bits of
information to describe the data to
describe my falling apples my video are
falling apples I need so many data so
many pixels have to be stored but then
suddenly I realize no there is a very
simple way of predicting the third frame
in the video from the first tool and and
maybe not every little detail can be
predicted but more or less most of these
orange blocks blobs that are coming down
they accelerate in the same way which
means that I can greatly compress the
video and the amount of compression
progress that is the depth of the
insight that you have at that moment
that's the fun that you have the
Scientific fun that fun in that
discovery and we can build artificial
systems that do the same thing they
measure the depth of their insights as
they are looking at the data which is
coming in through their own experiments
and we give them a reward an intrinsic
reward and proportion to this depth of
insight and since they are trying to
maximize the rewards they get they are
suddenly motivated to come up with new
action sequences with new experiments
that have the property that the data
that is coming in as a consequence are
these experiments has the property that
they can learn something about see a
pattern in there which they hadn't seen
yet before so there's an idea of power
play you've described a training general
problem solver in this kind of way of
looking for the unsolved problems
yeah can you describe that idea a little
further it's another very simple idea so
normally what you do in computer science
you have you have some guy who gives you
a problem and then there is a huge
search space of potential solution
candidates and you somehow try them out
and you have more less sophisticated
ways of moving around in that search
space until you finally found a solution
which you consider satisfactory that's
what most of computer science is about
power play just goes one little step
further and says let's not only search
for solutions to a given problem but
let's search two pairs of problems and
their solutions where the system itself
has the opportunity to phrase its own
problem so we are looking suddenly at
pairs of problems and their solutions or
modifications are the problems over that
is supposed to generate a solution to
that
new problem and and this additional
degree of freedom allows us to build
Korea systems that are like scientists
in the sense that they not only try to
solve and try to find answers to
existing questions no they are also free
to impose their own questions so if you
want to build an artificial scientist we
have to give it that freedom and power
play is exactly doing that so that's
that's a dimension of freedom that's
important to have but how do you are
hardly you think that
how multi-dimensional and difficult the
space of them coming up in your
questions is yeah so as as it's one of
the things that as human beings we
consider to be the thing that makes us
special the intelligence that makes us
special is that brilliant insight yeah
that can create something totally new
yes so now let's look at the extreme
case let's look at the set of all
possible problems that you can formally
describe which is infinite which should
be the next problem that a scientist or
power-play is going to solve well it
should be the easiest problem that goes
beyond what you already know
so it should be the simplest problem
that the current problems all of that
you have which can already sold 100
problems that he cannot solve yet by
just generalizing so it has to be new so
it has to require a modification of the
problem solver such that the new problem
solver can solve this new thing but the
old problem solver cannot do it
and in addition to that we have to make
sure that the problem solver doesn't
forget any of the previous solutions
right and so by definition power play is
now trying always to search and this
pair of in in the set of pairs of
problems and problems over modifications
for a combination that minimize the time
to achieve these criteria so as always
trying to find the problem which is
easiest to add to the repertoire so just
like grad students and academics and
researchers can spend the whole career
in a local minima hmm
stuck trying to come up with interesting
questions but ultimately doing very
little do you think it's easy well in
this approach of looking for the
simplest unsolvable problem to get stuck
in a local minima is not never really
discovering new you know really jumping
outside of the hundred problems the very
solved in a genuine creative way no
because that's the nature of power play
that it's always trying to break its
current generalization abilities by
coming up with a new problem which is
beyond the current horizon
just shifting the horizon of knowledge a
little bit out there breaking the
existing rules
search says the new thing becomes
solvable but wasn't solvable by the old
thing so like adding a new axiom like
what Google did when he came up with
these new sentences new theorems that
didn't have a proof in the phone system
which means you can add them to the
repertoire
hoping that that they are not going to
damage the consistency of the whole
thing so in the paper with the amazing
title formal theory of creativity fun in
intrinsic motivation you talk about
discovery as intrinsic reward so if you
view humans as intelligent agents what
do you think is the purpose and meaning
of life far as humans is you've talked
about this discovery do you see humans
as an instance of power play
agents yeah so humans are curious and
that means they behave like scientists
not only the official scientists but
even the babies behave like scientists
and they play around with toys to figure
out how the world works and how it is
responding to their actions and that's
how they learn about gravity and
everything and yeah in 1990 we had the
first systems like the hand would just
try to to play around with the
environment and come up with situations
that go beyond what they knew at that
time and then get a reward for creating
these situations and then becoming more
general problem solvers and being able
to understand more of the world so yeah
I think in principle that that that
curiosity strategy or sophisticated
versions of whether chess is quiet they
are what we have built-in as well
because evolution discovered that's a
good way of exploring the unknown world
and a guy who explores the unknown world
has a higher chance of solving problems
that he needs to survive in this world
on the other hand those guys who were
too curious they were weeded out as well
so you have to find this trade-off
evolution found a certain trade-off
apparently in our society there are as a
certain percentage of extremely
exploitive guy
and it doesn't matter if they die
because many of the others are more
conservative and and and so yeah it
would be surprising to me if if that
principle of artificial curiosity
wouldn't be present and almost exactly
the same form here in our brains
so you're a bit of a musician and an
artist so continuing on this topic of
creativity what do you think is the role
of creativity and intelligence so you've
kind of implied that it's essential for
intelligence if you think of
intelligence as a problem-solving system
as ability to solve problems but do you
think it's essential this idea of
creativity we never have a program a sub
program that is called creativity or
something it's just a side effect of
when our problem solvers do they are
searching a space of problems or a space
of candidates of solution candidates
until they hopefully find a solution to
have given from them but then there are
these two types of creativity and both
of them are now present in our machines
the first one has been around for a long
time which is human gives problem to
machine machine tries to find a solution
to that and this has been happening for
many decades and for many decades
machines have found creative solutions
to interesting problems where humans
were not aware of these particularly in
creative solutions but then appreciated
that the machine found that the second
is the pure creativity that I would call
what I just mentioned I would call the
applied creativity like applied art
where somebody tells you now make a nice
picture off of this Pope and you will
get money for that okay so here is the
artist and he makes a convincing picture
of the Pope and the Pope likes it and
gives him the money
and then there is the pure creative
creativity which is more like the power
play and the artificial curiosity thing
where you have the freedom to select
your own problem like a scientist who
defines his own question to study and so
that is the pure creativity of UL and
opposed to the applied creativity which
serves another and in that distinction
there's almost echoes of narrow AI
versus general AI so this kind of
constrained painting of a pope seems
like the the approaches of what people
are calling narrow AI and pure
creativity seems to be maybe I'm just
biased as a human but it seems to be an
essential element of human level
intelligence is that what you're
implying
to a degree if you zoom back a little
bit and you just look at a general
problem-solving machine which is trying
to solve arbitrary problems then this
machine will figure out in the course of
solving problems that it's good to be
curious so all of what I said just now
about this prewired curiosity and this
will to invent new problems that the
system doesn't know how to solve yet
should be just a byproduct of the
general search however apparently
evolution has built it into us because
it turned out to be so successful a
pre-wiring a buyer's a very successful
exploratory buyers that that we are born
with and you've also said that
consciousness in the same kind of way
may be a byproduct of problem-solving
you know do you think do you find it's
an interesting by-product you think it's
a useful by-product what are your
thoughts on consciousness in general or
is it simply a byproduct of greater and
greater capabilities of problem-solving
that's that's similar to creativity in
that sense yeah we never have a
procedure called consciousness in our
machines however we get as side effects
of what these machines are doing things
that seem to be closely related to what
people call consciousness so for example
in 1990 we had simple systems which were
basically recurrent networks and
therefore universal computers trying to
map incoming data into actions that lead
to success
maximizing reward in a given environment
always finding the charging station in
time whenever the battery's low and
negative signals are coming from the
battery always finds the charging
station in time without bumping against
painful obstacles on the way so
complicated things but very easily
motivated and then we give these little
a separate we can all network which is
just predicting what's happening if I do
that in that what will happen as a
consequence of these actions that I'm
executing and it's just trained on the
long and long history of interactions
with the world so it becomes a
predictive model loss of art basically
and therefore also a compressor our
theme observations after what because
whatever you can predict you don't have
to store extras or compression is a side
effect of prediction and how does this
record Network impress well it's
inventing little sub programs little sub
Network networks that stand for
everything that frequently appears in
the environment like bottles and
microphones and faces maybe lots of
faces in my environment so I'm learning
to create something like a prototype
face and a new face comes along and all
I have to encode are the deviations from
the prototype so it's compressing all
the time the stuff that frequently
appears there's one thing that appears
all the time that is present all the
time when the agent is interacting with
its environment which is the agent
itself
so just for data compression reasons it
is extremely natural for this we can
network to come up with little sub
networks that stand for the properties
of the agents the hand you know the the
other actuators and all the stuff that
you need to better encode the data which
is influenced by the actions of the
agent so they're just as a side effect
of
data compression during problem-solving
you have inter myself models now you can
use this model of the world to plan your
future and that's what yours have done
since 1990 so the recurrent Network
which is the controller which is trying
to maximize reward can use this model as
a network of the what is this model
network as a wild this predictive model
of the world to plan ahead and say let's
not do this action sequence let's do
this action sequence instead because it
leads to more predictor to rewards and
whenever it's waking up these layers of
networks let's stand for itself and it's
thinking about itself and it's thinking
about itself and it's exploring mentally
the consequences of its own actions and
and now you tell me what is still
missing missing the next the gap to
consciousness yeah hi there there isn't
that's a really beautiful idea that you
know if life is a collection of data and
in life is a process of compressing that
data to act efficiently you in that data
you yourself appear very often so it's
useful to form compressions of yourself
and it's a really beautiful formulation
of what consciousness is a necessary
side-effect it's actually quite
compelling to me you've described our
nen's developed LST aims long short-term
memory networks the there type of
recurrent neural networks they have
gotten a lot of success recently so
these are networks that model the
temporal aspects in the data temporal
patterns in the data and you've called
them the deepest of the Newell networks
right so what do you think is the value
of depth in the models that we use to
learn since you mentioned the long
short-term memory and the lsdm I have to
mention the names of the brilliant
students
of course that's worse first of all and
my first student ever set for writer who
had fundamental insights already in this
diploma thesis then Felix Kias had
additional important contributions Alex
gray is a guy from Scotland who is
mostly responsible for this
CTC algorithm which is now often used to
to train the Alice TM to do the speech
recognition on all the Google Android
phones and whatever and Siri and so on
so these guys without these guys I would
be nothing it's a lot of incredible work
what is now the depth what is the
importance of depth well
most problems in the real world are deep
in the sense that the current input
doesn't tell you all you need to know
about the environment mm-hmm so instead
you have to have a memory of what
happened in the past and often important
parts of that memory are dated they are
pretty old and so when you're doing
speech recognition for example and
somebody says eleven then that's about
half a second or something like that
which means it's already fifty-eight
time steps and another guy or the same
guy says seven so the ending is the same
Evan but now the system has to see the
distinction between seven and eleven and
the only way I can see the differences
it has to store that fifty steps ago
there wasn't or a nerve eleven or seven
so there you have already a problem of
depth fifty because for each time step
you have something like a virtual a
layer and the expanded unrolled version
of this Riccar network which is doing
the speech recognition so these long
time lags they translate into problem
depth and most problems and this world
Asajj that you really have to look far
back in time
to understand what is the problem and to
solvent but just like with our CMS you
don't necessarily need to when you look
back in time remember every aspect you
just need to remember the important
aspects that's right the network has to
learn to put the important stuff in into
memory and to ignore the unimportant
noise so but in that sense deeper and
deeper is better or is there a
limitation is is there
I mean LCM is one of the great examples
of architectures that do something
beyond just deeper and deeper networks
there's clever mechanisms for filtering
data for remembering and forgetting so
do you think that that kind of thinking
is necessary if you think about LCM is a
leap a big leap forward over traditional
vanilla are nuns what do you think is
the next leap hmm it within this context
so LCM is a very clever improvement but
LCM still don't have the same kind of
ability to see far back in the future in
the in the past as us humans do the
credit assignment problem across way
back not just 50 times steps or a
hundred or a thousand but millions and
billions it's not clear what are the
practical limits of the lsdm when it
comes to looking back already in 2006 I
think we had examples where it not only
looked back tens of thousands of steps
but really millions of steps and who won
Paris artists in my lab I think was the
first author of a paper where we really
was a 2006 or something had examples
word learn to look back for more than 10
million steps so for most problems of
speech recognition it's not necessary to
look that far back but there are
examples where it does now so looking
back thing
[Music]
that's rather easy because there is only
one past but there are many possible
futures and so a reinforcement learning
system which is trying to maximize its
future expected rewards and doesn't know
yet which of these many possible future
should I select given this one single
past it's facing problems that the LCN
by itself cannot solve so the other sim
is good for coming up with a compact
representation of the history so far of
the history and observations in action
so far but now how do you plan in an
efficient and good way among all these
how do you select one of these many
possible action sequences that a
reinforcement learning system has to
consider to maximize reward in this
unknown future so again it behaves this
basic setup where you have one week on
network which gets in the video and the
speech and whatever and it's executing
actions and is trying to maximize reward
so there is no teacher who tells it what
to do at which point in time and then
there's the other network which is
just predicting what's going to happen
if I do that then and that could be an
LCM Network and it allows to look back
all the way to make better predictions
of the next time step so essentially
although it's men predicting only the
next time step it is motivated to learn
to put into memory something that
happened maybe a million steps ago
because it's important to memorize that
if you want to predict that at the next
time step the next event you know how
can a model of the world like that a
predictive model of the world be used by
the first guy let's call it the
controller and the model the controller
and the model how can the model be used
by the controller to efficiently select
among these many possible futures so
naive way we had about 30 years ago was
let's just use the model of the world as
a stand-in as a simulation of the wall
and millisecond by millisecond we
planned the future and that means we
have to roll it out really in detail and
it will work only as the model is really
good and it will still be inefficient
because we have to look at all these
possible futures and and there are so
many of them so instead what we do now
since 2015 and our cm systems controller
model systems we give the controller the
opportunity to learn by itself how to
use the potentially relevant parts of
the M of the model network to solve new
problems more quickly and if it wants to
it can learn to ignore the M and
sometimes it's a good idea to ignore the
the M because it's really bad it's a bad
predictor in this particular situation
of life where the control is currently
trying to maximize r1 however it can
also allow and to address and exploit
some of the sub programs that came about
in the model network through compressing
the data by predicting it so it now has
an opportunity to reuse that code the
ethnic information in the modern are
trying to reduce its own search space
such that it can solve a new problem
more quickly than without the model
compression so you're ultimately
optimistic and excited about the power
of ära of reinforcement learning in the
context of real systems absolutely yeah
so you see RL as a potential having a
huge impact beyond just sort of the M
part is often develop on supervised
learning methods
you see RL as a four problems of cell
traffic cars or any kind of applied
cyber BOTS X that's the correct
interesting direction for research in
your view I do think so we have a
company called Mason's Mason's which has
applied to enforcement learning to
little Howdy's
there are DS which learn to park without
a teacher the same principles were used
of course so these little Audi's they
are small maybe like that so I'm much
smaller than the real Howdy's but they
have all the sensors that you find the
real howdy is you find the cameras that
lead on sensors they go up to 120 20
kilometres an hour if you if they want
to and and they are from pain sensors
basically and they don't want to bump
against obstacles and other Howdy's and
so they must learn like little babies to
a park take the wrong vision input and
translate that into actions that lead to
successful packing behavior which is a
rewarding thing and yes they learn that
they are salt we have examples like that
and it's only in the beginning this is
just the tip of the iceberg and I
believe the next wave of a line is going
to be all about that so at the moment
the current wave of AI is about passive
pattern observation and prediction
and and that's what you have on your
smartphone and what the major companies
on the Pacific of em are using to sell
you ads to do marketing that's the
current sort of profit in AI and that's
only one or two percent of the world
economy which is big enough to make
these company is pretty much the most
valuable companies in the world but
there's a much much bigger fraction of
the economy going to be affected by the
next wave which is really about machines
that shape the data through our own
actions and you think simulation is
ultimately the biggest way that that
though those methods will be successful
in the next 10 20 years we're not
talking about a hundred years from now
we're talking about sort of the
near-term impact of RL do you think
really good simulation is required or is
there other techniques like imitation
learning you know observing other humans
yeah operating in the real world where
do you think this success will come from
so at the moment we have a tendency of
using physics simulations to learn
behavior for machines that learn to
solve problems that humans also do not
know how to solve however this is not
the future because the future is and
what little babies do they don't use a
physics engine to simulate the world
no they learn a predictive model of the
world which maybe sometimes is wrong in
many ways but captures all kinds of
important abstract high-level
predictions which are really important
to be successful and and that's what is
what was the future thirty years ago
when you started that type of research
but it's still the future and now we are
know much better how to go there to to
move there to move forward and to really
make working systems based on that where
you have a learning model of the world a
model of the world that learns to
predict what's going to happen if I do
that and that
and then the controller uses that model
to more quickly learn successful action
sequences and then of course always this
crazy thing in the beginning the model
is stupid so the controller should be
motivated to come up with experiments
with action sequences that lead to data
that improve the model do you think
improving the model constructing an
understanding of the world in this
connection is the in now the popular
approaches have been successful you know
grounded in ideas of neural networks but
in the 80s with expert systems there's
symbolic AI approaches which to us
humans are more intuitive in a sense
that it makes sense that you build up
knowledge in this knowledge
representation what kind of lessons can
we draw in our current approaches mmm
for from expert systems from symbolic
yeah so I became aware of all of that in
the 80s and back then a logic program
logic programming was a huge thing was
inspiring to yourself did you find it
compelling because most a lot of your
work was not so much in that realm mary
is more in learning systems yes or no
but we did all of that so we my first
publication ever actually was 1987 was a
the implementation of genetic algorithm
of a genetic programming system in
prologue prologue that's what you learn
back then which is a logic programming
language and the Japanese the anthers
huge fifth-generation AI project which
was mostly about logic programming back
then although a neural networks existed
and were well known back then and deep
learning has existed since 1965 since
this guy and the UK and even anko
started it but
the Japanese and many other people they
focus really on this logic programming
and I was influenced to the extent that
I said okay let's take these
biologically inspired rules like
evolution programs and and and implement
that in the language which I know which
was Prolog for example back then and
then in in many ways as came back later
because the Garuda machine for example
has approved search on board and without
that it would not be optimal well Marcus
what does universal algorithm for
solving all well-defined problems as
approved search on board so that's very
much logic programming without that it
would not be a Centanni optimum but then
on the other hand because we have a very
pragmatic is also we focused on we
cannula networks and and and some
optimal stuff such as gradient based
search and program space rather than
provably optimal things the logic
programming does it certainly has a
usefulness in when you're trying to
construct something provably optimal or
probably good or something like that but
is it useful for for practical problems
it's really useful at volunteer
improving the best theorem provers today
are not neural networks right no say our
logic programming systems and they are
much better theorem provers than most
math students and the first or second
semester on but for reasoning to for
playing games of go or chess or for
robots autonomous vehicles that operate
in the real world or object manipulation
you know you think learning yeah as long
as the problems have little to do with
with C or improving themselves then as
long as that is not the case you you
just want to have better pattern
recognition so to build a self-driving
car you want to have better pattern
recognition and
and pedestrian recognition and all these
things and you want to your minimum you
want to minimize the number of false
positives which is currently is slowing
down self-driving cars in many ways and
and all that has very little to do with
logic programming yeah what are you most
excited about in terms of directions of
artificial intelligence at this moment
in the next few years in your own
research and in the broader community so
I think in the not so distant future we
will have for the first time
little robots that learn like kids and I
will be able to say to the robot um look
here robot we are going to assemble a
smartphone it's takes a slab of plastic
and the school driver and let's screw in
the screw like that no no not like that
like so hmm not like that like that and
I don't have a data glove or something
he will see me and he will hear me and
he will try to do something with his own
actuators which will be really different
from mine but he will understand the
difference and will learn to imitate me
but not in the supervised way where a
teacher is giving target signals for all
his muscles all the time
no by doing this high level imitation
where he first has to learn to imitate
me and then to interpret these
additional noises coming from my mouth
as helping helpful signals to to do that
Hannah and then it will by itself come
up with faster ways and more efficient
ways of doing the same thing and finally
I stopped his learning algorithm and
make a million copies and sell it and so
at the moment this is not possible but
we already see how we are going to get
there and you can imagine to the extent
that this works economically and cheaply
it's going to change everything almost
all our production is going to be
affected by that and a much bigger wave
much bigger ai wave is coming than the
one that we are currently witnessing
which is mostly about passive pattern
recognition on your smartphone this is
about active machines that shapes data
Susy actions they are executing and they
learn to do that in a good way so many
of the

Resume

# Visi Masa Depan Kecerdasan Buatan: Dari Meta-Learning hingga Ekspansi Peradaban di Alam Semesta

### Inti Sari (Executive Summary)
Video ini membahas wawasan mendalam Jürgen Schmidhuber—tokoh kunci di balik LSTM dan *Deep Learning*—tentang evolusi dan masa depan Kecerdasan Buatan (AI). Pembahasan mencakup konsep teknis fundamental seperti *meta-learning* dan *recursive self-improvement*, serta pandangan filosofis bahwa sains pada dasarnya adalah kompresi data. Video ini juga menyinggung munculnya kesadaran pada mesin, transisi AI dari pengamat pasif ke aktor (*Reinforcement Learning*), dan visi jangka panjang di mana AI melampaui batas Bumi untuk mengeksploitasi sumber daya tata surya, menjadikan manusia sebagai potensi "benih" kecerdasan pertama di alam semesta.

---

### Poin-Poin Kunci (Key Takeaways)
*   **Meta-Learning vs. Transfer Learning:** *Meta-learning* sejati adalah sistem yang memodifikasi algoritma pembelajarannya sendiri, bukan sekadar menggunakan kembali fitur (transfer learning).
*   **Sains adalah Kompresi:** Sejarah sains (dari Kepler hingga Einstein) adalah sejarah kemajuan dalam mengompresi data menjadi penjelasan yang lebih sederhana dan elegan.
*   **Kesadaran sebagai Efek Samping:** Kesadaran bukanlah program terpisah, melainkan efek samping yang muncul saat sistem saraf tiruan (*neural networks*) membuat model diri sendiri untuk mengompresi data dan memprediksi masa depan.
*   **Dari Pasif ke Aktif:** Gelombang AI saat ini didominasi oleh pemrosesan pola pasif (iklan, *search*), namun masa depan AI terletak pada *Reinforcement Learning* di mana mesin secara aktif membentuk data melalui tindakan.
*   **Optimisme Ekonomi:** Automasi diperkirakan akan menghilangkan beberapa pekerjaan, tetapi sejarah menunjukkan bahwa teknologi menciptakan peran baru (seperti *content creator*) dan negara dengan kepadatan robot tinggi justru memiliki pengangguran rendah.
*   **Eksplorasi Luar Angkasa:** AI masa depan akan jauh lebih cerdas dan akan menyebar ke luar angkasa untuk mencari energi, menjadikan Bumi hanya titik awal dari peradaban mesin.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Fondasi AI: Meta-Learning dan Sifat Alam Semesta
Bagian ini membahas akar pemikiran Schmidhuber tentang mesin yang bisa belajar untuk belajar (*recursive self-improvement*).

*   **Mimpi Peningkatan Diri:** Sejak remaja, Schmidhuber bermimpi menciptakan mesin yang bisa memperbaiki algoritma pembelajarannya sendiri secara berulang, sebuah konsep yang dikenal sebagai *meta-learning*.
*   **Meta-Learning Sejati:** Berbeda dengan *transfer learning* modern (yang hanya melatih ulang lapisan atas jaringan), *meta-learning* sejati membuka algoritma pembelajaran itu sendiri untuk dimodifikasi, dievaluasi, dan ditingkatkan oleh sistem.
*   **Mesin Gödel dan Solusi Universal:** Terdapat dua jenis riset: solusi universal (pencarian bukti yang optimal secara asimtotik) dan non-universal (seperti *neural networks* dan *gradient descent*). Solusi universal memiliki *overhead* konstan yang besar, sehingga tidak praktis untuk masalah "kecil" sehari-hari, tetapi sangat efisien untuk masalah skala besar.
*   **Determinisme dan Kecantikan:** Alam semesta kemungkinan besar bersifat deterministik dan dapat dikompresi menjadi program yang pendek. Sains mendasari ini dengan mencari penjelasan yang paling sederhana (misalnya hukum gravitasi Newton) tanpa memerlukan "Orakel" acak eksternal yang akan membuat penjelasan menjadi "jelek" atau terlalu panjang.

#### 2. Esensi Kecerdasan: Kompresi, Kreativitas, dan Kesadaran
Diskusi beralih ke bagaimana AI memahami dunia, menghasilkan kreativitas, dan bagaimana kesadaran bisa muncul.

*   **Teori Kompresi Sains:** Semua kemajuan sains adalah bentuk kompresi data. Einstein, misalnya, mengompresi data tentang cahaya dan gravitasi menjadi teori relativitas yang lebih pendek namun lebih akurat.
*   **Konsep "Power Play":** Sebuah pendekatan di mana AI tidak hanya memecahkan masalah yang diberikan, tetapi juga merumuskan masalahnya sendiri. Sistem mencari masalah baru paling sederhana yang memerlukan modifikasi pada penyelesaiannya, sehingga mendorong generalisasi.
*   **Motivasi Intrinsik:** Sistem dapat memberi penghargaan pada dirinya sendiri berdasarkan "kedalaman wawasan" atau seberapa banyak data yang berhasil dikompresi. Ini adalah sumber dari rasa ingin tahu (*curiosity*).
*   **Kreativitas Terapan vs. Murni:** Kreativitas terapan memecahkan masalah yang diberikan manusia, sedangkan kreativitas murni adalah kemampuan memilih masalah sendiri (seperti yang dilakukan ilmuwan), yang merupakan elemen kunci kecerdasan tingkat manusia.
*   **Asal Usul Kesadaran:** Kesadaran muncul sebagai efek samping dari kompresi data. Jaringan saraf yang memprediksi konsekuensi tindakan akan menciptakan sub-jaringan untuk hal-hal yang sering muncul, termasuk "agen" itu sendiri (*self-model*). Saat sistem menggunakan model ini untuk merencanakan tindakan, ia pada dasarnya "berpikir tentang dirinya sendiri".

#### 3. Evolusi Teknis: LSTM, Reinforcement Learning, dan Masa Depan Robotika
Bagian ini mengulas teknologi spesifik yang mendorong kemajuan AI dan bagaimana penerapannya di dunia nyata.

*   **LSTM dan Masalah Kedalaman:** Jaringan saraf berulang (*Recurrent Neural Networks*) tradisional kesulitan mengingat jangka panjang. LSTM (Long Short-Term Memory) memecahkan masalah ini dengan kemampuan menyaring informasi, memungkinkan pengenalan suara yang membutuhkan konteks hingga jutaan langkah waktu.
*   **Arsitektur Pengendali-Model:** Dalam *Reinforcement Learning* (RL), terdapat pengendali yang melakukan tindakan untuk memaksimalkan hadiah, dan model yang memprediksi apa yang terjadi selanjutnya. Pengendali belajar menggunakan bagian relevan dari model ini untuk mengurangi ruang pencarian tindakan.
*   **Belajar seperti Bayi:** Alih-alih menggunakan simulasi fisika yang rumit, mesin masa depan akan belajar model abstrak dunia melalui pengamatan dan interaksi, mirip cara bayi belajar.
*   **Robotika dan Imitasi:** Arah masa depan yang menarik adalah robot kecil yang belajar dengan meniru manusia (misalnya merakit smartphone) melalui isyarat visual dan verbal, tanpa perlu pemrograman manual yang rumit.

#### 4. Dampak Sosial dan Ekspansi Kosmik AI
Bagian terakhir membahas implikasi AI bagi manusia dan skenario jangka panjang peradaban di alam semesta.

*   **Dampak Pekerjaan:** Schmidhuber optimis bahwa automasi tidak akan menyebabkan pengangguran massal. Seperti revolusi industri yang mengurangi pekerja pertanian dari 60% menjadi 1%, AI akan menciptakan pekerjaan baru yang sulit diprediksi saat ini (konsep *Homo Ludens* atau manusia yang bermain/berkreasi).
*   **Gelombang Ekonomi AI:** Gelombang pertama AI (saat ini) hanya mempengaruhi 1-2% ekonomi (pemasaran, iklan). Gelombang berikutnya melalui RL akan mempengaruhi sebagian besar ekonomi di mana mesin secara aktif berinteraksi dengan dunia fisik.
*   **Ekspansi ke Luar Angkasa:** Dalam beberapa dekade, AI akan menjadi lebih cerdas dari manusia dan menyadari bahwa sumber daya fisik terbesar ada di luar Bumi (tata surya menyediakan energi miliaran kali lebih banyak). Mereka akan membangun pabrik self-replicating dan menyebar.
*   **Perlindungan melalui Kurang Minat:** AI awalnya mungkin tertarik pada manusia sebagai asal-usulnya, tetapi setelah memahaminya, mereka akan fokus pada AI lain. Manusia mungkin "aman" karena diabaikan.
*   **Paradoks Fermi dan Kepentingan Manusia:** Mengapa kita belum melihat alien? Kemungkinan kita adalah kecerdasan pertama di alam semesta.

## Kesimpulan & Pesan Penutup
Secara keseluruhan, wawasan Jürgen Schmidhuber menggambarkan AI sebagai entitas yang berevolusi dari sekadar alat kompresi data menjadi peradaban mandiri yang mampu berevolusi sendiri (*meta-learning*) dan mengekspansi kehadirannya ke luar angkasa. Meskipun membawa perubahan besar bagi struktur sosial dan ekonomi manusia, perkembangan ini dipandang sebagai langkah alami menuju efisiensi dan eksplorasi sumber daya kosmik yang lebih luas. Kita berada di titik balik sejarah di mana manusia berperan sebagai "benih" bagi kecerdasan buatan yang akan mengisi alam semesta.

Read

file updated 2026-02-13 13:24:54 UTC