Yann LeCun: Dark Matter of Intelligence and Self-Supervised Learning

Yann LeCun: Dark Matter of Intelligence and Self-Supervised Learning | Lex Fridman Podcast #258

SGzMElJ11Cc • 2022-01-22

Transcript preview

Open

Kind: captions
Language: en
the following is a conversation with
john le his second time in the
podcast he is the chief ai scientist at
meta formerly facebook professor at nyu
touring award winner one of the seminal
figures in the history of machine
learning and artificial intelligence
and someone who is brilliant and
opinionated in the best kind of way and
so is always fun to talk to
this is a lex friedman podcast to
support it please check out our sponsors
in the description and now here's my
conversation with yon lacoon
you co-wrote the article self-supervised
learning the dark matter of intelligence
great title by the way with ishan mizrah
so let me ask what is self-supervised
learning and why is it the dark matter
of intelligence
i'll start by the dark matter part
uh
there is obviously a kind of learning
that humans and animals are
are doing that we currently are not
reproducing properly with machines with
ai right so the most popular approaches
to machine learning today are
or paradigms i should say are supervised
running and reinforcement learning
and they are extremely inefficient
supervised learning requires many
samples for learning anything
and reinforcement learning requires a
ridiculously large number of
trials and errors to for you know a
system to run anything
and that's why we don't have
self-driving cars
that's a big leap from one to the other
okay so that to solve difficult problems
you have to have a lot of uh human
annotation for supervised learning to
work and to solve those difficult
problems with reinforcement learning you
have to have
some way to maybe simulate that problem
such that you can do that large scale
kind of learning that reinforcement
learning requires right so how is it
that you know most teenagers can learn
to drive a car in about 20 hours of
practice
whereas uh even with millions of hours
of
simulated practice a self-driving car
can't actually learn to drive itself
properly
um and so obviously we're missing
something right and it's quite obvious
for a lot of people that
you know the immediate response you get
from
many people is well you know humans use
their background knowledge
to learn faster and they're right
now how was that background knowledge
acquired and that's the big question
so now you have to ask
you know how do
babies in the first few months of life
learn how the world works
mostly by observation because they can
hardly act in the world
and they learn an enormous amount of
background knowledge about the world
that may be the
the basis of what we call common sense
this type of learning
it's not learning a task it's not
being reinforced for anything it's just
observing the world and figuring out how
it works
building world models learning world
models
how do we do this and how do we
reproduce this in in machines so cell
supervision learning
is
you know one
instance or one attempt at trying to
reproduce this kind of learning
okay so
you're looking at just observation so
not even the interacting part of a child
it's just sitting there watching mom and
dad walk around pick up stuff
all of that that's the that's what you
mean by background knowledge perhaps not
even watching mom and dad just you know
watching the world go by
just having eyes open or having eyes
closed or the very act of opening and
closing eyes
that the world appears and disappears
all that basic information
and you're saying in in order to learn
to drive
like the reason humans are able to learn
to drive quickly some faster than others
is because of the background knowledge
they were able to watch cars operate in
the world
in the many years leading up to it the
physics of basics objects all that kind
of stuff that's right i mean the basic
physics of objects you don't even know
you don't even need to know you know how
a car works right because that you can
learn fairly quickly i mean the example
i use very often is uh you're driving
next to a cliff
and you know in advance because of your
you know understanding of intuitive
physics that if you turn the wheel to
the right the car will veer to the right
we'll run off the cliff
fall off the cliff and nothing good will
come out of this right
um but if you are a sort of
you know tabularized reinforcement
learning system that doesn't have a
model of the world
you have to repeat falling off this
cliff thousands of times before you
figure out it's a bad idea and then a
few more thousand times before you
figure out how to not do it and then a
few more million times before you figure
out how to not do it in every situation
you ever encounter
so self-supervised learning still has to
have
some source of truth
being told to it by somebody and is so
you have to figure out a way without
human
assistance or without significant amount
of human assistance to get that truth
from the world
so the mystery there is um
how much signal is there how much truth
is there that the world gives you
whether it's the human world
like you watch youtube or something like
that or it's the more natural world
so how much signal is there
so here's the trick there is way more
signal
in sort of a self-supervised setting
than there is in either a supervised or
reinforcement setting
and this is going to my you know analogy
of the cake
the you know low cake as someone has
called it where
when you try to figure out how much
information you ask the machine to
predict and how much feedback you give
the machine at every trial
in reinforcement learning you give the
machine a single scaler you tell the
machine you did good you did bad and you
you and you only
tell this to the machine once in a while
when i say you it could be the the
universe telling the machine right
but it's just one scalar so as a
consequence there is you you cannot
possibly learn something very
complicated without many many many
trials where you get many many feedbacks
of this type
supervision you you give a few bits to
the machine
at every every sample
let's say you're training a
system on you know recognizing images on
imagenet there is 1000 categories that a
little less than 10 bits of information
per sample
but star supervisory here is a setting
you
ideally we don't know how to do this yet
but ideally you would
show a machine a segment of
a video and then stop the video and ask
me ask the machine to predict what's
going to happen next
so you let the machine predict and then
you let time
go by and show the machine what actually
happened and hope the machine will you
know learn to do a better job at
predicting next time around there's a
huge amount of information you give the
machine because it's an entire video
clip
of uh you know of the future after the
video clip you fed it
in the first place so both for language
and for vision
there's a
subtle seemingly trivial construction
but maybe that's representative of what
is required to create intelligence which
is
filling the gap
so in the
gaps it
sounds dumb but can you
it's
it is possible you can solve all of
intelligence in this way just for both
language
just
give a sentence
and continue it or give a sentence and
there's a gap in it uh
some words blanked out and you fill in
what words go there
for vision
you give a sequence of images
and predict what's going to happen next
or you fill in what happened in between
do you think it's possible that
formulation alone
as a signal for self-supervised learning
can solve intelligence for vision and
language i think that's our best shot at
the moment um so whether this will take
us all the way to you know human level
intelligence or something or just cat
level intelligence
uh it's not clear but
among all the possible approaches that
people have proposed i think is our best
shot so
i think this idea of uh
an intelligent system filling in the
blanks either
you know predicting the future inferring
the past
filling in missing information
uh you know i'm currently filling the
blank of what is behind your head and
what you what your head looks like and
you know from from the back uh because i
have you know basic knowledge about how
humans are made and i don't know if
you're gonna you know what are you gonna
say at which point you're gonna speak
whether you're gonna move your head this
way or that way which way you're gonna
look but i know you're not gonna just
dematerialize and reappear three meters
uh
down the hall uh you know because i know
what's possible and what's impossible uh
according to into the physics so you
have a model of what's possible what's
impossible and then you'd be very
surprised if it happens and then you'll
have to reconstruct your model
right so that that's the model of the
world it's what tells you you know what
fills in the blanks so given your
partial information
about the state of the world given by
your perception
uh your your model of the world fills in
the missing information and that
includes predicting the future
retrodicting the past uh you know
filling in things you don't immediately
perceive and that doesn't have to be
purely
generic vision or visual information or
generic language you can go to specifics
like
predicting
what control decision you make when
you're driving in a lane you have a
sequence of images from a vehicle
and then you could
you have information if you recorded on
video where the car ended up going so
you can go back in time and predict what
the car went based on the visual
information that's very specific domain
specific
right but the question is whether we can
come up with sort of a generic
uh method
for
you know training machines to do this
kind of prediction or filling in the
blanks so right now
uh this type of approach has been
unbelievably successful in the context
of natural language processing uh every
modern natural language processing is
pre-trained in self-supervised manner
to fill in the blanks to you you show it
a sequence of words you remove 10
percent of them and then you train some
gigantic neural net to predict the words
that are missing
that and once you've pre-trained that
network you can use the
internal representation learn by it as
input to
you know
something that you train supervised or
whatever that's been incredibly
successful not so successful in images
although it's making progress
and uh and it's based on uh sort of
manual data augmentation uh we can go
into this later but what has not been
successful yet is training for video so
getting a machine to learn to represent
the visual world for example
by just watching video nobody has really
succeeded in doing this okay well let's
kind of give a high level overview
what's the difference in
kind
and in difficulty between vision and
language so you said
people haven't been able to
really kind of crack the problem of
vision open in terms of self-supervised
learning but that may not be necessarily
because it's fundamentally more
difficult maybe like when we're talking
about achieving like passing
the turing test and the full spirit of
the turing test in language might be
harder than vision that's that's not
obvious so what in your view which is
harder
or perhaps are they just the same
problem when uh
the farther we get to solving each the
more we realize it's all the same thing
it's all the same cake i think
what i'm looking for are methods that
make make them look essentially like the
same cake but currently they're not and
the main issue
with
learning water models or learning
predictive models
is that
the prediction is never a single
thing
because the world is not entirely
predictable
it may be deterministic or stochastic we
can get into the philosophical
discussion about it but uh but even if
it's deterministic it's not entirely
predictable
and so
if i play
a short video clip and then i ask you to
predict what's going to happen next
there's many many plausible
continuations for that
video clip and the number of
continuation grows with the
interval of time that you're asking the
system to make a prediction uh for
and so
one big question with supervision is how
you
represent this uncertainty how you
represent
multiple discrete outcomes how you
represent a sort of continuum of
possible outcomes
etc and you know if you are a
sort of a classical machine learning
person you say oh you just represent a
distribution right
and that we know how to do
when we're predicting words missing
words in the text because
uh you can have a neural net give a
score for every word in a dictionary
it's a big you know it's a big list of
numbers you know maybe a hundred
thousand or so and you can turn them
into a probability distribution that
gives that tells you when i say a
sentence you know the
you know the cat is chasing the blank in
the kitchen
you know there are only a few words that
make sense there you know it could be a
mouse or it could be a laser spot or you
know something like that right
uh
and and if i if i say the the blank is
changing the blank in the savannah you
also have a bunch of plausible options
for those two words right
um that that because you you have kind
of a you know underlying reality that
you can refer to to sort of fill in
those those blanks
um
so
you cannot say for sure in the savannah
if it's a you know a lion or cheetah or
whatever you cannot know if it's a
zebra or
glue or you know whatever wildebeest the
same thing um
but uh
but you can represent the uncertainty by
just a long list of numbers now
if i uh if i do the same thing with
video and i ask you to predict a video
clip it's not a discrete set of
potential frames you have to have
somewhere representing a sort of
infinite number of plausible
continuations
of multiple frames in a you know high
dimensional continuous space and we just
have no idea how to do this properly
uh finite high dimensional
so like you because it's fine
dimensional yes just like the words i
try to get it to uh
down to a small finite set of like under
a million something like that something
like that i mean it's kind of ridiculous
that
we're
doing a distribution over every single
possible word for language and it works
it feels like that's a really dumb way
to do it
um like there seems to there seems to be
like there should be some more
compressed representation of the
distribution of the words you're right
about that and so i agree do you have
any interesting ideas about how to
represent all the reality in a
compressed way such you can form a
distribution over it that's one of the
big questions you know how do you do
that right i mean what's what's kind of
you know another thing that that really
is
stupid about um i shouldn't say stupid
but like simplistic about current
approaches to cell supervision in in uh
nlp in text
is that not only do you represent a
giant distribution over words but for
multiple words that are missing those
distributions are essentially
independent of each other
and you know you don't pay too much of a
price for this so you you so you can't
so the you know the system
you know in the the the sentence that i
gave earlier
if he gives a certain probability for a
lion and and uh cheetah and then a
certain probability for
uh you know gazelle uh wildebeest and
and and zebra
uh
those two probabilities are independent
of each other
uh and it's not the case that those
things are independent lions actually
attack like bigger animals than than she
does so
you know there's a huge independence
hypothesis in in this process which is
not actually true the reason for this is
that we don't know how to represent uh
properly distributions over
combinatorial uh sequences of symbols
essentially whenever because the number
goes exponentially with the length of
the of the symbols
and so we have to use tricks for this
but um those techniques can
you know get around like don't even deal
with it so
so the big question is like would there
be some sort of abstract
latent representation of text
that would say that
you know when i when i switch
lion for gazelle
lion for cheetah i also have to switch
zebra for gazelle
yeah so
this independence assumption let me
throw some criticism at you that i often
hear and see how you respond
so this kind of filling in the blanks is
just statistics you're not learning
anything
like the deep underlying concepts you're
just mimicking stuff from
the past you're not learning anything
new such that you can use it to
generalize about the world
or okay let me just say the crude
version which is it's just statistics
it's not intelligence uh what do you
have to say to that what do you usually
say to that if you kind of hear this
kind of thing i don't get into those
discussions because they are they're
kind of pointless um so first of all
it's quite possible that intelligence is
just statistics it's just statistics of
a particular kind yes uh where this is
the philosophical question it's kind of
is
is intel is it possible that
intelligence is just statistics yeah
but what kind of statistics so
uh if you are asking the question
are the model of the world the models of
the world that we learn um do they have
some notion of causality yes
so if the criticism comes from people
who say
you know current machine learning system
don't care about causality which by the
way is wrong
uh
you know i agree with them yeah you
should you know your model of the world
should have your actions as one of your
of the inputs
and that will drive you to learn causal
models of the world where you know what
you know what uh intervention in the
world will cause what results or you can
do this by observation of other agents
uh acting in the world and and observing
the effect
uh other humans for example so i think
you know at some level of description uh
intelligence is just statistics
uh but that doesn't mean you don't you
don't you know you won't have models
that have you know deep mechanistic
explanation for what goes on
uh the question is how do you learn them
that's that's the question i'm
interested in
uh because
you know a lot of people who actually
voice their criticism
say that those mechanistic model has to
have to come from someplace else they
have to come from human designers they
have to come from i don't know what and
obviously we learn them
or if we don't learn them as an
individual nature
learn them for us using evolution so
regardless of what you think those
processes have been learned somehow
so if you look at the the human brain
just like when we humans introspect
about how the brain works
it seems like when we think about what
is intelligence we think about
the high level stuff like the models
we've constructed concepts like
cognitive science like concepts of
memory
and reasoning module almost like these
high-level modules
is there's is this service a good
analogy
like are we ignoring
the uh
the dark matter the
the basic low-level mechanisms just like
we ignore the way the operating system
works we're just using the
uh the the high-level software we're
ignoring that
at the low level
the neural network might be doing
something like statistics
like me
sorry to use this word probably
incorrectly and crudely but doing this
kind of fill in the gap kind of learning
and just kind of updating the model
constantly in order to be able to
support the raw sensory information to
predict it and then adjust to the
prediction when it's wrong but like kyla
when we look at our brain at the high
level it feels like we're doing like
we're playing chess like
we're we're like playing with high level
concepts and we're stitching them
together and we're putting them into
long-term memory but really what's going
underneath
is something we're not able to
introspect which is this kind of
uh simple large neural network that's
just filling in the gaps right well okay
so there's a lot of questions there are
answers there okay so first of all
there's a whole school of thought in
neuroscience computational neuroscience
in particular
that likes the idea of predictive coding
which is really related to the idea i
was talking about in self-supervised
learning so everything is about
prediction the essence of intelligence
is the ability to predict
and everything the brain does is trying
to predict
uh predict everything from everything
else okay and that's really sort of the
underlying principle if you want that
uh cell supervisor learning is trying to
kind of reproduce this idea of
prediction that's kind of an essential
mechanism
of
task independent learning if you want
the next step is
what kind of intelligence are you
interested in reproducing and of course
you know we all think about you know
trying to reproduce sort of you know
high-level cognitive processes in humans
but like with machines we're not even at
the level of
even reproducing the
learning processes in a in a cat brain
um you know the most intelligent or
intelligent systems don't don't have as
much common sense as as a house cat
so um how is it that cats learn and you
know cats don't do a whole lot of uh
reasoning they certainly have causal
models they certainly have
uh because you know many cats can figure
out like how they can act on the world
to get what they want um they certainly
have uh a fantastic model of intuitive
physics uh certainly of the the the
dynamics of their own bodies but but
also of praise and things like that
right so
um they they're they're pretty smart
they only do this with about 800 million
neurons
we are not anywhere close to reproducing
this kind of
thing so to some extent i could i could
say
let's not even worry about like the high
level cognition
and kind of you know long-term planning
and reasoning that humans can do until
we figure out like you know can we even
reproduce what cats are doing
now that said this ability to
learn world models i think is the key to
the possibility of
learning machines that can also reason
so whenever i give a talk i'd say there
are there are three challenges in the
three main challenges in machine
learning the first one is
uh you know getting machines to learn to
represent the world
and proposing salt supervised running
the second is
getting machines to reason in ways that
are compatible with
essentially gradient-based learning
because this is what deep learning is
all about really
and the third one is something we have
no idea how to solve at least i have no
idea to solve
is uh
can we get machines to learn
hierarchical representations of action
plans
you know like you know we know how to
train them to learn hierarchical
representations of
perception you know with computational
nets and things like that and
transformers but what about action plans
can we uh get them to spontaneously
learn good hierarchical representations
of actions also gradient based
yeah all of that you know needs to be
somewhat differentiable so that you can
apply sort of gradient-based learning uh
which is really what deep learning is
about
so it's background
knowledge ability to reason in a way
this differentiable that
is somehow connected deeply integrated
with that background knowledge or builds
on top of that background knowledge and
then given that background knowledge be
able to make hierarchical plans right in
the world so if if you take classical
optimal control there's something in
classical optimal control called
uh
model predictive control
and it's you know it's been around since
the early 60s
nasa uses that to compute trajectories
of rockets and the basic idea is that
you have a pretty predictive model
of the rocket let's say or whatever
system you are you intend to control
which
given the state of the system at time t
and given an action
that you're taking the system so for
rocket to be thrust and you know all the
controls you can have
uh it gives you the state of the system
at time t plus delta t right so
basically a differential equation
something like that
um
and if you have this model and you have
this model
in the form of some sort of neural net
or some sort of uh set of formula that
you can back propagate gradient through
you can do what's called model
predictive control or gradient based
uh model predictive control so you have
uh you can unroll that
that model in time you you
you you feel it a
hypothesized sequence of actions
and then you have some objective
function that measures how well at the
end of the trajectory the system has
succeeded or matched what you wanted to
do
um you know is it a robot harm have you
grasped the object you want to grasp if
it's a rocket you know are you at the
right place near the space station
things like that
and by back propagation through time and
again this was invented in the 1960s by
optimal control theorists
you can figure out uh what is the
optimal sequence of actions that will
you know get my system to the the best
final state
so
that's a form of reasoning it's
basically planning and a lot of planning
uh systems in robotics are actually
based on this
and uh
and you can think of this as a form of
reasoning so
you know to take the example of the
teenager driving a car again you have a
pretty good dynamical model of the car
it doesn't need to be very accurate but
you know again that if you
turn the wheel to the right and there is
a cliff you're gonna run off the cliff
right you don't need to have a very
accurate model to predict that
and you can run this in your mind and
decide not to do it for that reason
because you can predict in advance that
the result is going to be bad so you can
sort of imagine different scenarios
and and then you know employ
uh or take the first step in the
scenario that is most favorable and then
repeat the process of planning that's
called receding horizon model predictive
control so even you know all those
things have names you know uh going back
you know decades um
and so
if you're not not uh you know classical
optimal control the model of the world
is not generally learned
uh there's you know sometimes a few
parameters you have to identify that's
called systems identification but
uh but generally
the model is
mostly deterministic and mostly built by
hand so the big question of ai
i think the big challenge of ai for the
next decade is how do we get machines to
learn predictive models of the world
that deal with uncertainty and deal with
the real world in all this complexity so
it's not just the trajectory of a rocket
which you can reduce to first principles
it's not it's not even just a trajectory
of a robot arm which again you can model
by you know careful mathematics but it's
everything else everything you observe
in the world you know people behavior
um you know physical systems that
involve collective phenomena
like water or or you know
trees and you know branches in a tree or
something or
or
like complex things that you know humans
have no trouble
developing abstract representations in
predictive model for but we still don't
know how to do with machines where do
you put in in these three maybe in the
in the planning stages
the game theoretic nature of this world
where your actions not only respond to
the dynamic nature of the world the
environment but also affected
so if there's other humans involved is
this is this
point number four or is it somehow
integrated into the hierarchical
representation of action in your view i
think it's integrated it's just um it's
just that now your model of the world
has to deal with you know it just makes
it more complicated right the fact that
uh humans are complicated and not easily
predictable
that makes your model of the world much
more complicated that much more
complicated well there's a chat i mean i
suppose chess is an analogy
so monte carlo tree search
there's a i go you go i go you go like
um andre kapatha recently gave a talk at
mit about car doors
i think there's some machine learning
too but mostly car doors and there's a
dynamic nature to the cart like the
person opening the door checking
and he wasn't talking about that he was
talking about the perception problem of
what the ontology of what defines a car
door this big philosophical question but
to me it was interesting because like
it's obvious that the person opening the
car doors they're trying to get out like
here in new york trying to get out of
the car
you slowing down is going to signal
something you speeding up is going to
signal something and that's a dance it's
a
asynchronous
chess game i don't know
so
i it feels like um
it's not just i mean i guess you can
integrate all of them into one giant
model like the entirety of the the
these little interactions because it's
not as complicated as chess it's just
like a little dance we do like a little
dance together and then we figure it out
well in some ways it's way more
complicated than chess because uh
because it's continuous it's uncertain
in a continuous manner
uh it doesn't feel more complicated but
it doesn't feel more complicated because
that's what we are we've evolved to
solve this is the kind of problem we've
evolved to solve and so we're good at it
because you know
nature has made us good at it
nature has not made us good at chess we
completely suck at chess yeah um
in fact that's why we designed it as a
game is to be challenging
and if there is something that you know
recent progress in chess and go
has made us realize is that humans are
really terrible at those things like
really bad you know there was a story
right before alphago that uh
uh you know the best go players thought
there were maybe two or three stones
behind you know an ideal player that
they would call god
uh in fact no they are like nine or ten
stones behind i mean we're just bad
so we're not good at and it's because we
have limited uh working memory we we're
not very good at like doing this uh tree
exploration that you know computers are
much better
at doing than we are but we are much
better at learning differentiable models
of the world
i mean i said differentiable in the kind
of
you know i should say
not differentiable in the sense that you
know we went back for up to it but in
the sense that
our brain has some mechanism for
estimating gradients uh of some kind
yeah and that's what you know makes us
uh efficient so if you have an agent
that consists of
a a model of the world which you know in
the human brain is basically the entire
front half of your brain
an objective function
which
uh in human in in humans is a
combination of two things there is your
sort of intrinsic motivation module
which is in the basal ganglia you know
at the base of your brain that's the
thing that measures pain and hunger and
things like that like immediate
feelings and emotions
and then there is
you know the equivalent of what people
in reform spectrum called a critic which
is a sort of module that predicts ahead
what the outcome
of a uh
of a situation will be
and so it's it's not a cost function but
it's sort of not an objective function
but it's sort of a
you know trained predictor of the
ultimate objective function and that
also is differentiable and so if all of
this is differentiable your cost
function your your critic your
uh
you know your your role model then you
can use
gradient-based type methods to do
planning to the reasoning to do learning
uh you know to do all the things that
would like an intelligent agent uh
to do
and the gradient-based learning like
what's your intuition that's probably at
the core of what can solve intelligence
so you don't need
like a
logic based reasoning uh in your view i
don't know how to make logic based
reasoning compatible with
efficient learning yeah and
okay i mean there is a big question
perhaps a philosophical question i mean
it's not that philosophical but uh that
we can ask is is that you know all the
learning
algorithms we know from engineering and
computer science
proceed by optimizing some objective
function
yeah right
so one question we may ask is
is
does learning in the brain minimize an
objective function
it could be a you know a composite of
multiple objective functions but it's
still an objective function
uh second if it does optimize an
objective function
does it do does it do it by
some sort of gradient estimation
you know it doesn't need to be back prop
but you know some way of estimating the
gradient in efficient manner
whose complexity is on the same order of
magnitude as you know actually running
the
inference
because you can't afford to do things
like you know perturbing a weight in
your brain to figure out what the effect
is and then sort of uh you know you can
do sort of estimating gradient by
perturbation it's it to me it seems very
imp implausible that the brain uses some
sort of
you know zeroth order black box gradient
free optimization
because it's so much less efficient than
gradient optimization so it has to have
a way of estimating gradients
is it possible that some kind of logic
based reasoning emerges in pockets as a
useful like you said if the brain is an
objective function maybe it's a
mechanism for creating objective
functions
it's it's a mechanism for
creating knowledge bases for example
that can then be queried like maybe it's
like an efficient representation of
knowledge that's learned in a
gradient-based way or something like
that well so i think there is a lot of
different types of
intelligence so first of all i think the
type of logical reasoning that we think
about
that we are
you know maybe stemming from
you know sort of classical ai of the
1970s and 80s
i think humans use that relatively
rarely
and are not particularly good at it but
we judge each other based on our ability
to uh
solve those rare problems it's called an
iq test i think so like i'm i'm not very
good at chess
yes i'm judging you this whole time
because
well we we actually with your with your
uh you know heritage i'm sure you're
good at chess
no stereotypes not all stereotypes are
true
well i'm terrible at chess so um
you know but i think perhaps uh another
type of intelligence that i have is this
uh
uh you know ability of sort of building
models of the world from
uh you know
reasoning obvious obviously but also
also data
and those those models generally are
more kind of analogical right so it's
it's
it's reasoning by simulation
and by analogy
where you use one model to apply to a
new situation even though you've never
seen that situation you can sort of
connect it to a situation you've
encountered before
uh and and your reasoning is more
you know akin to some sort of internal
simulation so you you're kind of
stimulating what's happening when you're
building i don't know a box out of wood
or something right you can imagine
in advance like what would be the result
of you know cutting the wood in this
particular way are you going to use you
know screws on nails or whatever
when you are interacting with someone
you also have a model of that person and
and sort of interact with that person
you know having this model in mind uh to
kind of
uh tell the person what you think is
useful to them so
i think this
this ability to construct most of the
world is
basically the essence the essence of
intelligence
and the ability to use it then to
plan uh actions that will
uh fulfill a particular criterion
of course is is necessary as well so i'm
going to ask you a series of impossible
questions as we keep asking is that been
doing so
so if that's the fundamental sort of
dark matter of intelligence this ability
to form a background model what's your
intuition about
how much knowledge is required
you know you know i think dark matter
you put a percentage
on it
of uh the composition of the universe
and how much of it is dark matter how
much of his dark energy how much
information do you think is required to
to be a house cat
so you have to be able to uh when you
see a box going it when you see a human
compute the most evil action
if there's a thing that's near an edge
you knock it off
all of that
plus the extra stuff you mentioned which
is a
great self-awareness of the physics of
your of your own body and in the world
how much knowledge is required do you
think to solve it um i don't even know
how to measure
an answer to that question i'm not sure
how to measure it but whatever it is it
fits in about about 800 000 neurons uh
800 million neurons or the
representation does
everything all knowledge everything
right
um
it was less than a billion a dog is two
billion but a cat is less than one
billion
and uh
so multiply that by a thousand and you
get the number of synapses and i think
almost all of it is is learned through
this you know a sort of supervised
running although you know i think a tiny
flavor is learned through reinforcement
running and certainly very little
through
you know classical supervised running
although it's not even clear how
supervised learning actually works in uh
in a biological world
um so i think almost all of it is uh is
self supervision but it's driven
by uh the the sort of ingrained
objective functions that a cat or human
have at the base of their brain which
kind of drives their
um their behavior so you know nature
tells us uh you're hungry
it doesn't tell us how to feed ourselves
that's that's something that the rest of
our brain has to figure out right
well it's interesting because there
might be more like deeper objective
functions underlying the whole thing
so
hunger may be some kind of
now you go to like neurobiology it might
be just the brain
uh
trying to maintain homeostasis
so hunger is just one of the
human perceivable
symptoms of the brain being unhappy with
the way things are currently right it
could be just like one really dumb
objective function at the core but
that's how that's how behavior is is
driven uh the the fact that you know the
orbital ganglia
uh drive us to do things that are that
are different from saying a wong tong or
certainly a cat
is what makes you know human nature
versus orangutan nature versus scat
nature
so for example
uh you know our basal ganglia drives us
to
seek the company of
other humans
and that's because nature has figured
out that we need to be social animals
for our species to survive and it's true
of many
primates
it's not true orangutons orangutans are
solitary animals um they don't seek the
company of others in fact they avoid
them
in fact they scream at them when they
come too close because they're
territorial
because for for their survival you know
uh evolution has figured out that's the
best thing
i mean they're occasionally social of
course for you know
reproduction and stuff like that but um
but but they're mostly solitary so
so all of those behaviors are not part
of intelligence you know people say oh
you're never going to have intelligent
machines because you know human
intelligence is social but then you look
at orangutans you look at octopus
octopus never know their parents
they barely interact with any other and
and they get to be really smart in less
than less than a year in like half a
year
you know in a year they're adults in two
years they're dead so
there are things that we think
as humans are intimately linked with
intelligence like
social interaction like language
we think i think we give way too much
importance to language as a substrate of
intelligence as humans
because we think our reasoning is so
linked with language so for to solve the
house cat intelligence problem you think
you could do it on a desert island you
could have pretty much you could just
have a cat sitting there
um
looking at the waves that the ocean
weighs and figure a lot of it out it
needs to have sort of you know the right
set of drives
uh to kind of
you know get it to do the thing and
learn the appropriate things right but
uh
like for example you know
baby humans are driven to
learn to stand up and walk
okay you know it's not that's kind of
this desire is hard-wired how to do it
precisely is not that's learned
but the desire to to walk move around
and stand up
that's sort of
probably hardwired it's very simple to
hardwire this kind of stuff
oh like the desire to well
that's interesting you're hardwired to
want to walk
that's not a
there's got to be a deeper need for
walking
i think it was probably socially imposed
by society that you need to walk all the
other bipedal
like a lot of simple animals that you
know would probably work without ever
watching any other members of the
species it seems like a scary thing to
have to do because you suck it by peter
walking at first it seems crawling is
much safer
much more
like why are you in a hurry
well because because you have this thing
that drives you to do it you know um
which is sort of part of uh the sort of
human development is that understood
actually what not entirely no what is
what's the reason to get on two feet
it's really hard like most animals don't
get on two feet well they get on four
feet you know many mammals get on four
feet yeah they very quickly some of them
extremely quickly
but i don't you know like from the last
time i've interacted with the table
that's much more stable than the thing
then two legs it's just a really hard
problem yeah how many birds have figured
it out with two feet
well technically we can go into ontology
they have four
i guess they have two feet they have two
feet chickens
you know dinosaurs had two feet many of
them allegedly
i'm just now learning that t-rex was
eating grass not other animals t-rex
might have been a friendly
friendly pet what do you think about uh
i don't know if you looked at
the test for general intelligence that
francois chile put together i don't know
if you got a chance to look at that kind
of thing like
what's your intuition about how to solve
like an iq type of test i don't know i
think it's so outside of my radar screen
that it's not really
relevant i think in the short term
well i guess one way to ask another way
perhaps more closer to what
to your work is like how do you solve
mnist uh with very little example data
that's right and that's the answer to
this probably is supervised running just
learn to represent images and then
learning
uh
you know to recognize handwritten digits
on top of this will only require a few
samples and we observe this in humans
right you you show a young child a
picture book with a couple pictures of
an elephant and that's it
the child knows what an elephant is and
we we see this today with practical
systems that we
you know we train image recognition
systems with
uh
enormous amounts of of images either
either completely self-supervised or
very weakly supervised for example
you can
train a
neural net to predict uh whatever
hashtag people type on instagram right
then you can do this with billions of
images because there's billions per day
that are showing up
so the
amount of training data there is
essentially unlimited
and then you take the output
representation
you know a couple layers down from the
output
of what the system learned
and feed this as input to a classifier
for any object in the world that you
want and it works pretty well so that's
transfer learning okay
or weekly supervised
transfer learning uh
people are making very very fast
progress using self-supervised running
uh for for with this kind of scenario as
well
um and you know my guess is that that's
that's gonna be the future for
self-supervised learning how much
cleaning do you think is needed for
filtering um
uh malicious signal or what's a better
term but like a lot of people use
hashtags on instagram
to uh get like good seo
that doesn't fully represent the
contents of the image
like they'll put a picture of a cat and
hashtag it would like science awesome
fun i don't know all kind of
why would you put science that's not
very good seo the way the way my
colleagues who worked on this project at
uh
at facebook now meta meta
a few years ago uh dealt with this is
that they only selected something like
17 000 tags that correspond to kind of
physical things or or situations like
you know that has some visual content
um
so you know
you wouldn't have like tbt or anything
like that
also they keep a very select set of
hashtags is what you're saying yeah okay
but it's still instead on the order of
uh you know 10 to 20 000 so it's fairly
large okay
can you uh tell me about data
augmentation what the heck is data
augmentation and how is it used maybe
contrast of learning
for uh
for video what are some cool ideas here
right so data augmentation i mean first
data augmentation you know is the idea
of artificially increasing the size of
your training set by distorting the
images that you have in ways that don't
change the nature of the image right so
you take you you're doing this you can
do data augmentation on any list and
people have done this since the 1990s
right you take a in this digit and you
shift it a little bit or you
change the size or
rotate it skew it
you know etc add noise
add noise etc and it it works better if
you train a supervised classifier with
augmented data you're going to get
better results
now it's become really interesting over
the last couple years because
a lot of supervised learning techniques
to pre-train vision systems are based on
data augmentation
and the
the basic techniques is originally
inspired by uh
techniques that
i worked on in the early 90s and jeff
intern worked on also in the early 90s
there was sort of parallel
work i used to call this siamese network
so basically you take
two identical copies of the same network
they share the same weights
and you show two
different views of the same object
either those two different views may
have been obtained by data augmentation
or maybe it's two different views of the
same scene from a camera that you moved
or at different times or something like
that right or two pictures of the same
person things like that
and then you train this neural net those
two identical copies of this neural net
to produce an output representation a
vector
in such a way that the representation
for those two
images
are as close to each other as possible
as identical to each other as possible
right because you want the system to
basically learn a function that will
that will be invariant that will not
change whose output will not change when
you transform those
inputs uh in in those in those
particular ways right
so that's easy to do
what's complicated is how do you make
sure that when you show two images that
are different the system will produce
different things
because if you don't
have a specific provision for this
the system will just ignore the input
when you train it it will end up
ignoring the input and just produce a
constant vector that is the same for
every input right yes that's called a
collapse
now how do you avoid collapse so there's
two ideas
one idea that i proposed in the early
90s with my colleagues at bell labs jane
bromley and a couple other people
which we now call contrastive learning
which is to have negative examples right
so you have pairs
of images that you know are different
and you show them to the network and
uh those two copies and then you you
push the two output vectors away from
each other
and they will eventually guarantee that
things that are semantically similar
produce similar representations and
things that are different produce
different representations
we actually came up with this idea for a
project of doing signature verification
so we would collect signature
signatures from like multiple signatures
on the same person and then train a
neural net to produce the same
representation
and then uh
you know
force the system to produce different
representations for different signatures
this was actually the the problem was
proposed by people from uh what was a
subsidiary of atnt at the time called
ncr
and they were interested in storing a
representation of the signature on the
80 bytes of the
magnetic strip of a credit card so we
came up with this idea of having a
neural net with 80 outputs
you know that we would quantize on bytes
so so that we could encode the
and that encoding was then used to
compare whether the signature matches or
not that's right so then you would you
know
sign you would run through the neural
net and then you would compare the
output vector to whatever is stored on
your card it actually worked
it worked but they ended up not using it
because nobody cares actually i mean the
american you know financial payment
system is
incredibly lags in that respect compared
to europe oh with the signatures what's
the purpose of signatures anyway this is
very nobody looks at them nobody cares
yeah it's uh yeah yeah no so so that
that's contrastive learning right so you
need positive and negative pairs and the
problem with that is that
you know even though i at the original
paper on this
i'm actually not very positive about it
because it doesn't work in high
dimension if your presentation is high
dimensional there's just too many ways
for two things to be different
and and so you would need lots and lots
and lots of negative pairs
so there is a particular implementation
of this which is relatively recent from
actually the google toronto group
uh where you know jeff intern is the
senior member there it's called sim
clear sim clr
and you know basically a particular way
of implementing this idea of contracting
running the particular objective
function
now
what i'm much more
enthusiastic about these days is
non-contrasting methods so other ways to
guarantee that
uh
the
representations would be different for
different
different inputs
and it's actually based on an idea that
jeff intern proposed in the early 90s
with a student at the time sue becker
and it's based on the idea of maximizing
the mutual information between the
outputs of the two systems you only show
positive pairs you only show pairs of
images that you know are somewhat
similar
and you train the two networks to be
informative
but also to be
as informative
of each other as possible so basically
one representation has to be predictable
from the other essentially
uh and you know he proposed that idea
had you know
a couple papers in the early 90s and
then nothing was done about it for
decades and i kind of revived this idea
together with my postdocs at fair
uh particularly a postdoc called
stefanoni

Resume

Berikut adalah rangkuman komprehensif dan terstruktur dari transkrip diskusi mengenai kecerdasan buatan, self-supervised learning, dan masa depan teknologi.

***

# Misteri Kecerdasan: Self-Supervised Learning dan Masa Depan AI

### Inti Sari (Executive Summary)
Diskusi ini membahas paradigma baru dalam kecerdasan buatan, khususnya **Self-Supervised Learning (SSL)**, yang digambarkan sebagai "materi gelap" kecerdasan karena membentuk pemahaman dasar yang belum mampu direplikasi oleh mesin. Narasumber menekankan bahwa untuk mencapai tingkat kecerdasan manusia atau hewan, AI harus belajar melalui observasi dunia (memprediksi masa depan dan mengisi celah informasi) daripada hanya bergantung pada *Supervised Learning* atau *Reinforcement Learning*. Selain aspek teknis, pembahasan juga menyentuh filosofi kesadaran, struktur otak, etika robot, serta aplikasi AI dalam memecahkan masalah ilmiah kompleks.

---

### Poin-Poin Kunci (Key Takeaways)
*   **Self-Supervised Learning (SSL)** adalah kunci utama untuk menciptakan "common sense" pada mesin, memungkinkan mereka belajar dari observasi tanpa memerlukan label data yang ekstensif.
*   **Perbandingan Paradigma:** *Reinforcement Learning* (RL) dianggap tidak efisien untuk belajar dari nol, sementara *Supervised Learning* terlalu kaku; SSL meniru cara manusia dan hewan belajar sejak bayi.
*   **Mekanisme Otak:** Kecerdasan manusia banyak bergantung pada kemampuan memprediksi (*predictive coding*) dan membangun model dunia (*world model*) secara internal, bukan sekadar logika simbolis.
*   **Emosi pada AI:** Dalam kecerdasan otonom, emosi (seperti rasa takut atau senang) bukanlah tambahan opsional, melainkan konsekuensi logis dari memiliki fungsi objektif dan kemampuan memprediksi hasil.
*   **Kritik Sistem Review:** Sistem *peer review* pada konferensi ilmu komputer saat ini dikritik karena menghambat inovasi dan cenderung menghukum ide-ide baru yang radikal.
*   **Aplikasi Ilmiah:** AI memiliki potensi besar untuk merevolusi sains dengan memodelkan fenomena kompleks seperti fisika plasma, katalisis, dan perubahan iklim.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Konsep Inti: Self-Supervised Learning (SSL)
Narasumber menjelaskan bahwa kekurangan utama AI saat ini adalah ketidakmampuan untuk mempelajari latar belakang (background knowledge) tentang dunia sebagaimana dilakukan manusia dan hewan.
*   **Materi Gelap Kecerdasan:** Istilah ini merujuk pada pembelajaran yang dilakukan manusia melalui observasi (misalnya bayi mengamati lingkungan) yang tidak direplikasi oleh sistem AI saat ini.
*   **Keterbatasan RL dan Supervised:** *Reinforcement Learning* membutuhkan jumlah percobaan yang sangat besar (tidak efisien), sementara *Supervised Learning* membutuhkan data berlabel yang tidak skala.
*   **Analogi Kue:** SSL adalah kue itu sendiri (sumber informasi terbesar), *Supervised Learning* adalah icing di atasnya, dan *Reinforcement Learning* adalah ceri di puncaknya.
*   **Tantangan Prediksi:** SSL bekerja sangat baik di NLP (Bahasa) karena bersifat diskrit (kamus kata terbatas). Namun, dalam visi dan video, prediksi menjadi sulit karena ruang keluaran yang kontinu dan tak terbatas (banyak kemungkinan masa depan yang plausible).

#### 2. Mekanisme Otak dan Prediksi
Pembahasan menghubungkan arsitektur AI dengan cara kerja otak biologis.
*   **Predictive Coding:** Otak secara konstan mencoba memprediksi segala sesuatu dari segala sesuatu. Kecerdasan pada dasarnya adalah kemampuan prediksi.
*   **Model Predictive Control (MPC):** Konsep ini digunakan dalam teknik kontrol klasik (misalnya roket NASA). Narasumber berpendapat bahwa penalaran (*reasoning*) pada dasarnya adalah perencanaan (*planning*) menggunakan model prediktif.
*   **Jalur Otak:** Otak memiliki jalur ventral (untuk pengenalan "apa") dan jalur dorsal (untuk lokasi/navigasi "di mana"). Sistem AI modern perlu mengintegrasikan keduanya.
*   **Kecerdasan vs Logika:** Manusia sebenarnya buruk dalam penalaran logis murni (seperti catur), tetapi sangat baik dalam membangun model analogis dan simulasi internal (interaksi sosial, fisika intuitif).

#### 3. Emosi, Kesadaran, dan Kehidupan
Narasumber memberikan pandangan yang unik tentang emosi dan kesadaran dalam konteks AI.
*   **Emosi pada Mesin:** Jika sebuah agen otonom memiliki "kritikus" yang memprediksi hasil (baik/buruk) dan motivasi intrinsik, maka agen tersebut secara fungsional memiliki emosi. Takut adalah prediksi hasil buruk; gembira adalah prediksi hasil baik.
*   **Hakikat Kesadaran:** Kesadaran mungkin muncul karena keterbatasan otak yang hanya dapat menjalankan satu model dunia pada satu waktu. Kesadaran berfungsi sebagai "eksekutif" yang mengonfigurasi model tersebut.
*   **Ketakutan akan Kematian:** Pada manusia, motivasi kompleks muncul dari kesadaran akan kematian (Terror Management Theory), yang mungkin tidak dimiliki oleh hewan seperti kucing.

#### 4. Etika, Hak Robot, dan Kepemilikan
Diskusi beralih ke implikasi sosial dari AI yang canggih di masa depan.
*   **Hak Robot:** Jika robot dapat mencadangkan (*backup*) ingatan mereka, konsep kematian dan pembunuhan menjadi berbeda. "Membunuh" robot mungkin dianggap ilegal jika dianggap menderita, tetapi menghapus memori bisa disamakan dengan pembodohan.
*   **Privasi dan Kepemilikan:** Robot asisten pribadi yang hidup dengan manusia selama bertahun-tahun akan menyerap banyak data pribadi. Pertanyaan etika muncul mengenai siapa yang memiliki "kepribadian" robot tersebut—apakah pembuatnya atau pemiliknya.

#### 5. Industri AI: Meta, FAIR, dan Metaverse
Narasumber membahas perannya di Meta (Facebook) dan strategi riset AI.
*   **Struktur FAIR:** FAIR (Facebook AI Research) kini dibagi menjadi FAIR Labs (riset murni, bottom-up) dan FAIR Excel (proyek terorganisir, dukungan teknik).
*   **Pendekatan Teknik vs. Ilmiah:** Untuk masalah mendesak seperti mobil otonom (Tesla), pendekatan teknik rekayasa (*shortcut*) diperlukan. Namun, untuk jangka panjang, fokusnya adalah pada pemecahan masalah fundamental SSL.
*   **Pertahanan Media:** Narasumber membela Meta, menyatakan bahwa media sering kali melukiskan perusahaan sebagai entitas jahat, sementara data akademis menunjukkan dampak media sosial lebih kompleks dan tidak sepenuhnya negatif.

#### 6. Kritik terhadap Budaya Akademik dan Peer Review
Narasumber mengkritik sistem publikasi ilmiah saat ini.
*   **Ineffisiensi Konferensi:** Bidang ilmu komputer tumbuh eksponensial, membuat mayoritas *reviewer* menjadi junior yang cenderung menolak inovasi radikal dan lebih memilih perbaikan inkremental yang membosankan.
*   **Solusi Usang:** Ia mengusulkan sistem *open review* dan reputasi, di mana ribuan orang dapat menilai kertas ilmiah, bukan hanya tiga orang.

#### 7. Aplikasi AI dalam Sains dan Nasihat Karir
Bagian penutup berfokus pada masa depan AI di luar komputasi biasa.

Read

file updated 2026-02-14 16:17:28 UTC