Ilya Sutskever: Deep Learning | Lex Fridman Podcast #94
13CZPWmke6A • 2020-05-08
Transcript preview
Open
Kind: captions
Language: en
the following is a conversation with
elias discover
co-founder and chief scientist of open
ai one of the most cited computer
scientists in history
with over 165 000 citations
and to me one of the most brilliant and
insightful minds
ever in the field of deep learning there
are very few people in this world
who i would rather talk to and
brainstorm with about deep learning
intelligence and life in general than
ilia
on and off the mic this was an honor
and a pleasure this conversation was
recorded before the outbreak of the
pandemic
for everyone feeling the medical
psychological and financial burden of
this crisis
i'm sending love your way stay strong
we're in this together
we'll beat this thing this is the
artificial intelligence podcast
if you enjoy it subscribe on youtube
review it with five stars and have a
podcast
support it on patreon or simply connect
with me on twitter
at lex friedman spelled f-r-i-d-m-a-n
as usual i'll do a few minutes of as now
and never any ads in the middle that can
break the flow of the conversation
i hope that works for you and doesn't
hurt the listening experience
this show is presented by cash app the
number one finance app in the app store
when you get it use code lex podcast
cash app lets you send money to friends
buy bitcoin
invest in the stock market with as
little as one dollar
since cash app allows you to buy bitcoin
let me mention that cryptocurrency in
the context of the history of money
is fascinating i recommend ascent of
money as a great book on this history
both the book and audiobook are great
debits and credits on ledgers
started around 30 000 years ago the us
dollar
created over 200 years ago and bitcoin
the first decentralized cryptocurrency
released just over 10 years ago
so given that history cryptocurrency is
still very much in its early days of
development
but it's still aiming to and just might
redefine
the nature of money so again if you get
cash out from the app store google play
and use the code lex podcast you get ten
dollars
and cash up will also donate ten dollars
to first
an organization that is helping advance
robotics and stem education
for young people around the world and
now
here's my conversation with ilya
you were one of the three authors with
alex kaczowski
jeff hinton of the famed alex ned paper
that is arguably the paper that marked
the big
catalytic moment that launched the deep
learning revolution
at that time take us back to that time
what was your intuition about
neural networks about the
representational power of neural
networks
and maybe you could mention how did that
evolve over
the next few years up to today over the
10 years
yeah i can answer that question at some
point in about 2010 or 2011
i connected two facts in my mind
basically
the realization was this at some point
we realized that we can train
very large i shouldn't say very you know
they're tiny by today's standards but
large and deep neural networks end to
end with back propagation
at some point different people obtained
this result i obtained this result
the first the first moment in which i
realized that
deep neural networks are powerful was
when james martens invented the
hessian-free optimizer
in 2010 and he trained a 10-layer neural
network
end-to-end without pre-training
from scratch and when that happened i
thought this is it
because if you can train a big neural
network a big neural network can
represent
very complicated function because if you
have a neural network with 10 layers
it's as though you allow the human brain
to run for
some number of milliseconds neuron
firings are slow
and so in maybe 100 milliseconds your
neurons only fire 10 times so it's also
kind of like 10 layers
and in 100 milliseconds you can
perfectly recognize any object
so i thought so i already had the idea
then that we need to train a very big
neural network
on lots of supervised data and then it
must succeed because we can find the
best neural network
and then there's also theory that if you
have more data than parameters
you won't overfit today we know that
actually this theory is very incomplete
and you want overfitting when you have
less data than parameters but definitely
if you have more data than parameters
you want overfit so the fact that neural
networks were heavily
over parametrized wasn't discouraging to
you
so you you were thinking about the
theory that the number of parameters
the fact there's a huge number of
parameters is okay it's gonna be okay i
mean there was some evidence before that
it was okayish but the theory was most
the theory was that if you had a big
data set and a big neural net it was
going to work
the over parameterization just didn't
really um figure much as a problem i
thought well with images you're just
going to add some data augmentation it's
going to be okay
so where was any doubt coming from the
main doubt was can we train a bigger
will we have enough computer trainer big
enough neural net with back propagation
back propagation i thought would work
this image wasn't clear would was
whether there would be enough compute
to get a very convincing result and then
at some point alex krajewski wrote these
insanely fast gooda kernels for
training convolutional neural nets and
that was bam let's do this let's get
imaging that and it's going to be the
greatest thing
was your intuition most of your
intuition from empirical results
by you and by others so like just
actually demonstrating that a piece of
program can train a 10-layer neural
network
or was there some pen and paper or
marker and white board
thinking intuition like because you just
connected a
10 layer large neural network to the
brain so you just mentioned the brain so
in your intuition about neural networks
does the human brain
come into play as a intuition builder
definitely
i mean you you know you got to be
precise with these analogies between
neural artificial neural networks in the
brain
but there is no question that the brain
is a huge source
of intuition and inspiration for deep
learning researchers since
all the way from rosenblatt in the 60s
like
if you look at the the whole idea of a
neural network is directly inspired by
the brain
you had people like mccollum and pitts
who were saying hey you got this these
neurons in the brain and hey we recently
learned about the computer and automata
can we use some ideas from the computer
and automata to design
some kind of computational object that's
going to be
simple computational and kind of like
the brain and they invented the neuron
so they were inspired by it back then
then you had the convolutional neural
network from fukushima
and then later yeah khan who said hey if
you limit the receptive fields of a
neural network it's going to be
especially
suitable for images as it turned out to
be true so there was
there was a very small number of
examples where analogies
to the brain were successful and i
thought well probably an artificial
neuron is not
that different from the brain if it's
queen hard enough so let's just
assume it is and roll with it so no
we're now at a time where deep learning
is very successful
so let us squint less and say
let's uh open our eyes and say what to
use an interesting
difference between the human brain now i
know you're probably not an expert
neither in your scientist and your
biologist but loosely speaking
what's the difference between the human
brain and artificial neural networks
that's interesting to you
for the next decade or two that's a good
question to ask what is in what is an
interesting difference between the
neurons between
the brain and our artificial neural
networks so i feel like today
artificial neural networks so we all
agree that there are certain
dimensions in which the human brain
vastly outperforms our
models but i also think that there are
some ways in which artificial neural
networks
have a number of very important
advantages over the brain
look looking at the advantages versus
disadvantages is a good way to figure
out what is the important difference
so the brain uses spikes which may or
may not be important
yeah that's a really interesting
question do you think it's important or
not
that's one big architectural difference
between artificial neural networks and
it's hard to tell but my prior is not
very high and i can
i can say why you know there are people
who are interested in spiking neural
networks and basically
what they figured out is that they need
to simulate the
non-spiking neural networks in spikes
and that's how they're gonna make them
work if you don't simulate the non-spike
in neural networks in spikes it's not
going to work because the question is
why should it work
and that connects to questions around
back propagation and questions around
deep learning you got this giant neural
network why should it work at all
why should the learning rule work at all
it's not a self-evident question
especially if you let's say if you were
just starting in the field and you read
the very early papers
you can say hey people are saying let's
build neural networks
that's a great idea because the brain is
a neural network so it would be useful
to build neural networks
now let's figure out how to train them
it should be possible to train them
properly but how
and so the big idea is the cost function
that's the big idea the cost function
is a way of measuring the performance of
the system according to some
measure by the way that is a big
actually let me think is that
is that uh one a difficult idea to
arrive at
and how big of an idea is that that
there's a single cost function
let me sorry let me take a pause is
supervised learning
a difficult concept to come to i don't
know
all concepts are very easy in retrospect
yeah that's what it seems trivial now
but i
so because because the reason i asked
that and we'll talk about it because is
there other
things is there things that don't
necessarily have
a cost function maybe have many cost
functions or maybe have
dynamic cost functions or maybe a
totally different kind of architectures
because we have to think like that in
order to arrive at something new right
so the only so the good examples of
things which don't have clear cost
functions are gans
again you have a game so instead of
thinking of a cost function
where you want to optimize where you
know that you have an algorithm gradient
descent which will optimize the cost
function
and then you can reason about the
behavior of your system in terms of what
it optimizes
with again you say i have a game and
i'll reason
about the behavior of the system in
terms of the equilibrium of the game
but it's all about coming up with these
mathematical objects that help us reason
about
the behavior of our system right that's
really interesting yes again is the only
one it's kind of a com
the cost function is emergent from the
comparison
it's i don't i don't know if it has a
cost function i don't know if it's
meaningful to talk about the cost
function of again
it's kind of like the cost function of
biological evolution or the cost
function of the economy
it's you can talk about
regions to which it will go towards but
i don't think
i don't think the cost function analogy
is the most useful so if evolution
doesn't
that's really interesting so if
evolution doesn't really have a cost
function
like a cost function based on its
something akin to our mathematical
conception of a cost function
then do you think cost functions in deep
learning are holding us back
yeah i so you just kind of mentioned
that cost function is a nice first
profound idea do you think that's a good
idea
do you think it's an idea will go past
so self-play starts to touch on that a
little bit uh in reinforcement learning
systems that's right self-play and also
ideas around exploration where you're
trying to
take action that surprise a predictor
i'm a big fan of cos functions i think
cost functions are great and they serve
us really well and i think that whenever
we can do things because with cost
functions we should
and you know maybe there is a chance
that we will come up with some
yet another profound way of looking at
things that will involve cost functions
in a less central way
but i don't know i think cost functions
are i mean
i would not better guess against cost
functions
is there other things about the brain
that pop into your mind
that might be different and interesting
for us to consider
in designing artificial neural networks
so we talked about spiking a little bit
i mean one one thing which may
potentially be useful i think people
neuroscientists figured out something
about the learning rule of the brain or
i'm talking about spike time independent
elasticity and it would be nice if some
people were to study that in simulation
wait sorry spike time independent
plasticity yeah what's that
std it's a particular learning rule that
uses spike timing to figure out how to
to determine how to update the
synapses so it's kind of like if the
synapse fires into the neuron before the
neuron fires
then it strengthens the synapse and if
the synapse fires into the neurons
shortly after the neuron fire then it
weakens the synapse something along this
line
i'm 90 sure it's right so if i said
something wrong here
don't don't get too angry
but you sounded brilliant while saying
it but the timing that's one thing
that's missing
the the temporal dynamics is not
captured
i think that's like a fundamental
property of the brain is the timing of
this
of the signals well your recurrent
neural networks
but you you think of that as i mean
that's a very crude simplified
uh what's that called uh there's a clock
i guess to uh recurring neural networks
it's
this it seems like the brain is the
general the continuous version of that
the
the generalization where all possible
timings are possible and then within
those timings this contains some
information
you think recurrent neural networks the
recurrence
in recurrent neural networks can capture
the same kind of phenomena
as the timing that seems to be important
for the brain
in the in the firing of neurons in the
brain i i mean i think i think regarding
neurons
recurrent neural networks are amazing
and they can do
i think they can do anything we'd want
them to if we'd want a system to do
right now recurrent neural networks have
been superseded by transformers but
maybe
one day they'll make a comeback maybe
they'll be back we'll see
let me uh in a small tangent say do you
think they'll be back
so so much of the breakthroughs recently
that we'll talk about on
uh natural language processing and
language modeling has been with
transformers that don't emphasize your
currents
do you think recurrence will make a
comeback well
some kind of recurrence i think very
likely recurrent neural networks for
pros
as they're typically thought of for
processing sequences i think it's also
possible
what is to you a recurrent neural
network and generally speaking i guess
what is a recurrent neural network
you have a neural network which
maintains a high dimensional hidden
state
and then when an observation arrives it
updates its high dimensional hidden
state through
its connections in some way
so do you think you know that's what
like expert systems did
right symbolic ai uh the knowledge based
growing a knowledge base is is
maintaining a
hidden state which is its knowledge base
and is growing it by sequential
processing do you think of it more
generally
in that way or is it simply
is it the more constrained form that of
of a hidden state with certain kind of
gating units that we think of as today
with lstms and that
i mean the hidden state is technically
what you described there the hidden
state that goes inside the lstm or the
rnn or something like this
but then what should be contained you
know if you want to make the expert
system
um analogy i'm not i mean you could say
that
the knowledge is stored in the
connections and then the short term
processing is done in the
in the hidden state yes
could you say that yeah so sort of do
you think there's a future of building
large
scale knowledge bases within the neural
networks
definitely
so we're going to pause on that
confidence because i want to explore
that
well let me zoom back out and ask
back to the history of imagenet neural
networks have been around for many
decades as you mentioned
what do you think were the key ideas
that led to their success that image in
that moment
and beyond the success in the past 10
years
okay so the question is to make sure i
didn't miss anything the key ideas that
led to the success of deep learning over
the past 10 years
exactly even though the fundamental
thing behind deep learning has been
around for much longer so
the key idea about deep learning
or rather the key fact about deep
learning before
deep learning started to be successful
is that it was underestimated
people who worked in machine learning
simply didn't think that neural networks
could do much
people didn't believe that large neural
networks could be trained
people thought that well there was lots
of there was a lot of debate going on in
machine learning about what are the
right methods and so on and
people were arguing because there were
no there were there were no there was no
way to get hard facts
and by that i mean there were no
benchmarks which were truly hard
that if you do really well in them then
you can say look
here is my system that's when you switch
from
that's when this field becomes a little
bit more of an engineering field so in
terms of deep learning to answer the
question
directly the ideas were all there the
thing that was missing was
a lot of supervised data and a lot of
compute
once you have a lot of supervised data
and a lot of compute then there is a
third thing which is needed as well
and that is conviction conviction that
if you take
the right stuff which already exists and
apply and mix it with a lot of data and
a lot of compute
that it will in fact work and so that
was the
missing piece it was you had the you
need the data
you needed the compute which showed up
in terms of gpus
and you needed the conviction to realize
that you need to mix them together
so that's really interesting so uh i i
guess the
presence of compute and the present
supervised data
allowed the empirical evidence to do the
convincing of the majority of the
computer science community
so i guess there was a key moment with
uh
jitendra malik and uh alex
alyosha afros who were very skeptical
right and then there's a jeffrey hinton
that was
the opposite of skeptical and there was
a convincing moment and i think emission
had served as that moment
that's right and they represented this
kind of were the big
pillars of computer vision community
kind of the
the wizards got together and then all of
a sudden there was a shift
and it's not enough for the ideas to all
be there and the computer to be there
it's
for it to convince the cynicism that
existed that
it's interesting that people just didn't
believe for a couple of decades
yeah well but it's more than that it's
kind of
been put this way it sounds like well
you know those silly people who didn't
believe
what were they what were they missing
but in reality things were confusing
because neural networks really did not
work on anything
and they were not the best method on
pretty much anything as well
and it was pretty rational to say yeah
this stuff doesn't have any traction
and that's why you need to have these
very hard tasks which are which produce
undeniable evidence and that's how we
make progress
and that's why the field is making
progress today because we have these
hard benchmarks
which represent true progress and so
and this is why we are able to avoid
endless debate
so incredibly you've contributed some of
the biggest recent ideas in ai
in in computer vision language natural
language processing
reinforcement learning sort of
everything in between
maybe not gans is there
there may not be a topic you haven't
touched and of course the the
fundamental science of deep learning
what is the difference to you between
vision
language and as in reinforcement
learning action
as learning problems and what are the
commonalities do you see them as all
interconnected
are they fundamentally different domains
that require
different approaches
okay that's a good question machine
learning is a field with a lot of unity
a huge amount of unity what do you mean
by unity
like overlap of ideas overlap of ideas
overlap of principles in fact there is
only
one or two or three principles which are
very very simple
and then they apply in almost the same
way in
almost the same way to the different
modalities to the different problems
and that's why today when someone writes
a paper on improving optimization
of deep learning and vision it improves
the different nlp applications and it
improves the different reinforcement
learning applications
reinforcement learn so i would say that
computer vision
and nlp are very similar to each other
today they differ in that they have
slightly different architectures we use
transformers in nlp and use
convolutional neural networks
in vision but it's also possible that
one day this will change and
everything will be unified with a single
architecture because if you go back a
few years ago in
natural language processing there were a
huge number of architectures for every
different tiny problem had its own
architecture
today this is just one transformer for
all those different tasks
and if you go back in time even more you
had even more and more fragmentation and
every little problem
in ai had its own little sub
specialization and sub
you know little set of collection of
skills people who would know how to
engineer the features
now it's all been subsumed by deep
learning we have this unification
and so i expect vision to become unified
with
natural language as well or rather i
shouldn't say expect i think it's
possible i don't want to be too sure
because
i think on the commercial neural net is
very computationally efficient
rl is different rl does require slightly
different techniques because you really
do need to take action
you really do need to do something about
exploration your variance is much higher
but i think there is a lot of unity even
there
and i would expect for example that at
some point there will be some
broader unification between rl and
supervised learning where somehow the rl
will be making decisions to make the
supervised learning go better and it
will be
i imagine one big black box and you just
throw every you know you shovel
travel things into it and it just
figures out what to do with whatever you
shovel it
i mean reinforcement learning has some
aspects of
language and vision combined
almost there's elements of a long-term
memory that you should be utilizing and
there's elements of a
really rich sensory space so it seems
like the
it's like the union of the two or
something like that
i'd say something slightly differently
i'd say that reinforcement learning is
neither but it naturally interfaces and
integrates with the two of them
do you think action is fundamentally
different so yeah what is interesting
about
what is unique about policy of
learning to act well so one example for
instance is that
when you learn to act you are
fundamentally in a non-stationary world
because as your actions change the
things you see
start changing you you experience the
world in a different way and this is not
the case for
the more traditional static problem
where you have at least some
distribution and you just apply a model
to that distribution
do you think it's a fundamentally
different problem or is it just a more
difficult
general it's a generalization of the
problem of understanding
i mean it's it's it's a question of
definitions almost there is a huge you
know there's a huge amount of
commonality for sure you take gradients
you try you take
gradients we try to approximate
gradients in both cases in some get in
the case of reinforcement learning you
have
some tools to reduce the variance of the
gradients you do that
there's lots of commonality use the same
neural net in both cases
you compute the gradient you apply atom
in both cases
so i mean there's lots in common for
sure but
there are some small differences which
are not
completely insignificant it's really
just a matter of your point of view what
frame of reference you what how much do
you want to zoom in or out
as you look at these problems which
problem do you think
is harder so people like no chomsky
believe that language is fundamental to
everything
so it underlies everything do you think
language
understanding is harder than visual
scene understanding or vice versa
i think that asking if a problem is hard
is slightly wrong
i think the question is a little bit
wrong and i want to explain why
so what does it mean for a problem to be
hard
okay the non-interesting dumb answer to
that is
there's this there's a benchmark
and there's a human level performance on
that benchmark and how
there's the effort required to reach the
human level okay benchmark so from the
perspective of how much until we
get to human level on a very good
benchmark
yeah like some i i understand what you
mean by that so what i was going i'm
going to say that a lot of it depends on
you know once you solve a problem it
stops being hard and that's all that's
always true and so
whether something is hard or not depends
on what our tools can do today so you
know you say today
true human level language understanding
and visual perception are hard in the
sense that there is no
way of solving the problem completely in
the next three months right
so i agree with that statement beyond
that i'm just i'll be my my guess would
be as good as yours i don't know
oh okay so you don't have a fundamental
intuition about
how hard language understanding is i
think i i know i changed my mind let's
say language is probably going to be
harder i mean it depends on how you
define it like if you mean
absolute top-notch 100 language
understanding i'll go with language
so but then if i show you a piece of
paper with letters on it
is that you see what i mean it's uh you
have a vision system you say it's the
best
human level vision system i show you i
open a book
and i show you letters will it
understand how these letters form into
words and sentences and meaning
is this part of the vision problem where
does vision end and language begin
yeah so chomsky would say it starts at
language so vision is just a little
example of the kind of
uh structure and you know fundamental
hierarchy of ideas that's already
represented in our brain somehow
that's represented through language but
where does vision stop and language
begin
that's a really interesting
question
it so one possibility is that it's
impossible to achieve
really deep understanding in either
images
or language without basically using the
same kind of system
so you're going to get the other for
free
i think i think it's pretty likely that
yes if we can get one we prob our
machine learning is probably that good
that we can get the other but it's not
100 i'm not 100 sure and also
i think a lot a lot of it really does
depend on your definitions
definitions of like perfect vision
because really you know reading is
vision but should it count
yeah to me so my definition is if a
system looked at an image
and then the system looked at a piece of
text
and then told me something about that
and i was really impressed that's
relative
you'll be impressed for half an hour and
then you're gonna say well i mean all
the systems do that but here's the thing
they don't do
yeah but i don't have that with humans
humans continue to impress me
is that true well the ones okay so
i'm a fan of monogamy so i like the idea
of marrying somebody being with them for
several decades
so i i believe in the fact that yes it's
possible to have somebody
continuously giving you uh pleasurable
interesting witty new ideas friends yeah
i think i think so they continue to
surprise you the surprise
it's um you know that injection of
randomness
seems to be uh it seems to be a nice
source of yeah continued
uh inspiration like the the wit the
humor i think
yeah that that the that would be
a it's a very subjective test but i
think if you have enough humans
in the room yeah i i understand what you
mean
yeah i feel like i i misunderstood what
you meant by impressing you i thought
you meant to impress you with its
intelligence with how how with how good
well it understands um an image
i thought you meant something like i'm
going to show it a really complicated
image and it's going to get it right and
you're going to say wow
that's really cool systems of you know
january 2020 have not been doing that
yeah no i i think it all boils down to
like
the reason people click like on stuff on
the internet which is like it makes them
laugh so it's like humor or wit
yeah or insight i'm sure we'll get it as
get that as well
so forgive the romanticized question but
looking back to you what is the most
beautiful or surprising idea in deep
learning
or ai in general you've come across so i
think the most beautiful thing about
deep learning is that it actually works
and i mean it because you got these
ideas you got the little neural network
you got the back propagation algorithm
and then you got some theories as to you
know this is kind of like the brain so
maybe if you make it large
if you make the neural network lodge and
you train it a lot of data then it will
do the same function of the brain does
and it turns out to be true that's crazy
and now we just train these neural
networks and you make them larger and
they keep getting better
and i find it unbelievable i find it
unbelievable that this whole ai stuff
with neural networks works
have you built up an intuition of why
are there little
bits and pieces of intuitions of
insights of
why this whole thing works i mean sums
definitely while we know that
optimization we now have good
you know we've take we've had lots of
empirical you know huge amounts of
empirical reasons to believe that
optimization should work
on all most problems we care about
did you have insights of what so you
just said empirical evidence
is most of your
sort of empirical evidence kind of
convinces you
it's like evolution is empirical it
shows you that look this
evolutionary process seems to be a good
way to design
organisms that survive in their
environment but it doesn't really
get you to the insides of how the whole
thing works
i think it's a good analogy is physics
you know how you say hey let's do some
physics calculation and come up with
some new physics theory and make some
prediction
but then you gotta run the experiment
you know you gotta run the experiment
it's important
so it's a bit the same here except that
maybe some sometimes
the experiment came before the theory
but it still is the case you know you
have some
data and you come up with some
prediction you say yeah let's make a big
neural network let's train it and it's
going to work
much better than anything before it and
it will in fact continue to get better
as you make it larger
and it turns out to be true that's
that's amazing when a theory is
validated like this you know
it's not a mathematical theory it's more
of a biological theory almost
so i think there are not terrible
analogies between deep learning and
biology
i would say it's like the geometric mean
of biology and physics that's deep
learning
the geometric meaning of biology and
physics
i think i'm going to need a few hours to
wrap my head around that
because just to find the geometric just
to find uh
the set of what biology represents
well biology in biology things are
really complicated theories are really
really
it's really hard to have good predictive
theory and if in physics the theories
are too good
in theory in physics people make these
super precise theories which make these
amazing predictions
and in machine learning mechanics in
between kind of in between but
it'd be nice if machine learning somehow
helped us discover the unification of
the two as opposed to some of the
in-between
but you're right that's you're you're
kind of trying to juggle both
so do you think there's still beautiful
and mysterious properties in your
networks that are yet to be discovered
definitely i think that we are still
massively underestimating deep learning
what do you think it will look like like
what if i knew i would have done it
yeah so uh
but if you look at all the progress from
the past 10 years i would say most of it
i would say there have been a few cases
where some were things that
felt like really new ideas showed up but
by and large it was
every year we thought okay deep learning
goes this far nope it actually goes
further
and then the next year okay now you now
this is this is peak deep learning we
are really done nope
goes further it just keeps going further
each year so that means that we keep
underestimating we keep not
understanding it
as surprising properties all the time do
you think it's getting harder and harder
to make progress need to make progress
it depends on what we mean i think the
field will continue to make
very robust progress for quite a while
i think for individual researchers
especially people who are doing
um research it can be harder because
there is a very large number of
researchers right now
i think that if you have a lot of
compute then you can make
a lot of very interesting discoveries
but then you have to deal with
the challenge of managing a huge compute
a huge classic compute cluster trying to
experiment so it's a little bit harder
so i'm asking all these questions that
nobody knows the answer to
but you're one of the smartest people i
know so i'm going to keep asking
the so let's imagine all the
breakthroughs that happen in the next 30
years in deep learning
do you think most of those breakthroughs
can be done by one person
with one computer sort of in the space
of breakthroughs do you think
compute will be compute
and large efforts will be necessary
i mean i can't be sure when you say one
computer you mean
how large uh
you're uh you're clever i mean one can
one gpu
i see i think it's pretty unlikely
i think it's pretty unlikely i think
that there are many
the stack of deep learning is starting
to be quite deep
if you look at it you've got all the way
from
the ideas the systems to build the data
sets
the distributed programming the building
the actual cluster
the gpu programming putting it all
together so now the stack is getting
really deep and i think it becomes
it can be quite hard for a single person
to become to be world class in every
single layer of the stack
what about the what like vladimir vapnik
really insist on is taking
mnist and trying to learn from very few
examples
so being able to learn more efficiently
do you think that's there'll be
breakthroughs in that space that would
may not need the huge compute i think it
will be a very
i think there will be a large number of
breakthroughs in general that will not
need a huge amount of compute
so maybe i should clarify that i think
that some breakthroughs will require a
lot of compute
and i think building systems which
actually do things will require a huge
amount of compute
that one is pretty obvious if you want
to do x
right an x requires a huge neural net
you got to get a huge neural net
but i think there will be lots of i
think there is lots of room for
very important work being done by small
groups and individuals
you may be sort of on the topic of the
the science of deep learning
talk about one of the recent papers that
you released
sure that deep double descent where
bigger models
and more data hurt i think it's really
interesting paper can you can you
describe the main idea and
yeah definitely so what happened is that
some
over over the years some small number of
researchers noticed that
it is kind of weird that when you make
the neural network larger it works
better and it seems to go in
contradiction with statistical ideas
and then some people made an analysis
showing that actually you got this
double descent bump
and what we've done was to show that
double descent occurs
for all for pretty much all practical
deep learning systems
and that it'll be also so can you step
back
uh what's the x-axis and the y-axis of a
double descent plot
okay great so you can you can look you
can
do things like you can take a neural
network
and you can start increasing its size
slowly while keeping your data set fixed
so if you increase the size of the
neural network slowly
and if you don't do early stopping
that's a pretty important
detail then
when the neural network is really small
you make it larger you get a very rapid
increase in performance
then you continue to make it large and
at some point performance will get worse
and it gets and and it gets the worst
exactly at the point at which it
achieves
zero training error precisely zero
training loss
and then as you make it large it starts
to get better again and it's kind of
counter-intuitive because you'd expect
deep learning phenomena to be
monotonic and
it's hard to be sure what it means but
it also occurs in in the case of linear
classifiers and the intuition basically
boils down to the following
when you when you have a lot when you
have
a large data set and a small model
then small tiny random so basically what
is overfitting
overfitting is when your model
is somehow very sensitive to the small
random
unimportant stuff in your data set in a
training day in the training data set
precisely
so if you have a small model and you
have a big data set
and there may be some random thing you
know some training cases are randomly in
the data set and others may not be there
but the small mod but the small model is
kind of insensitive to this randomness
because
it's the same you there is pretty much
no uncertainty about the model
when it is that it's large so okay so at
the very basic level to me
it is the most surprising thing that
neural networks don't overfit every time
very quickly uh
before ever being able to learn anything
the huge number of parameters
so here so there is one way okay so
maybe so let me try to give the
explanation
maybe that will be that will work so you
got a huge neural network let's suppose
you've got a
you are you have a huge neural network
you have a huge number of parameters
and now let's pretend everything is
linear which is not let's just pretend
then there is this big subspace where a
neural network achieves zero error
and sdgt is going to find approximately
the point
that's right approximately the point
with the smallest norm in that subspace
okay and that can also be proven to be
insensitive to
the small randomness in the data when
the dimensionality is high
but when the dimensionality of the data
is equal to the dimensionality of the
model
then there is a one-to-one
correspondence between all the data sets
and the models so small changes in the
data set actually lead to large changes
in the model and that's why performance
gets worse
so this is the best explanation more or
less
so then it would be good for the model
to have more parameters
so to be bigger than the data that's
right but
only if you don't really stop if you
introduce early stop in your
regularization you can make the double
asset descent bump
almost completely disappear what is
early stop early stopping is when
you train your model and you monitor
your test your validation performance
and then if at some point validation
performance starts to get worse you say
okay let's stop training
if you're good you're good you're good
enough so the
the magic happens after after that
moment so you don't want to do the early
stopping
well if you don't do the early stop and
you get this very you get a very
pronounced double descent
do you have any intuition why this
happens double descent
oh sorry are you stopping you no the
double descend so that oh yeah so i try
let's see the intuition is basically is
this
that when the data set has as many
degrees of freedom
as the model then there is a one-to-one
correspondence between them
and so small changes to the data set
lead to noticeable changes
in the model so your model is very
sensitive to all the randomness it is
unable to discard it
whereas it turns out that when you have
a lot more data than parameters or a lot
more parameters than data
the resulting solution will be
insensitive to small changes in the data
set
so it's able to that's nicely put
discard
the small changes the the randomness
exactly the
the the spurious correlation which you
don't want
jeff hinton suggested we need to throw
back propagation we already kind of
talked about this a little bit but
he suggested that we just throw away
back propagation and start over
i mean of course some of that is a
little bit um
and humor but what do you think what
could be an alternative method of
training neural networks
well the thing that he said precisely is
that to the extent you can't find back
propagation in the brain
it's worth seeing if we can learn
something from how the brain
learns but back propagation is very
useful and we should keep using it
oh you're saying that once we discover
the mechanism of learning in the brain
or any aspects of that mechanism we
should
also try to implement that in neural
networks if it turns out that we can't
find back propagation in the brain
if we can't find bad propagation in the
brain
well so i guess your answer to that is
back propagation is pretty damn useful
so why are we complaining i mean i i
personally am a big fan of back
propagation i think it's a great
algorithm because it solves an extremely
fundamental problem which is
finding a neural circuit
subject to some constraints and i don't
see that problem going away so that's
why i
i really i think it's pretty unlikely
that we'll have anything which is going
to be
dramatically different it could happen
but i wouldn't bet on it right now
so let me ask a sort of big picture
question
do you think can do you think neural
networks can be made to reason
why not well if you look for example at
alphago or alpha zero
the neural network of alpha zero plays
go
which which we all agree is a game that
requires reasoning
better than 99.9 of all humans
just the neural network without this
search just the neural network itself
doesn't that give us an existence proof
that neural networks can reason
to push back and disagree a little bit
we all agree that
go is reasoning i think i
i agree i don't think it's a trivial so
obviously reasoning like intelligence
is uh is a loose gray area term
a little bit maybe you disagree with
that but
yes i think it has some of the same
elements of
reasoning reasoning is almost like akin
to search
right there's a sequential element of
stepwise consideration of possibilities
and sort of building on top of those
possibilities in a sequential manner
until you arrive at some insight
so yeah i guess playing go is kind of
like that and when you have a single
neural network doing that without search
that's kind of like that so there's an
existent proof in a particular
constrained environment
that a process akin to what
many people call reasoning exist but
more general kind of reasoning so off
the board there is one other existence
oh boy which one
us humans yes okay all right so
do you think the architecture
that will allow neural networks to
reason
will look similar to the neural network
architectures we have today
i think it will i think well i don't
want to make two
overly definitive statements i think
it's definitely possible that
the neural networks that will produce
the reasoning breakthroughs of the
future will be
very similar to the architectures that
exist today maybe
a little bit more current maybe a little
bit deeper but
but these these new lines are so
insanely powerful
why wouldn't they be able to learn to
reason humans can reason
so why can't neural networks so do you
think the kind of stuff we've seen
neural networks do is a kind of just
weak reasoning so it's not a
fundamentally different process
again this is stuff we don't nobody
knows the answer to
so when it comes to our neural networks
i would
think which i would say is that neural
networks are capable of reasoning
but if you train a neural network on a
task which doesn't require reasoning
it's not going to reason this is a
well-known effect where the neural
network will solve
exactly the it will solve the problem
that you pose in front of it
in the easiest way possible
right that takes us to the
to one of the brilliant sort of ways you
describe neural networks which is uh
you refer to neural networks as the
search for small circuits
and maybe general intelligence
as the search for small programs
which i found is a metaphor very
compelling can you elaborate on that
difference
yeah so the thing which i said precisely
was that
if you can find the shortest program
that outputs the data in you at your
disposal
then you will be able to use it to make
the best prediction possible
and that's a theoretical statement which
can be proven mathematically
now you can also prove mathematically
that it is
that finding the shortest program which
generates some data
is not it's not a computable operation
no a finite amount of compute can do
this
so then with neural networks
neural networks are the next best stain
that actually works in practice
we are not able to find the best the
shortest program which generates our
data
but we are able to find you know a small
but now
now that statement should be amended
even a large circuit
which fits our data in some way well i
think what you meant by this small
circuit is the smallest
needed circuit well i see the thing the
thing which i would change now back back
then i really have i haven't fully
internalized the over parameter
the over parameterized results the the
things we know about over parameters
neural nets
now i would phrase it as a large circuit
that con whose weights contain a small
amount of information
which i think is what's going on if you
imagine the training process of a neural
network as you slowly transmit entropy
from the data set to the parameters
then somehow the amount of information
in the weights
ends up being not very large which would
explain why they generalized so well
so that's that the large circuit might
be one that's
helpful for the regulation for the
generalization yeah some of this
but do you see their
do you see it important to be able to
try to learn something like programs
i mean if you can definitely i think
it's kind of the answer is
kind of yes if we can do it we should do
things that we can do it
it's it's the reason we are pushing on
deep learning
the fundamental reason the cause the the
root cause
is that we are able to train them so in
other words training comes first
we've got our pillar which is the
training pillar
and now we are trying to contort our
neural networks around the training
pillar we got to stay trainable this is
an
invo this is an invariant we cannot
violate
and so being trainable means
starting from scratch knowing nothing
you can actually pretty quickly converge
towards knowing a lot
or even slowly but it means that given
the resources at your disposal
you can train the neural net and get it
to achieve
useful performance yeah that's a pillar
we can't move away from that's right
because if you can whereas if you say
hey let's find the shortest program
but we can't do that so it doesn't
matter how useful
that would be we can't do it so we want
so do you think you kind of mentioned
that the neural networks are good at
finding small circuits or large circuits
do you think then the matter of finding
small programs
is just the data no so
the sorry not not the size or character
the qual
the the type of data sort of ask giving
it programs
well i think the thing is that right now
finding there are no good precedence of
people successfully finding
programs really well and so the way
you'd find programs is you'd
train a deep neural network to do it
basically right
which is which is the right way to go
about it but there's not good
uh illustrations that it has hasn't been
done yet but
in principle it should be possible
can you elaborate in a little bit you
what's your insight in principle
and put another way you don't see why
it's not
possible well it's kind of like more
it's more a statement of
i think that it's i think that it's
unwise to bet against deep learning and
if it's a if it's a cognitive function
that humans seem to be able to do
then it doesn't take too long for
some deep neural net to pop up that can
do it too
yeah i'm i'm i'm there with you i can
i've
i've stopped betting against neural
networks
at this point because they continue to
surprise us
what about long-term memory can neural
networks have long-term memory or
something like
knowledge bases so being able to
aggregate
important information over long periods
of time
that would then serve as useful
sort of representations of state
that uh you can make decisions by so
have a long-term context based on what
you make in the decision
so in some sense the parameters already
do that
the parameters are an aggregation of the
day of the neural
of the entirety of the neural nets
experience and so they count as the long
as long form long-term knowledge
and people have trained various neural
nets to act as knowledge bases and
you know investigated with invest people
have investigated language tomorrow's
knowledge basis so
there is work there is work there yeah
but in some sense
do you think in every sense do you think
there's a
it's it's all just a a matter of coming
up with a better mechanism of forgetting
the useless stuff
and remembering the useful stuff because
right now i mean there's not
been mechanisms that do remember really
long-term information
what do you mean by that precisely
i like i like the word precisely so
i'm thinking of the kind of compression
of information the knowledge bases
Resume
Read
file updated 2026-02-13 13:22:51 UTC
Categories
Manage