Transcript

File TXT tidak ditemukan.
Transcript
13CZPWmke6A • Ilya Sutskever: Deep Learning | Lex Fridman Podcast #94
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0381_13CZPWmke6A.txt
Back Raw
Kind: captions
Language: en
the following is a conversation with
elias discover
co-founder and chief scientist of open
ai one of the most cited computer
scientists in history
with over 165 000 citations
and to me one of the most brilliant and
insightful minds
ever in the field of deep learning there
are very few people in this world
who i would rather talk to and
brainstorm with about deep learning
intelligence and life in general than
ilia
on and off the mic this was an honor
and a pleasure this conversation was
recorded before the outbreak of the
pandemic
for everyone feeling the medical
psychological and financial burden of
this crisis
i'm sending love your way stay strong
we're in this together
we'll beat this thing this is the
artificial intelligence podcast
if you enjoy it subscribe on youtube
review it with five stars and have a
podcast
support it on patreon or simply connect
with me on twitter
at lex friedman spelled f-r-i-d-m-a-n
as usual i'll do a few minutes of as now
and never any ads in the middle that can
break the flow of the conversation
i hope that works for you and doesn't
hurt the listening experience
this show is presented by cash app the
number one finance app in the app store
when you get it use code lex podcast
cash app lets you send money to friends
buy bitcoin
invest in the stock market with as
little as one dollar
since cash app allows you to buy bitcoin
let me mention that cryptocurrency in
the context of the history of money
is fascinating i recommend ascent of
money as a great book on this history
both the book and audiobook are great
debits and credits on ledgers
started around 30 000 years ago the us
dollar
created over 200 years ago and bitcoin
the first decentralized cryptocurrency
released just over 10 years ago
so given that history cryptocurrency is
still very much in its early days of
development
but it's still aiming to and just might
redefine
the nature of money so again if you get
cash out from the app store google play
and use the code lex podcast you get ten
dollars
and cash up will also donate ten dollars
to first
an organization that is helping advance
robotics and stem education
for young people around the world and
now
here's my conversation with ilya
you were one of the three authors with
alex kaczowski
jeff hinton of the famed alex ned paper
that is arguably the paper that marked
the big
catalytic moment that launched the deep
learning revolution
at that time take us back to that time
what was your intuition about
neural networks about the
representational power of neural
networks
and maybe you could mention how did that
evolve over
the next few years up to today over the
10 years
yeah i can answer that question at some
point in about 2010 or 2011
i connected two facts in my mind
basically
the realization was this at some point
we realized that we can train
very large i shouldn't say very you know
they're tiny by today's standards but
large and deep neural networks end to
end with back propagation
at some point different people obtained
this result i obtained this result
the first the first moment in which i
realized that
deep neural networks are powerful was
when james martens invented the
hessian-free optimizer
in 2010 and he trained a 10-layer neural
network
end-to-end without pre-training
from scratch and when that happened i
thought this is it
because if you can train a big neural
network a big neural network can
represent
very complicated function because if you
have a neural network with 10 layers
it's as though you allow the human brain
to run for
some number of milliseconds neuron
firings are slow
and so in maybe 100 milliseconds your
neurons only fire 10 times so it's also
kind of like 10 layers
and in 100 milliseconds you can
perfectly recognize any object
so i thought so i already had the idea
then that we need to train a very big
neural network
on lots of supervised data and then it
must succeed because we can find the
best neural network
and then there's also theory that if you
have more data than parameters
you won't overfit today we know that
actually this theory is very incomplete
and you want overfitting when you have
less data than parameters but definitely
if you have more data than parameters
you want overfit so the fact that neural
networks were heavily
over parametrized wasn't discouraging to
you
so you you were thinking about the
theory that the number of parameters
the fact there's a huge number of
parameters is okay it's gonna be okay i
mean there was some evidence before that
it was okayish but the theory was most
the theory was that if you had a big
data set and a big neural net it was
going to work
the over parameterization just didn't
really um figure much as a problem i
thought well with images you're just
going to add some data augmentation it's
going to be okay
so where was any doubt coming from the
main doubt was can we train a bigger
will we have enough computer trainer big
enough neural net with back propagation
back propagation i thought would work
this image wasn't clear would was
whether there would be enough compute
to get a very convincing result and then
at some point alex krajewski wrote these
insanely fast gooda kernels for
training convolutional neural nets and
that was bam let's do this let's get
imaging that and it's going to be the
greatest thing
was your intuition most of your
intuition from empirical results
by you and by others so like just
actually demonstrating that a piece of
program can train a 10-layer neural
network
or was there some pen and paper or
marker and white board
thinking intuition like because you just
connected a
10 layer large neural network to the
brain so you just mentioned the brain so
in your intuition about neural networks
does the human brain
come into play as a intuition builder
definitely
i mean you you know you got to be
precise with these analogies between
neural artificial neural networks in the
brain
but there is no question that the brain
is a huge source
of intuition and inspiration for deep
learning researchers since
all the way from rosenblatt in the 60s
like
if you look at the the whole idea of a
neural network is directly inspired by
the brain
you had people like mccollum and pitts
who were saying hey you got this these
neurons in the brain and hey we recently
learned about the computer and automata
can we use some ideas from the computer
and automata to design
some kind of computational object that's
going to be
simple computational and kind of like
the brain and they invented the neuron
so they were inspired by it back then
then you had the convolutional neural
network from fukushima
and then later yeah khan who said hey if
you limit the receptive fields of a
neural network it's going to be
especially
suitable for images as it turned out to
be true so there was
there was a very small number of
examples where analogies
to the brain were successful and i
thought well probably an artificial
neuron is not
that different from the brain if it's
queen hard enough so let's just
assume it is and roll with it so no
we're now at a time where deep learning
is very successful
so let us squint less and say
let's uh open our eyes and say what to
use an interesting
difference between the human brain now i
know you're probably not an expert
neither in your scientist and your
biologist but loosely speaking
what's the difference between the human
brain and artificial neural networks
that's interesting to you
for the next decade or two that's a good
question to ask what is in what is an
interesting difference between the
neurons between
the brain and our artificial neural
networks so i feel like today
artificial neural networks so we all
agree that there are certain
dimensions in which the human brain
vastly outperforms our
models but i also think that there are
some ways in which artificial neural
networks
have a number of very important
advantages over the brain
look looking at the advantages versus
disadvantages is a good way to figure
out what is the important difference
so the brain uses spikes which may or
may not be important
yeah that's a really interesting
question do you think it's important or
not
that's one big architectural difference
between artificial neural networks and
it's hard to tell but my prior is not
very high and i can
i can say why you know there are people
who are interested in spiking neural
networks and basically
what they figured out is that they need
to simulate the
non-spiking neural networks in spikes
and that's how they're gonna make them
work if you don't simulate the non-spike
in neural networks in spikes it's not
going to work because the question is
why should it work
and that connects to questions around
back propagation and questions around
deep learning you got this giant neural
network why should it work at all
why should the learning rule work at all
it's not a self-evident question
especially if you let's say if you were
just starting in the field and you read
the very early papers
you can say hey people are saying let's
build neural networks
that's a great idea because the brain is
a neural network so it would be useful
to build neural networks
now let's figure out how to train them
it should be possible to train them
properly but how
and so the big idea is the cost function
that's the big idea the cost function
is a way of measuring the performance of
the system according to some
measure by the way that is a big
actually let me think is that
is that uh one a difficult idea to
arrive at
and how big of an idea is that that
there's a single cost function
let me sorry let me take a pause is
supervised learning
a difficult concept to come to i don't
know
all concepts are very easy in retrospect
yeah that's what it seems trivial now
but i
so because because the reason i asked
that and we'll talk about it because is
there other
things is there things that don't
necessarily have
a cost function maybe have many cost
functions or maybe have
dynamic cost functions or maybe a
totally different kind of architectures
because we have to think like that in
order to arrive at something new right
so the only so the good examples of
things which don't have clear cost
functions are gans
again you have a game so instead of
thinking of a cost function
where you want to optimize where you
know that you have an algorithm gradient
descent which will optimize the cost
function
and then you can reason about the
behavior of your system in terms of what
it optimizes
with again you say i have a game and
i'll reason
about the behavior of the system in
terms of the equilibrium of the game
but it's all about coming up with these
mathematical objects that help us reason
about
the behavior of our system right that's
really interesting yes again is the only
one it's kind of a com
the cost function is emergent from the
comparison
it's i don't i don't know if it has a
cost function i don't know if it's
meaningful to talk about the cost
function of again
it's kind of like the cost function of
biological evolution or the cost
function of the economy
it's you can talk about
regions to which it will go towards but
i don't think
i don't think the cost function analogy
is the most useful so if evolution
doesn't
that's really interesting so if
evolution doesn't really have a cost
function
like a cost function based on its
something akin to our mathematical
conception of a cost function
then do you think cost functions in deep
learning are holding us back
yeah i so you just kind of mentioned
that cost function is a nice first
profound idea do you think that's a good
idea
do you think it's an idea will go past
so self-play starts to touch on that a
little bit uh in reinforcement learning
systems that's right self-play and also
ideas around exploration where you're
trying to
take action that surprise a predictor
i'm a big fan of cos functions i think
cost functions are great and they serve
us really well and i think that whenever
we can do things because with cost
functions we should
and you know maybe there is a chance
that we will come up with some
yet another profound way of looking at
things that will involve cost functions
in a less central way
but i don't know i think cost functions
are i mean
i would not better guess against cost
functions
is there other things about the brain
that pop into your mind
that might be different and interesting
for us to consider
in designing artificial neural networks
so we talked about spiking a little bit
i mean one one thing which may
potentially be useful i think people
neuroscientists figured out something
about the learning rule of the brain or
i'm talking about spike time independent
elasticity and it would be nice if some
people were to study that in simulation
wait sorry spike time independent
plasticity yeah what's that
std it's a particular learning rule that
uses spike timing to figure out how to
to determine how to update the
synapses so it's kind of like if the
synapse fires into the neuron before the
neuron fires
then it strengthens the synapse and if
the synapse fires into the neurons
shortly after the neuron fire then it
weakens the synapse something along this
line
i'm 90 sure it's right so if i said
something wrong here
don't don't get too angry
but you sounded brilliant while saying
it but the timing that's one thing
that's missing
the the temporal dynamics is not
captured
i think that's like a fundamental
property of the brain is the timing of
this
of the signals well your recurrent
neural networks
but you you think of that as i mean
that's a very crude simplified
uh what's that called uh there's a clock
i guess to uh recurring neural networks
it's
this it seems like the brain is the
general the continuous version of that
the
the generalization where all possible
timings are possible and then within
those timings this contains some
information
you think recurrent neural networks the
recurrence
in recurrent neural networks can capture
the same kind of phenomena
as the timing that seems to be important
for the brain
in the in the firing of neurons in the
brain i i mean i think i think regarding
neurons
recurrent neural networks are amazing
and they can do
i think they can do anything we'd want
them to if we'd want a system to do
right now recurrent neural networks have
been superseded by transformers but
maybe
one day they'll make a comeback maybe
they'll be back we'll see
let me uh in a small tangent say do you
think they'll be back
so so much of the breakthroughs recently
that we'll talk about on
uh natural language processing and
language modeling has been with
transformers that don't emphasize your
currents
do you think recurrence will make a
comeback well
some kind of recurrence i think very
likely recurrent neural networks for
pros
as they're typically thought of for
processing sequences i think it's also
possible
what is to you a recurrent neural
network and generally speaking i guess
what is a recurrent neural network
you have a neural network which
maintains a high dimensional hidden
state
and then when an observation arrives it
updates its high dimensional hidden
state through
its connections in some way
so do you think you know that's what
like expert systems did
right symbolic ai uh the knowledge based
growing a knowledge base is is
maintaining a
hidden state which is its knowledge base
and is growing it by sequential
processing do you think of it more
generally
in that way or is it simply
is it the more constrained form that of
of a hidden state with certain kind of
gating units that we think of as today
with lstms and that
i mean the hidden state is technically
what you described there the hidden
state that goes inside the lstm or the
rnn or something like this
but then what should be contained you
know if you want to make the expert
system
um analogy i'm not i mean you could say
that
the knowledge is stored in the
connections and then the short term
processing is done in the
in the hidden state yes
could you say that yeah so sort of do
you think there's a future of building
large
scale knowledge bases within the neural
networks
definitely
so we're going to pause on that
confidence because i want to explore
that
well let me zoom back out and ask
back to the history of imagenet neural
networks have been around for many
decades as you mentioned
what do you think were the key ideas
that led to their success that image in
that moment
and beyond the success in the past 10
years
okay so the question is to make sure i
didn't miss anything the key ideas that
led to the success of deep learning over
the past 10 years
exactly even though the fundamental
thing behind deep learning has been
around for much longer so
the key idea about deep learning
or rather the key fact about deep
learning before
deep learning started to be successful
is that it was underestimated
people who worked in machine learning
simply didn't think that neural networks
could do much
people didn't believe that large neural
networks could be trained
people thought that well there was lots
of there was a lot of debate going on in
machine learning about what are the
right methods and so on and
people were arguing because there were
no there were there were no there was no
way to get hard facts
and by that i mean there were no
benchmarks which were truly hard
that if you do really well in them then
you can say look
here is my system that's when you switch
from
that's when this field becomes a little
bit more of an engineering field so in
terms of deep learning to answer the
question
directly the ideas were all there the
thing that was missing was
a lot of supervised data and a lot of
compute
once you have a lot of supervised data
and a lot of compute then there is a
third thing which is needed as well
and that is conviction conviction that
if you take
the right stuff which already exists and
apply and mix it with a lot of data and
a lot of compute
that it will in fact work and so that
was the
missing piece it was you had the you
need the data
you needed the compute which showed up
in terms of gpus
and you needed the conviction to realize
that you need to mix them together
so that's really interesting so uh i i
guess the
presence of compute and the present
supervised data
allowed the empirical evidence to do the
convincing of the majority of the
computer science community
so i guess there was a key moment with
uh
jitendra malik and uh alex
alyosha afros who were very skeptical
right and then there's a jeffrey hinton
that was
the opposite of skeptical and there was
a convincing moment and i think emission
had served as that moment
that's right and they represented this
kind of were the big
pillars of computer vision community
kind of the
the wizards got together and then all of
a sudden there was a shift
and it's not enough for the ideas to all
be there and the computer to be there
it's
for it to convince the cynicism that
existed that
it's interesting that people just didn't
believe for a couple of decades
yeah well but it's more than that it's
kind of
been put this way it sounds like well
you know those silly people who didn't
believe
what were they what were they missing
but in reality things were confusing
because neural networks really did not
work on anything
and they were not the best method on
pretty much anything as well
and it was pretty rational to say yeah
this stuff doesn't have any traction
and that's why you need to have these
very hard tasks which are which produce
undeniable evidence and that's how we
make progress
and that's why the field is making
progress today because we have these
hard benchmarks
which represent true progress and so
and this is why we are able to avoid
endless debate
so incredibly you've contributed some of
the biggest recent ideas in ai
in in computer vision language natural
language processing
reinforcement learning sort of
everything in between
maybe not gans is there
there may not be a topic you haven't
touched and of course the the
fundamental science of deep learning
what is the difference to you between
vision
language and as in reinforcement
learning action
as learning problems and what are the
commonalities do you see them as all
interconnected
are they fundamentally different domains
that require
different approaches
okay that's a good question machine
learning is a field with a lot of unity
a huge amount of unity what do you mean
by unity
like overlap of ideas overlap of ideas
overlap of principles in fact there is
only
one or two or three principles which are
very very simple
and then they apply in almost the same
way in
almost the same way to the different
modalities to the different problems
and that's why today when someone writes
a paper on improving optimization
of deep learning and vision it improves
the different nlp applications and it
improves the different reinforcement
learning applications
reinforcement learn so i would say that
computer vision
and nlp are very similar to each other
today they differ in that they have
slightly different architectures we use
transformers in nlp and use
convolutional neural networks
in vision but it's also possible that
one day this will change and
everything will be unified with a single
architecture because if you go back a
few years ago in
natural language processing there were a
huge number of architectures for every
different tiny problem had its own
architecture
today this is just one transformer for
all those different tasks
and if you go back in time even more you
had even more and more fragmentation and
every little problem
in ai had its own little sub
specialization and sub
you know little set of collection of
skills people who would know how to
engineer the features
now it's all been subsumed by deep
learning we have this unification
and so i expect vision to become unified
with
natural language as well or rather i
shouldn't say expect i think it's
possible i don't want to be too sure
because
i think on the commercial neural net is
very computationally efficient
rl is different rl does require slightly
different techniques because you really
do need to take action
you really do need to do something about
exploration your variance is much higher
but i think there is a lot of unity even
there
and i would expect for example that at
some point there will be some
broader unification between rl and
supervised learning where somehow the rl
will be making decisions to make the
supervised learning go better and it
will be
i imagine one big black box and you just
throw every you know you shovel
travel things into it and it just
figures out what to do with whatever you
shovel it
i mean reinforcement learning has some
aspects of
language and vision combined
almost there's elements of a long-term
memory that you should be utilizing and
there's elements of a
really rich sensory space so it seems
like the
it's like the union of the two or
something like that
i'd say something slightly differently
i'd say that reinforcement learning is
neither but it naturally interfaces and
integrates with the two of them
do you think action is fundamentally
different so yeah what is interesting
about
what is unique about policy of
learning to act well so one example for
instance is that
when you learn to act you are
fundamentally in a non-stationary world
because as your actions change the
things you see
start changing you you experience the
world in a different way and this is not
the case for
the more traditional static problem
where you have at least some
distribution and you just apply a model
to that distribution
do you think it's a fundamentally
different problem or is it just a more
difficult
general it's a generalization of the
problem of understanding
i mean it's it's it's a question of
definitions almost there is a huge you
know there's a huge amount of
commonality for sure you take gradients
you try you take
gradients we try to approximate
gradients in both cases in some get in
the case of reinforcement learning you
have
some tools to reduce the variance of the
gradients you do that
there's lots of commonality use the same
neural net in both cases
you compute the gradient you apply atom
in both cases
so i mean there's lots in common for
sure but
there are some small differences which
are not
completely insignificant it's really
just a matter of your point of view what
frame of reference you what how much do
you want to zoom in or out
as you look at these problems which
problem do you think
is harder so people like no chomsky
believe that language is fundamental to
everything
so it underlies everything do you think
language
understanding is harder than visual
scene understanding or vice versa
i think that asking if a problem is hard
is slightly wrong
i think the question is a little bit
wrong and i want to explain why
so what does it mean for a problem to be
hard
okay the non-interesting dumb answer to
that is
there's this there's a benchmark
and there's a human level performance on
that benchmark and how
there's the effort required to reach the
human level okay benchmark so from the
perspective of how much until we
get to human level on a very good
benchmark
yeah like some i i understand what you
mean by that so what i was going i'm
going to say that a lot of it depends on
you know once you solve a problem it
stops being hard and that's all that's
always true and so
whether something is hard or not depends
on what our tools can do today so you
know you say today
true human level language understanding
and visual perception are hard in the
sense that there is no
way of solving the problem completely in
the next three months right
so i agree with that statement beyond
that i'm just i'll be my my guess would
be as good as yours i don't know
oh okay so you don't have a fundamental
intuition about
how hard language understanding is i
think i i know i changed my mind let's
say language is probably going to be
harder i mean it depends on how you
define it like if you mean
absolute top-notch 100 language
understanding i'll go with language
so but then if i show you a piece of
paper with letters on it
is that you see what i mean it's uh you
have a vision system you say it's the
best
human level vision system i show you i
open a book
and i show you letters will it
understand how these letters form into
words and sentences and meaning
is this part of the vision problem where
does vision end and language begin
yeah so chomsky would say it starts at
language so vision is just a little
example of the kind of
uh structure and you know fundamental
hierarchy of ideas that's already
represented in our brain somehow
that's represented through language but
where does vision stop and language
begin
that's a really interesting
question
it so one possibility is that it's
impossible to achieve
really deep understanding in either
images
or language without basically using the
same kind of system
so you're going to get the other for
free
i think i think it's pretty likely that
yes if we can get one we prob our
machine learning is probably that good
that we can get the other but it's not
100 i'm not 100 sure and also
i think a lot a lot of it really does
depend on your definitions
definitions of like perfect vision
because really you know reading is
vision but should it count
yeah to me so my definition is if a
system looked at an image
and then the system looked at a piece of
text
and then told me something about that
and i was really impressed that's
relative
you'll be impressed for half an hour and
then you're gonna say well i mean all
the systems do that but here's the thing
they don't do
yeah but i don't have that with humans
humans continue to impress me
is that true well the ones okay so
i'm a fan of monogamy so i like the idea
of marrying somebody being with them for
several decades
so i i believe in the fact that yes it's
possible to have somebody
continuously giving you uh pleasurable
interesting witty new ideas friends yeah
i think i think so they continue to
surprise you the surprise
it's um you know that injection of
randomness
seems to be uh it seems to be a nice
source of yeah continued
uh inspiration like the the wit the
humor i think
yeah that that the that would be
a it's a very subjective test but i
think if you have enough humans
in the room yeah i i understand what you
mean
yeah i feel like i i misunderstood what
you meant by impressing you i thought
you meant to impress you with its
intelligence with how how with how good
well it understands um an image
i thought you meant something like i'm
going to show it a really complicated
image and it's going to get it right and
you're going to say wow
that's really cool systems of you know
january 2020 have not been doing that
yeah no i i think it all boils down to
like
the reason people click like on stuff on
the internet which is like it makes them
laugh so it's like humor or wit
yeah or insight i'm sure we'll get it as
get that as well
so forgive the romanticized question but
looking back to you what is the most
beautiful or surprising idea in deep
learning
or ai in general you've come across so i
think the most beautiful thing about
deep learning is that it actually works
and i mean it because you got these
ideas you got the little neural network
you got the back propagation algorithm
and then you got some theories as to you
know this is kind of like the brain so
maybe if you make it large
if you make the neural network lodge and
you train it a lot of data then it will
do the same function of the brain does
and it turns out to be true that's crazy
and now we just train these neural
networks and you make them larger and
they keep getting better
and i find it unbelievable i find it
unbelievable that this whole ai stuff
with neural networks works
have you built up an intuition of why
are there little
bits and pieces of intuitions of
insights of
why this whole thing works i mean sums
definitely while we know that
optimization we now have good
you know we've take we've had lots of
empirical you know huge amounts of
empirical reasons to believe that
optimization should work
on all most problems we care about
did you have insights of what so you
just said empirical evidence
is most of your
sort of empirical evidence kind of
convinces you
it's like evolution is empirical it
shows you that look this
evolutionary process seems to be a good
way to design
organisms that survive in their
environment but it doesn't really
get you to the insides of how the whole
thing works
i think it's a good analogy is physics
you know how you say hey let's do some
physics calculation and come up with
some new physics theory and make some
prediction
but then you gotta run the experiment
you know you gotta run the experiment
it's important
so it's a bit the same here except that
maybe some sometimes
the experiment came before the theory
but it still is the case you know you
have some
data and you come up with some
prediction you say yeah let's make a big
neural network let's train it and it's
going to work
much better than anything before it and
it will in fact continue to get better
as you make it larger
and it turns out to be true that's
that's amazing when a theory is
validated like this you know
it's not a mathematical theory it's more
of a biological theory almost
so i think there are not terrible
analogies between deep learning and
biology
i would say it's like the geometric mean
of biology and physics that's deep
learning
the geometric meaning of biology and
physics
i think i'm going to need a few hours to
wrap my head around that
because just to find the geometric just
to find uh
the set of what biology represents
well biology in biology things are
really complicated theories are really
really
it's really hard to have good predictive
theory and if in physics the theories
are too good
in theory in physics people make these
super precise theories which make these
amazing predictions
and in machine learning mechanics in
between kind of in between but
it'd be nice if machine learning somehow
helped us discover the unification of
the two as opposed to some of the
in-between
but you're right that's you're you're
kind of trying to juggle both
so do you think there's still beautiful
and mysterious properties in your
networks that are yet to be discovered
definitely i think that we are still
massively underestimating deep learning
what do you think it will look like like
what if i knew i would have done it
yeah so uh
but if you look at all the progress from
the past 10 years i would say most of it
i would say there have been a few cases
where some were things that
felt like really new ideas showed up but
by and large it was
every year we thought okay deep learning
goes this far nope it actually goes
further
and then the next year okay now you now
this is this is peak deep learning we
are really done nope
goes further it just keeps going further
each year so that means that we keep
underestimating we keep not
understanding it
as surprising properties all the time do
you think it's getting harder and harder
to make progress need to make progress
it depends on what we mean i think the
field will continue to make
very robust progress for quite a while
i think for individual researchers
especially people who are doing
um research it can be harder because
there is a very large number of
researchers right now
i think that if you have a lot of
compute then you can make
a lot of very interesting discoveries
but then you have to deal with
the challenge of managing a huge compute
a huge classic compute cluster trying to
experiment so it's a little bit harder
so i'm asking all these questions that
nobody knows the answer to
but you're one of the smartest people i
know so i'm going to keep asking
the so let's imagine all the
breakthroughs that happen in the next 30
years in deep learning
do you think most of those breakthroughs
can be done by one person
with one computer sort of in the space
of breakthroughs do you think
compute will be compute
and large efforts will be necessary
i mean i can't be sure when you say one
computer you mean
how large uh
you're uh you're clever i mean one can
one gpu
i see i think it's pretty unlikely
i think it's pretty unlikely i think
that there are many
the stack of deep learning is starting
to be quite deep
if you look at it you've got all the way
from
the ideas the systems to build the data
sets
the distributed programming the building
the actual cluster
the gpu programming putting it all
together so now the stack is getting
really deep and i think it becomes
it can be quite hard for a single person
to become to be world class in every
single layer of the stack
what about the what like vladimir vapnik
really insist on is taking
mnist and trying to learn from very few
examples
so being able to learn more efficiently
do you think that's there'll be
breakthroughs in that space that would
may not need the huge compute i think it
will be a very
i think there will be a large number of
breakthroughs in general that will not
need a huge amount of compute
so maybe i should clarify that i think
that some breakthroughs will require a
lot of compute
and i think building systems which
actually do things will require a huge
amount of compute
that one is pretty obvious if you want
to do x
right an x requires a huge neural net
you got to get a huge neural net
but i think there will be lots of i
think there is lots of room for
very important work being done by small
groups and individuals
you may be sort of on the topic of the
the science of deep learning
talk about one of the recent papers that
you released
sure that deep double descent where
bigger models
and more data hurt i think it's really
interesting paper can you can you
describe the main idea and
yeah definitely so what happened is that
some
over over the years some small number of
researchers noticed that
it is kind of weird that when you make
the neural network larger it works
better and it seems to go in
contradiction with statistical ideas
and then some people made an analysis
showing that actually you got this
double descent bump
and what we've done was to show that
double descent occurs
for all for pretty much all practical
deep learning systems
and that it'll be also so can you step
back
uh what's the x-axis and the y-axis of a
double descent plot
okay great so you can you can look you
can
do things like you can take a neural
network
and you can start increasing its size
slowly while keeping your data set fixed
so if you increase the size of the
neural network slowly
and if you don't do early stopping
that's a pretty important
detail then
when the neural network is really small
you make it larger you get a very rapid
increase in performance
then you continue to make it large and
at some point performance will get worse
and it gets and and it gets the worst
exactly at the point at which it
achieves
zero training error precisely zero
training loss
and then as you make it large it starts
to get better again and it's kind of
counter-intuitive because you'd expect
deep learning phenomena to be
monotonic and
it's hard to be sure what it means but
it also occurs in in the case of linear
classifiers and the intuition basically
boils down to the following
when you when you have a lot when you
have
a large data set and a small model
then small tiny random so basically what
is overfitting
overfitting is when your model
is somehow very sensitive to the small
random
unimportant stuff in your data set in a
training day in the training data set
precisely
so if you have a small model and you
have a big data set
and there may be some random thing you
know some training cases are randomly in
the data set and others may not be there
but the small mod but the small model is
kind of insensitive to this randomness
because
it's the same you there is pretty much
no uncertainty about the model
when it is that it's large so okay so at
the very basic level to me
it is the most surprising thing that
neural networks don't overfit every time
very quickly uh
before ever being able to learn anything
the huge number of parameters
so here so there is one way okay so
maybe so let me try to give the
explanation
maybe that will be that will work so you
got a huge neural network let's suppose
you've got a
you are you have a huge neural network
you have a huge number of parameters
and now let's pretend everything is
linear which is not let's just pretend
then there is this big subspace where a
neural network achieves zero error
and sdgt is going to find approximately
the point
that's right approximately the point
with the smallest norm in that subspace
okay and that can also be proven to be
insensitive to
the small randomness in the data when
the dimensionality is high
but when the dimensionality of the data
is equal to the dimensionality of the
model
then there is a one-to-one
correspondence between all the data sets
and the models so small changes in the
data set actually lead to large changes
in the model and that's why performance
gets worse
so this is the best explanation more or
less
so then it would be good for the model
to have more parameters
so to be bigger than the data that's
right but
only if you don't really stop if you
introduce early stop in your
regularization you can make the double
asset descent bump
almost completely disappear what is
early stop early stopping is when
you train your model and you monitor
your test your validation performance
and then if at some point validation
performance starts to get worse you say
okay let's stop training
if you're good you're good you're good
enough so the
the magic happens after after that
moment so you don't want to do the early
stopping
well if you don't do the early stop and
you get this very you get a very
pronounced double descent
do you have any intuition why this
happens double descent
oh sorry are you stopping you no the
double descend so that oh yeah so i try
let's see the intuition is basically is
this
that when the data set has as many
degrees of freedom
as the model then there is a one-to-one
correspondence between them
and so small changes to the data set
lead to noticeable changes
in the model so your model is very
sensitive to all the randomness it is
unable to discard it
whereas it turns out that when you have
a lot more data than parameters or a lot
more parameters than data
the resulting solution will be
insensitive to small changes in the data
set
so it's able to that's nicely put
discard
the small changes the the randomness
exactly the
the the spurious correlation which you
don't want
jeff hinton suggested we need to throw
back propagation we already kind of
talked about this a little bit but
he suggested that we just throw away
back propagation and start over
i mean of course some of that is a
little bit um
and humor but what do you think what
could be an alternative method of
training neural networks
well the thing that he said precisely is
that to the extent you can't find back
propagation in the brain
it's worth seeing if we can learn
something from how the brain
learns but back propagation is very
useful and we should keep using it
oh you're saying that once we discover
the mechanism of learning in the brain
or any aspects of that mechanism we
should
also try to implement that in neural
networks if it turns out that we can't
find back propagation in the brain
if we can't find bad propagation in the
brain
well so i guess your answer to that is
back propagation is pretty damn useful
so why are we complaining i mean i i
personally am a big fan of back
propagation i think it's a great
algorithm because it solves an extremely
fundamental problem which is
finding a neural circuit
subject to some constraints and i don't
see that problem going away so that's
why i
i really i think it's pretty unlikely
that we'll have anything which is going
to be
dramatically different it could happen
but i wouldn't bet on it right now
so let me ask a sort of big picture
question
do you think can do you think neural
networks can be made to reason
why not well if you look for example at
alphago or alpha zero
the neural network of alpha zero plays
go
which which we all agree is a game that
requires reasoning
better than 99.9 of all humans
just the neural network without this
search just the neural network itself
doesn't that give us an existence proof
that neural networks can reason
to push back and disagree a little bit
we all agree that
go is reasoning i think i
i agree i don't think it's a trivial so
obviously reasoning like intelligence
is uh is a loose gray area term
a little bit maybe you disagree with
that but
yes i think it has some of the same
elements of
reasoning reasoning is almost like akin
to search
right there's a sequential element of
stepwise consideration of possibilities
and sort of building on top of those
possibilities in a sequential manner
until you arrive at some insight
so yeah i guess playing go is kind of
like that and when you have a single
neural network doing that without search
that's kind of like that so there's an
existent proof in a particular
constrained environment
that a process akin to what
many people call reasoning exist but
more general kind of reasoning so off
the board there is one other existence
oh boy which one
us humans yes okay all right so
do you think the architecture
that will allow neural networks to
reason
will look similar to the neural network
architectures we have today
i think it will i think well i don't
want to make two
overly definitive statements i think
it's definitely possible that
the neural networks that will produce
the reasoning breakthroughs of the
future will be
very similar to the architectures that
exist today maybe
a little bit more current maybe a little
bit deeper but
but these these new lines are so
insanely powerful
why wouldn't they be able to learn to
reason humans can reason
so why can't neural networks so do you
think the kind of stuff we've seen
neural networks do is a kind of just
weak reasoning so it's not a
fundamentally different process
again this is stuff we don't nobody
knows the answer to
so when it comes to our neural networks
i would
think which i would say is that neural
networks are capable of reasoning
but if you train a neural network on a
task which doesn't require reasoning
it's not going to reason this is a
well-known effect where the neural
network will solve
exactly the it will solve the problem
that you pose in front of it
in the easiest way possible
right that takes us to the
to one of the brilliant sort of ways you
describe neural networks which is uh
you refer to neural networks as the
search for small circuits
and maybe general intelligence
as the search for small programs
which i found is a metaphor very
compelling can you elaborate on that
difference
yeah so the thing which i said precisely
was that
if you can find the shortest program
that outputs the data in you at your
disposal
then you will be able to use it to make
the best prediction possible
and that's a theoretical statement which
can be proven mathematically
now you can also prove mathematically
that it is
that finding the shortest program which
generates some data
is not it's not a computable operation
no a finite amount of compute can do
this
so then with neural networks
neural networks are the next best stain
that actually works in practice
we are not able to find the best the
shortest program which generates our
data
but we are able to find you know a small
but now
now that statement should be amended
even a large circuit
which fits our data in some way well i
think what you meant by this small
circuit is the smallest
needed circuit well i see the thing the
thing which i would change now back back
then i really have i haven't fully
internalized the over parameter
the over parameterized results the the
things we know about over parameters
neural nets
now i would phrase it as a large circuit
that con whose weights contain a small
amount of information
which i think is what's going on if you
imagine the training process of a neural
network as you slowly transmit entropy
from the data set to the parameters
then somehow the amount of information
in the weights
ends up being not very large which would
explain why they generalized so well
so that's that the large circuit might
be one that's
helpful for the regulation for the
generalization yeah some of this
but do you see their
do you see it important to be able to
try to learn something like programs
i mean if you can definitely i think
it's kind of the answer is
kind of yes if we can do it we should do
things that we can do it
it's it's the reason we are pushing on
deep learning
the fundamental reason the cause the the
root cause
is that we are able to train them so in
other words training comes first
we've got our pillar which is the
training pillar
and now we are trying to contort our
neural networks around the training
pillar we got to stay trainable this is
an
invo this is an invariant we cannot
violate
and so being trainable means
starting from scratch knowing nothing
you can actually pretty quickly converge
towards knowing a lot
or even slowly but it means that given
the resources at your disposal
you can train the neural net and get it
to achieve
useful performance yeah that's a pillar
we can't move away from that's right
because if you can whereas if you say
hey let's find the shortest program
but we can't do that so it doesn't
matter how useful
that would be we can't do it so we want
so do you think you kind of mentioned
that the neural networks are good at
finding small circuits or large circuits
do you think then the matter of finding
small programs
is just the data no so
the sorry not not the size or character
the qual
the the type of data sort of ask giving
it programs
well i think the thing is that right now
finding there are no good precedence of
people successfully finding
programs really well and so the way
you'd find programs is you'd
train a deep neural network to do it
basically right
which is which is the right way to go
about it but there's not good
uh illustrations that it has hasn't been
done yet but
in principle it should be possible
can you elaborate in a little bit you
what's your insight in principle
and put another way you don't see why
it's not
possible well it's kind of like more
it's more a statement of
i think that it's i think that it's
unwise to bet against deep learning and
if it's a if it's a cognitive function
that humans seem to be able to do
then it doesn't take too long for
some deep neural net to pop up that can
do it too
yeah i'm i'm i'm there with you i can
i've
i've stopped betting against neural
networks
at this point because they continue to
surprise us
what about long-term memory can neural
networks have long-term memory or
something like
knowledge bases so being able to
aggregate
important information over long periods
of time
that would then serve as useful
sort of representations of state
that uh you can make decisions by so
have a long-term context based on what
you make in the decision
so in some sense the parameters already
do that
the parameters are an aggregation of the
day of the neural
of the entirety of the neural nets
experience and so they count as the long
as long form long-term knowledge
and people have trained various neural
nets to act as knowledge bases and
you know investigated with invest people
have investigated language tomorrow's
knowledge basis so
there is work there is work there yeah
but in some sense
do you think in every sense do you think
there's a
it's it's all just a a matter of coming
up with a better mechanism of forgetting
the useless stuff
and remembering the useful stuff because
right now i mean there's not
been mechanisms that do remember really
long-term information
what do you mean by that precisely
i like i like the word precisely so
i'm thinking of the kind of compression
of information the knowledge bases
represent
sort of creating a
now i apologize for my sort of
human-centric thinking about
what knowledge is because neural
networks aren't
interpretable necessarily with the kind
of knowledge they have discovered
but a good example for me is knowledge
bases being able to build up over time
something like
the knowledge that wikipedia represents
it's a really compressed
structured
knowledge base obviously not the actual
wikipedia or the language
but like a semantic web the dream that
semantic web represented
so it's a really nice compressed
knowledge base or something
akin to that in the non-interpretable
sense as
neural networks would have well the
neural networks would be
non-interpretable if you look at their
weights but
their outputs should be very
interpretable okay so yeah how do
you make very smart neural networks like
language models interpretable
well you ask them to generate some text
then the text will generally be
interpretable
do you find that the epitome of
interpretability like
can you do better like can you uh
because you can't okay i'd like to know
what does it know and what doesn't know
i would like
the neural network to come up with
examples where it
it's completely dumb and examples where
it's completely brilliant
and the only way i know how to do that
now is to generate a lot of examples and
use my human judgment
but it would be nice if a neonatal had
some aware self-awareness
about it yeah 100 i'm a big believer in
self-awareness and i think that
i think i think neural net
self-awareness will allow for things
like
the capabilities like the ones you
describe like for them to know what they
know and what they don't know
and for them to know where to invest to
increase their skills most optimally
and to your question of interpretability
there are actually two answers to that
question
one answer is you know we have the
neural net so we can
analyze the neurons and we can try to
understand what the different neurons
and different layers mean
and you can actually do that and openai
has done some work on that
but there is a different answer which is
that i would say this is the
human-centric answer where you say
you know you look at a human being you
can't read you know
how how do you know what a human being
is think and you ask them you say hey
what do you think about this what do you
think about that
and you get some answers the answers you
get are sticky in the sense you already
have a mental model
you already have an uh yeah mental model
of that human being
you already have an understanding of
like a
a big conception of what it of that
human being how they think
what they know how they see the world
and then everything you ask you're
adding on to that and that stickiness
seems to be that's one of the really
interesting qualities of the the human
being is that information is sticky
you don't you seem to remember the
useful stuff aggregate it well
and forget most of the information
that's not useful
that process but that's also pretty
similar to the process that neural
networks do
is just that neural network so much
crappier at it at this time
it doesn't seem to be fundamentally that
different but
just to stick on reasoning for a little
longer
he said why not why can't i reason
what's a good impressive
feat benchmark to you of reasoning
that you'll be impressed by if you don't
know what we're able to do
is that something you already have in
mind well i think writing writing
really good code i think
proving really hard theorems solving
open-ended problems with out-of-the-box
solutions
and uh sort of theorem type mathematical
problems
yeah i think though those ones are a
very natural example as well
you know if you can prove an unproven
theorem then it's hard to argue don't
reason
and so by the way and this comes back to
the point about the hard results you
know
if you got a heart if you have machine
learning
deep learning as a field is very
fortunate because we have the ability to
sometimes produce these unambiguous
results
and when they happen uh the debate
changes the conversation changes it's a
conversa
we have the ability to produce
conversation changing results
conversation and then of course just
like you said people kind of take that
for granted and say that wasn't actually
a hard problem
well i mean at some point we'll probably
run out of heart problems
yeah that whole mortality thing is kind
of
kind of a sticky problem that we haven't
quite figured out maybe we'll solve that
one
i think one of the fascinating things in
your entire body of work but also the
work at open ai recently
one of the conversation changers has
been in the world of language models
can you briefly kind of try to describe
the recent history of using neural
networks
in the domain of language and text well
there's been lots of history
i think i think the elman network was
was this was was a small
tiny recurrent neural network applied to
language back in the 80s
so the history is really you know fairly
long at least
and the thing that started the thing
that changed
the trajectory of neural networks and
language is
the thing that changed the trajectory of
deep learning and that's data and
compute
so suddenly you move from small language
models which
learn a little bit and with language
models in particular you can
there's a very clear explanation for why
they need to be large
to be good because they're trying to
predict the next word
so we don't when you don't know anything
you'll notice very
very broad stroke surface level patterns
like
sometimes there are characters and there
is a space between those characters
you'll notice this pattern
and you'll notice that sometimes there
is a comma and then the next character
is a capital letter you'll notice that
pattern
eventually you may start to notice that
there are certain words occur often you
may notice that
spellings are a thing you may notice
syntax and when you get
really good at all these you start to
notice the semantics
you start to notice the facts but for
that to happen the language model needs
to be larger
so that's let's linger on that because
that's where you
and noam chomps could disagree
so you think we're actually taking uh
incremental steps a sort of larger
network larger compute will be able to
get to the semantics to be able to
understand language
without what gnome likes to sort of
think of as a
fundamental understandings of the
structure of language
like imposing your theory of language
onto the
learning mechanism so you're saying the
learning
you can learn from raw data the
mechanism that underlies language
well i think i think it's pretty likely
but i also want to say that i don't
really
know precisely what is what chomsky
means
when he talks about him you said
something about imposing
your structure and language i'm not 100
sure what he means but
empirically it seems that when you
inspect those larger language models
they exhibit signs of understanding the
semantics whereas the smaller language
models do not
we've seen that a few years ago when we
did work on the sentiment neuron we
trained the small
you know smaller shell stm to predict
the next character
in amazon reviews and we noticed that
when you increase the size of the lstm
from 500
lstm cells to 4000 lstm cells then one
of the neurons
starts to represent the sentiment of the
article
of story of the review now why is that
sentiment is a pretty semantic
attribute it's not a syntactic attribute
and for people who might not know i
don't know if that's a standard term but
sentiment is whether it's a positive or
negative review that's right like this
is the person happy with something is
the person unhappy with something
and so here we had very clear evidence
that a small
neural net does not capture sentiment
while a large neural net does
and why is that well our theory is that
at some point
you run out of syntax to models you
start gotta focus on something else
and with size you quickly run out
of syntax to model and then you really
start to focus on the semantics is would
be the idea
that's right and so i don't i don't want
to imply that our models have complete
semantic understanding because that's
not true
but they definitely are showing signs of
semantic understanding partial semantic
understanding but
the smaller models do not show that
those signs
can you take a step back and say what is
gpt2 which is
one of the big language models that was
the conversation
change in the past couple of years yes
it's so gpt-2
is a transformer with one and a half
billion parameters
that was trained on upon about 40
billion
tokens of text which were obtained
from web pages that were linked to from
reddit articles with more than three
upvotes and what's the transformer
the transformer is the most important
advance in neural network architectures
in recent history
what is attention maybe too because i
think that's the interesting
idea not necessarily sort of technically
speaking but the idea of attention
versus maybe what recurring neural
networks represent
yeah so the thing is the transformer is
a combination
of multiple ideas simultaneously which
attention is one
do you think attention is the key no
it's a key but it's not the key
the transformer is successful because it
is the simultaneous combination of
multiple ideas and if you were to remove
either idea it would be much less
successful
so the transformer uses a lot of
attention but attention existed for a
few years
so that can't be the main innovation the
transformer
is designed in such a way that it runs
really fast on the gpu
and that makes a huge amount of
difference this is one thing
the second thing is the transformer is
not recurrent
and that is really important too because
it is more shallow and therefore much
easier to optimize
so in other words it uses attention it
is it is a really great fit to the gpu
and it is not recurrent so therefore
less deep and easier to optimize
and the combination of those factors
make it successful so now it makes
it makes great use of your gpu it allows
you to achieve
better results for the same amount of
compute
and that's why it's successful were you
surprised how well transformers worked
and gpt2 worked so you worked on
language
you've had a lot of great ideas before
transformers came about in language
so you got to see the whole set of
revolutions before and after
were you surprised yeah a little a
little
yeah i mean it's hard it's hard to
remember because
you adapt really quickly but it
definitely was surprising it definitely
was in fact i'll
you know what i'll i'll retract my
statement it was
it was pretty amazing it was just
amazing to see generate this text
of this and you know you got to keep in
mind that we've seen at that time we've
seen all this progress in gans
in improving you know the samples
produced by cans were just amazing
you have these realistic faces but text
hasn't really moved that much
and suddenly we moved from you know
whatever gans were in 2015
to the best most amazing gans in one
step right and i was really stunning
even though theory predicted yeah you
train a big language model of course you
should get this
but then to see it with your own eyes
it's something else
and yet we adapt really quickly and now
there's
uh sort of
some cognitive scientists write articles
saying that gpt2 models don't truly
understand
language so we adapt quickly to how
amazing
the fact that they're able to model the
language so well is
so what do you think is the bar
for what for impressing us that it
i don't know do you think that bar will
continuously be moved
definitely i i think when you start to
see really
dramatic economic impact that's when i
think that's in some sense
the next barrier because right now if
you think about the working ai
it's really confusing it's really hard
to know what to make of all these
advances
it's kind of like okay you got an
advance and now you can do
more things and you got another
improvement and you got another cool
demo
at some point i think
people who are outside of ai they can no
longer distinguish this progress anymore
so we were talking offline
about translating russian to english and
how there's a lot of brilliant work in
russian that
the the rest of the world doesn't know
about that's true for chinese that's
true for a lot of
for a lot of scientists and just
artistic work in general
do you think translation is the place
where we're going to see sort of
economic
big impact i i don't know i i think i
think there is a huge number of
i mean first of all i would want to i
want to point out the translation
already today is huge i think billions
of people interact with
uh big chunks of the internet primarily
through translation so
translation is already huge and it's
hugely hugely positive too
i think self-driving is going to be
hugely impactful
and that's you know it's it's unknown
exactly when it happens but again i
would
i would not bet against deep learning so
i so that's deep learning in general but
you
you keep learning for self-driving yes
deep learning for self-driving but
i was talking about sort of language
models let's see just to ch
just spear it off a little bit just to
check you're not seeing a connection
between driving and language no no
okay all right they both use neural nets
they'll be a poetic connection i think
there might be some
like you said there might be some kind
of unification towards uh
a kind of multi-task transformers
that can take on both language and
vision tasks
that'd be an interesting unification
now let's see what can i ask about gpt2
more um
it's simple it's not much to ask it's so
you take it
you take a transform you make it bigger
you give it more data and suddenly it
does all those amazing things
yeah one of the beautiful things is that
gpg the transformers are
fundamentally simple to explain to train
do you think bigger will continue to
show better results
in language probably
sort of like what are the next steps
with gpt2 do you think
i mean for i think for for sure seeing
what
uh larger versions can do is one
direction
also i mean there are there are many
questions there's one question which i'm
curious about and that's the following
so right now gpt2 so we feed all this
data from the internet which means that
he needs to memorize all those
random facts about everything in the
internet
and it would be nice if
the model could somehow use its own
intelligence
to decide what data it wants to study
accept and what data it wants to reject
just like people people don't learn all
data indiscriminately we are
super selective about what we learn and
i think this kind of active learning i
think would be very nice to have
yeah listen i love active learning so
let me ask does the selection of data
can you just elaborate that a little bit
more do you think the selection of data
is um like i i have this kind of sense
that the optimization of how you select
data so
the active learning process is going to
be
a place for a lot of breakthroughs even
in the near future
because there hasn't been many
breakthroughs there that are public i
feel like there might be
private breakthroughs that companies
keep to themselves because the
fundamental problem has to be solved
if you want to solve self-driving if you
want to solve a particular
task but do you what do you think about
the space in general
yeah so i think that for something like
active learning or in fact for
any kind of capability like active
learning the thing that it really needs
is a problem
it needs a problem that requires it
it's very hard to do research about the
capability if you don't have a task
because then what's going to happen is
you will come up with an artificial task
get good results
but not really convince anyone right
like we're now past the stage where
getting a result an mnist
some clever formulation remnants will
will convince people that's right in
fact you could
quite easily come up with a simple
active learning scheme on amnesty and
get a 10x
speed up but then so what and i think
that
with active learning their needs they
need active learning will naturally
arise
as there are as problems that require it
pop up
that's how i would that's my my take on
it
there's another interesting thing that
openai has brought up with gpt2 which is
when you create a powerful artificial
intelligence system and it was unclear
what kind of detrimental once you
release gpt2
what kind of detrimental effect it will
have because if you have an
a model that can generate pretty
realistic text
you can start to imagine that you know
on the
it would be used by bots and some some
way that we can't even imagine so like
there's this nervousness about what it's
possible to do
so you you did a really kind of brave
and i think profound thing which you
started a conversation about this like
how do we release
powerful artificial intelligence models
to the public
if we do it all how do we privately
discuss
with other even competitors about
how we manage the use of the systems and
so on
so from that this whole experience you
released a report on it
but in general are there any insights
that you've gathered
from just thinking about this about how
you release models like this
i mean i think that my take on this is
that the field of ai
has been in a state of childhood and now
it's exiting that state
and it's entering a state of maturity
what that means is that ai is very
successful
and also very impactful and its impact
is not only large but it's also growing
and so for that reason it seems wise to
start thinking
about the impact of our systems before
releasing them
maybe a little bit too soon rather than
a little bit too late
and with the case of gpt2 like i
mentioned earlier
the results really were stunning and it
seemed
plausible it didn't seem certain it
seemed plausible that
something like gpt2 could easily use to
reduce the cost of
this information and so
there was a question of what's the best
way to release it and staged release
seemed logical a small model was
released
and there was time to see the
many people use these models in lots of
cool ways they've been lots of really
cool applications
there haven't been any negative
applications we know of
and so eventually it was released but
also other people replicated similar
models
that's an interesting question though
that we know of so
in your view stage release is
uh at least part of the answer to the
question of how do we
uh how what do we do once we create a
system like this it's part of the answer
yes
is there any other insights like say you
don't want to release the model at all
because it's useful to you for whatever
the business is
well there are plenty plenty of people
don't release models already
right of course but is there some
moral ethical responsibility when you
have a very powerful model to sort of
communicate like just as you said
when you had gpt2 it was unclear how
much it could be used for misinformation
it's an open question and getting an
answer to that
might require that you talk to other
really smart people that are outside of
uh
outside your particular group
have you please tell me there's some
optimistic pathway
for people across the world to
collaborate on these kinds of cases
or is it still really difficult from
from one company to talk to another
company
so it's definitely possible it's
definitely possible to
discuss these kind of models
with colleagues elsewhere and to
get get their take on what's on what to
do how hard is it though
i mean do you see that happening
i think that's that's a place where it's
important to gradually build trust
between companies
because ultimately all the ai developers
are building technology which is bitcoin
to be increasingly more powerful
and so it's
the way to think about it is that
ultimately we're only together
yeah it's uh i tend to believe in the
the better angels of our nature but i do
hope
that um that when you build a really
powerful ai system in a particular
domain
that you also think about the potential
negative consequences of um
it's an interesting and scary
possibility that it'll be a race
for a ai development that would push
people to close
that development and not share ideas
with others
i don't love this i've been like a pure
academic for 10 years i really like
sharing ideas and it's fun it's exciting
what do you think it takes to let's talk
about agi a little bit
what do you think it takes to build a
system of human level intelligence we
talked about reasoning
we talked about long-term memory but in
general what does it take you think
well i can't be sure
but i think the deep learning plus maybe
another
small idea do you think self-play will
be involved
so like you've spoken about the powerful
mechanism of self-play where
systems learn by sort of uh
exploring the world in a competitive
setting against
other entities that are similarly
skilled as them
and so incrementally improve in this way
do you think self-play will be a
component of
building an agi system yeah so what i
would say
to build agi i think is going to be
deep learning plus some ideas and i
think self-play will be one of those
ideas
i think that that is a very
self play has this amazing property that
it can surprise us
in truly novel ways for example
like we i mean pretty much every
self-play system
both are dotabot i don't know if openai
had a release about
multi-agent where you had two little
agents who were playing hide and seek
and of course also alpha zero they were
all
surprising behaviors they all produce
behaviors that we didn't expect they are
creative solutions to problems
and that seems like an important part of
agi that our systems don't exhibit
routinely right now
and so that's why i like this area i
like this direction because of its
ability to surprise us
to surprise us and an agr system would
surprise us fundamentally yes but and to
be precise not just
not just a random surprise but to find a
surprising solution to a problem that's
also useful
right now a lot of the self-play
mechanisms have been used
in the game context or at least in the
simulation context
how much how much do you how far along
the path
to egi do you think will be done in
simulation how much faith
promise do you have in simulation
versus having to have a system that
operates
in the real world whether it's the real
world of digital
real world data or real world like
actual physical world of robotics
i don't think it's an either or i think
simulation is a tool
and it helps it has certain strengths
and certain weaknesses and we should
use it yeah but okay i understand that
that's um
that's true but one of the criticisms of
self-play one of the criticisms of
reinforcement learning is one of the
the its current power
its current results while amazing have
been demonstrated in a simulated
environments
or very constrained physical
environments do you think it's possible
to escape them
escape the simulated environments and be
able to learn in non-simulated
environments
or do you think it's possible to also
just
simulate in the photorealistic and
physics realistic way the real world in
a way that we can solve real problems
with self-play
in simulation so i think that
transfer from simulation to the real
world is definitely possible
and has been exhibited many times in by
many different groups it's been
especially successful in vision
also open ai in the summer has
demonstrated a robot hand which was
trained entirely in simulation
in a certain way that allowed for
cinderella transfer to occur
is this uh for the rubik's cube that's
right and i wasn't aware that was
trained in simulation it was straining
simulation entirely
really so what it wasn't in the physical
the hand wasn't trained
no 100 of the training was done in
simulation
and the policy that was learned in
simulation was trained to be very
adaptive
so adaptive that when you transfer it
could very quickly adapt to the physical
to the physical world so the kind of
perturbations with the
giraffe or whatever the heck it was
those weren't were those part of the
simulation well the simulation was
generally
so the simulation was trained to be
robust to many different things but not
the kind of perturbations we've had in
the video so
it's never been trained with a glove
it's never been trained with a
stuffed giraffe so in theory these are
novel perturbations correct
it's not in theory in practice that
those are novel
probation well that's okay
that's a clean small scale but clean
example of a transfer from the simulated
world to the to the physical world
yeah and i will also say that i expect
the transfer capabilities of deep
learning to increase
in general and the better the transfer
capabilities are
the more useful simulation will become
because then you could take you could
experience something in simulation
and then learn a moral of the story
which you could then carry with you to
the real world
right as humans do all the time when
they play computer games
so let me ask sort of an
embodied question staying on agi for a
sec
do you think aj asks us that we need to
have a body we need to have some of
those human elements of
self-awareness consciousness sort of
fear of mortalities or self-preservation
in the physical space
which comes with having a body i think
having a body will be useful
i don't think it's necessary but i think
it's very useful to have a body for sure
because you can learn
a whole new you you can learn things
which cannot be learned without a body
but at the same time i think that you
can if you don't have a body you could
compensate for it and still succeed you
think so
yes well if there is evidence for this
for example there are many people who
were born deaf and
blind and they were able to compensate
for the lack of
modalities i'm thinking about helen
keller specifically
so even if you're not able to physically
interact with the world
and if you're not able to i mean i
actually was
getting it maybe let me ask
on the more particular i'm not sure if
it's connected to having a body or not
but
the idea of consciousness and a more
constrained version of that is
self-awareness
do you think an egi system should have
consciousness
it's what we can't define kind of
whatever the heck you think
consciousness is
yeah hard question to answer given how
hard it is to find it
do you think it's useful to think about
i mean it's it's definitely interesting
it's fascinating i think it's definitely
possible that our assistants will be
conscious
do you think that's an emergent thing
that just comes from do you think
consciousness could emerge from the
representation that's
stored within your networks so like that
it naturally just emerges when you
become
more and more you're able to represent
more and more of the world
well i'd say i'd make the following
argument which is
humans are conscious and if you believe
that
artificial neural nets are sufficiently
similar to the brain
then there should at least exist
artificial neurons you should be
conscious too
you're leaning on that existence proof
pretty heavily okay
but it's it's just that that's that's
the best answer i can give
no i i know i know i know
uh there's still an open question if
there's not some magic in the brain
that we're not i mean i don't mean a
non-materialistic
magic but that um that the brain might
be a lot more complicated and
interesting that we give it credit for
if that's the case then it should show
up and at some point
at some point we will find out that we
can't continue to make progress but i
think
i think it's unlikely so we talk about
consciousness but let me talk about
another poorly defined concept of
intelligence
again we've talked about reasoning we've
talked about memory
what do you think is a good test of
intelligence for you
are you impressed by the test that alan
turing formulated
with the imitation game of that with
natural language is there something
in your mind that you will be deeply
impressed by
if a system was able to do i mean lots
of things
there's certain there's certain
frontiers there is a certain frontier of
capabilities today
yeah and there exists things outside of
that frontier
and i would be impressed by any such
thing for example
i would be impressed by a deep learning
system
which solves a very pedestrian you know
pedestrian task like machine translation
or computer vision task or
something which never makes mistake
a human wouldn't make under any
circumstances
i think that is something which have not
yet been demonstrated and i would find
it very
impressive yeah so right now they make
mistakes and differ
they might be more accurate than human
beings but they still they make a
different set of mistakes
so my my i would guess that a lot of the
skepticism that some people have about
deep learning
is when they look at their mistakes and
they say well those mistakes
they make no sense like if you
understood the concept you wouldn't make
that mistake
and i think that changing that would be
would would that would that would
inspire me that would be yes this is
this this is this is progress
yeah that's that's a really nice way to
put it but i also just
don't like that human instinct to
criticize a model is not intelligent
that's the same instinct as we do when
we criticize
any group of creatures as the other
because it's very possible that
gpt2 is much smarter than human beings
and many things
that's definitely true it has a lot more
breadth of knowledge yes
breadth knowledge and even and even
perhaps
depth on certain topics it's kind of
hard to judge what
depth means but there's definitely a
sense in which
humans don't make mistakes that these
models do
yes the same is applied to autonomous
vehicles
the same is probably going to continue
being applied to a lot of artificial
intelligence systems
we find this is the annoying this is the
process of
in the 21st century the process of
analyzing the progress of ai
is the search for one case where the
system fails
in a big way where humans would not
and then many people writing articles
about it
and then broadly as a com as a the
public generally gets convinced that the
system is not intelligent
and we like pacify ourselves by thinking
it's not intelligent because of this one
anecdotal case and this can seems to
continue happening
yeah i mean there is truth to that
though there is people also i'm sure
that plenty of people are also extremely
impressed by the system that exists
today
but i think this connects to the earlier
point we discussed that
it's just confusing to judge progress in
ai
yeah and you know you have a new robot
demonstrating something
how impressed should you be and i think
that
people will start to be impressed once
ai starts to really move the needle on
the gdp
so you're one of the people that might
be able to create an agi system here not
you but you and open ai if
if you do create an ajax system and you
get to spend sort of
the evening with it him her
what would you talk about do you think
the very first time first time well the
first time i would just
i would just ask all kinds of questions
and try to make it to get it to make a
mistake and i would be amazed that it
doesn't make mistakes and just keep
keep asking abroad okay
what kind of questions do you think
would they be factual or would they be
personal emotional psychological what do
you think
all of that bob
would you ask for advice definitely
i mean why why would i limit myself
talking to a system like this
now again let me emphasize the fact that
you truly are one of the people that
might be in the room where this happens
so let me ask a sort of a profound
question
about um i've just talked to a stalin
historian
i've been talking to a lot of people who
are studying power
abraham lincoln said nearly all men can
stand adversity
but if you want to test a man's
character give him power
i would say the power of the 21st
century maybe
the 22nd but hopefully the 21st would be
the creation of an agi system and the
people who
have control direct possession and
control of the agi system
so what do you think after spending that
evening
having a discussion with the agi system
what do you think you would do
well the ideal world would like to
imagine
is one where humanity are like
the board the board members of a company
where the agi is the ceo
so it would be
i would like the picture which i would
imagine is you have some kind of
different
entities different countries or cities
and the people that live there vote
for what the agi that represents them
should do and then age other represents
them goes and does it
i think a picture like that
i find very appealing and you could have
multiple you would have an agi for a
city for a country and there would be
it would be trying to in effect
take the democratic process to the next
level and the board can always fire the
ceo
essentially press the reset button and
say re-randomize the parameters here
well let me
sort of that's actually okay that's a
beautiful vision
i think as long as it's possible to con
to
press the reset button do you think it
will always be possible to press the
reset button
so i think that it's def it's definitely
be possible to build
so you're talking so the question that i
really understand from you
is will reveal humans or
humans people have control over the ai
systems that they built
yes and my answer is it's definitely
possible to build ai systems which
will want to be controlled by their
humans wow that's part of their
so it's not that just they can't help
but be controlled but that's that's um
the they exist the one of the objectives
of their existence is to be controlled
in the same way that human parents
generally want to help their children
they want their children to succeed
it's not a burden for them they are
excited to
help the children and to feed them and
to dress them and to
take care of them and i believe
with highest conviction that the same
will be possible
for an agi it will be possible to
program an agi to design it in such a
way that it will have a similar
deep drive that it will be delighted to
fulfill
and the drive will be to help humans
flourish
but let me take a step back to that
moment where you create the agi system i
think this is a really crucial moment
and between that moment and
the the democratic board members with
the agi at the head
there has to be a relinquishing of power
says george washington
despite all the bad things he did one of
the big things he did is he relinquished
power
he first of all didn't want to be
president and
even when he became president he gave he
didn't keep just
serving as most dictators do for
indefinitely
do you see yourself being able to
relinquish control over an agi system
given how much power you can have over
the world
at first financial just make a lot of
money
right and then control by having
possession as a gi system
i i'd find it trivial to do that i'd
find it trivial to
relinquish this this kind of i mean you
know the
the kind of scenario you are describing
sounds terrifying to me
that's all i would absolutely not want
to be in that position
do you think you represent the majority
or the minority
of people in the ai community well i
mean
open question an important one are most
people good is another way to ask it
so i don't know if most people are good
but
i think that when it really counts
people can be better than we think
that's beautifully put yeah are there
specific mechanisms you can think of
of aligning aig and values to human
values
is that do you think about these
problems of continued alignment
as we develop the eye systems yeah
definitely
in some sense the kind of question which
you are asking
is so if you have to translate that
question to today's terms
yes it would be a question about
how to get an rl agent
that's optimizing a value function which
itself is learned
and if you look at humans humans are
like that because the
reward function the value function of
humans is not external
it is internal that's right and
there are definite ideas of how to train
a value function basically an objective
you know
and as objective as possible perception
system
that will be trained separately
to recognize to internalize human
judgments
on different situations and then that
component would then be integrated
as the value as the base value function
for some more capable
rail system you could imagine a process
like this i'm not saying this is
the process i'm saying this is an
example of the kind of thing you could
do
so on that topic of the objective
functions of human existence
what do you think is the objective
function that
is implicit in human existence what's
the meaning of life
oh
i think the question is is wrong in some
way i think that
the question implies that the reason
there is an objective answer which is an
external answer you know your meaning of
life is x
right i think what's going on is that we
exist and
that's amazing and we should try to make
the most of it and try to
maximize our own value and enjoyment of
a very short time while we do exist
it's funny because action does require
an objective function it's definitely
theirs in some form but it's difficult
to make it explicit
and maybe impossible to make it explicit
i guess is what you're getting at and
that's an interesting
fact of an rl environment well but i was
making a slightly different point is
that
humans want things and their ones create
the drives that cause them to you know
our wants are our objective functions
our individual objective functions we
can later decide that we want to change
that what we wanted before is no longer
good and we want something else yeah but
they're so dynamic there's
there's got to be some underlying sort
of freud
there's things there's like sexual stuff
there's people who think it's the fear
of
fear of death and there's also the
desire for knowledge and you know all
these kinds of things
procreation the sort of all the
evolutionary arguments
it seems to be there might be some kind
of fundamental objective function from
from which everything else uh emerges
but it seems because that's very
important i think i think that probably
is an evolutionary objective function
which is to survive and procreate and
make sure you make your children succeed
that would be my guess but it doesn't
give an answer to the question what's
the meaning of life
i think you can see how humans are
part of this big process this ancient
process we are
we are we exist on a small planet
and that's it so given that we exist try
to make the most of it and try to
enjoy more and suffer less as much as we
can
let me ask two silly questions about
life
one do you have regrets moments
that if you uh went back you would do
differently and two
are there moments that you're especially
proud of that made you truly happy
so i can answer that i can answer both
questions of course
there are there's a huge number of
choices and decisions that i've made
that
with the benefit of hindsight i wouldn't
have made them and i do experience some
regret but
you know i try to take solace in the
knowledge that at the time i did the
best i could
and in terms of things that i'm proud of
there are i'm very fortunate to have
things i'm proud to have done things i'm
proud of
and they made me happy for himself for
some time but i don't think that
that is the source of happiness so your
academic accomplishments all the
papers you're one of the most excited
people in the world
all the breakthroughs i mentioned in
computer vision and language and so on
is what is the source of happiness
and pride for you i mean all those
things are a source of pride for sure
i'm very
ungrateful for having done all those
things and it was very fun to do them
but happiness comes from but you know
you can happiness well
my current view is that happiness comes
from our
to allow to a very large degree from the
way we look at things
you know you can have a simple meal and
be quite happy as a result or you can
talk to someone and
be happy as a result as well or
conversely you can have a meal and be
disappointed that the meal wasn't a
better meal
so i think a lot of happiness comes from
that but i'm not sure i don't want to be
too confident i
being humble in the face of the
uncertainty seems to be also a part
of this whole happiness thing well i
don't think there's a better way to end
it than
uh meaning of life and discussions of
happiness so ilya
thank you so much you've given me a few
incredible ideas you've given the world
many incredible ideas i really
appreciate it and thanks for talking
today
yeah thanks for stopping stopping by i
really enjoyed it
thanks for listening to this
conversation with elias discoverer and
thank you to our presenting sponsor
cash app please consider supporting the
podcast by downloading cash app
and using code lex podcast if you enjoy
this podcast
subscribe on youtube review it with 5
stars in apple podcast
support on patreon or simply connect
with me on twitter
at lex friedman and now
let me leave you with some words from
alan turing on machine learning
instead of trying to produce a program
to simulate the adult mind
why not rather try to produce one which
simulates the child's
if this were then subjected to an
appropriate course of education
one would obtain the adult brain
thank you for listening and hope to see
you next time