Transcript
xoVibFYi1Gs • Language or Vision - What's Harder? (Ilya Sutskever) | AI Podcast Clips
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0383_xoVibFYi1Gs.txt
Kind: captions
Language: en
so incredibly you've contributed some of
the biggest recent ideas in AI in
computer vision language natural
language processing reinforcement
learning sort of everything in between
maybe not ganz is there anything there
may not be a topic you haven't touched
and of course the the fundamental
science of deep learning what is the
difference to you between vision
language and as in reinforcement
learning action as learning problems and
what are the commonalities do you see
them as all interconnected are they
fundamentally different domains that
require different approaches ok that's a
good question machine learning is a
field with a lot of unity a huge amount
of unity in what I mean by unity like
overlap of ideas overlap of ideas
overlap of principles in fact there is
only one or two or three principles
which are very very simple and then they
apply in almost the same way in almost
the same way to the different modalities
to the different problems and that's why
today when someone writes a paper on
improving optimization of deep learning
in vision it improves the different NLP
applications and it improves the
different reinforcement learning
applications reinforcement learning so I
would say that computer vision and NLP
are very similar to each other today
they differ in that they have slightly
different architectures we use
transformers in NLP and mis
convolutional neural networks in vision
but it's also possible that one day this
will change and everything will be
unified with a single architecture
because if you go back a few years ago
in natural language processing the work
gives a huge number of architectures for
every different tiny problem had its own
architecture today this is just one
transformer for all those different
tasks and if you go back in time even
more you had even more and more
fragmentation and every little problem
in AI had its own little sub
specialization and sub in a little set
of collection of skills people who would
know how to engineer the features now
solving subsume by deep learning we have
this unification and so I expect a
vision to become unified with natural
languages well origins I expect I think
it's
I don't want to be too sure because I
think on the commercial you know that is
very computationally efficient RL is
different
RL does require slightly different
techniques because you really do need to
take action you really do need to do
something about exploration your
variance is much higher but I think
there is a lot of unity even there and I
would expect for example that at some
point there will be some broader
unification between RL and supervised
learning where somehow they RL will be
making decisions to make the supermost
don't even go better and it'll be I
imagine one big black box and you just
throw every you know you shovel travel
things into it and in just figures out
what to do visit whatever you shovel in
it I mean reinforcement learning has
some aspects of language and vision
combined almost there's elements of a
long term memory that you should be
utilizing and there's elements of a
really rich sensory space so it seems
like the it's like the union of the two
or something like that but I'd say
something slightly differently I'd say
that reinforcement learning is neither
but it naturally interfaces and
integrates view the two of them do you
think action is fundamentally different
so yeah what is interesting about what
is unique about policy of learning to
act well so one example for instance is
that when you learn to act you're
fundamentally in a non-stationary world
because as your actions change the
things you see start changing you you
experience the world in a different way
and this is not the case for the more
traditional static problem you have at
least some distribution and you just
apply a model to that distribution you
think it's a fundamentally different
problem or is it just more difficult
generally it's a generalization of the
problem of understanding I mean it's
it's it's a question of definitions
almost there is a huge you know there's
a huge amount of commonality for sure
there gradients attract you take
gradients we try to approximate
gradients in both cases in some key in
the case of reinforcement learning you
have some tools to reduce the variance
of the gradients you do that there's
lots of commonality use the same neural
net in both cases you compute the
gradient you apply atom in both cases so
I mean there's lots in common for sure
but there are some small differences
which are not completely insignificant
it's really just a matter of your of
your point of view what frame of
reference you what how much do don't
want to zoom in or out as you look at
these problems which problem do you
think is harder so people like Noam
Chomsky believe that language is
fundamental to everything so it
underlies everything do you think
language understanding is harder than
visual scene understanding or vice versa
I think is it asking if a problem is
hard is slightly wrong I think the
question is a little bit wrong and I
want to explain why so what does it mean
for a problem to be hard
okay then uninteresting dumb answer to
that is there's a there's a benchmark
and there's a human level performance on
that benchmark and how as the effort
required to reach the human level
okay benchmark so from the perspective
of how much until you get to human level
and a very good benchmark yeah like some
and I honest I understand what you mean
by that
so when I was growing up going to say
that a lot of it depends on you know
once you solve a problem he stops being
hard and that's resolved that's always
true and so but if something is hard or
not depends on water tools can do today
so you know I say today through human
level language understanding and visual
perception are hard and sense that there
is no way of solving the problem
completely in the next three months
right so I agree with that statement
beyond that I'm just I'd be my guess
would be as good as yours I don't know
oh okay so you'd have a fundamental
intuition about how hard language
understanding is I think I know I
changed my mind that's a language is
probably going to be hard I mean it
depends on how you define it
like if you mean absolute top not 100
percent language understanding I'll go
with language and so but then if I show
you a piece of paper with letters on it
is that if you see what I mean it's um
you have a vision system you say it's
the best human level vision system I
show you I open a book and I show you
letters if you will to understand how
these letters form into words and
sentences and meaning is this part of
the vision problem where does the vision
end and language begin
yeah so Chomsky would say it starts at
language so vision is just a little
example of the kind of structure and you
know fundamental hierarchy of ideas
that's already represented in our brain
somehow that's represented through
language but where does vision stop and
language begin that's a really
interesting question it so one
possibility is that it's impossible to
achieve really deep understanding in
either images or language without
basically using the same kind of system
so you're going to get the other for
free I think I think it's pretty likely
that yes if we can get one we probe our
machine learning is probably that good
that we can get the other but it's not
one honey I'm not 100% sure and also but
I think a lot a lot of it really does
depend on your definitions definitions
of like perfect vision because rady no
reading his vision but should it count
yet to me so my definition of a system
looked at an image and then a system
looked at a piece of text and then told
me something about that and I was really
impressed that's relative you'll be
impressed for half an hour and then
you're gonna say well I mean all the
systems do that but here's the thing
they don't do yeah but I don't have that
with humans humans continue to impress
me is that true well the ones okay so
I'm a fan of monogamy so I like the idea
of marrying somebody being with them for
several decades so I believe in the fact
that yes it's possible to have somebody
continuously giving you pleasurable
interesting witty new ideas friends yeah
I think I think so they continue to
surprise you the surprise it's a you
know that injection of randomness seems
to be a it seems to be a nice source of
yeah continued inspiration like the the
width the humor I think yeah that that
would be it's a very subjective test but
I think if you have enough humans in
their own yeah III understand what you
mean yeah I feel like I misunderstood
what you meant by impressing you I
thought you meant to impress you with
its intelligence with how how with how
good well it understands an image I
thought you meant something like I'm
gonna show you really complicated image
and it's gonna get it right and you
gonna say wow that's really cool a
systems of you know a January 2020 have
not been doing that yeah no I I think it
all boils down to like the reason people
click like on stuff on the internet
which is like it makes them laugh so
it's like humor or wit yeah or insight
I'm sure we'll get it as get that as
well
you