Transcript

LRYkH-fAVGE • Jitendra Malik: Computer Vision | Lex Fridman Podcast #110
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0421_LRYkH-fAVGE.txt
Back Raw
Kind: captions
Language: en
the following is a conversation with
jitendra malik
a professor at berkeley and one of the
seminal figures in the field of computer
vision
the kind before the deep learning
revolution and
the kind after he has been cited
over 180 thousand times and has mentored
many world-class researchers in computer
science
quick summary of the ads two sponsors
one new one
which is better help and an old goody
expressvpn please consider supporting
this podcast by going to betterhelp.com
lex and signing up at expressvpn.com
lexpod click the links buy the stuff
it really is the best way to support
this podcast and the journey i'm on
if you enjoy this thing subscribe on
youtube review it with 5 stars on apple
podcast
support it on patreon or connect with me
on twitter
at lex friedman however the heck you
spell that
as usual i'll do a few minutes of ads
now and never neons in the middle that
can break the flow of the conversation
this show is sponsored by better help
spelled h-e-l-p help
check it out at betterhelp.com lex
they figure out what you need and match
you with a licensed professional
therapist
in under 48 hours it's not a crisis line
it's not self-help it's professional
counseling done securely
online i'm a bit from the david goggins
line of creatures
as you may know and so have some demons
to contend with
usually on long runs or all nights
working
forever and possibly full of self-doubt
it may be because i'm russian
but i think suffering is essential for
creation
but i also think you can suffer
beautifully in a way that doesn't
destroy you
for most people i think a good therapist
can help in this
so it's at least worth a try check out
their reviews
they're good it's easy private
affordable
available worldwide you can communicate
by text anytime
and schedule weekly audio and video
sessions
i highly recommend that you check them
out at betterhelp.com
lex this show is also sponsored by
expressvpn get it at expressvpn.com
to support this podcast and to get an
extra three months
free on a one-year package i've been
using expressvpn for many years
i love it i think expressvpn is the best
vpn out there they told me to say it
but it happens to be true it doesn't log
your data
it's crazy fast and it's easy to use
literally just
one big sexy power on button
again for obvious reasons it's really
important that they don't log your data
it works on linux and everywhere else
too but really
why use anything else shout out to my
favorite flavor of linux ubuntu mate
2004 once again get it at
expressvpn.comlexpod
to support this podcast and to get an
extra three months free and a one year
package
and now here's my conversation with
jitendra
in 1966 seymour papper
at mit wrote up a proposal called the
summer vision project to be given
as far as we know to 10 students to work
on and solve that summer
so that proposal outlined many of the
computer vision tasks we still work on
today
why do you think we underestimate and
perhaps we did underestimate and perhaps
still underestimate
how hard computer vision is because
most of what we do in vision we do
unconsciously or subconsciously
in human vision in human vision so that
gives us
this that effortlessness gives us the
sense that oh
this must be very easy to implement on a
computer
now this is why
the early researchers in ai got it so
wrong
however if you go into neuroscience or
psychology
of human vision then the complexity
becomes very clear
the fact is that a very large part of
the
the cerebral cortex is devoted to visual
processing
i mean and this is true in other
primates as well
so once we looked at it from a
neuroscience or psychology perspective
it it becomes quite clear that the
problem is very challenging and it will
take some time
you said the higher level parts are the
harder parts
i think vision appears to to be easy
because
uh most of what visual processing
is subconscious or unconscious right
so we underestimate the difficulty
whereas
uh when you are
like proving a mathematical theorem or
playing chess
the difficulty is much more evident so
because it is your conscious brain which
is processing
uh various aspects of the
problem-solving
behavior whereas in vision all this is
happening but it's not in your
awareness it's in your it's operating
below that
but it's it still seems strange yes
that's true but it seems strange that
as computer vision researchers for
example
the community broadly is time and time
again makes the mistake of um
thinking the problem is easier than it
is or maybe it's not a mistake
we'll talk a little bit about autonomous
driving for example how hard of a vision
task that is
it do do you think i mean what
is it just human nature or is there
something fundamental to the vision
problem that we we underestimate
we're still not able to be cognizant of
how hard the problem is
yeah i think in the early days it could
have been excused because
in the early days all aspects of ai were
regarded as too easy
but i think today it is much less
excusable
and i think why people
fall for this is because of what i call
the fallacy of the successful first
step there are many problems in
vision where getting 50
of the solution you can get in one
minute getting to 90 percent
can take you a day getting to 99 percent
may take you five years and
99.99 may be not in your lifetime
i wonder if that's a unique division
that it seems that language people are
not so
confident about so natural language
processing people are a little bit more
cautious about our ability
to to solve that problem
i think for language people intuit that
we have to be able to do
natural language understanding for
vision
it seems that we're not cognizant or we
don't think about how much
understanding is required it's probably
still an open problem
but in your sense how much understanding
is required to solve vision like
this put another way how much
something called common sense reasoning
is
required to really be able to interpret
even static scenes yeah so vision
operates at
uh at all levels and there are parts
which are
which can be solved with what we could
call maybe peripheral processing
so in the in the human vision literature
there used to be these terms
sensation perception and cognition
which roughly speaking referred to like
the front end of processing
middle stages of processing and higher
level of processing
and i think they made a big deal out of
out of this and they wanted to just
study only perception and then
dismiss certain certain problems as
being quote cognitive
but really i think these are artificial
divides
the problem is continuous at all level
and there are challenges at all levels
the techniques that we have today
they work better at the lower and mid
levels of the problem
i think the higher levels of the problem
quote the cognitive levels of the
problem
are there and we
in many real applications we have to
confront them
now how much that is necessary will
depend on the application
for some problems it doesn't matter for
some problems it matters a lot
so i am for example
a pessimist on fully autonomous driving
in the near future
and the reason is because i think there
will be
that 0.01 percent of the cases
where quite sophisticated cognitive
reasoning is called for
however there are tasks where you can
first of all they are much more they are
robust so in the sense that
error rates error is not so much of a
problem
for example uh uh let's say we are
you're doing uh
image search you're trying to get images
based on some
some some description some visual
description
we are very tolerant of errors there
right i mean when google image search
gives you some images back and a few of
them are
wrong it's okay it doesn't hurt anybody
there's no
there's not a matter of life and death
but
making mistakes when you're
driving at 60 miles per hour and you
could potentially kill somebody
is much more important so just for the
for the fun of it since you mentioned
let's go there briefly
about autonomous vehicles so one of the
companies in the space tesla
is work with andre karpathy and elon
musk are working on
a system called autopilot which is
primarily a vision-based system with
eight cameras
and uh basically a single neural network
a multi-task neural network
they they call it hydro net multiple
heads
so it does multiple tasks but is forming
the same representation
at the core do you think driving can be
converted
in this way to uh purely a vision
problem
and then solved within you with learning
or even more specifically in the current
approach
what do you think about what tesla
autopilot team is doing
so the way i think about it is that
there are certainly
subset subsets of the visual based
driving problem which are quite solvable
so for example driving in freeway
conditions
is quite a solvable problem i think
there were demonstrations of that going
back to the 1980s by
someone called ernst stickmans in munich
in the 90s there were approaches from
carnegie mellon there were approaches
from
our team at berkeley in the 2000s there
were approaches from stanford
and so on so autonomous driving in
certain settings is very doable
the challenge is to have an autopilot
work
under all kinds of driving conditions
at that point it's not just a question
of vision
or perception but really also of control
and dealing with all the edge cases
so where do you think most of the
difficult cases
to me even the highway driving is an
open problem because
uh it applies the same 50 90 95 99 rule
or the first step the fallacy of the
first step i forget how you put
it we fall victim to i think even
highway driving has a lot of elements
because to solve autonomous driving you
have to completely relinquish
the the fat help of a human being
you're always in control so that you're
really going to feel the edge cases
so i i think even highway driving is
really difficult but
in terms of the general driving task do
you think
vision is the fundamental problem or is
it
also your action the the interaction
with the environment
the ability to uh and then like the
middle ground i don't know if you put
that under vision which is
trying to predict the behavior of others
which is a little bit
in the world of understanding the scene
but it's also trying to form a model of
the actors in the scene
and predict their behavior yeah i
include that in vision because
to me perception blends into cognition
and building predictive models of other
agents in the world
which could be other agents could be
people other agents could be other cars
that is part of the task of perception
because
perception always has to uh not tell us
what is now but what will happen
because what's now is boring it's done
it's over with
okay yeah we care about the future
because we
act in the future and we care about the
past
and as much as it informs what's going
to happen in the future
so i think we have to build predictive
models of of
of behaviors of people and and those can
get quite
complicated so uh uh
i mean uh i i've seen examples of this
in
uh actually i mean i own a tesla
and it has various safety features built
in
and uh what i see are these examples
where
let's say there is some uh skateboarder
i mean
this i and i i don't want to be too
critical because
obviously this is these are the systems
are always being improved
and any specific criticism i have
maybe the system six months from now
will not have that
that that particular failure mode
so uh it
it had it it had the wrong response and
it's because it couldn't predict
what what this skateboarder was going to
do
okay and because it really required that
higher level cognitive understanding
of what skateboarders typically do as
opposed to a normal pedestrian
so what might have been the correct
behavior for a pedestrian
a typical behavior for pedestrian was
not the typical behavior for a
skateboarder right yeah
and uh so so therefore
to do a good job there you need to have
enough data where
you have pedestrians you also have
skateboarders
you've seen enough skateboarders to see
what
uh what kinds of patterns or behavior
they have
so it is it is in principle with enough
data that problem could be solved
but uh i think our current
systems computer vision systems they
need far
far more data than humans do for
learning those
same capabilities so say that there is
going to be a system that solves
autonomous driving
do you think it will look similar to
what we have today
but have a lot more data perhaps more
compute but the fundamental
architectures involved
like neuro well in the case of tesla
autopilot is
neural networks do you think it will
look similar
in that regard and we'll just have more
data that's a
scientific hypothesis as which way is it
going to go
uh i will tell you what i would bet on
uh
so and this is at my general
philosophical position on how these
uh learning systems have been
uh what we have found currently very
effective in
computer vision uh with in in the deep
learning paradigm is
sort of tabula rasa learning and tabular
us are learning
in a supervised way with lots and lots
of what's going on
in the sense that blank slate we just
have the system which is
given a series of experiences in this
setting and then it learns there
now if let's think about human driving
it is not tabular assad learning
so at the age of 16 in high school
uh a teenager goes into uh
goes into driver ed class right and now
at that point they learn but at the age
of 16 they are already visual geniuses
because from 0 to 16 they have built a
certain repertoire of vision
in fact most of it has probably been
achieved by
age 2 right in in this period of age
up to age 2 they know that the world is
three-dimensional they know how
objects look like from different
perspectives
they know about occlusion they
know about common dynamics of humans and
other bodies
they have some notion of intuitive
physics so they
they built that up from their
observations and interactions
in early childhood and of course
reinforced through
their their growing up to age 16. so
then
at age 16 when they go into driver ed
what are they learning they're not
learning afresh the visual world
they have a mastery of the visual world
what they are learning
is control okay they are learning how to
be smooth
about control about steering and brakes
and so forth
they're learning a sense of typical
traffic situations
now the the that education process
can be quite short because they are
coming in as visual geniuses
and of course in their future they're
going to encounter situations which are
very novel
right so during my driver ed class
that i may not have had to deal with a
skateboarder i may not have had to deal
with a truck
driving in front of me who's from
who's where the back opens up and some
junk gets dropped from the truck
and i have to deal with it right but i
can deal with this
as a driver even though i did not
encounter this in my driver at
class and the reason i can deal with it
is because i have all this
general visual knowledge and expertise
and uh do you think the learning
mechanisms we have today
can do that kind of long-term
accumulation of knowledge
or do we have to uh do some kind of
you know in the the the work that led up
to expert systems with knowledge
representation
you know the broader field of what of
artificial intelligence
uh worked on this kind of accumulation
of knowledge
do you think neural networks can do the
same i think uh
i don't see any in principle problem
with neural networks doing it
but i think the learning techniques
would need to evolve significantly
so the current uh the current
learning techniques that we have yeah is
our supervised learning you're given
lots of examples
xiy pairs and you you learn the
functional mapping between them
i think that human learning is far
richer than that
it includes many different components
there are
there is a a child explores the world
and sees as for example a child
takes an object and manipulates it
in his or her hand and therefore gets to
see the object from different points of
view
and the child has commanded the movement
so that's a kind of learning data but
the learning data has been
arranged by the child and this is a very
rich
kind of data the child can do various
experiments with the world so
so there are many aspects of sort of
human learning and these have been
studied in
in child development by psychologists
and they what they tell us is that
supervised learning is a very small part
of it
there are many different aspects of
learning
and what we would need to do is to
develop models of
all of these and then
train our systems in that with that kind
of
uh protocol so new new methods of
learning
yes some of which might imitate the
human brain but you also
in your talks have mentioned some of the
compute side of things
the in terms of the difference in the
human brain or referencing marvik
hans marvel the so
do you do you think there's something
interesting valuable to consider about
the difference
in the computational power of the human
brain versus
the computers of today in terms of
instructions
per second yes so if we go back
uh so so this is a point i've been
making for 20 years now
and i think once upon a time the way i
used to
argue this was that we just didn't have
the computing power of the human brain
our computers were uh were not
quite there and i mean there is a
well well-known trade-off which we know
that
the that neurons are slow compared to
transistors but uh but we have a lot of
them and they have a very high
connectivity
whereas in silicon you have much faster
devices transistors switch at
on the order of nanoseconds but the
connectivity is usually smaller
right at this point in time i mean we
are now talking about
2020 we do have if you consider the
latest gpus and so on
amazing computing power and if we look
back at enhanced modex type of
calculations which he did in the 1990s
we may be there today in terms of
computing power comparable to the brain
but it's not in the of the same style
it's of a very different style
so i mean for example the the style of
computing that we have in our gpus
is far far more power hungry than
the style of computing that is there in
the human brain or other
biological uh entities
yeah and that the efficiency part is uh
we're gonna have to solve that in order
to build actual real world systems
of large scale let me ask sort of
the high level question step taking a
step back
how would you articulate the general
problem of computer vision
does such a thing exist so if you look
at the computer vision conferences and
the work that's been going on
it's often separated into different
little segments
breaking the problem of vision apart
into whether segmentation
3d reconstruction object detection
i don't know image capturing whatever uh
there's benchmarks for each
but if you were to sort of
philosophically say what is
the big problem of computer vision does
such a thing exist
yes but it's not in isolation so
if we have to so for all
intelligence tasks i
always go back to sort of biology or
humans and if we think about
vision or perception in that setting we
realize that
perception is always to guide action
perception
in a for a biological system does not
give any benefits
unless it is coupled with action so we
can go back
and think about the first multicellular
animals
which arose in the cambrian era you know
500 million years ago
and uh these animals could move
and they could see in some ways and
their two
activities helped each other because uh
uh how does movement help movement
helps that because you can get food in
different places
but you need to know where to go and
that's really about
perception or seeing i mean i mean
vision is
perhaps the single most perception sense
but
all the others are equally are also
important so
uh so perception and action kind of grow
go together
so earlier it was in these very simple
feedback loops
which were about uh finding food
or avoiding becoming food if there's a
predator running uh
trying to you know eat you up
and and so forth so so we must at the
fundamental level connect
perception to action then
as we evolved uh perception became more
and more sophisticated
because it served many more purposes and
uh so today we have what seems like a
fairly general purpose capability
which can look at the external world and
build and
a model of the external world inside the
head
we do have that capability that model is
not perfect
and psychologists have great fun in
pointing out the ways in which
the model in your head is not a perfect
model of the external world
and they have create various illusions
to
show the ways in which it is imperfect
but
it's amazing how far it has come from a
very simple
perception action loop that you exists
in you know
an animal 500 million years ago once we
have this
these very sophisticated visual systems
we can then
impose a structure on them it's we as
scientists who are imposing that
structure
where we have chosen to characterize
this part of the system as this
code module of object detection or quote
this module of 3d reconstruction
what's going on is really all of these
processes are running
simultaneously and uh
and and they are running simultaneously
because originally their purpose was
in fact to help guide action so
as a guiding general statement of a
problem do you think
we can say that the the general problem
of computer vision
you said in humans it was tied to action
do you think we should also say that
ultimately the the goal
the problem of computer vision is to
sense the world
in the way that helps you act in the
world
yes i think that's the most fundamental
uh
that's the most fundamental purpose
we have by now hyper evolved
so we have this visual system which can
be used for other things
for example judging the aesthetic value
of a painting
and this is not guiding action maybe
it's guiding action in terms of how much
money you will put in your auction bid
but that's a bit stretched
but the basics are in fact in terms of
action
but we have we've evolved
really this hyper uh we have hyper
evolved our visual system
actually just too uh sorry to interrupt
but perhaps it is
fundamentally about action you kind of
jokingly said about spending
but perhaps the capitalistic
uh drive that drives a lot of the
development in this world
is is about to exchange your money and
the fundamental action is money if you
watch
netflix if you enjoy watching movies
you're using your perception system to
interpret the movie
ultimately your enjoyment of that movie
means you'll subscribe to netflix
so the action is this uh
this extra layer that we've developed in
modern society perhaps this is
fundamentally tied to the action of
spending money
well certainly with respect to uh
you know interactions with firms so so
in this homo economics role
when you're interacting with firms it
does become
uh it does become that that's what else
is there
uh that was a rhetorical question okay
so
to to linger on the division between the
static and the dynamic
so much of the work in computer vision
so many of the breakthroughs that you've
been a part of
have been in the static world in
looking at static images and then you've
also
worked on starting but it's a much
smaller degree the community is looking
at dynamic and video
at dynamic scenes and then there is
robotic vision
which is dynamic but also where you
actually have a robot in the physical
world
interacting based on that vision
which problem is harder
the the the intuit sort of the the
trivial first answers
well of course one image is harder but
so if you look at a deeper question
there
are we um what's the term cutting
ourselves
cutting ourselves at the knees or like
making the problem harder by focusing on
the images that's a fair question i
think
sometimes we we can simplify our problem
so much
that we essentially lose
part of the juice that could enable us
to solve the problem
and one could reasonably argue that to
some extent this happens when we go from
video to single images
now historically uh you have to consider
the limits of
imposed by the competition capabilities
we had
so if we many of the choices made in the
computer vision community
uh through the 70s 80s 90s
can be understood as
choices which were forced upon us by
the fact that we just didn't have access
to compute
enough compute not enough memory none of
hard drives not
exactly not enough not enough compute
not enough storage
so so think of these choices so one of
the choices is
focusing on single images rather than
video okay
clear questions storage and compute
we had to focus on we did we
used to detect edges and throw away the
image right so you have an image
which i say 256 by 256 pixels and
instead of keeping around the grayscale
value what we did was we detected edges
find the places where the brightness
changes a lot
so now that and now and then throw away
the rest
so this was a major compression device
and the hope was that this makes it
that you can still work with it and the
logic was humans can interpret a line
drawing
and uh and yes and this will save us a
competition so many of the choices were
dictated by that
i think uh today
we are no longer detecting edges right
we
process images with convnets because we
don't need to we don't have that
those compute restrictions anymore now
video is still
under studied because video compute is
still quite challenging
if you are a university researcher i
think
video computing is not so challenging if
you are at
google or facebook or amazon still super
challenging i've
just spoke with the vp of engineering
google head of
the youtube search and discovery and
they still struggle doing stuff on
video it's very difficult except doing
except using techniques that are
essentially the techniques you used in
in the 90s some very basic computer
vision techniques
no that's when you want to do things at
scale so if
you want to operate at the scale of all
the content of youtube it's very
challenging and there's similar issues
in
facebook but as a researcher you
you have you have more uh you know
opportunities
you can train large you know that works
with relatively large
uh video data sets yeah yes so i think
that
this is part of the reason why we have
so emphasized static images
i think that this is changing and over
the next few years
i see a lot more progress happening in
in video so i have this generic
statement that
to me video recognition feels like 10
years behind
object recognition and you can quantify
that because
you can take some of the challenging
video data sets and
their performance on action
classification is like say 30
which is kind of what we used to have
around
2009 in object detection you know so
it's like about 10 years behind
and uh whether it'll take 10 years to
catch up is a different question
hopefully it will take less than that
let me ask a similar question i've
already asked but once again so for
dynamic scenes
do you think do you think some kind of
injection
of knowledge basis and reasoning is
required
to help improve like action recognition
like if if if um
if we solve the general action
recognition problem
what do you think the solution would
look like it's another way yeah
so i i completely
agree that knowledge is called for and
that knowledge can be
quite sophisticated so the way i would
say it is that
perception blends into cognition and
cognition brings in
issues of memory and
this notion of a schema from psychology
which is
uh let me use the classic example which
is
you go to a restaurant right now the
things that
happen in a certain order you walk in
somebody takes you to a table
a waiter comes gives you a menu
takes the order food arrives eventually
a
bill arrives etc etc this is a classic
example of ai from the 1970s
uh it was called there was the term
frames and
scripts and schemas these are all quite
similar ideas
okay in the 70s the way
the ai of the time dealt with it was by
build hand coding this
so they hand coded in this notion of a
script and the various
stages and the actors and so on and so
forth
and use that to interpret for example
language
i mean if there's a description of a of
a story involving
some people eating at a restaurant there
are way all these
inferences you can make because you know
what happens typically at a restaurant
so i think this kind of uh
this kind of knowledge is absolutely
essential so i think
that when we are going to do long-form
video understanding
we are going to need to do this i think
the kinds of technology that we have
right now with
3d convolutions over a couple of seconds
of clip or video
it's very much tailored towards
short-term video understanding
not that long-term understanding
long-term understanding
requires a notion of
this notion of schemas that i talked
about perhaps some notions of
goals intentionality functionality
and so on and so forth now
how will we bring that in so we could
either revert back to the 70s and say
okay i'm going to hand code in
a script or we might
try to learn it so i
tend to believe that we have to find
learning ways of doing this
because i think learning ways to land up
being more robust
and there must be a learning version of
the story because
uh children acquire a lot of this
knowledge
by uh sort of just observation so
at no moment in a child's life there's a
it's possible but i think it's not so
typical that
somebody that a mother coaches a child
through all the stages of what happens
in a restaurant
they just go as a family they they
they go to the restaurant they eat come
back and the child goes through 10 such
experiences
and the child has has got a schema of
what happens when you go to a restaurant
so we somehow need to we need to provide
that capability to our systems
you mentioned the following line from
the end of the alan turing paper
uh computing machinery and intelligence
that many people
like you said many people know and very
few have read
where he proposes the turing test this
is this is how you know because it's
towards the end of the paper
instead of trying to produce a program
to simulate the adult mind
why not rather try to produce one which
simulates the child's
so that's a really interesting point if
i think about the benchmarks we have
before us the the tests
of our computer vision systems they're
often kind of trying to
get to the adult so what kind of
benchmarks should we have
what kind of tests for computer vision
do you think we should have
that mimic the child's in computer
vision yeah
i think we should have those and we
don't have those today
and i think uh the part of that
the challenge is that we should really
be collecting data
of the type that a child uh that the
child experiences
right so that gets into issues of you
know privacy and so on and so forth
but there are attempts in this direction
to
sort of try to collect the kind of data
that a child
encounters growing up so what's the
child's linguistic
environment what's the child's visual
environment
so if we could collect that kind of data
and then develop learning schemes based
on that data
that would be one way to do it i
i think that's a very promising
direction myself there might be people
who would argue
that we could just short circuit this in
some way and
uh sometimes we have
imitated uh we have not
we have had success by not imitating
nature in detail so
the usual example is airplanes right we
don't build flapping winds
flapping wings so uh
yes that's uh that's one of the points
of debate
uh in my mind i i i would i would bet on
this this learning like a child approach
so one of the fundamental aspects of
learning like a child is the
interactivity
so the child gets to play with the data
set it's learning from
yes it's against the select i mean you
can call that active learning you can
you know in the machine learning world
you can call it a lot of terms
what are your thoughts about this whole
space of being able to play with the
data set or select what you're learning
yeah so i think that uh i
i believe in that and i think that we
could achieve it in in two ways and i
think we should use both
so one is uh actually
real robotics right so real uh
you know physical embodiments of agents
who are interacting with the world and
they have a physical body with
dynamics and mass and moment of inertia
and friction and all the rest and you
learn your body the robot learns its
body by
doing a series of
actions the second is that simulation
environments
so i think simulation environments are
getting much much better
in my in my life in
facebook ai research our group has
worked on something called habitat
which is a simulation environment
which is a visually photorealistic
environment of
you know places like houses or interiors
of
various urban spaces and so forth and as
you move
you get a picture which is a pretty
accurate picture
so uh i i can now uh you can imagine
that
subsequent generations of these
simulators will be accurate not just
visually but with respect to
you know forces and masses and
haptic interactions and so on
and uh then then we have that
environment to play with
i think that let me state one reason why
i think
this active being able to act in the
world is
important i think that this is one way
to break
the correlation versus causation barrier
so this is something which is of a great
deal of interest these days i mean
people like judea pearl have
talked a lot about uh why
that we are neglecting causality and he
describes the entire set of successes of
deep learning as just curve fitting
right because it's uh but i i don't
quite agree
about as a troublemaker he is but uh
causality
is important but causality is not
is not like a single silver bullet it's
not like one single principle there are
many different aspects here
and one of the ways in which uh
one of our most reliable ways of
establishing causal links and this is
the way
for example the the medical community
does this is
randomized control trials so you have
you
you pick some situation and now in some
situation you perform an action and
for certain others you don't
right so so you have a control
experiment well the child is in fact
performing controlled experiments all
the time
right right right okay small scale and
in a small scale and
but but that is a way that the child
gets to
build and refine its causal models of
the world
and my colleague alison gopnik has
together with a couple of authors
co-authors has this book called the
scientist in the crib
referring to children so i like the part
that i like about that is
the scientist wants to do wants to build
causal models
and the scientist does control
experiments and i think the child is
doing that
so to enable that we will need to
have these these active experiments
and i think this could be done some in
the real world and some in simulation so
you have hope for simulation
i have a hopeless solution that's an
exciting possibility if we can get to
not just photo realistic but what's that
called
life realistic yeah uh simulation
so you don't see any fundamental
blocks to why we can't eventually
simulate
the the principles of what it means to
exist in the world
as a physical i i don't see any
fundamental problems there i mean
and look the computer graphics community
has come a long way
right so the in the early days back
going back to the 80s and 90s they were
they were focusing on visual realism
right and then they could do the easy
stuff but they couldn't do stuff like
hair or fur and so on
okay well they managed to do that then
they couldn't do physical
actions right like there's a bowl of
glass and it falls down and it shatters
but then they could start to do pretty
realistic models of that
and so on and so forth so the graphics
people have shown that they can do
this forward direction not just for
optical interactions but also for
physical interactions
so i think uh of course some of that is
very computer intensive but
i think by and by we will find ways of
making our models ever more realistic
you break vision apart into in one of
your presentations
early vision static scene understanding
dynamics and understanding
and raise a few interesting questions i
thought i could just throw some
some at you just to see if you want to
talk about them
so early vision so it's what is it
you said um sensation
perception and cognition so is this a
sensation yes
what can we learn from image statistics
that we don't already know
so at the lowest level what um
what can we make from just this the the
statistic the basics so there were the
variations in the rock pixels the
textures and so on
yeah so what we seem to have learned is
uh
uh uh is that there's a lot of
redundancy in these images and
as a result we are able to do a lot of
compression
and and this compression is very
important in biological settings right
so you might have ten to the eight
photoreceptors and only ten to the six
fibers in the optic nerve so you have to
do this compression by
a factor of hundreds to one and
uh and uh so there are analogs of that
which are happening in
in our neural net artificial neural
network that's the early layer so you
think
there's a lot of compression that can be
done in the beginning
yeah just just the statistics yeah
um how much
how much well so i mean the the way to
think about it is
just how successful is image compression
right and we we and there are and that's
been done with
older technologies but it can be done
with there are
several companies which are trying to
use
sort of these more advanced neural
network type techniques for compression
both for static images as well as for
for video
one of my former students has a company
which is trying to do
stuff like this and
i think i think that they are showing
quite
interesting results and i think that
that's all
the success of that's really about image
statistics and video statistics but
that's still not doing
compression of the kind when i see a
picture of a cat
all i have to say is it's a cat that's
another semantic kind of complication
yeah so this is this is at the lower
level right so we are we are we as i
said yeah
that's focusing on low level statistics
so to linger on that for a little bit
uh you mentioned how far can bottom-up
image segmentation go
and in general what you mentioned
that the central question for scene
understanding is the interplay of
bottom-up and top-down information maybe
this is a good time
to elaborate on that maybe define what
is
what is up what is top down
in the comments yes the computer vision
uh right that's uh
so today what we have are a are very
interesting systems because they work
completely bottom up
how are they what does bottom bottom-up
mean sorry so bottom-up means in this
case means a feed-forward net neural
network
so starting from the raw pixels yeah
they start from the raw pixels and they
they end up with some something like cat
or not a cat
right so our our systems are running
totally feed forward
they're trained in a very top-down way
so they're trained by saying okay this
is a cat there's a cat there's a dog
there's a zebra etc
and i'm not happy with either of these
choices fully
we have gone into uh because we have
completely separated these processes
right so there is a so i would like the
uh the process uh
so what do we know compared to biology
so in biology what we know is that the
processes
in at test time at run time
those processes are not purely feed
forward but they involve feedback
so and they involve much shallower
neural networks
so the kinds of neural networks we are
using in computer vision say a resnet 50
has 50 layers
well in in the brain in the visual
cortex
going from the retina to it maybe we
have like seven
right so they're far shallower but we
have the possibility of feedback so
there are backward connections
and this might enable us to uh
to deal with the more ambiguous stimuli
for example
so the the biological solution seems to
involve feedback
the solution in in artificial
vision seems to be just feed forward but
with a much deeper network
and the two are functionally equivalent
because if you have a feedback network
which just has like three rounds of
feedback
you can just unroll it and make it three
times the depth
and create it in a totally feed forward
way
so this is something which i mean we
have written some papers on this
theme but i really feel that this should
this theme should be pursued further
have some kind of recurrence mechanism
yeah
okay the other uh so that so that's uh
so i
so i want to have a little bit more top
down in the
at test time okay then at training time
we make use of a lot of top-down
knowledge right now
so basically to learn to segment an
object we have to have all these
examples of this is the boundary of a
cat and this is the boundary of a chair
and this is the boundary of a horse and
so on and this is
too much top-down knowledge how do
humans do this we manage to we manage
with far less supervision
and we do it in a sort of bottom-up way
because for example
we're looking at a video stream and the
horse moves
and that enables me to say that all
these pixels are together
yeah so the gestural psychologists used
to call this
the principle of common fate so there
was a bottom-up
process by which we were able to segment
out these objects
and we have totally focused on this
top-down training signal
so in my view we have currently solved
it
in machine vision this top-down
bottom-up interaction
but i don't find the solution fully
satisfactory
and i would rather have a bit of both in
at both stages
for all computer vision problems which
is not just segmentation
and and and and the question that you
can ask is
so for me i'm inspired a lot by human
vision and i care about that
you could be a just a hard-boiled
engineer not give a damn
so to you i would then argue that uh you
would need far less training data
if you could make my uh research agenda
you know fruitful okay so
maybe taking a step into uh segmentation
static scene understanding
what is the interaction between
segmentation and recognition
you mentioned the movement of objects
so for people who don't know computer
vision
segmentation is this weird activity that
we
that computer vision folks have all
agreed is very important
uh of drawing outlines around objects
versus a bounding box or
and then classifying that object
what's what's the value of segmentation
what is it
as a problem in computer vision how is
it fundamentally different from
detection recognition any other problems
yeah so i think
uh so so segmentation
enables us to say that
some set of pixels are an object without
necessarily even being able to name that
object or knowing properties of that
object
oh so you mean segmentation purely as
as as the act of separating an object
from its background a blob of uh
of that's united in some way from his
background yeah so identification if you
were
making an entity out of it and
justification yeah beautifully
so so i think that we have that
capability
and that is that enables us
to uh as we are growing up to
acquire uh names of objects
with very little supervision so suppose
the child
lets posit that the child has this
ability to separate out
objects in the world then when the
there's a
the mother says pick up your bottle or
the cat's behaving funny today
[Laughter]
the word cat suggests some object and
then the child sort of does the mapping
right right the mother doesn't have to
teach
a specific object labels by pointing to
them
weak supervision works in the context
that you have
the ability to create objects so
i think that uh so to me that's that's a
very fundamental capability
uh there are applications where this is
very important uh
for example medical diagnosis so in
medical diagnosis
uh you have some uh brain scan i mean
some
this is some work that we did in my
group where you have ct scans of people
who have
had traumatic brain injury and what uh
what the radiologist needs to do is to
precisely delineate various
places where there might be bleeds for
example
and there's there are clear needs like
that
so they're certainly very practical
applications of computer vision where
segmentation is necessary
but philosophically segmentation
enables the task of recognition
to proceed with much weaker supervision
than we require today
and you think of segmentation as this
kind of task that takes on
a visual scene and breaks it apart
into into interesting entities yeah
that might be useful for whatever the
task is yeah
and and it is not semantics free so i
think i
i mean it it blends into it involves
perception and cognition it is not it is
not
i i think the mistake that we used to
make in the early days of computer
vision
was to treat it as a purely bottom-up
perceptual task it is not just that
because we do revise our notion of
segmentation with more experience right
because
for example there are objects which are
non-rigid like animals
or humans and uh i think
understanding that all the pixels of a
human are one entity is actually quite a
challenge
because the parts of the human they can
move independently
and the human wears clothes so they
might be differently colored
so it's all sort of a challenge you
mentioned the three hours of computer
vision
are recognition reconstruction
reorganization
can you describe these three r's sure
how they interact
yeah so uh so recognition is the easiest
one
because that's uh what i think
people generally think of as computer
vision
achieving these days which is uh labels
so is this a cat is this a dog is this a
chihuahua i mean you know it could be
very fine grain like
you know specific breed of a dog or a
specific species or bird
or it could be very abstract like animal
but given a part of an image or a whole
image say
put a label on that yeah so that's
that's recognition
reconstruction is uh
essentially it you can think of it as
inverse
graphics i mean that's one way
to think about it so graphics is your
you have some internal computer
representation
and uh you have a computer
representation of some objects arranged
in a scene
and what you do is you produce a picture
you produce the pixels corresponding to
a rendering of that scene
so uh so let's
do the inverse of this we are given an
image and we try to
we we we say oh this image
arises from some objects in a scene
looked at with a camera from this
viewpoint and we might have more
information about the objects like their
shape maybe their textures maybe
you know color et cetera et cetera so
that's the reconstruction problem in a
way
that you are in your head creating a
model of the external world
okay reorganization is to do with
essentially finding these entities so
uh so it's uh organization or
the word organization implies structure
so uh that in in uh perception
in psychology we use the term perceptual
organization
that uh the the world is not just
an image is not just seen as is not
internally represented as just a
collection of pixels but we
make these entities we create these
entities
objects whatever you want to call in the
relationship between the entities as
well or is it purely about the entities
it could be about the relationships but
mainly we focus on the fact that there
are entities
sometimes i'm trying to pinpoint what
the organization means
so organization is that instead of like
a
uniform grid we have the structure of
objects
so segmentation is a small part of that
so segmentation gets us going towards
that
yeah and you kind of have this triangle
where they all interact together
yes so how do you see that interaction
in uh sort of uh
reorganization is yes defining the
entities in the world
the recognition is labeling those
entities
and then reconstruction is what filling
in the gaps
well to for example see
impute some 3d objects corresponding to
each of these
entities that would be part of adding
more information that's not
there in the raw data correct
i mean i started pushing this kind of a
view in the around 2010 or something
like that
because at that time in computer vision
the distinction that
people were were just
working on many different problems but
they treated each of them as a separate
isolated problem with each with its own
data set and then you try to solve that
and get good numbers on it
so i wasn't i didn't like that approach
because i wanted to see
the connection between these and
if people divided up vision into
into various modules the way they would
do it is as low level mid-level and
high-level vision
corresponding roughly to the
psychologist's notion of sensation
perception and cognition
and i didn't that didn't map to tasks
that people cared about
okay so therefore i tried to promote
this particular framework
as a way of considering the problems
that people in computer vision were
actually working on
and trying to be more explicit about the
fact that they actually
are connected to each other and i was at
that time
just doing this on the basis of
information flow
now it turns out in the last five years
or so
in the post the deep learning revolution
that this this architecture has turned
out to be
very conducive to that
because basically in these neural
networks we are trying to
build multiple representations
there can be multiple output heads
sharing common representations
so in a certain sense today given the
reality of what solutions people have to
these
i i i i do not need to preach this
anymore
it is it is just there it's part of the
solution space
so speaking of neural networks how much
of
this uh problem of computer vision
of the organization recognition
can be um reconstruction
how much of it can be learned end to end
do you think
instead of uh set it and forget it just
plug and play
have a giant data set multiple perhaps
multi-modal
and then just learn the entirety of it
well so i i think that currently what
that end-to-end learning means nowadays
is end-to-end supervised learning
and and that i would argue is too narrow
a view of the problem
i would i like this child development
view
this lifelong learning view one where
there are certain capabilities that are
built up and then there are certain
capabilities which are built up
on top of that so uh
that's that's what i i believe in
so i think uh
end-to-end learning in the supervised
setting
for a very precise task to me is
a kind of is uh
it's sort of a limited view of the of
the learning process
got it so if we think about beyond
purely supervised look at back to
children you mentioned six lessons
that we can learn from children uh of
be multimodal be incremental be physical
explore be social use language can you
speak to these perhaps picking one
that you find most fundamental toward
yeah time today
yeah so i mean i should say to give due
credit this is from a paper by
smith and gasser and it reflects
essentially i would say common wisdom
among
child development people it's just that
these are this is not common wisdom
among people
in computer vision and ai and machine
learning so
i view my role as uh trying to
bridge the worlds bridge the two worlds
so uh so let's take an example of a
multi-modal i like that
so multi-modal canonical example is uh
a child interacting with uh with an
object
so then the child so the child holds a
ball and plays with it
so at that point it's getting a touch
signal
so the touch signal is
is getting as the notion of 3d shape but
it is sparse
and then the child is also seeing a
visual signal right
and and these two so imagine these are
two in totally different spaces
right so one is the space of receptors
on the skin
of the fingers and the thumb and the
palm
right and then these map on to these
neuronal fibers are
getting activated somewhere right these
lead to some activation in somatosensory
cortex
i mean a similar thing will happen if we
have a robot
hand okay and then we have the pixels
corresponding to the
visual view but we know that they
correspond to the same
object right so that's
a very very strong cross calibration
signal
and it is self-supervisory which is
beautiful right
there's nobody assigning a label the
mother doesn't have to
come and assign a label the child
doesn't even have to
know that this object is called a ball
okay but the obj the child is learning
something about the three-dimensional
world
from this signal uh
i think tactile and visual there is some
work on
there is a lot of work currently on
audio and visual
okay an audio visual so there is some
event that happens in the world
and that event has a visual signature
and it has a
auditory signature so there is this
glass bowl on the table and it falls and
breaks and i hear the
smashing sound and i see the pieces of
glass
okay i've built that connection between
the two
right we have people uh i mean this has
become a hot topic in computer vision in
the last couple of years
there is there are problems like uh
separating out multiple speakers right
which was a classic problem in in
audition they call this the problem of
source separation or the
cocktail party effect and so on but just
try to do it visually
when you also have it becomes so much
easier and so much more useful
so the the multimodal i mean there's so
much more
signal with multimodal and you can use
that
for some kind of weak supervision as
well yes
because they are occurring at the same
time in time yeah so you have time
which links the two right so at a
certain moment t1
you've got a certain signal in the
auditory domain and a certain signal in
the visual domain
but they must be causally related yeah
it's an exciting area not well studied
yet
not yeah i mean we have a little bit of
work at this but uh but
but so much more needs to be done yeah
so so so
so this this is this is a good example
be physical
that's to do with uh like the one thing
we talked about
earlier that that there's a embodied
world
to mention language use language so
no chomsky believes that language may be
at the core of cognition at the core of
everything in the human mind
what is the connection between language
and vision to you
like what's more fundamental are they
neighbors
is one the parent and the child the
chicken and the egg
oh it's very clear it is vision which is
the appearance the fundament the
permission is the fundamental
ability okay well so
uh it comes before you think vision is
more fundamental than language
correct and and and it and yeah
you can think of it either in phylogeny
or in ontogeny
so phylogeny means if you look at
evolutionary time
right so you we have vision that
developed 500 million years ago
okay then something like when we get to
maybe like
five million years ago you have the
first bipedal primate so when we started
to
walk then the hands became free and so
then
manipulation the ability to manipulate
objects and build tools and
so on and so forth so you said 500 000
years ago no no sorry
the the first multicellular animals
which you can say
had some intelligence arose 500 million
years ago
okay and now let's fast forward to say
the last
seven million years which is the
development of the hominid line right
where from the other primates we have
the branch which leads on to modern
humans
now there are many of these hominids
but the the ones which
you know people talk about lucy because
that's like a skeleton from three
million years ago and we know that lucy
walked okay so at this stage you have
that the hand is free for manipulating
objects
and then the ability to manipulate
objects build
tools and the brain size
grew in this era so okay so now you have
manipulation
now we don't know exactly when language
arrows
but after that but after that because no
apes have i mean so i mean chomsky is
correct in that that it is a uniquely
human capability
and we primates
other primaries don't have that but so
it developed somewhere in this era
but it developed i would
i mean uh argue that it probably
developed after we had this stage of
uh uh humans or i mean the
human species already able to manipulate
and a hands-free much bigger brain size
and for that there's a lot of vision
has already had had to have developed
yeah so
the sensation and the perception may be
some of the cognition
yeah so we we so those
so so that so the world so there
so so these ancestors of us
you know three four million years ago
they had
uh they had spatial intelligence so they
knew that the world consists of objects
they knew that the objects were in
certain relationships to each other
they had observed causal
interactions among objects they could
move in space so they had space and time
and all
of that so language
builds on that substrate so language has
a lot of
i mean i mean the all human languages
have constructs which depend on
a notion of space and time where did
that notion of space and time come from
it had to come from perception and
action in the world we live in
yeah what you refer to as the spatial
intelligence yeah yeah
to linger a little bit we mentioned
touring and his uh mention of
we should learn from children
nevertheless language is
the fundamental piece of the test of
intelligence that touring proposed
what do you think is a good test of
intelligence are you
what would impress the heck out of you
is it fundamentally
natural language or is there something
in vision
i i think uh i i wouldn't i
i don't think we should have created a
single test of intelligence
so just like i don't believe in iq as a
single number
i think generally there can be many
capabilities
which are correlated perhaps
so i think that there will be
uh there will be accomplishments which
are visual accomplishments
accomplishments which are
uh accomplishments in manipulation or
robotics and then accomplishments in
language
i do believe that language will be the
hardest not to crack
really yeah so what's what's harder to
pass
the spirit of the touring test or like
whatever formulation will make it
natural language convincingly in natural
language
like somebody you would want to have a
beer with hang out and have a chat with
or the general natural scene
understanding
you think language is the type i think
i'm not a fan of the
i think i think turing test that turing
as he proposed the test in 1950
was trying to solve a certain problem
yeah imitation
yeah and and i think it made a lot of
sense then
where we are today 70 years later
i think i think we
we should not worry about that i mean i
think the turing test is no
longer the right way to uh to
to channel research in in ai because
that it takes us down this path of this
chat bot which can fool us for five
minutes or whatever
okay i think i would rather have a list
of 10 different tasks i mean i think
their tasks which their tasks in the
manipulation domain tasks and navigation
tasks and visual scene understanding
tasks in under reading a story and
answering questions based on that i mean
so my favorite
language understanding task would be
you know reading a novel and being able
to answer arbitrary questions from it
okay right i i think that to me
uh and this is not an exhausted list by
any means
so i would uh i think that that's what
we
where we need to be going to and each of
these
on each of these axes there's a fair
amount of work to be done
so on the visual understanding side in
this intelligence olympics that we've
set up yeah what's a good
test for one of many
of visual scene understanding
uh do you think such benchmarks exist
sorry to interrupt no there
there aren't any i i think i think
essentially
to me a really uh good
aid to the blind so suppose there was a
blind person
and i needed to assist the blind person
so ultimately like we said vision that
aids in the action
in the survival in this world yeah
maybe in a simulated world
maybe easier to to measure performance
in a simulated world
what we are ultimately after is
performance in the real world
so david hilbert in 1900 proposed 23
open problems in mathematics some of
which are still unsolved
most important famous of which is
probably the riemann hypothesis
you've thought about and presented about
the hilbert problems of computer vision
so let me ask what to you today
i don't know when the last year you
presented that 2015 but versions of it
yeah you're kind of the the face and the
spokesperson for computer vision
yeah it's your job to just to state what
the problem
the open problems are for the field so
what today
are the hilbert problems of computer
vision do you think
let me pick pick one to which i regard
as
uh clearly clearly unsolved
which is what i would call long-form
video understanding
so so we have a video clip and we want
to
understand the behavior
in there in terms of
agents their goals
intentionality and uh
make predictions about what might happen
you know so so that that kind of
understanding which goes away from
atomic visual action so
so in the short range the question is
are you sitting are you standing are you
catching a ball
right that we can do now or we even if
we can't do it fully accurately
if we can do it at 50 percent maybe next
year we'll do it at 65 and so forth
but i think the long range video
understanding
i don't think we we we can do today well
today and that means so long and it
blends into cognition that's the reason
why it's challenging
and so you have to track you have to
understand the entities
you have to understand the sds you have
to track them
and you have to have some kind of model
of their behavior
correct and their and if their behavior
might be
these are these are agents so they are
not just like passive
objects but the agent so therefore we
they might they would exhibit gold
directed behavior
okay so this is this is one area then i
will talk about
say understanding the world in 3d now
this may seem
paradoxical because in a way we have
been able to do 3d understanding even
like
30 years ago right but i don't think we
currently have the richness of
3d understanding in our computer vision
system that we would like
because ah so let me elaborate on that a
bit
so currently we have two kinds of
techniques which are
not fully unified so there are the kinds
of techniques from
multi-view geometry that you have
multiple pictures of a scene and you do
a
reconstruction using stereoscopic vision
or structure from motion
but these techniques do not
they totally fail if you just have a
single view because they are relying
on this this multiple geometry
okay then we have some techniques that
we have developed in the computer vision
community which try to
guess 3d from single views and these
techniques are based
on on supervised learning
and they are based on having a training
time
3d models of objects available
and this is completely unnatural
supervision
right that's not cad models are not
injected into your brain
okay so what would i like what i would
like would be a kind of
uh learning as you
move around the world uh notion of 3d
so so we we have our
succession of visual experiences
and from those we
so in as part of that i might see a
chair from different viewpoints
or a table from viewpoint different
viewpoints and so on
now as part that enables me to build
some internal representation and then
next time i just see
a single photograph and it may not even
be of that chair it's of some other
chair
and i have a guess of what its 3d shape
is like
so you're almost learning the cad model
kind of
yeah implicitly i mean implicitly i mean
the cad model need not be in the same
form as
used by computer graphics hidden in the
representation
it's hidden in the representation the
ability to predict new views
and what i would see if i
went to such and such position by the
way and
on a small tangent on that are you
uncomforta are you
okay or comfortable with
neural networks that do achieve visual
understanding that do for example
achieve this kind of 3d understanding
and you don't know how they you don't
know
the rep you're not able to interest but
you're not able to
visualize or understand or interact with
the representation
so the fact that they're not or may not
be explainable
yeah i think that's fine i to me that is
uh
so so let me put some caveats on that
so it depends on the setting so first of
all i think
uh uh the
uh humans are not explainable
so yeah that's a really good point yeah
so we we
one human to another human is not fully
explainable
i think there are settings where
explainability matters
and these might these are these might be
for example questions on medical
diagnosis
so i'm in a setting where
maybe the doctor maybe a computer
program has made a certain diagnosis
and then depending on the diagnosis
perhaps i should have treatment day or
treatment b
right so now is the computer programs
diagnosis based on data
which was data collected of
for american males who are in their 30s
and 40s
and maybe not so relevant to me
maybe it is relevant you know et cetera
et cetera and we i mean in
medical diagnosis we have major issues
to do with the reference class
so we may have acquired statistics from
one group of people and applying it to
a different group of people who may not
share all the same characteristics
the data might have there might be error
bars in the prediction
so that prediction should really be
taken with
a huge grain of salt and but this has an
impact on what treatments
should be picked right so
so there are settings where i want to
know more than just
this is the answer but what i
acknowledge is that
so so so so i in that sense
explainability and interpretability may
matter
it's about giving error bounds and a
better sense of the quality of the
decision
where what i where i'm willing to
sacrifice interpretability is that
i believe that there can be systems
which can be highly performant but which
are internally
black boxes and and that seems to be
words headed some of the best performing
systems are essentially black boxes yeah
uh
fundamentally by their construction you
and i are
black boxes to each other yeah so the
nice thing about the black boxes we are
is so we ourselves are black boxes
but we're also those of us who are
charming
are able to convince others like explain
the black
what's going on inside the black box
with narratives with stories
so in some sense uh neural networks
don't have to actually
explain what's going on inside they just
have to come up with stories real or
fake
that convince you that they know what's
going on
and i'm sure we can do that we can
create those nearer
those stories neural networks can create
those stories yeah
and the transformer will be involved do
you think we will ever
build a system of human level or
superhuman level intelligence
we've kind of defined what it takes to
try to approach that but do you think
we'll
do you think that's within our reach the
thing that we thought we could do
what touring thought actually we could
do by a year 2000
right what do you think we'll ever be
able to do so
i think there are two answers here one
question one answer is
in principle can we do this at some time
and my answer is yes the second
answer is a pragmatic one do you think
we will be able to do it in the next 20
years
or whatever and to that man says no
so and of course that's a wild guess i i
i i think that
you know donald trump's felt is not a
favorite person of mine but
one of his lines is very good which is
about
known knowns known unknowns and unknown
unknowns
so in the business we are in
there are known unknowns and we have
unknown unknowns
so i think with respect to
a lot of what the case in
vision and robotics i feel like
we have known unknowns so i have a sense
of where we need to go
and what the problems that need to be
solved are
i feel with respect to natural language
understanding and high level cognition
it's not just known unknowns but also
unknown unknowns
so it is very difficult to put any kind
of uh
time frame to that uh do you think some
of the
unknown unknowns might be positive in
that they'll surprise us and make the
job much easier
so fundamental breakthroughs i think
that is possible because certainly i
have
been very positively surprised by how
effective these deep learning systems
have been because
i certainly would not have believed that
in
2010 i think
what we knew from the mathematical
theory
was that convex optimization works when
there's a single global optima then
these gradient descent techniques would
work now these are
non-linear systems with non-convex
systems
huge number of variables so
over-parametrized over-parameterized
and the people who used to play with
them a lot
the ones who are totally immersed in the
lore and the
black magic they knew that they worked
uh well even though they were really
i thought like everybody no the claim
that
i hear from my friends like yan lacoon
and so forth
now yeah that they feel that they were
comfortable with them
well he says but the community as a
whole
was certainly not and i think uh
we were to me that was the surprise that
they actually worked robustly
for a wide range of problems from a wide
range of initializations and so on
and uh so that was that that was
certainly
more rapid progress than uh we expected
but then there are certainly lots of
times in fact
most of the history and fear is when we
have made less pro
progress at a slower rate than we
expected
so uh we just keep going
i think uh what i regard as
uh really unwarranted are these
these fears of uh you know agi in 10
years and 20 years and
that kind of stuff because that's based
on completely unrealistic models of how
rapidly we will make progress in this
field so i agree with you but i've also
gotten a chance to interact with very
smart people who really worry about the
existential threats of ai
and i as an open-minded person and sort
of taking
and taking it in do you think
if ai systems in some way the unknown
unknowns
not super intelligent ai but in ways we
don't quite understand
uh the nature of superintelligence will
have a detrimental effect on society
do you think this is something we should
be worried about
or we need to first allow the unknown
our nose to become
known unknowns i think we need to be
worried about ai today
i think that it is not just a worry we
need to have when we get that
agi i think that ai is being used in
many systems today
and there might be settings for example
when it causes
biases or decisions which could
be harmful i mean decisions which could
be unfair to some people
or it could be a self-driving cars which
kills a pedestrian
so ai systems are being deployed today
right and they're being deployed in many
different settings maybe in medical
diagnosis maybe in a self-driving car
maybe
in selecting applicants for an interview
so
i would argue that when these systems
make mistakes there are consequences
and we are in a certain sense
responsible for those consequences
so i would argue that this is a
continuous effort
it is we and and this is something that
in a way is not so surprising it's about
all
engineering and scientific progress
which uh
great power comes great responsibility
so as these systems are deployed we have
to worry about them and
it's a continuous problem i don't think
of it as something
which will suddenly happen on some day
in 2079
for which i need to design some clever
trick
i'm saying that these problems exist
today yeah
and we need to be continuously on the
lookout for
worrying about safety biases risks
right i mean the self-driving car kills
are pedestrian
and they have right i mean the this uber
incident in arizona yeah right it has
happened
right this is not about agi it in fact
it's about a very dumb intelligence
which is also killing people the worry
people have with agi
is the scale and i but i think you're
right is like the thing that worries me
about ai
today and it's happening in a huge
skills recommend
recommender systems recommendation
systems so if you look at
twitter or facebook or youtube their
controlling the ideas that we have
access to
the news and so on and that's a
fundamentally machine learning algorithm
behind each of these recommendations
and they i mean my life would not be the
same without
these sources of information i'm a
totally new human being and
the ideas that i know are very much
because of the internet
because of the algorithm that i
recommend those ideas and so
as they get smarter and smarter i mean
that is the agi
yeah is that's the the algorithm that's
recommending
the next youtube video you should watch
has control of millions of billions of
people
that that algorithm is already super
intelligent and
has complete control of the population
not a complete but
very strong control for now we can turn
off youtube we can just
go have a normal life outside of that
but the more and more that
gets into our life it's that algorithm
we start
depending on it in the different
companies that are working on the
algorithm so i think it's
you're right it's already it's already
there
and youtube in particular is using
computer vision
doing their hardest to try to understand
the content of videos so they could
be able to connect videos with the
people who would benefit from those
videos the most and so that development
could go in a bunch of different
directions some of which might be
harmful
so yeah you're right the the the threats
of ai are here already we should be
thinking about them
on a philosophical notion
if you could personal perhaps
if you could relive a moment in your
life outside of family
because it made you truly happy or was a
profound moment that impacted the
direction of your life
what would you go to
i don't think of single moments but i
look over the long haul
i feel that i've been very lucky because
i feel that i think that in
scientific research a lot of it is about
being at the right place at the right
time
and you can you can work on problems at
a time when
they're just too premature you know you
butt your head
against them and and nothing happens
because it's
the prerequisites for success are not
there and then there are times when you
are in a field which is all
pretty mature and you can only
solve curricules upon colloquius i've
been lucky to have been in this field
which
for 34 years 35 well actually 34 years
as a professor at berkeley so
longer than that uh which
when i started in it was just
like some little crazy absolutely
useless field which couldn't really do
anything
to a time when it's really really
solving a lot of practical problems has
a lot
has offered a lot of tools for
scientific research
right because computer vision is
impactful for
images in biology or astronomy and and
so on and so forth
and we have so we have made great
scientific progress which has had
real practical impact in the world and i
feel lucky that
i i got in at a time when the field was
very young and at a time when it is
it's now mature but not fully mature
it's mature but not
done i mean it's really in still in a in
a productive phase yes
yeah yeah i think people 500 years from
now would laugh are you calling this
field mature
yeah that is very possible yeah so but
you're also
lest i forget to mention you've also
mentored
some of the biggest names of computer
vision computer science and ai
today uh there's so many questions i
could ask but really is
what what is it how did you do it what
does it take to be
a good mentor what does it take to be a
good guide
yeah i i think what i feel i've been
lucky to have
had very very smart and hardworking and
creative students i think
some part of the credit just belongs to
being at berkeley
i think those of us who are at top
universities
are blessed because we have
very very smart and capable students
coming on
knocking on our door so so i have to be
humble enough to acknowledge that
but what have i added i think i have
added something
what i have added is uh i think
what i've always tried to teach them is
a sense of picking the right problems
so i think that in science in the short
run
success is always based on technical
competence
your you know you're quick with math or
you are
whatever i mean there's certain
technical capabilities which make for
short-range
progress long-range progress is really
determined
by asking the right questions and
focusing on the right problems
and i feel that
what i've been able to bring to the
table in terms of
advising these students is
some sense of taste of what are good
problems
what are problems that are worth
attacking now as opposed to waiting
10 years what's a good problem if you
could summarize
if is that possible to even summarize
like what what's your sense of a good
problem
i i think uh i think uh i have a sense
of what is a good problem which is
uh there is a british scientist uh
in fact he won a nobel prize peter
medover who has a
a book on on this and uh basically he
calls
it the research is the art of the
soluble
so we need to sort of find problems
which are
which are not yet solved but which are
approachable
and he sort of refers to this
sense that there is this problem which
isn't quite solved yet but it has a soft
underbelly
there is some place where you can you
know
spear the beast yes and having that
intuition that this problem is ripe is
is a good thing because otherwise you
can just beat your head and not make
progress
so i think that is that is important so
if
if i have that and if i can convey that
to students
it's not just that they do great
research while they're working with me
but that they continue to do great
research so in a sense i'm proud of my
students
and their achievements and their great
research even
20 years after they've seized being my
student
so it's in part developing helping them
develop that sense that a problem
is not yet solved but it's solvable
correct
the other thing which i have which i i
think i bring to the table
uh is i is a certain
intellectual breadth i i've
spent a fair amount of time studying
psychology
neuroscience relevant areas of applied
math and so forth
so i can probably help them see some
connections
to disparate things which
they might not have otherwise so
so the smart students coming into
berkeley can be
very uh deep in the sense they can think
very deeply meaning very
hard down one particular path but
where i could help them is the the
shallow breadth
but uh whereas they would have the
the narrow depth and uh but
that's that's of some value well it was
beautifully refreshing just to hear you
naturally jump to psychology back to
computer science and this conversation
back and forth
i mean that that's uh that's actually a
rare quality and i think it's
certainly for students empowering to
think about problems in a new way
so for that and for many other reasons i
really enjoyed this conversation thank
you so much it was a huge honor thanks
for talking today
it's been my pleasure thanks for
listening to this conversation
with jitendra malik and thank you to our
sponsors
betterhelp and expressvpn
please consider supporting this podcast
by going to betterhelp.com
lex and signing up at expressvpn.com
lexpod click the links buy the stuff
it's how they know i sent you and it
really is the best way to support this
podcast
and the journey i'm on if you enjoy this
thing
subscribe on youtube review 5 stars on
apple podcast
support it on patreon or connect with me
on twitter
at lex friedman don't ask me how to
spell that i don't remember
myself and now let me leave you with
some words from prince mishkin
and the idiot by dostoyevsky beauty
will save the world thank you for
listening
and hope to see you next time
you