Jitendra Malik: Computer Vision | Lex Fridman Podcast #110

LRYkH-fAVGE • 2020-07-21

Transcript preview

Open

Kind: captions
Language: en
the following is a conversation with
jitendra malik
a professor at berkeley and one of the
seminal figures in the field of computer
vision
the kind before the deep learning
revolution and
the kind after he has been cited
over 180 thousand times and has mentored
many world-class researchers in computer
science
quick summary of the ads two sponsors
one new one
which is better help and an old goody
expressvpn please consider supporting
this podcast by going to betterhelp.com
lex and signing up at expressvpn.com
lexpod click the links buy the stuff
it really is the best way to support
this podcast and the journey i'm on
if you enjoy this thing subscribe on
youtube review it with 5 stars on apple
podcast
support it on patreon or connect with me
on twitter
at lex friedman however the heck you
spell that
as usual i'll do a few minutes of ads
now and never neons in the middle that
can break the flow of the conversation
this show is sponsored by better help
spelled h-e-l-p help
check it out at betterhelp.com lex
they figure out what you need and match
you with a licensed professional
therapist
in under 48 hours it's not a crisis line
it's not self-help it's professional
counseling done securely
online i'm a bit from the david goggins
line of creatures
as you may know and so have some demons
to contend with
usually on long runs or all nights
working
forever and possibly full of self-doubt
it may be because i'm russian
but i think suffering is essential for
creation
but i also think you can suffer
beautifully in a way that doesn't
destroy you
for most people i think a good therapist
can help in this
so it's at least worth a try check out
their reviews
they're good it's easy private
affordable
available worldwide you can communicate
by text anytime
and schedule weekly audio and video
sessions
i highly recommend that you check them
out at betterhelp.com
lex this show is also sponsored by
expressvpn get it at expressvpn.com
to support this podcast and to get an
extra three months
free on a one-year package i've been
using expressvpn for many years
i love it i think expressvpn is the best
vpn out there they told me to say it
but it happens to be true it doesn't log
your data
it's crazy fast and it's easy to use
literally just
one big sexy power on button
again for obvious reasons it's really
important that they don't log your data
it works on linux and everywhere else
too but really
why use anything else shout out to my
favorite flavor of linux ubuntu mate
2004 once again get it at
expressvpn.comlexpod
to support this podcast and to get an
extra three months free and a one year
package
and now here's my conversation with
jitendra
in 1966 seymour papper
at mit wrote up a proposal called the
summer vision project to be given
as far as we know to 10 students to work
on and solve that summer
so that proposal outlined many of the
computer vision tasks we still work on
today
why do you think we underestimate and
perhaps we did underestimate and perhaps
still underestimate
how hard computer vision is because
most of what we do in vision we do
unconsciously or subconsciously
in human vision in human vision so that
gives us
this that effortlessness gives us the
sense that oh
this must be very easy to implement on a
computer
now this is why
the early researchers in ai got it so
wrong
however if you go into neuroscience or
psychology
of human vision then the complexity
becomes very clear
the fact is that a very large part of
the
the cerebral cortex is devoted to visual
processing
i mean and this is true in other
primates as well
so once we looked at it from a
neuroscience or psychology perspective
it it becomes quite clear that the
problem is very challenging and it will
take some time
you said the higher level parts are the
harder parts
i think vision appears to to be easy
because
uh most of what visual processing
is subconscious or unconscious right
so we underestimate the difficulty
whereas
uh when you are
like proving a mathematical theorem or
playing chess
the difficulty is much more evident so
because it is your conscious brain which
is processing
uh various aspects of the
problem-solving
behavior whereas in vision all this is
happening but it's not in your
awareness it's in your it's operating
below that
but it's it still seems strange yes
that's true but it seems strange that
as computer vision researchers for
example
the community broadly is time and time
again makes the mistake of um
thinking the problem is easier than it
is or maybe it's not a mistake
we'll talk a little bit about autonomous
driving for example how hard of a vision
task that is
it do do you think i mean what
is it just human nature or is there
something fundamental to the vision
problem that we we underestimate
we're still not able to be cognizant of
how hard the problem is
yeah i think in the early days it could
have been excused because
in the early days all aspects of ai were
regarded as too easy
but i think today it is much less
excusable
and i think why people
fall for this is because of what i call
the fallacy of the successful first
step there are many problems in
vision where getting 50
of the solution you can get in one
minute getting to 90 percent
can take you a day getting to 99 percent
may take you five years and
99.99 may be not in your lifetime
i wonder if that's a unique division
that it seems that language people are
not so
confident about so natural language
processing people are a little bit more
cautious about our ability
to to solve that problem
i think for language people intuit that
we have to be able to do
natural language understanding for
vision
it seems that we're not cognizant or we
don't think about how much
understanding is required it's probably
still an open problem
but in your sense how much understanding
is required to solve vision like
this put another way how much
something called common sense reasoning
is
required to really be able to interpret
even static scenes yeah so vision
operates at
uh at all levels and there are parts
which are
which can be solved with what we could
call maybe peripheral processing
so in the in the human vision literature
there used to be these terms
sensation perception and cognition
which roughly speaking referred to like
the front end of processing
middle stages of processing and higher
level of processing
and i think they made a big deal out of
out of this and they wanted to just
study only perception and then
dismiss certain certain problems as
being quote cognitive
but really i think these are artificial
divides
the problem is continuous at all level
and there are challenges at all levels
the techniques that we have today
they work better at the lower and mid
levels of the problem
i think the higher levels of the problem
quote the cognitive levels of the
problem
are there and we
in many real applications we have to
confront them
now how much that is necessary will
depend on the application
for some problems it doesn't matter for
some problems it matters a lot
so i am for example
a pessimist on fully autonomous driving
in the near future
and the reason is because i think there
will be
that 0.01 percent of the cases
where quite sophisticated cognitive
reasoning is called for
however there are tasks where you can
first of all they are much more they are
robust so in the sense that
error rates error is not so much of a
problem
for example uh uh let's say we are
you're doing uh
image search you're trying to get images
based on some
some some description some visual
description
we are very tolerant of errors there
right i mean when google image search
gives you some images back and a few of
them are
wrong it's okay it doesn't hurt anybody
there's no
there's not a matter of life and death
but
making mistakes when you're
driving at 60 miles per hour and you
could potentially kill somebody
is much more important so just for the
for the fun of it since you mentioned
let's go there briefly
about autonomous vehicles so one of the
companies in the space tesla
is work with andre karpathy and elon
musk are working on
a system called autopilot which is
primarily a vision-based system with
eight cameras
and uh basically a single neural network
a multi-task neural network
they they call it hydro net multiple
heads
so it does multiple tasks but is forming
the same representation
at the core do you think driving can be
converted
in this way to uh purely a vision
problem
and then solved within you with learning
or even more specifically in the current
approach
what do you think about what tesla
autopilot team is doing
so the way i think about it is that
there are certainly
subset subsets of the visual based
driving problem which are quite solvable
so for example driving in freeway
conditions
is quite a solvable problem i think
there were demonstrations of that going
back to the 1980s by
someone called ernst stickmans in munich
in the 90s there were approaches from
carnegie mellon there were approaches
from
our team at berkeley in the 2000s there
were approaches from stanford
and so on so autonomous driving in
certain settings is very doable
the challenge is to have an autopilot
work
under all kinds of driving conditions
at that point it's not just a question
of vision
or perception but really also of control
and dealing with all the edge cases
so where do you think most of the
difficult cases
to me even the highway driving is an
open problem because
uh it applies the same 50 90 95 99 rule
or the first step the fallacy of the
first step i forget how you put
it we fall victim to i think even
highway driving has a lot of elements
because to solve autonomous driving you
have to completely relinquish
the the fat help of a human being
you're always in control so that you're
really going to feel the edge cases
so i i think even highway driving is
really difficult but
in terms of the general driving task do
you think
vision is the fundamental problem or is
it
also your action the the interaction
with the environment
the ability to uh and then like the
middle ground i don't know if you put
that under vision which is
trying to predict the behavior of others
which is a little bit
in the world of understanding the scene
but it's also trying to form a model of
the actors in the scene
and predict their behavior yeah i
include that in vision because
to me perception blends into cognition
and building predictive models of other
agents in the world
which could be other agents could be
people other agents could be other cars
that is part of the task of perception
because
perception always has to uh not tell us
what is now but what will happen
because what's now is boring it's done
it's over with
okay yeah we care about the future
because we
act in the future and we care about the
past
and as much as it informs what's going
to happen in the future
so i think we have to build predictive
models of of
of behaviors of people and and those can
get quite
complicated so uh uh
i mean uh i i've seen examples of this
in
uh actually i mean i own a tesla
and it has various safety features built
in
and uh what i see are these examples
where
let's say there is some uh skateboarder
i mean
this i and i i don't want to be too
critical because
obviously this is these are the systems
are always being improved
and any specific criticism i have
maybe the system six months from now
will not have that
that that particular failure mode
so uh it
it had it it had the wrong response and
it's because it couldn't predict
what what this skateboarder was going to
do
okay and because it really required that
higher level cognitive understanding
of what skateboarders typically do as
opposed to a normal pedestrian
so what might have been the correct
behavior for a pedestrian
a typical behavior for pedestrian was
not the typical behavior for a
skateboarder right yeah
and uh so so therefore
to do a good job there you need to have
enough data where
you have pedestrians you also have
skateboarders
you've seen enough skateboarders to see
what
uh what kinds of patterns or behavior
they have
so it is it is in principle with enough
data that problem could be solved
but uh i think our current
systems computer vision systems they
need far
far more data than humans do for
learning those
same capabilities so say that there is
going to be a system that solves
autonomous driving
do you think it will look similar to
what we have today
but have a lot more data perhaps more
compute but the fundamental
architectures involved
like neuro well in the case of tesla
autopilot is
neural networks do you think it will
look similar
in that regard and we'll just have more
data that's a
scientific hypothesis as which way is it
going to go
uh i will tell you what i would bet on
uh
so and this is at my general
philosophical position on how these
uh learning systems have been
uh what we have found currently very
effective in
computer vision uh with in in the deep
learning paradigm is
sort of tabula rasa learning and tabular
us are learning
in a supervised way with lots and lots
of what's going on
in the sense that blank slate we just
have the system which is
given a series of experiences in this
setting and then it learns there
now if let's think about human driving
it is not tabular assad learning
so at the age of 16 in high school
uh a teenager goes into uh
goes into driver ed class right and now
at that point they learn but at the age
of 16 they are already visual geniuses
because from 0 to 16 they have built a
certain repertoire of vision
in fact most of it has probably been
achieved by
age 2 right in in this period of age
up to age 2 they know that the world is
three-dimensional they know how
objects look like from different
perspectives
they know about occlusion they
know about common dynamics of humans and
other bodies
they have some notion of intuitive
physics so they
they built that up from their
observations and interactions
in early childhood and of course
reinforced through
their their growing up to age 16. so
then
at age 16 when they go into driver ed
what are they learning they're not
learning afresh the visual world
they have a mastery of the visual world
what they are learning
is control okay they are learning how to
be smooth
about control about steering and brakes
and so forth
they're learning a sense of typical
traffic situations
now the the that education process
can be quite short because they are
coming in as visual geniuses
and of course in their future they're
going to encounter situations which are
very novel
right so during my driver ed class
that i may not have had to deal with a
skateboarder i may not have had to deal
with a truck
driving in front of me who's from
who's where the back opens up and some
junk gets dropped from the truck
and i have to deal with it right but i
can deal with this
as a driver even though i did not
encounter this in my driver at
class and the reason i can deal with it
is because i have all this
general visual knowledge and expertise
and uh do you think the learning
mechanisms we have today
can do that kind of long-term
accumulation of knowledge
or do we have to uh do some kind of
you know in the the the work that led up
to expert systems with knowledge
representation
you know the broader field of what of
artificial intelligence
uh worked on this kind of accumulation
of knowledge
do you think neural networks can do the
same i think uh
i don't see any in principle problem
with neural networks doing it
but i think the learning techniques
would need to evolve significantly
so the current uh the current
learning techniques that we have yeah is
our supervised learning you're given
lots of examples
xiy pairs and you you learn the
functional mapping between them
i think that human learning is far
richer than that
it includes many different components
there are
there is a a child explores the world
and sees as for example a child
takes an object and manipulates it
in his or her hand and therefore gets to
see the object from different points of
view
and the child has commanded the movement
so that's a kind of learning data but
the learning data has been
arranged by the child and this is a very
rich
kind of data the child can do various
experiments with the world so
so there are many aspects of sort of
human learning and these have been
studied in
in child development by psychologists
and they what they tell us is that
supervised learning is a very small part
of it
there are many different aspects of
learning
and what we would need to do is to
develop models of
all of these and then
train our systems in that with that kind
of
uh protocol so new new methods of
learning
yes some of which might imitate the
human brain but you also
in your talks have mentioned some of the
compute side of things
the in terms of the difference in the
human brain or referencing marvik
hans marvel the so
do you do you think there's something
interesting valuable to consider about
the difference
in the computational power of the human
brain versus
the computers of today in terms of
instructions
per second yes so if we go back
uh so so this is a point i've been
making for 20 years now
and i think once upon a time the way i
used to
argue this was that we just didn't have
the computing power of the human brain
our computers were uh were not
quite there and i mean there is a
well well-known trade-off which we know
that
the that neurons are slow compared to
transistors but uh but we have a lot of
them and they have a very high
connectivity
whereas in silicon you have much faster
devices transistors switch at
on the order of nanoseconds but the
connectivity is usually smaller
right at this point in time i mean we
are now talking about
2020 we do have if you consider the
latest gpus and so on
amazing computing power and if we look
back at enhanced modex type of
calculations which he did in the 1990s
we may be there today in terms of
computing power comparable to the brain
but it's not in the of the same style
it's of a very different style
so i mean for example the the style of
computing that we have in our gpus
is far far more power hungry than
the style of computing that is there in
the human brain or other
biological uh entities
yeah and that the efficiency part is uh
we're gonna have to solve that in order
to build actual real world systems
of large scale let me ask sort of
the high level question step taking a
step back
how would you articulate the general
problem of computer vision
does such a thing exist so if you look
at the computer vision conferences and
the work that's been going on
it's often separated into different
little segments
breaking the problem of vision apart
into whether segmentation
3d reconstruction object detection
i don't know image capturing whatever uh
there's benchmarks for each
but if you were to sort of
philosophically say what is
the big problem of computer vision does
such a thing exist
yes but it's not in isolation so
if we have to so for all
intelligence tasks i
always go back to sort of biology or
humans and if we think about
vision or perception in that setting we
realize that
perception is always to guide action
perception
in a for a biological system does not
give any benefits
unless it is coupled with action so we
can go back
and think about the first multicellular
animals
which arose in the cambrian era you know
500 million years ago
and uh these animals could move
and they could see in some ways and
their two
activities helped each other because uh
uh how does movement help movement
helps that because you can get food in
different places
but you need to know where to go and
that's really about
perception or seeing i mean i mean
vision is
perhaps the single most perception sense
but
all the others are equally are also
important so
uh so perception and action kind of grow
go together
so earlier it was in these very simple
feedback loops
which were about uh finding food
or avoiding becoming food if there's a
predator running uh
trying to you know eat you up
and and so forth so so we must at the
fundamental level connect
perception to action then
as we evolved uh perception became more
and more sophisticated
because it served many more purposes and
uh so today we have what seems like a
fairly general purpose capability
which can look at the external world and
build and
a model of the external world inside the
head
we do have that capability that model is
not perfect
and psychologists have great fun in
pointing out the ways in which
the model in your head is not a perfect
model of the external world
and they have create various illusions
to
show the ways in which it is imperfect
but
it's amazing how far it has come from a
very simple
perception action loop that you exists
in you know
an animal 500 million years ago once we
have this
these very sophisticated visual systems
we can then
impose a structure on them it's we as
scientists who are imposing that
structure
where we have chosen to characterize
this part of the system as this
code module of object detection or quote
this module of 3d reconstruction
what's going on is really all of these
processes are running
simultaneously and uh
and and they are running simultaneously
because originally their purpose was
in fact to help guide action so
as a guiding general statement of a
problem do you think
we can say that the the general problem
of computer vision
you said in humans it was tied to action
do you think we should also say that
ultimately the the goal
the problem of computer vision is to
sense the world
in the way that helps you act in the
world
yes i think that's the most fundamental
uh
that's the most fundamental purpose
we have by now hyper evolved
so we have this visual system which can
be used for other things
for example judging the aesthetic value
of a painting
and this is not guiding action maybe
it's guiding action in terms of how much
money you will put in your auction bid
but that's a bit stretched
but the basics are in fact in terms of
action
but we have we've evolved
really this hyper uh we have hyper
evolved our visual system
actually just too uh sorry to interrupt
but perhaps it is
fundamentally about action you kind of
jokingly said about spending
but perhaps the capitalistic
uh drive that drives a lot of the
development in this world
is is about to exchange your money and
the fundamental action is money if you
watch
netflix if you enjoy watching movies
you're using your perception system to
interpret the movie
ultimately your enjoyment of that movie
means you'll subscribe to netflix
so the action is this uh
this extra layer that we've developed in
modern society perhaps this is
fundamentally tied to the action of
spending money
well certainly with respect to uh
you know interactions with firms so so
in this homo economics role
when you're interacting with firms it
does become
uh it does become that that's what else
is there
uh that was a rhetorical question okay
so
to to linger on the division between the
static and the dynamic
so much of the work in computer vision
so many of the breakthroughs that you've
been a part of
have been in the static world in
looking at static images and then you've
also
worked on starting but it's a much
smaller degree the community is looking
at dynamic and video
at dynamic scenes and then there is
robotic vision
which is dynamic but also where you
actually have a robot in the physical
world
interacting based on that vision
which problem is harder
the the the intuit sort of the the
trivial first answers
well of course one image is harder but
so if you look at a deeper question
there
are we um what's the term cutting
ourselves
cutting ourselves at the knees or like
making the problem harder by focusing on
the images that's a fair question i
think
sometimes we we can simplify our problem
so much
that we essentially lose
part of the juice that could enable us
to solve the problem
and one could reasonably argue that to
some extent this happens when we go from
video to single images
now historically uh you have to consider
the limits of
imposed by the competition capabilities
we had
so if we many of the choices made in the
computer vision community
uh through the 70s 80s 90s
can be understood as
choices which were forced upon us by
the fact that we just didn't have access
to compute
enough compute not enough memory none of
hard drives not
exactly not enough not enough compute
not enough storage
so so think of these choices so one of
the choices is
focusing on single images rather than
video okay
clear questions storage and compute
we had to focus on we did we
used to detect edges and throw away the
image right so you have an image
which i say 256 by 256 pixels and
instead of keeping around the grayscale
value what we did was we detected edges
find the places where the brightness
changes a lot
so now that and now and then throw away
the rest
so this was a major compression device
and the hope was that this makes it
that you can still work with it and the
logic was humans can interpret a line
drawing
and uh and yes and this will save us a
competition so many of the choices were
dictated by that
i think uh today
we are no longer detecting edges right
we
process images with convnets because we
don't need to we don't have that
those compute restrictions anymore now
video is still
under studied because video compute is
still quite challenging
if you are a university researcher i
think
video computing is not so challenging if
you are at
google or facebook or amazon still super
challenging i've
just spoke with the vp of engineering
google head of
the youtube search and discovery and
they still struggle doing stuff on
video it's very difficult except doing
except using techniques that are
essentially the techniques you used in
in the 90s some very basic computer
vision techniques
no that's when you want to do things at
scale so if
you want to operate at the scale of all
the content of youtube it's very
challenging and there's similar issues
in
facebook but as a researcher you
you have you have more uh you know
opportunities
you can train large you know that works
with relatively large
uh video data sets yeah yes so i think
that
this is part of the reason why we have
so emphasized static images
i think that this is changing and over
the next few years
i see a lot more progress happening in
in video so i have this generic
statement that
to me video recognition feels like 10
years behind
object recognition and you can quantify
that because
you can take some of the challenging
video data sets and
their performance on action
classification is like say 30
which is kind of what we used to have
around
2009 in object detection you know so
it's like about 10 years behind
and uh whether it'll take 10 years to
catch up is a different question
hopefully it will take less than that
let me ask a similar question i've
already asked but once again so for
dynamic scenes
do you think do you think some kind of
injection
of knowledge basis and reasoning is
required
to help improve like action recognition
like if if if um
if we solve the general action
recognition problem
what do you think the solution would
look like it's another way yeah
so i i completely
agree that knowledge is called for and
that knowledge can be
quite sophisticated so the way i would
say it is that
perception blends into cognition and
cognition brings in
issues of memory and
this notion of a schema from psychology
which is
uh let me use the classic example which
is
you go to a restaurant right now the
things that
happen in a certain order you walk in
somebody takes you to a table
a waiter comes gives you a menu
takes the order food arrives eventually
a
bill arrives etc etc this is a classic
example of ai from the 1970s
uh it was called there was the term
frames and
scripts and schemas these are all quite
similar ideas
okay in the 70s the way
the ai of the time dealt with it was by
build hand coding this
so they hand coded in this notion of a
script and the various
stages and the actors and so on and so
forth
and use that to interpret for example
language
i mean if there's a description of a of
a story involving
some people eating at a restaurant there
are way all these
inferences you can make because you know
what happens typically at a restaurant
so i think this kind of uh
this kind of knowledge is absolutely
essential so i think
that when we are going to do long-form
video understanding
we are going to need to do this i think
the kinds of technology that we have
right now with
3d convolutions over a couple of seconds
of clip or video
it's very much tailored towards
short-term video understanding
not that long-term understanding
long-term understanding
requires a notion of
this notion of schemas that i talked
about perhaps some notions of
goals intentionality functionality
and so on and so forth now
how will we bring that in so we could
either revert back to the 70s and say
okay i'm going to hand code in
a script or we might
try to learn it so i
tend to believe that we have to find
learning ways of doing this
because i think learning ways to land up
being more robust
and there must be a learning version of
the story because
uh children acquire a lot of this
knowledge
by uh sort of just observation so
at no moment in a child's life there's a
it's possible but i think it's not so
typical that
somebody that a mother coaches a child
through all the stages of what happens
in a restaurant
they just go as a family they they
they go to the restaurant they eat come
back and the child goes through 10 such
experiences
and the child has has got a schema of
what happens when you go to a restaurant
so we somehow need to we need to provide
that capability to our systems
you mentioned the following line from
the end of the alan turing paper
uh computing machinery and intelligence
that many people
like you said many people know and very
few have read
where he proposes the turing test this
is this is how you know because it's
towards the end of the paper
instead of trying to produce a program
to simulate the adult mind
why not rather try to produce one which
simulates the child's
so that's a really interesting point if
i think about the benchmarks we have
before us the the tests
of our computer vision systems they're
often kind of trying to
get to the adult so what kind of
benchmarks should we have
what kind of tests for computer vision
do you think we should have
that mimic the child's in computer
vision yeah
i think we should have those and we
don't have those today
and i think uh the part of that
the challenge is that we should really
be collecting data
of the type that a child uh that the
child experiences
right so that gets into issues of you
know privacy and so on and so forth
but there are attempts in this direction
to
sort of try to collect the kind of data
that a child
encounters growing up so what's the
child's linguistic
environment what's the child's visual
environment
so if we could collect that kind of data
and then develop learning schemes based
on that data
that would be one way to do it i
i think that's a very promising
direction myself there might be people
who would argue
that we could just short circuit this in
some way and
uh sometimes we have
imitated uh we have not
we have had success by not imitating
nature in detail so
the usual example is airplanes right we
don't build flapping winds
flapping wings so uh
yes that's uh that's one of the points
of debate
uh in my mind i i i would i would bet on
this this learning like a child approach
so one of the fundamental aspects of
learning like a child is the
interactivity
so the child gets to play with the data
set it's learning from
yes it's against the select i mean you
can call that active learning you can
you know in the machine learning world
you can call it a lot of terms
what are your thoughts about this whole
space of being able to play with the
data set or select what you're learning
yeah so i think that uh i
i believe in that and i think that we
could achieve it in in two ways and i
think we should use both
so one is uh actually
real robotics right so real uh
you know physical embodiments of agents
who are interacting with the world and
they have a physical body with
dynamics and mass and moment of inertia
and friction and all the rest and you
learn your body the robot learns its
body by
doing a series of
actions the second is that simulation
environments
so i think simulation environments are
getting much much better
in my in my life in
facebook ai research our group has
worked on something called habitat
which is a simulation environment
which is a visually photorealistic
environment of
you know places like houses or interiors
of
various urban spaces and so forth and as
you move
you get a picture which is a pretty
accurate picture
so uh i i can now uh you can imagine
that
subsequent generations of these
simulators will be accurate not just
visually but with respect to
you know forces and masses and
haptic interactions and so on
and uh then then we have that
environment to play with
i think that let me state one reason why
i think
this active being able to act in the
world is
important i think that this is one way
to break
the correlation versus causation barrier
so this is something which is of a great
deal of interest these days i mean
people like judea pearl have
talked a lot about uh why
that we are neglecting causality and he
describes the entire set of successes of
deep learning as just curve fitting
right because it's uh but i i don't
quite agree
about as a troublemaker he is but uh
causality
is important but causality is not
is not like a single silver bullet it's
not like one single principle there are
many different aspects here
and one of the ways in which uh
one of our most reliable ways of
establishing causal links and this is
the way
for example the the medical community
does this is
randomized control trials so you have
you
you pick some situation and now in some
situation you perform an action and
for certain others you don't
right so so you have a control
experiment well the child is in fact
performing controlled experiments all
the time
right right right okay small scale and
in a small scale and
but but that is a way that the child
gets to
build and refine its causal models of
the world
and my colleague alison gopnik has
together with a couple of authors
co-authors has this book called the
scientist in the crib
referring to children so i like the part
that i like about that is
the scientist wants to do wants to build
causal models
and the scientist does control
experiments and i think the child is
doing that
so to enable that we will need to
have these these active experiments
and i think this could be done some in
the real world and some in simulation so
you have hope for simulation
i have a hopeless solution that's an
exciting possibility if we can get to
not just photo realistic but what's that
called
life realistic yeah uh simulation
so you don't see any fundamental
blocks to why we can't eventually
simulate
the the principles of what it means to
exist in the world
as a physical i i don't see any
fundamental problems there i mean
and look the computer graphics community
has come a long way
right so the in the early days back
going back to the 80s and 90s they were
they were focusing on visual realism
right and then they could do the easy
stuff but they couldn't do stuff like
hair or fur and so on
okay well they managed to do that then
they couldn't do physical
actions right like there's a bowl of
glass and it falls down and it shatters
but then they could start to do pretty
realistic models of that
and so on and so forth so the graphics
people have shown that they can do
this forward direction not just for
optical interactions but also for
physical interactions
so i think uh of course some of that is
very computer intensive but
i think by and by we will find ways of
making our models ever more realistic
you break vision apart into in one of
your presentations
early vision static scene understanding
dynamics and understanding
and raise a few interesting questions i
thought i could just throw some
some at you just to see if you want to
talk about them
so early vision so it's what is it
you said um sensation
perception and cognition so is this a
sensation yes
what can we learn from image statistics
that we don't already know
so at the lowest level what um
what can we make from just this the the
statistic the basics so there were the
variations in the rock pixels the
textures and so on
yeah so what we seem to have learned is
uh
uh uh is that there's a lot of
redundancy in these images and
as a result we are able to do a lot of
compression
and and this compression is very
important in biological settings right
so you might have ten to the eight
photoreceptors and only ten to the six
fibers in the optic nerve so you have to
do this compression by
a factor of hundreds to one and
uh and uh so there are analogs of that
which are happening in
in our neural net artificial neural
network that's the early layer so you
think
there's a lot of compression that can be
done in the beginning
yeah just just the statistics yeah
um how much
how much well so i mean the the way to
think about it is
just how successful is image compression
right and we we and there are and that's
been done with
older technologies but it can be done
with there are
several companies which are trying to
use
sort of these more advanced neural
network type techniques for compression
both for static images as well as for
for video
one of my former students has a company
which is trying to do
stuff like this and
i think i think that they are showing
quite
interesting results and i think that
that's all
the success of that's really about image
statistics and video statistics but
that's still not doing
compression of the kind when i see a
picture of a cat
all i have to say is it's a cat that's
another semantic kind of complication
yeah so this is this is at the lower
level right so we are we are we as i
said yeah
that's focusing on low level statistics
so to linger on that for a little bit
uh you mentioned how far can bottom-up
image segmentation go
and in general what you mentioned
that the central question for scene
understanding is the interplay of
bottom-up and top-down information maybe
this is a good time
to elaborate on that maybe define what
is
what is up what is top down
in the comments yes the computer vision
uh right that's uh
so today what we have are a are very
interesting systems because they work
completely bottom up
how are they what does bottom bottom-up
mean sorry so bottom-up means in this
case means a feed-forward net neural
network
so starting from the raw pixels yeah
they start from the raw pixels and they
they end up with some something like cat
or not a cat
right so our our systems are running
totally feed forward
they're trained in a very top-down way
so they're trained by saying okay this
is a cat there's a cat there's a dog
there's a zebra etc
and i'm not happy with either of these
choices fully
we have gone into uh because we have
completely separated these processes
right so there is a so i would like the
uh the process uh
so what do we know compared to biology
so in biology what we know is that the
processes
in at test time at run time
those processes are not purely feed
forward but they involve feedback
so and they involve much shallower
neural networks
so the kinds of neural networks we are
using in computer vision say a resnet 50
has 50 layers
well in in the brain in the visual
cortex
going from the retina to it maybe we
have like seven
right so they're far shallower but we
have the possibility of feedback so
there are backward connections
and this might enable us to uh
to deal with the more ambiguous stimuli
for example
so the the biological solution seems to
involve feedback
the solution in in artificial
vision seems to be just feed forward but
with a much deeper network
and the two are functionally equivalent
because if you have a feedback network
which just has like three rounds of
feedback
you can just unroll it and make it three
times the depth
and create it in a totally feed forward
way
so this is something which i mean we
have written some papers on this
theme but i really feel that this should
this theme should be pursued further
have some kind of recurrence mechanism
yeah
okay the other uh so that so that's uh
so i
so i want to have a little bit more top
down in the
at test time okay then at training time
we make use of a lot of top-down
knowledge right now
so basically to learn to segment an
object we have to have all these
examples of this is the boundary of a
cat and this is the boundary of a chair
and this is the boundary of a horse and
so on and this is
too much top-down knowledge how do
humans do this we manage to we manage
with far less supervision
and we do it in a sort of bottom-up way
because for example
we're looking at a video stream and the
horse moves
and that enables me to say that all
these pixels are together
yeah so the gestural psychologists used
to call this
the principle of common fate so there
was a bottom-up
process by which we were able to segment
out these objects
and we have totally focused on this
top-down training signal
so in my view we have currently solved
it
in machine vision this top-down
bottom-up interaction
but i don't find the solution fully
satisfactory
and i would rather have a bit of both in
at both stages
for all computer vision problems which
is not just segmentation
and and and and the question that you
can ask is
so for me i'm inspired a lot by human
vision and i care about that
you could be a just a hard-boiled
engineer not give a damn
so to you i would then argue that uh you
would need far less training data
if you could make my uh research agenda
you know fruitful okay so
maybe taking a step into uh segmentation
static scene understanding
what is the interaction between
segmentation and recognition
you mentioned the movement of objects
so for people who don't know computer
vision
segmentation is this weird activity that
we
that computer vision folks have all
agreed is very important
uh of drawing outlines around objects
versus a bounding box or
and then classifying that object
what's what's the value of segmentation
what is it
as a problem in computer vision how is
it fundamentally different from
detection recognition any other problems
yeah so i think
uh so so segmentation
enables us to say that
some set of pixels are an object without
necessarily even being able to name that
object or knowing properties of that
object
oh so you mean segmentation purely as
as as the act of separating an object
from its background a blob of uh
of that's united in some way from his
background yeah so identification if you
were
making an entity out of it and
justification yeah beautifully
so so i think that we have that
capability
and that is that enables us
to uh as we are growing up to
acquire uh names of objects
with very little supervision so suppose
the child
lets posit that the child has this
ability to separate out
objects in the world then when the
there's a
the mother says pick up your bottle or
the cat's behaving funny today
[Laughter]
the word cat suggests some object and
then the child sort of does the mapping
right right the mother doesn't have to
teach
a specific object labels by pointing to
them
weak supervision works in the context
that you have
the ability to create objects so
i think that uh so to me that's that's a
very fundamental capability
uh there are applications where this is
very important uh
for example medical diagnosis so in
medical diagnosis
uh you have some uh brain scan i mean
some
this is some work that we did in my
group where you have ct scans of people
who have
had traumatic brain injury and what uh
what the radiologist needs to do is to
precisely delineate various
places where there might be bleeds for
example
and there's there are clear needs like
that
so they're certainly very practical
applications of computer vision where
segmentation is necessary
but philosophically segmentation
enables the task of recognition
to proceed with much weaker supervision
than we require today
and you think of segmentation as this
kind of task that takes on
a visual scene and breaks it apart
into into interesting entities yeah
that might be useful for whatever the
task is yeah
and and it is not semantics free so i
think i
i mean it it blends into it involves
perception and cognition it is not it is
not
i i think the mistake that we used to
make in the early days of computer
vision
was to treat it as a purely bottom-up
perceptual task it is not just that
because we do revise our notion of
segmentation with more experience right
because
for example there are objects which are
non-rigid like animals
or humans and uh i think
understanding that all the pixels of a
human are one entity is actually quite a
challenge
because the parts of the human they can
move independently
and the human wears clothes so they
might be differently colored
so it's all sort of a challenge you
mentioned the three hours of computer
vision
are recognition reconstruction
reorganization
can you describe these three r's sure
how they interact
yeah so uh so recognition is the easiest
one
because that's uh what i think
people generally think of as computer
vision
achieving these days which is uh labels
so is this a cat is this a dog is this a
chihuahua i mean you know it could be
very fine grain like
you know specific breed of a dog or a
specific species or bird
or it could be very abstract like animal
but given a part of an image or a whole
image say
put a label on that yeah so that's
that's recognition
reconstruction is uh
essentially it you can think of it as
inverse
graphics i mean that's one way
to think about it so graphics is your
you have some internal computer
representation
and uh you have a computer
representation of some objects arranged
in a scene
and what you do is you produce a picture
you produce the pixels corresponding to
a rendering of that scene
so uh so let's
do the inverse of this we are given an
image and we try to
we we we say oh this image
arises from some objects in a scene
looked at with a camera from this
viewpoint and we might have more
information about the objects like their
shape maybe their textures maybe
you know color et cetera et cetera so
that's the reconstruction problem in a
way
that you are in your head creating a
model of the external world
okay reorganization is to do with
essentially finding these entities so
uh so it's uh organization or
the word organization implies structure
so uh that in in uh perception
in psychology we use the term perceptual
organization
that uh the the world is not just
an image is not just seen as is not
internally represented as just a
collection of pixels but we
make these entities we create these
entities
objects whatever you want to call in the
relationship between the entities as
well or is it purely about the entities
it could be about the relationships but
mainly we focus on the fact that there
are entities
sometimes i'm trying to pinpoint what
the organization means
so organization is that instead of like
a
uniform grid we have the structure of
objects
so segmentation is a small part of that
so segmentation gets us going towards
that
yeah and you kind of have this triangle
where they all interact together
yes so how do you see that interaction
in uh sort of uh
reorganization is yes defining the
entities in the world
the recognition is labeling those
entities
and then reconstruction is what filling
in the gaps
well to for example see
impute some 3d objects corresponding to
each of these
entities that would be part of adding
more information that's not
there in the raw data correct
i mean i started pushing this kind of a
view in the around 2010 or something
like that
because at that time in computer vision
the distinction that
people were were just
working on many different problems but
they treated each of them as a separate
isolated problem with each with its own
data set and then you try to solve that
and get good numbers on it
so i wasn't i didn't like that approach
because i wanted to see
the connection between these and
if people divided up vision into
into various modules the way they would
do it is as low level mid-level and
high-level vision
corresponding roughly to the
psychologist's notion of sensation
perception and cognition
and i didn't that didn't map to tasks
that people cared about
okay so therefore i tried to promote
this particular framework
as a way of considering the problems
that people in computer vision were
actually working on
and trying to be more explicit about the
fact that they actually
are connected to each other and i was at
that time
just doing this on the basis of
information flow
now it turns out in the last five years
or so
in the post the deep learning revolution
that this this architecture has turned
out to be
very conducive to that
because basically in these neural
networks we are trying to
build multiple representations
there can be multiple output heads
sharing common representations
so in a certain sense today given the
reality of what solutions people have to
these
i i i i do not need to preach this
anymore
it is it is just there it's part of the
solution space
so speaking of neural networks how much
of
this uh problem of computer vision
of the organization recognition
can be um reconstruction
how much of it can be learned end to end
do you think
instead of uh set it and forget it just
plug and play
have a giant data set multiple perhaps
multi-modal
and then just learn the entirety of it
well so i i think that currently what
that end-to-end learning means nowadays
is end-to-end supervised learning
and and that i would argue is too narrow
a view of the problem
i would i like this child development
view

Resume

Berikut adalah rangkuman komprehensif dan terstruktur dari transkrip wawancara dengan Jitendra Malik, seorang profesor UC Berkeley dan tokoh penting dalam bidang *Computer Vision*.

---

# Mengungkap Misteri Penglihatan Komputer: Dari Mobil Otonom hingga Kecerdasan Buatan Generasi Berikutnya

### Inti Sari (Executive Summary)
Video ini membahas perjalanan dan tantangan dalam bidang *Computer Vision* (penglihatan komputer) bersama Jitendra Malik, seorang profesor dan peneliti ternama dari UC Berkeley. Diskusi mencakup mengapa penglihatan manusia sulit ditiru oleh mesin, keterbatasan teknologi mobil otonom saat ini, pentingnya pendekatan pembelajaran yang meniru perkembangan anak-anak, serta pandangan realistis mengenai masa depan Artificial General Intelligence (AGI) dan risiko etika AI yang sudah hadir di masa kini.

### Poin-Poin Kunci (Key Takeaways)
*   **Ilusi Kemudahan:** Penglihatan terasa mudah bagi manusia karena diproses secara bawah sadar, namun secara teknis ini adalah masalah yang sangat kompleks yang merambah ke kognisi.
*   **Fallacy of the First Step:** Peneliti sering meremehkan kesulitan *Computer Vision*; mencapai akurasi 50% itu mudah, tetapi mencapai 99,99% (yang dibutuhkan untuk keselamatan hidup) bisa memakan waktu seumur hidup.
*   **Visi untuk Aksi:** Tujuan utama evolusi penglihatan bukanlah untuk menganalisis gambar statis, melainkan untuk memandu gerak dan interaksi dengan dunia nyata.
*   **Ketidakpastian Mobil Otonom:** Malik pesimis tentang kendaraan otonom *full-level* dalam waktu dekat karena tantangan "edge cases" yang membutuhkan penalaran kognitif tingkat tinggi.
*   **Pendekatan "Seperti Anak-Anak":** AI harus belajar melalui observasi, interaksi fisik, dan korelasi multisensori (seperti sentuhan dan penglihatan), bukan hanya data berlabel.
*   **Risiko AI Saat Ini:** Ancaman AI yang nyata bukanlah robot pemberontak di masa depan, melainkan algoritma rekomendasi yang memanipulasi perilaku manusia dan bias dalam sistem pengambilan keputusan.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Sejarah & Sifat Dasar Computer Vision
*   **Summer Vision Project (1966):** Seymour Papert dari MIT pernah memprediksi bahwa masalah penglihatan komputer dapat diselesaikan dalam satu musim panas. Faktanya, hingga puluhan tahun kemudian, masalah ini masih menjadi pekerjaan rumah yang besar.
*   **Mengapa Penglihatan Itu Sulit:** Manusia melakukan penglihatan secara *subconscious* (bawah sadar), sehingga kita tidak menyadari kompleksitas komputasi yang terjadi di otak. Bagian besar korteks serebral manusia dikhususkan untuk pemrosesan visual.
*   **Tingkatan Penglihatan:** Penglihatan tidak bisa dipisahkan menjadi sensasi, persepsi, dan kognisi. Ini adalah aliran kontinu. Teknik saat ini mumpuni di tingkat rendah hingga menengah, tetapi kesulitan di tingkat kognitif tinggi (pemahaman makna).

#### 2. Tantangan Mobil Otonom & Prediksi
*   **Kesalahan Langkah Pertama:** Dalam pengembangan mobil otonom, menyelesaikan 90% masalah (jalan raya normal) itu relatif mudah. Namun, 0,01% sisanya (kondisi langka atau *edge cases*) adalah yang paling sulit dan berbahaya.
*   **Visi adalah Prediksi:** Penglihatan tidak hanya tentang "apa yang ada sekarang", tetapi "apa yang akan terjadi selanjutnya". Mengemudi membutuhkan pemodelan perilaku agen lain (pengemudi lain, pejalan kaki) untuk memprediksi aksi mereka.
*   **Standar Keselamatan:** Berbeda dengan pencarian gambar di Google yang toleran terhadap kesalahan, sistem mengemudi harus bebas kesalahan total karena menyangkut nyawa manusia.

#### 3. Evolusi, Efisiensi, & Statis vs Dinamis
*   **Visi untuk Gerak:** Secara evolusioner, penglihatan berkembang bersamaan dengan kemampuan bergerak untuk mencari makan dan menghindari pemangsa. Oleh karena itu, visi harus dipasangkan dengan aksi (*embodied AI*).
*   **Kesenjangan Video vs Gambar:** Kebanyakan penelitian berfokus pada gambar statis karena keterbatasan komputasi di masa lalu. Pengenalan video (dinamis) tertinggal sekitar 10 tahun dibandingkan pengenalan objek statis.
*   **Efisiensi Biologis:** Otak manusia jauh lebih efisien secara energi dibandingkan GPU yang membutuhkan daya besar. Untuk membangun sistem skala besar, efisiensi komputasi harus ditingkatkan.

#### 4. Bagaimana Mesin Harus Belajar (Pendekatan Perkembangan Anak)
*   **Skema & Niat:** Memahami video jangka panjang membutuhkan "skema" (seperti skenario restoran: masuk, duduk, pesan, makan, bayar) yang dipelajari anak-anak melalui pengamatan berulang tanpa instruksi eksplisit.
*   **Pembelajaran Aktif:** Anak-anak belajar dengan berinteraksi (aktif) dengan dunia, melakukan eksperimen kecil seperti ilmuwan untuk memahami sebab-akibat (*causality*), bukan hanya memproses data pasif.
*   **Simulasi & Robotika:** Untuk mencapai ini, kita memerlukan robot fisik atau lingkungan simulasi yang realistis (seperti "Habitat" dari Facebook AI Research) di mana AI bisa belajar fisika dan interaksi.

#### 5. Kerangka "3 R's" dan Pemrosesan Multisensori
*   **3 R's Vision:**
    1.  **Recognition:** Memberi label pada gambar (paling mudah).
    2.  **Reconstruction:** Membangun model 3D dunia dari gambar (inverse graphics).
    3.  **Reorganization:** Mengelompokkan piksel menjadi entitas/objek yang terstruktur (segmentasi).
*   **Multimodal Learning:** Manusia belajar dengan menggabungkan indera. Contohnya, anak yang memegang bola (sentuhan + visual) memahami bentuk 3D bola lebih baik. Sinyal audio dan visual juga saling menguatkan (misal: suara kaca pecah dengan visual kaca pecah).

#### 6. Visi vs Bahasa & Pengujian Kecerdasan
*   **Visi Lebih Fundamental:** Penglihatan telah berkembang selama 500 juta tahun, sedangkan bahasa baru muncul belakangan. Kecerdasan spasial dan kausalitas ada pada nenek moyang kita jauh sebelum bahasa berkembang.
*   **Turing Test vs Daftar Tugas:** Malik tidak setuju dengan Turing Test tunggal. Ia menyarankan serangkaian 10 tugas berbeda (navigasi, manipulasi objek, pemahaman visual, bahasa) untuk menguji kecerdasan secara menyeluruh.
*   **Masalah Hilbert Computer Vision:**
    1.  Memahami video jangka panjang (perilaku, tujuan, niat).
    2.  Memahami dunia 3D dari pengalaman visual yang berkesinambungan.

#### 7. Masa Depan AGI, Risiko, dan Etika
*   **Jelmaan AGI:** Secara prinsip, AGI (Kecerdasan Umum Buatan) mungkin akan tercapai, tetapi Malik pesimis ini akan terjadi dalam 20 tahun ke depan. Ada "hal-hal yang tidak diketahui" (*unknown unknowns*) dalam kognisi tingkat tinggi.
*   **Black Box vs Explainability:** Malik nyaman dengan jaringan saraf yang merupakan "kotak hitam" asalkan performanya tinggi, mirip seperti manusia yang tidak sepenuhnya bisa menjelaskan proses pikirannya. Namun, penjelasan tetap krusial dalam bidang medis.
*   **Risiko Nyata Saat Ini:** Kita tidak perlu takut pada Skynet, tetapi kita harus khawatir tentang:
    *   Bias dalam algoritma (misalnya dalam perekrutan atau pinjaman).
    *   Kecelakaan mobil otonom (seperti insiden Uber).
    *   **Algoritma Rekomendasi:** Sistem seperti YouTube atau Facebook yang menggunakan *Computer Vision* untuk memahami konten dan memanipulasi apa yang kita tonton, secara efektif mengendalikan pikiran miliaran orang.

#### 8. Mentorship & Filosofi Karir
*   **Seni dari yang Bisa Diselesaikan:** Kunci kesuksesan ilmiah adalah memilih masalah yang belum terpecahkan tetapi sudah "matang" untuk diselesaikan (*ripe problems*), bukan terlalu dini maupun terlalu terlambat.
*   **Keberuntungan & Selera:** Malik mengakui keberuntungannya berada di bidang yang tepat pada waktu yang tepat. Ia menekankan pentingnya memiliki "selera" dalam memilih masalah penelitian dan keluasan intelektual (menghubungkan psikologi, neurosains, dan matematika).

### Kesimpulan & Pesan Penutup
Jitendra Malik menutup diskusi dengan pesan bahwa meskipun kita telah membuat kemajuan luar biasa dalam *Computer Vision*, perjalanan ini masih jauh dari selesai. Ia mengingatkan para peneliti untuk fokus pada masalah yang substantif dan tidak terjebak pada hype. Bagi masyarakat umum, pesan utamanya adalah waspada terhadap pengaruh algoritma yang sudah mengatur hidup kita saat ini, sambil tetap mengapresiasi keindahan dan kompleksitas dari kemampuan peng

Read

file updated 2026-02-13 13:22:38 UTC