Transcript
LRYkH-fAVGE • Jitendra Malik: Computer Vision | Lex Fridman Podcast #110
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0421_LRYkH-fAVGE.txt
Kind: captions Language: en the following is a conversation with jitendra malik a professor at berkeley and one of the seminal figures in the field of computer vision the kind before the deep learning revolution and the kind after he has been cited over 180 thousand times and has mentored many world-class researchers in computer science quick summary of the ads two sponsors one new one which is better help and an old goody expressvpn please consider supporting this podcast by going to betterhelp.com lex and signing up at expressvpn.com lexpod click the links buy the stuff it really is the best way to support this podcast and the journey i'm on if you enjoy this thing subscribe on youtube review it with 5 stars on apple podcast support it on patreon or connect with me on twitter at lex friedman however the heck you spell that as usual i'll do a few minutes of ads now and never neons in the middle that can break the flow of the conversation this show is sponsored by better help spelled h-e-l-p help check it out at betterhelp.com lex they figure out what you need and match you with a licensed professional therapist in under 48 hours it's not a crisis line it's not self-help it's professional counseling done securely online i'm a bit from the david goggins line of creatures as you may know and so have some demons to contend with usually on long runs or all nights working forever and possibly full of self-doubt it may be because i'm russian but i think suffering is essential for creation but i also think you can suffer beautifully in a way that doesn't destroy you for most people i think a good therapist can help in this so it's at least worth a try check out their reviews they're good it's easy private affordable available worldwide you can communicate by text anytime and schedule weekly audio and video sessions i highly recommend that you check them out at betterhelp.com lex this show is also sponsored by expressvpn get it at expressvpn.com to support this podcast and to get an extra three months free on a one-year package i've been using expressvpn for many years i love it i think expressvpn is the best vpn out there they told me to say it but it happens to be true it doesn't log your data it's crazy fast and it's easy to use literally just one big sexy power on button again for obvious reasons it's really important that they don't log your data it works on linux and everywhere else too but really why use anything else shout out to my favorite flavor of linux ubuntu mate 2004 once again get it at expressvpn.comlexpod to support this podcast and to get an extra three months free and a one year package and now here's my conversation with jitendra in 1966 seymour papper at mit wrote up a proposal called the summer vision project to be given as far as we know to 10 students to work on and solve that summer so that proposal outlined many of the computer vision tasks we still work on today why do you think we underestimate and perhaps we did underestimate and perhaps still underestimate how hard computer vision is because most of what we do in vision we do unconsciously or subconsciously in human vision in human vision so that gives us this that effortlessness gives us the sense that oh this must be very easy to implement on a computer now this is why the early researchers in ai got it so wrong however if you go into neuroscience or psychology of human vision then the complexity becomes very clear the fact is that a very large part of the the cerebral cortex is devoted to visual processing i mean and this is true in other primates as well so once we looked at it from a neuroscience or psychology perspective it it becomes quite clear that the problem is very challenging and it will take some time you said the higher level parts are the harder parts i think vision appears to to be easy because uh most of what visual processing is subconscious or unconscious right so we underestimate the difficulty whereas uh when you are like proving a mathematical theorem or playing chess the difficulty is much more evident so because it is your conscious brain which is processing uh various aspects of the problem-solving behavior whereas in vision all this is happening but it's not in your awareness it's in your it's operating below that but it's it still seems strange yes that's true but it seems strange that as computer vision researchers for example the community broadly is time and time again makes the mistake of um thinking the problem is easier than it is or maybe it's not a mistake we'll talk a little bit about autonomous driving for example how hard of a vision task that is it do do you think i mean what is it just human nature or is there something fundamental to the vision problem that we we underestimate we're still not able to be cognizant of how hard the problem is yeah i think in the early days it could have been excused because in the early days all aspects of ai were regarded as too easy but i think today it is much less excusable and i think why people fall for this is because of what i call the fallacy of the successful first step there are many problems in vision where getting 50 of the solution you can get in one minute getting to 90 percent can take you a day getting to 99 percent may take you five years and 99.99 may be not in your lifetime i wonder if that's a unique division that it seems that language people are not so confident about so natural language processing people are a little bit more cautious about our ability to to solve that problem i think for language people intuit that we have to be able to do natural language understanding for vision it seems that we're not cognizant or we don't think about how much understanding is required it's probably still an open problem but in your sense how much understanding is required to solve vision like this put another way how much something called common sense reasoning is required to really be able to interpret even static scenes yeah so vision operates at uh at all levels and there are parts which are which can be solved with what we could call maybe peripheral processing so in the in the human vision literature there used to be these terms sensation perception and cognition which roughly speaking referred to like the front end of processing middle stages of processing and higher level of processing and i think they made a big deal out of out of this and they wanted to just study only perception and then dismiss certain certain problems as being quote cognitive but really i think these are artificial divides the problem is continuous at all level and there are challenges at all levels the techniques that we have today they work better at the lower and mid levels of the problem i think the higher levels of the problem quote the cognitive levels of the problem are there and we in many real applications we have to confront them now how much that is necessary will depend on the application for some problems it doesn't matter for some problems it matters a lot so i am for example a pessimist on fully autonomous driving in the near future and the reason is because i think there will be that 0.01 percent of the cases where quite sophisticated cognitive reasoning is called for however there are tasks where you can first of all they are much more they are robust so in the sense that error rates error is not so much of a problem for example uh uh let's say we are you're doing uh image search you're trying to get images based on some some some description some visual description we are very tolerant of errors there right i mean when google image search gives you some images back and a few of them are wrong it's okay it doesn't hurt anybody there's no there's not a matter of life and death but making mistakes when you're driving at 60 miles per hour and you could potentially kill somebody is much more important so just for the for the fun of it since you mentioned let's go there briefly about autonomous vehicles so one of the companies in the space tesla is work with andre karpathy and elon musk are working on a system called autopilot which is primarily a vision-based system with eight cameras and uh basically a single neural network a multi-task neural network they they call it hydro net multiple heads so it does multiple tasks but is forming the same representation at the core do you think driving can be converted in this way to uh purely a vision problem and then solved within you with learning or even more specifically in the current approach what do you think about what tesla autopilot team is doing so the way i think about it is that there are certainly subset subsets of the visual based driving problem which are quite solvable so for example driving in freeway conditions is quite a solvable problem i think there were demonstrations of that going back to the 1980s by someone called ernst stickmans in munich in the 90s there were approaches from carnegie mellon there were approaches from our team at berkeley in the 2000s there were approaches from stanford and so on so autonomous driving in certain settings is very doable the challenge is to have an autopilot work under all kinds of driving conditions at that point it's not just a question of vision or perception but really also of control and dealing with all the edge cases so where do you think most of the difficult cases to me even the highway driving is an open problem because uh it applies the same 50 90 95 99 rule or the first step the fallacy of the first step i forget how you put it we fall victim to i think even highway driving has a lot of elements because to solve autonomous driving you have to completely relinquish the the fat help of a human being you're always in control so that you're really going to feel the edge cases so i i think even highway driving is really difficult but in terms of the general driving task do you think vision is the fundamental problem or is it also your action the the interaction with the environment the ability to uh and then like the middle ground i don't know if you put that under vision which is trying to predict the behavior of others which is a little bit in the world of understanding the scene but it's also trying to form a model of the actors in the scene and predict their behavior yeah i include that in vision because to me perception blends into cognition and building predictive models of other agents in the world which could be other agents could be people other agents could be other cars that is part of the task of perception because perception always has to uh not tell us what is now but what will happen because what's now is boring it's done it's over with okay yeah we care about the future because we act in the future and we care about the past and as much as it informs what's going to happen in the future so i think we have to build predictive models of of of behaviors of people and and those can get quite complicated so uh uh i mean uh i i've seen examples of this in uh actually i mean i own a tesla and it has various safety features built in and uh what i see are these examples where let's say there is some uh skateboarder i mean this i and i i don't want to be too critical because obviously this is these are the systems are always being improved and any specific criticism i have maybe the system six months from now will not have that that that particular failure mode so uh it it had it it had the wrong response and it's because it couldn't predict what what this skateboarder was going to do okay and because it really required that higher level cognitive understanding of what skateboarders typically do as opposed to a normal pedestrian so what might have been the correct behavior for a pedestrian a typical behavior for pedestrian was not the typical behavior for a skateboarder right yeah and uh so so therefore to do a good job there you need to have enough data where you have pedestrians you also have skateboarders you've seen enough skateboarders to see what uh what kinds of patterns or behavior they have so it is it is in principle with enough data that problem could be solved but uh i think our current systems computer vision systems they need far far more data than humans do for learning those same capabilities so say that there is going to be a system that solves autonomous driving do you think it will look similar to what we have today but have a lot more data perhaps more compute but the fundamental architectures involved like neuro well in the case of tesla autopilot is neural networks do you think it will look similar in that regard and we'll just have more data that's a scientific hypothesis as which way is it going to go uh i will tell you what i would bet on uh so and this is at my general philosophical position on how these uh learning systems have been uh what we have found currently very effective in computer vision uh with in in the deep learning paradigm is sort of tabula rasa learning and tabular us are learning in a supervised way with lots and lots of what's going on in the sense that blank slate we just have the system which is given a series of experiences in this setting and then it learns there now if let's think about human driving it is not tabular assad learning so at the age of 16 in high school uh a teenager goes into uh goes into driver ed class right and now at that point they learn but at the age of 16 they are already visual geniuses because from 0 to 16 they have built a certain repertoire of vision in fact most of it has probably been achieved by age 2 right in in this period of age up to age 2 they know that the world is three-dimensional they know how objects look like from different perspectives they know about occlusion they know about common dynamics of humans and other bodies they have some notion of intuitive physics so they they built that up from their observations and interactions in early childhood and of course reinforced through their their growing up to age 16. so then at age 16 when they go into driver ed what are they learning they're not learning afresh the visual world they have a mastery of the visual world what they are learning is control okay they are learning how to be smooth about control about steering and brakes and so forth they're learning a sense of typical traffic situations now the the that education process can be quite short because they are coming in as visual geniuses and of course in their future they're going to encounter situations which are very novel right so during my driver ed class that i may not have had to deal with a skateboarder i may not have had to deal with a truck driving in front of me who's from who's where the back opens up and some junk gets dropped from the truck and i have to deal with it right but i can deal with this as a driver even though i did not encounter this in my driver at class and the reason i can deal with it is because i have all this general visual knowledge and expertise and uh do you think the learning mechanisms we have today can do that kind of long-term accumulation of knowledge or do we have to uh do some kind of you know in the the the work that led up to expert systems with knowledge representation you know the broader field of what of artificial intelligence uh worked on this kind of accumulation of knowledge do you think neural networks can do the same i think uh i don't see any in principle problem with neural networks doing it but i think the learning techniques would need to evolve significantly so the current uh the current learning techniques that we have yeah is our supervised learning you're given lots of examples xiy pairs and you you learn the functional mapping between them i think that human learning is far richer than that it includes many different components there are there is a a child explores the world and sees as for example a child takes an object and manipulates it in his or her hand and therefore gets to see the object from different points of view and the child has commanded the movement so that's a kind of learning data but the learning data has been arranged by the child and this is a very rich kind of data the child can do various experiments with the world so so there are many aspects of sort of human learning and these have been studied in in child development by psychologists and they what they tell us is that supervised learning is a very small part of it there are many different aspects of learning and what we would need to do is to develop models of all of these and then train our systems in that with that kind of uh protocol so new new methods of learning yes some of which might imitate the human brain but you also in your talks have mentioned some of the compute side of things the in terms of the difference in the human brain or referencing marvik hans marvel the so do you do you think there's something interesting valuable to consider about the difference in the computational power of the human brain versus the computers of today in terms of instructions per second yes so if we go back uh so so this is a point i've been making for 20 years now and i think once upon a time the way i used to argue this was that we just didn't have the computing power of the human brain our computers were uh were not quite there and i mean there is a well well-known trade-off which we know that the that neurons are slow compared to transistors but uh but we have a lot of them and they have a very high connectivity whereas in silicon you have much faster devices transistors switch at on the order of nanoseconds but the connectivity is usually smaller right at this point in time i mean we are now talking about 2020 we do have if you consider the latest gpus and so on amazing computing power and if we look back at enhanced modex type of calculations which he did in the 1990s we may be there today in terms of computing power comparable to the brain but it's not in the of the same style it's of a very different style so i mean for example the the style of computing that we have in our gpus is far far more power hungry than the style of computing that is there in the human brain or other biological uh entities yeah and that the efficiency part is uh we're gonna have to solve that in order to build actual real world systems of large scale let me ask sort of the high level question step taking a step back how would you articulate the general problem of computer vision does such a thing exist so if you look at the computer vision conferences and the work that's been going on it's often separated into different little segments breaking the problem of vision apart into whether segmentation 3d reconstruction object detection i don't know image capturing whatever uh there's benchmarks for each but if you were to sort of philosophically say what is the big problem of computer vision does such a thing exist yes but it's not in isolation so if we have to so for all intelligence tasks i always go back to sort of biology or humans and if we think about vision or perception in that setting we realize that perception is always to guide action perception in a for a biological system does not give any benefits unless it is coupled with action so we can go back and think about the first multicellular animals which arose in the cambrian era you know 500 million years ago and uh these animals could move and they could see in some ways and their two activities helped each other because uh uh how does movement help movement helps that because you can get food in different places but you need to know where to go and that's really about perception or seeing i mean i mean vision is perhaps the single most perception sense but all the others are equally are also important so uh so perception and action kind of grow go together so earlier it was in these very simple feedback loops which were about uh finding food or avoiding becoming food if there's a predator running uh trying to you know eat you up and and so forth so so we must at the fundamental level connect perception to action then as we evolved uh perception became more and more sophisticated because it served many more purposes and uh so today we have what seems like a fairly general purpose capability which can look at the external world and build and a model of the external world inside the head we do have that capability that model is not perfect and psychologists have great fun in pointing out the ways in which the model in your head is not a perfect model of the external world and they have create various illusions to show the ways in which it is imperfect but it's amazing how far it has come from a very simple perception action loop that you exists in you know an animal 500 million years ago once we have this these very sophisticated visual systems we can then impose a structure on them it's we as scientists who are imposing that structure where we have chosen to characterize this part of the system as this code module of object detection or quote this module of 3d reconstruction what's going on is really all of these processes are running simultaneously and uh and and they are running simultaneously because originally their purpose was in fact to help guide action so as a guiding general statement of a problem do you think we can say that the the general problem of computer vision you said in humans it was tied to action do you think we should also say that ultimately the the goal the problem of computer vision is to sense the world in the way that helps you act in the world yes i think that's the most fundamental uh that's the most fundamental purpose we have by now hyper evolved so we have this visual system which can be used for other things for example judging the aesthetic value of a painting and this is not guiding action maybe it's guiding action in terms of how much money you will put in your auction bid but that's a bit stretched but the basics are in fact in terms of action but we have we've evolved really this hyper uh we have hyper evolved our visual system actually just too uh sorry to interrupt but perhaps it is fundamentally about action you kind of jokingly said about spending but perhaps the capitalistic uh drive that drives a lot of the development in this world is is about to exchange your money and the fundamental action is money if you watch netflix if you enjoy watching movies you're using your perception system to interpret the movie ultimately your enjoyment of that movie means you'll subscribe to netflix so the action is this uh this extra layer that we've developed in modern society perhaps this is fundamentally tied to the action of spending money well certainly with respect to uh you know interactions with firms so so in this homo economics role when you're interacting with firms it does become uh it does become that that's what else is there uh that was a rhetorical question okay so to to linger on the division between the static and the dynamic so much of the work in computer vision so many of the breakthroughs that you've been a part of have been in the static world in looking at static images and then you've also worked on starting but it's a much smaller degree the community is looking at dynamic and video at dynamic scenes and then there is robotic vision which is dynamic but also where you actually have a robot in the physical world interacting based on that vision which problem is harder the the the intuit sort of the the trivial first answers well of course one image is harder but so if you look at a deeper question there are we um what's the term cutting ourselves cutting ourselves at the knees or like making the problem harder by focusing on the images that's a fair question i think sometimes we we can simplify our problem so much that we essentially lose part of the juice that could enable us to solve the problem and one could reasonably argue that to some extent this happens when we go from video to single images now historically uh you have to consider the limits of imposed by the competition capabilities we had so if we many of the choices made in the computer vision community uh through the 70s 80s 90s can be understood as choices which were forced upon us by the fact that we just didn't have access to compute enough compute not enough memory none of hard drives not exactly not enough not enough compute not enough storage so so think of these choices so one of the choices is focusing on single images rather than video okay clear questions storage and compute we had to focus on we did we used to detect edges and throw away the image right so you have an image which i say 256 by 256 pixels and instead of keeping around the grayscale value what we did was we detected edges find the places where the brightness changes a lot so now that and now and then throw away the rest so this was a major compression device and the hope was that this makes it that you can still work with it and the logic was humans can interpret a line drawing and uh and yes and this will save us a competition so many of the choices were dictated by that i think uh today we are no longer detecting edges right we process images with convnets because we don't need to we don't have that those compute restrictions anymore now video is still under studied because video compute is still quite challenging if you are a university researcher i think video computing is not so challenging if you are at google or facebook or amazon still super challenging i've just spoke with the vp of engineering google head of the youtube search and discovery and they still struggle doing stuff on video it's very difficult except doing except using techniques that are essentially the techniques you used in in the 90s some very basic computer vision techniques no that's when you want to do things at scale so if you want to operate at the scale of all the content of youtube it's very challenging and there's similar issues in facebook but as a researcher you you have you have more uh you know opportunities you can train large you know that works with relatively large uh video data sets yeah yes so i think that this is part of the reason why we have so emphasized static images i think that this is changing and over the next few years i see a lot more progress happening in in video so i have this generic statement that to me video recognition feels like 10 years behind object recognition and you can quantify that because you can take some of the challenging video data sets and their performance on action classification is like say 30 which is kind of what we used to have around 2009 in object detection you know so it's like about 10 years behind and uh whether it'll take 10 years to catch up is a different question hopefully it will take less than that let me ask a similar question i've already asked but once again so for dynamic scenes do you think do you think some kind of injection of knowledge basis and reasoning is required to help improve like action recognition like if if if um if we solve the general action recognition problem what do you think the solution would look like it's another way yeah so i i completely agree that knowledge is called for and that knowledge can be quite sophisticated so the way i would say it is that perception blends into cognition and cognition brings in issues of memory and this notion of a schema from psychology which is uh let me use the classic example which is you go to a restaurant right now the things that happen in a certain order you walk in somebody takes you to a table a waiter comes gives you a menu takes the order food arrives eventually a bill arrives etc etc this is a classic example of ai from the 1970s uh it was called there was the term frames and scripts and schemas these are all quite similar ideas okay in the 70s the way the ai of the time dealt with it was by build hand coding this so they hand coded in this notion of a script and the various stages and the actors and so on and so forth and use that to interpret for example language i mean if there's a description of a of a story involving some people eating at a restaurant there are way all these inferences you can make because you know what happens typically at a restaurant so i think this kind of uh this kind of knowledge is absolutely essential so i think that when we are going to do long-form video understanding we are going to need to do this i think the kinds of technology that we have right now with 3d convolutions over a couple of seconds of clip or video it's very much tailored towards short-term video understanding not that long-term understanding long-term understanding requires a notion of this notion of schemas that i talked about perhaps some notions of goals intentionality functionality and so on and so forth now how will we bring that in so we could either revert back to the 70s and say okay i'm going to hand code in a script or we might try to learn it so i tend to believe that we have to find learning ways of doing this because i think learning ways to land up being more robust and there must be a learning version of the story because uh children acquire a lot of this knowledge by uh sort of just observation so at no moment in a child's life there's a it's possible but i think it's not so typical that somebody that a mother coaches a child through all the stages of what happens in a restaurant they just go as a family they they they go to the restaurant they eat come back and the child goes through 10 such experiences and the child has has got a schema of what happens when you go to a restaurant so we somehow need to we need to provide that capability to our systems you mentioned the following line from the end of the alan turing paper uh computing machinery and intelligence that many people like you said many people know and very few have read where he proposes the turing test this is this is how you know because it's towards the end of the paper instead of trying to produce a program to simulate the adult mind why not rather try to produce one which simulates the child's so that's a really interesting point if i think about the benchmarks we have before us the the tests of our computer vision systems they're often kind of trying to get to the adult so what kind of benchmarks should we have what kind of tests for computer vision do you think we should have that mimic the child's in computer vision yeah i think we should have those and we don't have those today and i think uh the part of that the challenge is that we should really be collecting data of the type that a child uh that the child experiences right so that gets into issues of you know privacy and so on and so forth but there are attempts in this direction to sort of try to collect the kind of data that a child encounters growing up so what's the child's linguistic environment what's the child's visual environment so if we could collect that kind of data and then develop learning schemes based on that data that would be one way to do it i i think that's a very promising direction myself there might be people who would argue that we could just short circuit this in some way and uh sometimes we have imitated uh we have not we have had success by not imitating nature in detail so the usual example is airplanes right we don't build flapping winds flapping wings so uh yes that's uh that's one of the points of debate uh in my mind i i i would i would bet on this this learning like a child approach so one of the fundamental aspects of learning like a child is the interactivity so the child gets to play with the data set it's learning from yes it's against the select i mean you can call that active learning you can you know in the machine learning world you can call it a lot of terms what are your thoughts about this whole space of being able to play with the data set or select what you're learning yeah so i think that uh i i believe in that and i think that we could achieve it in in two ways and i think we should use both so one is uh actually real robotics right so real uh you know physical embodiments of agents who are interacting with the world and they have a physical body with dynamics and mass and moment of inertia and friction and all the rest and you learn your body the robot learns its body by doing a series of actions the second is that simulation environments so i think simulation environments are getting much much better in my in my life in facebook ai research our group has worked on something called habitat which is a simulation environment which is a visually photorealistic environment of you know places like houses or interiors of various urban spaces and so forth and as you move you get a picture which is a pretty accurate picture so uh i i can now uh you can imagine that subsequent generations of these simulators will be accurate not just visually but with respect to you know forces and masses and haptic interactions and so on and uh then then we have that environment to play with i think that let me state one reason why i think this active being able to act in the world is important i think that this is one way to break the correlation versus causation barrier so this is something which is of a great deal of interest these days i mean people like judea pearl have talked a lot about uh why that we are neglecting causality and he describes the entire set of successes of deep learning as just curve fitting right because it's uh but i i don't quite agree about as a troublemaker he is but uh causality is important but causality is not is not like a single silver bullet it's not like one single principle there are many different aspects here and one of the ways in which uh one of our most reliable ways of establishing causal links and this is the way for example the the medical community does this is randomized control trials so you have you you pick some situation and now in some situation you perform an action and for certain others you don't right so so you have a control experiment well the child is in fact performing controlled experiments all the time right right right okay small scale and in a small scale and but but that is a way that the child gets to build and refine its causal models of the world and my colleague alison gopnik has together with a couple of authors co-authors has this book called the scientist in the crib referring to children so i like the part that i like about that is the scientist wants to do wants to build causal models and the scientist does control experiments and i think the child is doing that so to enable that we will need to have these these active experiments and i think this could be done some in the real world and some in simulation so you have hope for simulation i have a hopeless solution that's an exciting possibility if we can get to not just photo realistic but what's that called life realistic yeah uh simulation so you don't see any fundamental blocks to why we can't eventually simulate the the principles of what it means to exist in the world as a physical i i don't see any fundamental problems there i mean and look the computer graphics community has come a long way right so the in the early days back going back to the 80s and 90s they were they were focusing on visual realism right and then they could do the easy stuff but they couldn't do stuff like hair or fur and so on okay well they managed to do that then they couldn't do physical actions right like there's a bowl of glass and it falls down and it shatters but then they could start to do pretty realistic models of that and so on and so forth so the graphics people have shown that they can do this forward direction not just for optical interactions but also for physical interactions so i think uh of course some of that is very computer intensive but i think by and by we will find ways of making our models ever more realistic you break vision apart into in one of your presentations early vision static scene understanding dynamics and understanding and raise a few interesting questions i thought i could just throw some some at you just to see if you want to talk about them so early vision so it's what is it you said um sensation perception and cognition so is this a sensation yes what can we learn from image statistics that we don't already know so at the lowest level what um what can we make from just this the the statistic the basics so there were the variations in the rock pixels the textures and so on yeah so what we seem to have learned is uh uh uh is that there's a lot of redundancy in these images and as a result we are able to do a lot of compression and and this compression is very important in biological settings right so you might have ten to the eight photoreceptors and only ten to the six fibers in the optic nerve so you have to do this compression by a factor of hundreds to one and uh and uh so there are analogs of that which are happening in in our neural net artificial neural network that's the early layer so you think there's a lot of compression that can be done in the beginning yeah just just the statistics yeah um how much how much well so i mean the the way to think about it is just how successful is image compression right and we we and there are and that's been done with older technologies but it can be done with there are several companies which are trying to use sort of these more advanced neural network type techniques for compression both for static images as well as for for video one of my former students has a company which is trying to do stuff like this and i think i think that they are showing quite interesting results and i think that that's all the success of that's really about image statistics and video statistics but that's still not doing compression of the kind when i see a picture of a cat all i have to say is it's a cat that's another semantic kind of complication yeah so this is this is at the lower level right so we are we are we as i said yeah that's focusing on low level statistics so to linger on that for a little bit uh you mentioned how far can bottom-up image segmentation go and in general what you mentioned that the central question for scene understanding is the interplay of bottom-up and top-down information maybe this is a good time to elaborate on that maybe define what is what is up what is top down in the comments yes the computer vision uh right that's uh so today what we have are a are very interesting systems because they work completely bottom up how are they what does bottom bottom-up mean sorry so bottom-up means in this case means a feed-forward net neural network so starting from the raw pixels yeah they start from the raw pixels and they they end up with some something like cat or not a cat right so our our systems are running totally feed forward they're trained in a very top-down way so they're trained by saying okay this is a cat there's a cat there's a dog there's a zebra etc and i'm not happy with either of these choices fully we have gone into uh because we have completely separated these processes right so there is a so i would like the uh the process uh so what do we know compared to biology so in biology what we know is that the processes in at test time at run time those processes are not purely feed forward but they involve feedback so and they involve much shallower neural networks so the kinds of neural networks we are using in computer vision say a resnet 50 has 50 layers well in in the brain in the visual cortex going from the retina to it maybe we have like seven right so they're far shallower but we have the possibility of feedback so there are backward connections and this might enable us to uh to deal with the more ambiguous stimuli for example so the the biological solution seems to involve feedback the solution in in artificial vision seems to be just feed forward but with a much deeper network and the two are functionally equivalent because if you have a feedback network which just has like three rounds of feedback you can just unroll it and make it three times the depth and create it in a totally feed forward way so this is something which i mean we have written some papers on this theme but i really feel that this should this theme should be pursued further have some kind of recurrence mechanism yeah okay the other uh so that so that's uh so i so i want to have a little bit more top down in the at test time okay then at training time we make use of a lot of top-down knowledge right now so basically to learn to segment an object we have to have all these examples of this is the boundary of a cat and this is the boundary of a chair and this is the boundary of a horse and so on and this is too much top-down knowledge how do humans do this we manage to we manage with far less supervision and we do it in a sort of bottom-up way because for example we're looking at a video stream and the horse moves and that enables me to say that all these pixels are together yeah so the gestural psychologists used to call this the principle of common fate so there was a bottom-up process by which we were able to segment out these objects and we have totally focused on this top-down training signal so in my view we have currently solved it in machine vision this top-down bottom-up interaction but i don't find the solution fully satisfactory and i would rather have a bit of both in at both stages for all computer vision problems which is not just segmentation and and and and the question that you can ask is so for me i'm inspired a lot by human vision and i care about that you could be a just a hard-boiled engineer not give a damn so to you i would then argue that uh you would need far less training data if you could make my uh research agenda you know fruitful okay so maybe taking a step into uh segmentation static scene understanding what is the interaction between segmentation and recognition you mentioned the movement of objects so for people who don't know computer vision segmentation is this weird activity that we that computer vision folks have all agreed is very important uh of drawing outlines around objects versus a bounding box or and then classifying that object what's what's the value of segmentation what is it as a problem in computer vision how is it fundamentally different from detection recognition any other problems yeah so i think uh so so segmentation enables us to say that some set of pixels are an object without necessarily even being able to name that object or knowing properties of that object oh so you mean segmentation purely as as as the act of separating an object from its background a blob of uh of that's united in some way from his background yeah so identification if you were making an entity out of it and justification yeah beautifully so so i think that we have that capability and that is that enables us to uh as we are growing up to acquire uh names of objects with very little supervision so suppose the child lets posit that the child has this ability to separate out objects in the world then when the there's a the mother says pick up your bottle or the cat's behaving funny today [Laughter] the word cat suggests some object and then the child sort of does the mapping right right the mother doesn't have to teach a specific object labels by pointing to them weak supervision works in the context that you have the ability to create objects so i think that uh so to me that's that's a very fundamental capability uh there are applications where this is very important uh for example medical diagnosis so in medical diagnosis uh you have some uh brain scan i mean some this is some work that we did in my group where you have ct scans of people who have had traumatic brain injury and what uh what the radiologist needs to do is to precisely delineate various places where there might be bleeds for example and there's there are clear needs like that so they're certainly very practical applications of computer vision where segmentation is necessary but philosophically segmentation enables the task of recognition to proceed with much weaker supervision than we require today and you think of segmentation as this kind of task that takes on a visual scene and breaks it apart into into interesting entities yeah that might be useful for whatever the task is yeah and and it is not semantics free so i think i i mean it it blends into it involves perception and cognition it is not it is not i i think the mistake that we used to make in the early days of computer vision was to treat it as a purely bottom-up perceptual task it is not just that because we do revise our notion of segmentation with more experience right because for example there are objects which are non-rigid like animals or humans and uh i think understanding that all the pixels of a human are one entity is actually quite a challenge because the parts of the human they can move independently and the human wears clothes so they might be differently colored so it's all sort of a challenge you mentioned the three hours of computer vision are recognition reconstruction reorganization can you describe these three r's sure how they interact yeah so uh so recognition is the easiest one because that's uh what i think people generally think of as computer vision achieving these days which is uh labels so is this a cat is this a dog is this a chihuahua i mean you know it could be very fine grain like you know specific breed of a dog or a specific species or bird or it could be very abstract like animal but given a part of an image or a whole image say put a label on that yeah so that's that's recognition reconstruction is uh essentially it you can think of it as inverse graphics i mean that's one way to think about it so graphics is your you have some internal computer representation and uh you have a computer representation of some objects arranged in a scene and what you do is you produce a picture you produce the pixels corresponding to a rendering of that scene so uh so let's do the inverse of this we are given an image and we try to we we we say oh this image arises from some objects in a scene looked at with a camera from this viewpoint and we might have more information about the objects like their shape maybe their textures maybe you know color et cetera et cetera so that's the reconstruction problem in a way that you are in your head creating a model of the external world okay reorganization is to do with essentially finding these entities so uh so it's uh organization or the word organization implies structure so uh that in in uh perception in psychology we use the term perceptual organization that uh the the world is not just an image is not just seen as is not internally represented as just a collection of pixels but we make these entities we create these entities objects whatever you want to call in the relationship between the entities as well or is it purely about the entities it could be about the relationships but mainly we focus on the fact that there are entities sometimes i'm trying to pinpoint what the organization means so organization is that instead of like a uniform grid we have the structure of objects so segmentation is a small part of that so segmentation gets us going towards that yeah and you kind of have this triangle where they all interact together yes so how do you see that interaction in uh sort of uh reorganization is yes defining the entities in the world the recognition is labeling those entities and then reconstruction is what filling in the gaps well to for example see impute some 3d objects corresponding to each of these entities that would be part of adding more information that's not there in the raw data correct i mean i started pushing this kind of a view in the around 2010 or something like that because at that time in computer vision the distinction that people were were just working on many different problems but they treated each of them as a separate isolated problem with each with its own data set and then you try to solve that and get good numbers on it so i wasn't i didn't like that approach because i wanted to see the connection between these and if people divided up vision into into various modules the way they would do it is as low level mid-level and high-level vision corresponding roughly to the psychologist's notion of sensation perception and cognition and i didn't that didn't map to tasks that people cared about okay so therefore i tried to promote this particular framework as a way of considering the problems that people in computer vision were actually working on and trying to be more explicit about the fact that they actually are connected to each other and i was at that time just doing this on the basis of information flow now it turns out in the last five years or so in the post the deep learning revolution that this this architecture has turned out to be very conducive to that because basically in these neural networks we are trying to build multiple representations there can be multiple output heads sharing common representations so in a certain sense today given the reality of what solutions people have to these i i i i do not need to preach this anymore it is it is just there it's part of the solution space so speaking of neural networks how much of this uh problem of computer vision of the organization recognition can be um reconstruction how much of it can be learned end to end do you think instead of uh set it and forget it just plug and play have a giant data set multiple perhaps multi-modal and then just learn the entirety of it well so i i think that currently what that end-to-end learning means nowadays is end-to-end supervised learning and and that i would argue is too narrow a view of the problem i would i like this child development view this lifelong learning view one where there are certain capabilities that are built up and then there are certain capabilities which are built up on top of that so uh that's that's what i i believe in so i think uh end-to-end learning in the supervised setting for a very precise task to me is a kind of is uh it's sort of a limited view of the of the learning process got it so if we think about beyond purely supervised look at back to children you mentioned six lessons that we can learn from children uh of be multimodal be incremental be physical explore be social use language can you speak to these perhaps picking one that you find most fundamental toward yeah time today yeah so i mean i should say to give due credit this is from a paper by smith and gasser and it reflects essentially i would say common wisdom among child development people it's just that these are this is not common wisdom among people in computer vision and ai and machine learning so i view my role as uh trying to bridge the worlds bridge the two worlds so uh so let's take an example of a multi-modal i like that so multi-modal canonical example is uh a child interacting with uh with an object so then the child so the child holds a ball and plays with it so at that point it's getting a touch signal so the touch signal is is getting as the notion of 3d shape but it is sparse and then the child is also seeing a visual signal right and and these two so imagine these are two in totally different spaces right so one is the space of receptors on the skin of the fingers and the thumb and the palm right and then these map on to these neuronal fibers are getting activated somewhere right these lead to some activation in somatosensory cortex i mean a similar thing will happen if we have a robot hand okay and then we have the pixels corresponding to the visual view but we know that they correspond to the same object right so that's a very very strong cross calibration signal and it is self-supervisory which is beautiful right there's nobody assigning a label the mother doesn't have to come and assign a label the child doesn't even have to know that this object is called a ball okay but the obj the child is learning something about the three-dimensional world from this signal uh i think tactile and visual there is some work on there is a lot of work currently on audio and visual okay an audio visual so there is some event that happens in the world and that event has a visual signature and it has a auditory signature so there is this glass bowl on the table and it falls and breaks and i hear the smashing sound and i see the pieces of glass okay i've built that connection between the two right we have people uh i mean this has become a hot topic in computer vision in the last couple of years there is there are problems like uh separating out multiple speakers right which was a classic problem in in audition they call this the problem of source separation or the cocktail party effect and so on but just try to do it visually when you also have it becomes so much easier and so much more useful so the the multimodal i mean there's so much more signal with multimodal and you can use that for some kind of weak supervision as well yes because they are occurring at the same time in time yeah so you have time which links the two right so at a certain moment t1 you've got a certain signal in the auditory domain and a certain signal in the visual domain but they must be causally related yeah it's an exciting area not well studied yet not yeah i mean we have a little bit of work at this but uh but but so much more needs to be done yeah so so so so this this is this is a good example be physical that's to do with uh like the one thing we talked about earlier that that there's a embodied world to mention language use language so no chomsky believes that language may be at the core of cognition at the core of everything in the human mind what is the connection between language and vision to you like what's more fundamental are they neighbors is one the parent and the child the chicken and the egg oh it's very clear it is vision which is the appearance the fundament the permission is the fundamental ability okay well so uh it comes before you think vision is more fundamental than language correct and and and it and yeah you can think of it either in phylogeny or in ontogeny so phylogeny means if you look at evolutionary time right so you we have vision that developed 500 million years ago okay then something like when we get to maybe like five million years ago you have the first bipedal primate so when we started to walk then the hands became free and so then manipulation the ability to manipulate objects and build tools and so on and so forth so you said 500 000 years ago no no sorry the the first multicellular animals which you can say had some intelligence arose 500 million years ago okay and now let's fast forward to say the last seven million years which is the development of the hominid line right where from the other primates we have the branch which leads on to modern humans now there are many of these hominids but the the ones which you know people talk about lucy because that's like a skeleton from three million years ago and we know that lucy walked okay so at this stage you have that the hand is free for manipulating objects and then the ability to manipulate objects build tools and the brain size grew in this era so okay so now you have manipulation now we don't know exactly when language arrows but after that but after that because no apes have i mean so i mean chomsky is correct in that that it is a uniquely human capability and we primates other primaries don't have that but so it developed somewhere in this era but it developed i would i mean uh argue that it probably developed after we had this stage of uh uh humans or i mean the human species already able to manipulate and a hands-free much bigger brain size and for that there's a lot of vision has already had had to have developed yeah so the sensation and the perception may be some of the cognition yeah so we we so those so so that so the world so there so so these ancestors of us you know three four million years ago they had uh they had spatial intelligence so they knew that the world consists of objects they knew that the objects were in certain relationships to each other they had observed causal interactions among objects they could move in space so they had space and time and all of that so language builds on that substrate so language has a lot of i mean i mean the all human languages have constructs which depend on a notion of space and time where did that notion of space and time come from it had to come from perception and action in the world we live in yeah what you refer to as the spatial intelligence yeah yeah to linger a little bit we mentioned touring and his uh mention of we should learn from children nevertheless language is the fundamental piece of the test of intelligence that touring proposed what do you think is a good test of intelligence are you what would impress the heck out of you is it fundamentally natural language or is there something in vision i i think uh i i wouldn't i i don't think we should have created a single test of intelligence so just like i don't believe in iq as a single number i think generally there can be many capabilities which are correlated perhaps so i think that there will be uh there will be accomplishments which are visual accomplishments accomplishments which are uh accomplishments in manipulation or robotics and then accomplishments in language i do believe that language will be the hardest not to crack really yeah so what's what's harder to pass the spirit of the touring test or like whatever formulation will make it natural language convincingly in natural language like somebody you would want to have a beer with hang out and have a chat with or the general natural scene understanding you think language is the type i think i'm not a fan of the i think i think turing test that turing as he proposed the test in 1950 was trying to solve a certain problem yeah imitation yeah and and i think it made a lot of sense then where we are today 70 years later i think i think we we should not worry about that i mean i think the turing test is no longer the right way to uh to to channel research in in ai because that it takes us down this path of this chat bot which can fool us for five minutes or whatever okay i think i would rather have a list of 10 different tasks i mean i think their tasks which their tasks in the manipulation domain tasks and navigation tasks and visual scene understanding tasks in under reading a story and answering questions based on that i mean so my favorite language understanding task would be you know reading a novel and being able to answer arbitrary questions from it okay right i i think that to me uh and this is not an exhausted list by any means so i would uh i think that that's what we where we need to be going to and each of these on each of these axes there's a fair amount of work to be done so on the visual understanding side in this intelligence olympics that we've set up yeah what's a good test for one of many of visual scene understanding uh do you think such benchmarks exist sorry to interrupt no there there aren't any i i think i think essentially to me a really uh good aid to the blind so suppose there was a blind person and i needed to assist the blind person so ultimately like we said vision that aids in the action in the survival in this world yeah maybe in a simulated world maybe easier to to measure performance in a simulated world what we are ultimately after is performance in the real world so david hilbert in 1900 proposed 23 open problems in mathematics some of which are still unsolved most important famous of which is probably the riemann hypothesis you've thought about and presented about the hilbert problems of computer vision so let me ask what to you today i don't know when the last year you presented that 2015 but versions of it yeah you're kind of the the face and the spokesperson for computer vision yeah it's your job to just to state what the problem the open problems are for the field so what today are the hilbert problems of computer vision do you think let me pick pick one to which i regard as uh clearly clearly unsolved which is what i would call long-form video understanding so so we have a video clip and we want to understand the behavior in there in terms of agents their goals intentionality and uh make predictions about what might happen you know so so that that kind of understanding which goes away from atomic visual action so so in the short range the question is are you sitting are you standing are you catching a ball right that we can do now or we even if we can't do it fully accurately if we can do it at 50 percent maybe next year we'll do it at 65 and so forth but i think the long range video understanding i don't think we we we can do today well today and that means so long and it blends into cognition that's the reason why it's challenging and so you have to track you have to understand the entities you have to understand the sds you have to track them and you have to have some kind of model of their behavior correct and their and if their behavior might be these are these are agents so they are not just like passive objects but the agent so therefore we they might they would exhibit gold directed behavior okay so this is this is one area then i will talk about say understanding the world in 3d now this may seem paradoxical because in a way we have been able to do 3d understanding even like 30 years ago right but i don't think we currently have the richness of 3d understanding in our computer vision system that we would like because ah so let me elaborate on that a bit so currently we have two kinds of techniques which are not fully unified so there are the kinds of techniques from multi-view geometry that you have multiple pictures of a scene and you do a reconstruction using stereoscopic vision or structure from motion but these techniques do not they totally fail if you just have a single view because they are relying on this this multiple geometry okay then we have some techniques that we have developed in the computer vision community which try to guess 3d from single views and these techniques are based on on supervised learning and they are based on having a training time 3d models of objects available and this is completely unnatural supervision right that's not cad models are not injected into your brain okay so what would i like what i would like would be a kind of uh learning as you move around the world uh notion of 3d so so we we have our succession of visual experiences and from those we so in as part of that i might see a chair from different viewpoints or a table from viewpoint different viewpoints and so on now as part that enables me to build some internal representation and then next time i just see a single photograph and it may not even be of that chair it's of some other chair and i have a guess of what its 3d shape is like so you're almost learning the cad model kind of yeah implicitly i mean implicitly i mean the cad model need not be in the same form as used by computer graphics hidden in the representation it's hidden in the representation the ability to predict new views and what i would see if i went to such and such position by the way and on a small tangent on that are you uncomforta are you okay or comfortable with neural networks that do achieve visual understanding that do for example achieve this kind of 3d understanding and you don't know how they you don't know the rep you're not able to interest but you're not able to visualize or understand or interact with the representation so the fact that they're not or may not be explainable yeah i think that's fine i to me that is uh so so let me put some caveats on that so it depends on the setting so first of all i think uh uh the uh humans are not explainable so yeah that's a really good point yeah so we we one human to another human is not fully explainable i think there are settings where explainability matters and these might these are these might be for example questions on medical diagnosis so i'm in a setting where maybe the doctor maybe a computer program has made a certain diagnosis and then depending on the diagnosis perhaps i should have treatment day or treatment b right so now is the computer programs diagnosis based on data which was data collected of for american males who are in their 30s and 40s and maybe not so relevant to me maybe it is relevant you know et cetera et cetera and we i mean in medical diagnosis we have major issues to do with the reference class so we may have acquired statistics from one group of people and applying it to a different group of people who may not share all the same characteristics the data might have there might be error bars in the prediction so that prediction should really be taken with a huge grain of salt and but this has an impact on what treatments should be picked right so so there are settings where i want to know more than just this is the answer but what i acknowledge is that so so so so i in that sense explainability and interpretability may matter it's about giving error bounds and a better sense of the quality of the decision where what i where i'm willing to sacrifice interpretability is that i believe that there can be systems which can be highly performant but which are internally black boxes and and that seems to be words headed some of the best performing systems are essentially black boxes yeah uh fundamentally by their construction you and i are black boxes to each other yeah so the nice thing about the black boxes we are is so we ourselves are black boxes but we're also those of us who are charming are able to convince others like explain the black what's going on inside the black box with narratives with stories so in some sense uh neural networks don't have to actually explain what's going on inside they just have to come up with stories real or fake that convince you that they know what's going on and i'm sure we can do that we can create those nearer those stories neural networks can create those stories yeah and the transformer will be involved do you think we will ever build a system of human level or superhuman level intelligence we've kind of defined what it takes to try to approach that but do you think we'll do you think that's within our reach the thing that we thought we could do what touring thought actually we could do by a year 2000 right what do you think we'll ever be able to do so i think there are two answers here one question one answer is in principle can we do this at some time and my answer is yes the second answer is a pragmatic one do you think we will be able to do it in the next 20 years or whatever and to that man says no so and of course that's a wild guess i i i i think that you know donald trump's felt is not a favorite person of mine but one of his lines is very good which is about known knowns known unknowns and unknown unknowns so in the business we are in there are known unknowns and we have unknown unknowns so i think with respect to a lot of what the case in vision and robotics i feel like we have known unknowns so i have a sense of where we need to go and what the problems that need to be solved are i feel with respect to natural language understanding and high level cognition it's not just known unknowns but also unknown unknowns so it is very difficult to put any kind of uh time frame to that uh do you think some of the unknown unknowns might be positive in that they'll surprise us and make the job much easier so fundamental breakthroughs i think that is possible because certainly i have been very positively surprised by how effective these deep learning systems have been because i certainly would not have believed that in 2010 i think what we knew from the mathematical theory was that convex optimization works when there's a single global optima then these gradient descent techniques would work now these are non-linear systems with non-convex systems huge number of variables so over-parametrized over-parameterized and the people who used to play with them a lot the ones who are totally immersed in the lore and the black magic they knew that they worked uh well even though they were really i thought like everybody no the claim that i hear from my friends like yan lacoon and so forth now yeah that they feel that they were comfortable with them well he says but the community as a whole was certainly not and i think uh we were to me that was the surprise that they actually worked robustly for a wide range of problems from a wide range of initializations and so on and uh so that was that that was certainly more rapid progress than uh we expected but then there are certainly lots of times in fact most of the history and fear is when we have made less pro progress at a slower rate than we expected so uh we just keep going i think uh what i regard as uh really unwarranted are these these fears of uh you know agi in 10 years and 20 years and that kind of stuff because that's based on completely unrealistic models of how rapidly we will make progress in this field so i agree with you but i've also gotten a chance to interact with very smart people who really worry about the existential threats of ai and i as an open-minded person and sort of taking and taking it in do you think if ai systems in some way the unknown unknowns not super intelligent ai but in ways we don't quite understand uh the nature of superintelligence will have a detrimental effect on society do you think this is something we should be worried about or we need to first allow the unknown our nose to become known unknowns i think we need to be worried about ai today i think that it is not just a worry we need to have when we get that agi i think that ai is being used in many systems today and there might be settings for example when it causes biases or decisions which could be harmful i mean decisions which could be unfair to some people or it could be a self-driving cars which kills a pedestrian so ai systems are being deployed today right and they're being deployed in many different settings maybe in medical diagnosis maybe in a self-driving car maybe in selecting applicants for an interview so i would argue that when these systems make mistakes there are consequences and we are in a certain sense responsible for those consequences so i would argue that this is a continuous effort it is we and and this is something that in a way is not so surprising it's about all engineering and scientific progress which uh great power comes great responsibility so as these systems are deployed we have to worry about them and it's a continuous problem i don't think of it as something which will suddenly happen on some day in 2079 for which i need to design some clever trick i'm saying that these problems exist today yeah and we need to be continuously on the lookout for worrying about safety biases risks right i mean the self-driving car kills are pedestrian and they have right i mean the this uber incident in arizona yeah right it has happened right this is not about agi it in fact it's about a very dumb intelligence which is also killing people the worry people have with agi is the scale and i but i think you're right is like the thing that worries me about ai today and it's happening in a huge skills recommend recommender systems recommendation systems so if you look at twitter or facebook or youtube their controlling the ideas that we have access to the news and so on and that's a fundamentally machine learning algorithm behind each of these recommendations and they i mean my life would not be the same without these sources of information i'm a totally new human being and the ideas that i know are very much because of the internet because of the algorithm that i recommend those ideas and so as they get smarter and smarter i mean that is the agi yeah is that's the the algorithm that's recommending the next youtube video you should watch has control of millions of billions of people that that algorithm is already super intelligent and has complete control of the population not a complete but very strong control for now we can turn off youtube we can just go have a normal life outside of that but the more and more that gets into our life it's that algorithm we start depending on it in the different companies that are working on the algorithm so i think it's you're right it's already it's already there and youtube in particular is using computer vision doing their hardest to try to understand the content of videos so they could be able to connect videos with the people who would benefit from those videos the most and so that development could go in a bunch of different directions some of which might be harmful so yeah you're right the the the threats of ai are here already we should be thinking about them on a philosophical notion if you could personal perhaps if you could relive a moment in your life outside of family because it made you truly happy or was a profound moment that impacted the direction of your life what would you go to i don't think of single moments but i look over the long haul i feel that i've been very lucky because i feel that i think that in scientific research a lot of it is about being at the right place at the right time and you can you can work on problems at a time when they're just too premature you know you butt your head against them and and nothing happens because it's the prerequisites for success are not there and then there are times when you are in a field which is all pretty mature and you can only solve curricules upon colloquius i've been lucky to have been in this field which for 34 years 35 well actually 34 years as a professor at berkeley so longer than that uh which when i started in it was just like some little crazy absolutely useless field which couldn't really do anything to a time when it's really really solving a lot of practical problems has a lot has offered a lot of tools for scientific research right because computer vision is impactful for images in biology or astronomy and and so on and so forth and we have so we have made great scientific progress which has had real practical impact in the world and i feel lucky that i i got in at a time when the field was very young and at a time when it is it's now mature but not fully mature it's mature but not done i mean it's really in still in a in a productive phase yes yeah yeah i think people 500 years from now would laugh are you calling this field mature yeah that is very possible yeah so but you're also lest i forget to mention you've also mentored some of the biggest names of computer vision computer science and ai today uh there's so many questions i could ask but really is what what is it how did you do it what does it take to be a good mentor what does it take to be a good guide yeah i i think what i feel i've been lucky to have had very very smart and hardworking and creative students i think some part of the credit just belongs to being at berkeley i think those of us who are at top universities are blessed because we have very very smart and capable students coming on knocking on our door so so i have to be humble enough to acknowledge that but what have i added i think i have added something what i have added is uh i think what i've always tried to teach them is a sense of picking the right problems so i think that in science in the short run success is always based on technical competence your you know you're quick with math or you are whatever i mean there's certain technical capabilities which make for short-range progress long-range progress is really determined by asking the right questions and focusing on the right problems and i feel that what i've been able to bring to the table in terms of advising these students is some sense of taste of what are good problems what are problems that are worth attacking now as opposed to waiting 10 years what's a good problem if you could summarize if is that possible to even summarize like what what's your sense of a good problem i i think uh i think uh i have a sense of what is a good problem which is uh there is a british scientist uh in fact he won a nobel prize peter medover who has a a book on on this and uh basically he calls it the research is the art of the soluble so we need to sort of find problems which are which are not yet solved but which are approachable and he sort of refers to this sense that there is this problem which isn't quite solved yet but it has a soft underbelly there is some place where you can you know spear the beast yes and having that intuition that this problem is ripe is is a good thing because otherwise you can just beat your head and not make progress so i think that is that is important so if if i have that and if i can convey that to students it's not just that they do great research while they're working with me but that they continue to do great research so in a sense i'm proud of my students and their achievements and their great research even 20 years after they've seized being my student so it's in part developing helping them develop that sense that a problem is not yet solved but it's solvable correct the other thing which i have which i i think i bring to the table uh is i is a certain intellectual breadth i i've spent a fair amount of time studying psychology neuroscience relevant areas of applied math and so forth so i can probably help them see some connections to disparate things which they might not have otherwise so so the smart students coming into berkeley can be very uh deep in the sense they can think very deeply meaning very hard down one particular path but where i could help them is the the shallow breadth but uh whereas they would have the the narrow depth and uh but that's that's of some value well it was beautifully refreshing just to hear you naturally jump to psychology back to computer science and this conversation back and forth i mean that that's uh that's actually a rare quality and i think it's certainly for students empowering to think about problems in a new way so for that and for many other reasons i really enjoyed this conversation thank you so much it was a huge honor thanks for talking today it's been my pleasure thanks for listening to this conversation with jitendra malik and thank you to our sponsors betterhelp and expressvpn please consider supporting this podcast by going to betterhelp.com lex and signing up at expressvpn.com lexpod click the links buy the stuff it's how they know i sent you and it really is the best way to support this podcast and the journey i'm on if you enjoy this thing subscribe on youtube review 5 stars on apple podcast support it on patreon or connect with me on twitter at lex friedman don't ask me how to spell that i don't remember myself and now let me leave you with some words from prince mishkin and the idiot by dostoyevsky beauty will save the world thank you for listening and hope to see you next time you