Yann LeCun: Dark Matter of Intelligence and Self-Supervised Learning | Lex Fridman Podcast #258
SGzMElJ11Cc • 2022-01-22
Transcript preview
Open
Kind: captions Language: en the following is a conversation with john le his second time in the podcast he is the chief ai scientist at meta formerly facebook professor at nyu touring award winner one of the seminal figures in the history of machine learning and artificial intelligence and someone who is brilliant and opinionated in the best kind of way and so is always fun to talk to this is a lex friedman podcast to support it please check out our sponsors in the description and now here's my conversation with yon lacoon you co-wrote the article self-supervised learning the dark matter of intelligence great title by the way with ishan mizrah so let me ask what is self-supervised learning and why is it the dark matter of intelligence i'll start by the dark matter part uh there is obviously a kind of learning that humans and animals are are doing that we currently are not reproducing properly with machines with ai right so the most popular approaches to machine learning today are or paradigms i should say are supervised running and reinforcement learning and they are extremely inefficient supervised learning requires many samples for learning anything and reinforcement learning requires a ridiculously large number of trials and errors to for you know a system to run anything and that's why we don't have self-driving cars that's a big leap from one to the other okay so that to solve difficult problems you have to have a lot of uh human annotation for supervised learning to work and to solve those difficult problems with reinforcement learning you have to have some way to maybe simulate that problem such that you can do that large scale kind of learning that reinforcement learning requires right so how is it that you know most teenagers can learn to drive a car in about 20 hours of practice whereas uh even with millions of hours of simulated practice a self-driving car can't actually learn to drive itself properly um and so obviously we're missing something right and it's quite obvious for a lot of people that you know the immediate response you get from many people is well you know humans use their background knowledge to learn faster and they're right now how was that background knowledge acquired and that's the big question so now you have to ask you know how do babies in the first few months of life learn how the world works mostly by observation because they can hardly act in the world and they learn an enormous amount of background knowledge about the world that may be the the basis of what we call common sense this type of learning it's not learning a task it's not being reinforced for anything it's just observing the world and figuring out how it works building world models learning world models how do we do this and how do we reproduce this in in machines so cell supervision learning is you know one instance or one attempt at trying to reproduce this kind of learning okay so you're looking at just observation so not even the interacting part of a child it's just sitting there watching mom and dad walk around pick up stuff all of that that's the that's what you mean by background knowledge perhaps not even watching mom and dad just you know watching the world go by just having eyes open or having eyes closed or the very act of opening and closing eyes that the world appears and disappears all that basic information and you're saying in in order to learn to drive like the reason humans are able to learn to drive quickly some faster than others is because of the background knowledge they were able to watch cars operate in the world in the many years leading up to it the physics of basics objects all that kind of stuff that's right i mean the basic physics of objects you don't even know you don't even need to know you know how a car works right because that you can learn fairly quickly i mean the example i use very often is uh you're driving next to a cliff and you know in advance because of your you know understanding of intuitive physics that if you turn the wheel to the right the car will veer to the right we'll run off the cliff fall off the cliff and nothing good will come out of this right um but if you are a sort of you know tabularized reinforcement learning system that doesn't have a model of the world you have to repeat falling off this cliff thousands of times before you figure out it's a bad idea and then a few more thousand times before you figure out how to not do it and then a few more million times before you figure out how to not do it in every situation you ever encounter so self-supervised learning still has to have some source of truth being told to it by somebody and is so you have to figure out a way without human assistance or without significant amount of human assistance to get that truth from the world so the mystery there is um how much signal is there how much truth is there that the world gives you whether it's the human world like you watch youtube or something like that or it's the more natural world so how much signal is there so here's the trick there is way more signal in sort of a self-supervised setting than there is in either a supervised or reinforcement setting and this is going to my you know analogy of the cake the you know low cake as someone has called it where when you try to figure out how much information you ask the machine to predict and how much feedback you give the machine at every trial in reinforcement learning you give the machine a single scaler you tell the machine you did good you did bad and you you and you only tell this to the machine once in a while when i say you it could be the the universe telling the machine right but it's just one scalar so as a consequence there is you you cannot possibly learn something very complicated without many many many trials where you get many many feedbacks of this type supervision you you give a few bits to the machine at every every sample let's say you're training a system on you know recognizing images on imagenet there is 1000 categories that a little less than 10 bits of information per sample but star supervisory here is a setting you ideally we don't know how to do this yet but ideally you would show a machine a segment of a video and then stop the video and ask me ask the machine to predict what's going to happen next so you let the machine predict and then you let time go by and show the machine what actually happened and hope the machine will you know learn to do a better job at predicting next time around there's a huge amount of information you give the machine because it's an entire video clip of uh you know of the future after the video clip you fed it in the first place so both for language and for vision there's a subtle seemingly trivial construction but maybe that's representative of what is required to create intelligence which is filling the gap so in the gaps it sounds dumb but can you it's it is possible you can solve all of intelligence in this way just for both language just give a sentence and continue it or give a sentence and there's a gap in it uh some words blanked out and you fill in what words go there for vision you give a sequence of images and predict what's going to happen next or you fill in what happened in between do you think it's possible that formulation alone as a signal for self-supervised learning can solve intelligence for vision and language i think that's our best shot at the moment um so whether this will take us all the way to you know human level intelligence or something or just cat level intelligence uh it's not clear but among all the possible approaches that people have proposed i think is our best shot so i think this idea of uh an intelligent system filling in the blanks either you know predicting the future inferring the past filling in missing information uh you know i'm currently filling the blank of what is behind your head and what you what your head looks like and you know from from the back uh because i have you know basic knowledge about how humans are made and i don't know if you're gonna you know what are you gonna say at which point you're gonna speak whether you're gonna move your head this way or that way which way you're gonna look but i know you're not gonna just dematerialize and reappear three meters uh down the hall uh you know because i know what's possible and what's impossible uh according to into the physics so you have a model of what's possible what's impossible and then you'd be very surprised if it happens and then you'll have to reconstruct your model right so that that's the model of the world it's what tells you you know what fills in the blanks so given your partial information about the state of the world given by your perception uh your your model of the world fills in the missing information and that includes predicting the future retrodicting the past uh you know filling in things you don't immediately perceive and that doesn't have to be purely generic vision or visual information or generic language you can go to specifics like predicting what control decision you make when you're driving in a lane you have a sequence of images from a vehicle and then you could you have information if you recorded on video where the car ended up going so you can go back in time and predict what the car went based on the visual information that's very specific domain specific right but the question is whether we can come up with sort of a generic uh method for you know training machines to do this kind of prediction or filling in the blanks so right now uh this type of approach has been unbelievably successful in the context of natural language processing uh every modern natural language processing is pre-trained in self-supervised manner to fill in the blanks to you you show it a sequence of words you remove 10 percent of them and then you train some gigantic neural net to predict the words that are missing that and once you've pre-trained that network you can use the internal representation learn by it as input to you know something that you train supervised or whatever that's been incredibly successful not so successful in images although it's making progress and uh and it's based on uh sort of manual data augmentation uh we can go into this later but what has not been successful yet is training for video so getting a machine to learn to represent the visual world for example by just watching video nobody has really succeeded in doing this okay well let's kind of give a high level overview what's the difference in kind and in difficulty between vision and language so you said people haven't been able to really kind of crack the problem of vision open in terms of self-supervised learning but that may not be necessarily because it's fundamentally more difficult maybe like when we're talking about achieving like passing the turing test and the full spirit of the turing test in language might be harder than vision that's that's not obvious so what in your view which is harder or perhaps are they just the same problem when uh the farther we get to solving each the more we realize it's all the same thing it's all the same cake i think what i'm looking for are methods that make make them look essentially like the same cake but currently they're not and the main issue with learning water models or learning predictive models is that the prediction is never a single thing because the world is not entirely predictable it may be deterministic or stochastic we can get into the philosophical discussion about it but uh but even if it's deterministic it's not entirely predictable and so if i play a short video clip and then i ask you to predict what's going to happen next there's many many plausible continuations for that video clip and the number of continuation grows with the interval of time that you're asking the system to make a prediction uh for and so one big question with supervision is how you represent this uncertainty how you represent multiple discrete outcomes how you represent a sort of continuum of possible outcomes etc and you know if you are a sort of a classical machine learning person you say oh you just represent a distribution right and that we know how to do when we're predicting words missing words in the text because uh you can have a neural net give a score for every word in a dictionary it's a big you know it's a big list of numbers you know maybe a hundred thousand or so and you can turn them into a probability distribution that gives that tells you when i say a sentence you know the you know the cat is chasing the blank in the kitchen you know there are only a few words that make sense there you know it could be a mouse or it could be a laser spot or you know something like that right uh and and if i if i say the the blank is changing the blank in the savannah you also have a bunch of plausible options for those two words right um that that because you you have kind of a you know underlying reality that you can refer to to sort of fill in those those blanks um so you cannot say for sure in the savannah if it's a you know a lion or cheetah or whatever you cannot know if it's a zebra or glue or you know whatever wildebeest the same thing um but uh but you can represent the uncertainty by just a long list of numbers now if i uh if i do the same thing with video and i ask you to predict a video clip it's not a discrete set of potential frames you have to have somewhere representing a sort of infinite number of plausible continuations of multiple frames in a you know high dimensional continuous space and we just have no idea how to do this properly uh finite high dimensional so like you because it's fine dimensional yes just like the words i try to get it to uh down to a small finite set of like under a million something like that something like that i mean it's kind of ridiculous that we're doing a distribution over every single possible word for language and it works it feels like that's a really dumb way to do it um like there seems to there seems to be like there should be some more compressed representation of the distribution of the words you're right about that and so i agree do you have any interesting ideas about how to represent all the reality in a compressed way such you can form a distribution over it that's one of the big questions you know how do you do that right i mean what's what's kind of you know another thing that that really is stupid about um i shouldn't say stupid but like simplistic about current approaches to cell supervision in in uh nlp in text is that not only do you represent a giant distribution over words but for multiple words that are missing those distributions are essentially independent of each other and you know you don't pay too much of a price for this so you you so you can't so the you know the system you know in the the the sentence that i gave earlier if he gives a certain probability for a lion and and uh cheetah and then a certain probability for uh you know gazelle uh wildebeest and and and zebra uh those two probabilities are independent of each other uh and it's not the case that those things are independent lions actually attack like bigger animals than than she does so you know there's a huge independence hypothesis in in this process which is not actually true the reason for this is that we don't know how to represent uh properly distributions over combinatorial uh sequences of symbols essentially whenever because the number goes exponentially with the length of the of the symbols and so we have to use tricks for this but um those techniques can you know get around like don't even deal with it so so the big question is like would there be some sort of abstract latent representation of text that would say that you know when i when i switch lion for gazelle lion for cheetah i also have to switch zebra for gazelle yeah so this independence assumption let me throw some criticism at you that i often hear and see how you respond so this kind of filling in the blanks is just statistics you're not learning anything like the deep underlying concepts you're just mimicking stuff from the past you're not learning anything new such that you can use it to generalize about the world or okay let me just say the crude version which is it's just statistics it's not intelligence uh what do you have to say to that what do you usually say to that if you kind of hear this kind of thing i don't get into those discussions because they are they're kind of pointless um so first of all it's quite possible that intelligence is just statistics it's just statistics of a particular kind yes uh where this is the philosophical question it's kind of is is intel is it possible that intelligence is just statistics yeah but what kind of statistics so uh if you are asking the question are the model of the world the models of the world that we learn um do they have some notion of causality yes so if the criticism comes from people who say you know current machine learning system don't care about causality which by the way is wrong uh you know i agree with them yeah you should you know your model of the world should have your actions as one of your of the inputs and that will drive you to learn causal models of the world where you know what you know what uh intervention in the world will cause what results or you can do this by observation of other agents uh acting in the world and and observing the effect uh other humans for example so i think you know at some level of description uh intelligence is just statistics uh but that doesn't mean you don't you don't you know you won't have models that have you know deep mechanistic explanation for what goes on uh the question is how do you learn them that's that's the question i'm interested in uh because you know a lot of people who actually voice their criticism say that those mechanistic model has to have to come from someplace else they have to come from human designers they have to come from i don't know what and obviously we learn them or if we don't learn them as an individual nature learn them for us using evolution so regardless of what you think those processes have been learned somehow so if you look at the the human brain just like when we humans introspect about how the brain works it seems like when we think about what is intelligence we think about the high level stuff like the models we've constructed concepts like cognitive science like concepts of memory and reasoning module almost like these high-level modules is there's is this service a good analogy like are we ignoring the uh the dark matter the the basic low-level mechanisms just like we ignore the way the operating system works we're just using the uh the the high-level software we're ignoring that at the low level the neural network might be doing something like statistics like me sorry to use this word probably incorrectly and crudely but doing this kind of fill in the gap kind of learning and just kind of updating the model constantly in order to be able to support the raw sensory information to predict it and then adjust to the prediction when it's wrong but like kyla when we look at our brain at the high level it feels like we're doing like we're playing chess like we're we're like playing with high level concepts and we're stitching them together and we're putting them into long-term memory but really what's going underneath is something we're not able to introspect which is this kind of uh simple large neural network that's just filling in the gaps right well okay so there's a lot of questions there are answers there okay so first of all there's a whole school of thought in neuroscience computational neuroscience in particular that likes the idea of predictive coding which is really related to the idea i was talking about in self-supervised learning so everything is about prediction the essence of intelligence is the ability to predict and everything the brain does is trying to predict uh predict everything from everything else okay and that's really sort of the underlying principle if you want that uh cell supervisor learning is trying to kind of reproduce this idea of prediction that's kind of an essential mechanism of task independent learning if you want the next step is what kind of intelligence are you interested in reproducing and of course you know we all think about you know trying to reproduce sort of you know high-level cognitive processes in humans but like with machines we're not even at the level of even reproducing the learning processes in a in a cat brain um you know the most intelligent or intelligent systems don't don't have as much common sense as as a house cat so um how is it that cats learn and you know cats don't do a whole lot of uh reasoning they certainly have causal models they certainly have uh because you know many cats can figure out like how they can act on the world to get what they want um they certainly have uh a fantastic model of intuitive physics uh certainly of the the the dynamics of their own bodies but but also of praise and things like that right so um they they're they're pretty smart they only do this with about 800 million neurons we are not anywhere close to reproducing this kind of thing so to some extent i could i could say let's not even worry about like the high level cognition and kind of you know long-term planning and reasoning that humans can do until we figure out like you know can we even reproduce what cats are doing now that said this ability to learn world models i think is the key to the possibility of learning machines that can also reason so whenever i give a talk i'd say there are there are three challenges in the three main challenges in machine learning the first one is uh you know getting machines to learn to represent the world and proposing salt supervised running the second is getting machines to reason in ways that are compatible with essentially gradient-based learning because this is what deep learning is all about really and the third one is something we have no idea how to solve at least i have no idea to solve is uh can we get machines to learn hierarchical representations of action plans you know like you know we know how to train them to learn hierarchical representations of perception you know with computational nets and things like that and transformers but what about action plans can we uh get them to spontaneously learn good hierarchical representations of actions also gradient based yeah all of that you know needs to be somewhat differentiable so that you can apply sort of gradient-based learning uh which is really what deep learning is about so it's background knowledge ability to reason in a way this differentiable that is somehow connected deeply integrated with that background knowledge or builds on top of that background knowledge and then given that background knowledge be able to make hierarchical plans right in the world so if if you take classical optimal control there's something in classical optimal control called uh model predictive control and it's you know it's been around since the early 60s nasa uses that to compute trajectories of rockets and the basic idea is that you have a pretty predictive model of the rocket let's say or whatever system you are you intend to control which given the state of the system at time t and given an action that you're taking the system so for rocket to be thrust and you know all the controls you can have uh it gives you the state of the system at time t plus delta t right so basically a differential equation something like that um and if you have this model and you have this model in the form of some sort of neural net or some sort of uh set of formula that you can back propagate gradient through you can do what's called model predictive control or gradient based uh model predictive control so you have uh you can unroll that that model in time you you you you feel it a hypothesized sequence of actions and then you have some objective function that measures how well at the end of the trajectory the system has succeeded or matched what you wanted to do um you know is it a robot harm have you grasped the object you want to grasp if it's a rocket you know are you at the right place near the space station things like that and by back propagation through time and again this was invented in the 1960s by optimal control theorists you can figure out uh what is the optimal sequence of actions that will you know get my system to the the best final state so that's a form of reasoning it's basically planning and a lot of planning uh systems in robotics are actually based on this and uh and you can think of this as a form of reasoning so you know to take the example of the teenager driving a car again you have a pretty good dynamical model of the car it doesn't need to be very accurate but you know again that if you turn the wheel to the right and there is a cliff you're gonna run off the cliff right you don't need to have a very accurate model to predict that and you can run this in your mind and decide not to do it for that reason because you can predict in advance that the result is going to be bad so you can sort of imagine different scenarios and and then you know employ uh or take the first step in the scenario that is most favorable and then repeat the process of planning that's called receding horizon model predictive control so even you know all those things have names you know uh going back you know decades um and so if you're not not uh you know classical optimal control the model of the world is not generally learned uh there's you know sometimes a few parameters you have to identify that's called systems identification but uh but generally the model is mostly deterministic and mostly built by hand so the big question of ai i think the big challenge of ai for the next decade is how do we get machines to learn predictive models of the world that deal with uncertainty and deal with the real world in all this complexity so it's not just the trajectory of a rocket which you can reduce to first principles it's not it's not even just a trajectory of a robot arm which again you can model by you know careful mathematics but it's everything else everything you observe in the world you know people behavior um you know physical systems that involve collective phenomena like water or or you know trees and you know branches in a tree or something or or like complex things that you know humans have no trouble developing abstract representations in predictive model for but we still don't know how to do with machines where do you put in in these three maybe in the in the planning stages the game theoretic nature of this world where your actions not only respond to the dynamic nature of the world the environment but also affected so if there's other humans involved is this is this point number four or is it somehow integrated into the hierarchical representation of action in your view i think it's integrated it's just um it's just that now your model of the world has to deal with you know it just makes it more complicated right the fact that uh humans are complicated and not easily predictable that makes your model of the world much more complicated that much more complicated well there's a chat i mean i suppose chess is an analogy so monte carlo tree search there's a i go you go i go you go like um andre kapatha recently gave a talk at mit about car doors i think there's some machine learning too but mostly car doors and there's a dynamic nature to the cart like the person opening the door checking and he wasn't talking about that he was talking about the perception problem of what the ontology of what defines a car door this big philosophical question but to me it was interesting because like it's obvious that the person opening the car doors they're trying to get out like here in new york trying to get out of the car you slowing down is going to signal something you speeding up is going to signal something and that's a dance it's a asynchronous chess game i don't know so i it feels like um it's not just i mean i guess you can integrate all of them into one giant model like the entirety of the the these little interactions because it's not as complicated as chess it's just like a little dance we do like a little dance together and then we figure it out well in some ways it's way more complicated than chess because uh because it's continuous it's uncertain in a continuous manner uh it doesn't feel more complicated but it doesn't feel more complicated because that's what we are we've evolved to solve this is the kind of problem we've evolved to solve and so we're good at it because you know nature has made us good at it nature has not made us good at chess we completely suck at chess yeah um in fact that's why we designed it as a game is to be challenging and if there is something that you know recent progress in chess and go has made us realize is that humans are really terrible at those things like really bad you know there was a story right before alphago that uh uh you know the best go players thought there were maybe two or three stones behind you know an ideal player that they would call god uh in fact no they are like nine or ten stones behind i mean we're just bad so we're not good at and it's because we have limited uh working memory we we're not very good at like doing this uh tree exploration that you know computers are much better at doing than we are but we are much better at learning differentiable models of the world i mean i said differentiable in the kind of you know i should say not differentiable in the sense that you know we went back for up to it but in the sense that our brain has some mechanism for estimating gradients uh of some kind yeah and that's what you know makes us uh efficient so if you have an agent that consists of a a model of the world which you know in the human brain is basically the entire front half of your brain an objective function which uh in human in in humans is a combination of two things there is your sort of intrinsic motivation module which is in the basal ganglia you know at the base of your brain that's the thing that measures pain and hunger and things like that like immediate feelings and emotions and then there is you know the equivalent of what people in reform spectrum called a critic which is a sort of module that predicts ahead what the outcome of a uh of a situation will be and so it's it's not a cost function but it's sort of not an objective function but it's sort of a you know trained predictor of the ultimate objective function and that also is differentiable and so if all of this is differentiable your cost function your your critic your uh you know your your role model then you can use gradient-based type methods to do planning to the reasoning to do learning uh you know to do all the things that would like an intelligent agent uh to do and the gradient-based learning like what's your intuition that's probably at the core of what can solve intelligence so you don't need like a logic based reasoning uh in your view i don't know how to make logic based reasoning compatible with efficient learning yeah and okay i mean there is a big question perhaps a philosophical question i mean it's not that philosophical but uh that we can ask is is that you know all the learning algorithms we know from engineering and computer science proceed by optimizing some objective function yeah right so one question we may ask is is does learning in the brain minimize an objective function it could be a you know a composite of multiple objective functions but it's still an objective function uh second if it does optimize an objective function does it do does it do it by some sort of gradient estimation you know it doesn't need to be back prop but you know some way of estimating the gradient in efficient manner whose complexity is on the same order of magnitude as you know actually running the inference because you can't afford to do things like you know perturbing a weight in your brain to figure out what the effect is and then sort of uh you know you can do sort of estimating gradient by perturbation it's it to me it seems very imp implausible that the brain uses some sort of you know zeroth order black box gradient free optimization because it's so much less efficient than gradient optimization so it has to have a way of estimating gradients is it possible that some kind of logic based reasoning emerges in pockets as a useful like you said if the brain is an objective function maybe it's a mechanism for creating objective functions it's it's a mechanism for creating knowledge bases for example that can then be queried like maybe it's like an efficient representation of knowledge that's learned in a gradient-based way or something like that well so i think there is a lot of different types of intelligence so first of all i think the type of logical reasoning that we think about that we are you know maybe stemming from you know sort of classical ai of the 1970s and 80s i think humans use that relatively rarely and are not particularly good at it but we judge each other based on our ability to uh solve those rare problems it's called an iq test i think so like i'm i'm not very good at chess yes i'm judging you this whole time because well we we actually with your with your uh you know heritage i'm sure you're good at chess no stereotypes not all stereotypes are true well i'm terrible at chess so um you know but i think perhaps uh another type of intelligence that i have is this uh uh you know ability of sort of building models of the world from uh you know reasoning obvious obviously but also also data and those those models generally are more kind of analogical right so it's it's it's reasoning by simulation and by analogy where you use one model to apply to a new situation even though you've never seen that situation you can sort of connect it to a situation you've encountered before uh and and your reasoning is more you know akin to some sort of internal simulation so you you're kind of stimulating what's happening when you're building i don't know a box out of wood or something right you can imagine in advance like what would be the result of you know cutting the wood in this particular way are you going to use you know screws on nails or whatever when you are interacting with someone you also have a model of that person and and sort of interact with that person you know having this model in mind uh to kind of uh tell the person what you think is useful to them so i think this this ability to construct most of the world is basically the essence the essence of intelligence and the ability to use it then to plan uh actions that will uh fulfill a particular criterion of course is is necessary as well so i'm going to ask you a series of impossible questions as we keep asking is that been doing so so if that's the fundamental sort of dark matter of intelligence this ability to form a background model what's your intuition about how much knowledge is required you know you know i think dark matter you put a percentage on it of uh the composition of the universe and how much of it is dark matter how much of his dark energy how much information do you think is required to to be a house cat so you have to be able to uh when you see a box going it when you see a human compute the most evil action if there's a thing that's near an edge you knock it off all of that plus the extra stuff you mentioned which is a great self-awareness of the physics of your of your own body and in the world how much knowledge is required do you think to solve it um i don't even know how to measure an answer to that question i'm not sure how to measure it but whatever it is it fits in about about 800 000 neurons uh 800 million neurons or the representation does everything all knowledge everything right um it was less than a billion a dog is two billion but a cat is less than one billion and uh so multiply that by a thousand and you get the number of synapses and i think almost all of it is is learned through this you know a sort of supervised running although you know i think a tiny flavor is learned through reinforcement running and certainly very little through you know classical supervised running although it's not even clear how supervised learning actually works in uh in a biological world um so i think almost all of it is uh is self supervision but it's driven by uh the the sort of ingrained objective functions that a cat or human have at the base of their brain which kind of drives their um their behavior so you know nature tells us uh you're hungry it doesn't tell us how to feed ourselves that's that's something that the rest of our brain has to figure out right well it's interesting because there might be more like deeper objective functions underlying the whole thing so hunger may be some kind of now you go to like neurobiology it might be just the brain uh trying to maintain homeostasis so hunger is just one of the human perceivable symptoms of the brain being unhappy with the way things are currently right it could be just like one really dumb objective function at the core but that's how that's how behavior is is driven uh the the fact that you know the orbital ganglia uh drive us to do things that are that are different from saying a wong tong or certainly a cat is what makes you know human nature versus orangutan nature versus scat nature so for example uh you know our basal ganglia drives us to seek the company of other humans and that's because nature has figured out that we need to be social animals for our species to survive and it's true of many primates it's not true orangutons orangutans are solitary animals um they don't seek the company of others in fact they avoid them in fact they scream at them when they come too close because they're territorial because for for their survival you know uh evolution has figured out that's the best thing i mean they're occasionally social of course for you know reproduction and stuff like that but um but but they're mostly solitary so so all of those behaviors are not part of intelligence you know people say oh you're never going to have intelligent machines because you know human intelligence is social but then you look at orangutans you look at octopus octopus never know their parents they barely interact with any other and and they get to be really smart in less than less than a year in like half a year you know in a year they're adults in two years they're dead so there are things that we think as humans are intimately linked with intelligence like social interaction like language we think i think we give way too much importance to language as a substrate of intelligence as humans because we think our reasoning is so linked with language so for to solve the house cat intelligence problem you think you could do it on a desert island you could have pretty much you could just have a cat sitting there um looking at the waves that the ocean weighs and figure a lot of it out it needs to have sort of you know the right set of drives uh to kind of you know get it to do the thing and learn the appropriate things right but uh like for example you know baby humans are driven to learn to stand up and walk okay you know it's not that's kind of this desire is hard-wired how to do it precisely is not that's learned but the desire to to walk move around and stand up that's sort of probably hardwired it's very simple to hardwire this kind of stuff oh like the desire to well that's interesting you're hardwired to want to walk that's not a there's got to be a deeper need for walking i think it was probably socially imposed by society that you need to walk all the other bipedal like a lot of simple animals that you know would probably work without ever watching any other members of the species it seems like a scary thing to have to do because you suck it by peter walking at first it seems crawling is much safer much more like why are you in a hurry well because because you have this thing that drives you to do it you know um which is sort of part of uh the sort of human development is that understood actually what not entirely no what is what's the reason to get on two feet it's really hard like most animals don't get on two feet well they get on four feet you know many mammals get on four feet yeah they very quickly some of them extremely quickly but i don't you know like from the last time i've interacted with the table that's much more stable than the thing then two legs it's just a really hard problem yeah how many birds have figured it out with two feet well technically we can go into ontology they have four i guess they have two feet they have two feet chickens you know dinosaurs had two feet many of them allegedly i'm just now learning that t-rex was eating grass not other animals t-rex might have been a friendly friendly pet what do you think about uh i don't know if you looked at the test for general intelligence that francois chile put together i don't know if you got a chance to look at that kind of thing like what's your intuition about how to solve like an iq type of test i don't know i think it's so outside of my radar screen that it's not really relevant i think in the short term well i guess one way to ask another way perhaps more closer to what to your work is like how do you solve mnist uh with very little example data that's right and that's the answer to this probably is supervised running just learn to represent images and then learning uh you know to recognize handwritten digits on top of this will only require a few samples and we observe this in humans right you you show a young child a picture book with a couple pictures of an elephant and that's it the child knows what an elephant is and we we see this today with practical systems that we you know we train image recognition systems with uh enormous amounts of of images either either completely self-supervised or very weakly supervised for example you can train a neural net to predict uh whatever hashtag people type on instagram right then you can do this with billions of images because there's billions per day that are showing up so the amount of training data there is essentially unlimited and then you take the output representation you know a couple layers down from the output of what the system learned and feed this as input to a classifier for any object in the world that you want and it works pretty well so that's transfer learning okay or weekly supervised transfer learning uh people are making very very fast progress using self-supervised running uh for for with this kind of scenario as well um and you know my guess is that that's that's gonna be the future for self-supervised learning how much cleaning do you think is needed for filtering um uh malicious signal or what's a better term but like a lot of people use hashtags on instagram to uh get like good seo that doesn't fully represent the contents of the image like they'll put a picture of a cat and hashtag it would like science awesome fun i don't know all kind of why would you put science that's not very good seo the way the way my colleagues who worked on this project at uh at facebook now meta meta a few years ago uh dealt with this is that they only selected something like 17 000 tags that correspond to kind of physical things or or situations like you know that has some visual content um so you know you wouldn't have like tbt or anything like that also they keep a very select set of hashtags is what you're saying yeah okay but it's still instead on the order of uh you know 10 to 20 000 so it's fairly large okay can you uh tell me about data augmentation what the heck is data augmentation and how is it used maybe contrast of learning for uh for video what are some cool ideas here right so data augmentation i mean first data augmentation you know is the idea of artificially increasing the size of your training set by distorting the images that you have in ways that don't change the nature of the image right so you take you you're doing this you can do data augmentation on any list and people have done this since the 1990s right you take a in this digit and you shift it a little bit or you change the size or rotate it skew it you know etc add noise add noise etc and it it works better if you train a supervised classifier with augmented data you're going to get better results now it's become really interesting over the last couple years because a lot of supervised learning techniques to pre-train vision systems are based on data augmentation and the the basic techniques is originally inspired by uh techniques that i worked on in the early 90s and jeff intern worked on also in the early 90s there was sort of parallel work i used to call this siamese network so basically you take two identical copies of the same network they share the same weights and you show two different views of the same object either those two different views may have been obtained by data augmentation or maybe it's two different views of the same scene from a camera that you moved or at different times or something like that right or two pictures of the same person things like that and then you train this neural net those two identical copies of this neural net to produce an output representation a vector in such a way that the representation for those two images are as close to each other as possible as identical to each other as possible right because you want the system to basically learn a function that will that will be invariant that will not change whose output will not change when you transform those inputs uh in in those in those particular ways right so that's easy to do what's complicated is how do you make sure that when you show two images that are different the system will produce different things because if you don't have a specific provision for this the system will just ignore the input when you train it it will end up ignoring the input and just produce a constant vector that is the same for every input right yes that's called a collapse now how do you avoid collapse so there's two ideas one idea that i proposed in the early 90s with my colleagues at bell labs jane bromley and a couple other people which we now call contrastive learning which is to have negative examples right so you have pairs of images that you know are different and you show them to the network and uh those two copies and then you you push the two output vectors away from each other and they will eventually guarantee that things that are semantically similar produce similar representations and things that are different produce different representations we actually came up with this idea for a project of doing signature verification so we would collect signature signatures from like multiple signatures on the same person and then train a neural net to produce the same representation and then uh you know force the system to produce different representations for different signatures this was actually the the problem was proposed by people from uh what was a subsidiary of atnt at the time called ncr and they were interested in storing a representation of the signature on the 80 bytes of the magnetic strip of a credit card so we came up with this idea of having a neural net with 80 outputs you know that we would quantize on bytes so so that we could encode the and that encoding was then used to compare whether the signature matches or not that's right so then you would you know sign you would run through the neural net and then you would compare the output vector to whatever is stored on your card it actually worked it worked but they ended up not using it because nobody cares actually i mean the american you know financial payment system is incredibly lags in that respect compared to europe oh with the signatures what's the purpose of signatures anyway this is very nobody looks at them nobody cares yeah it's uh yeah yeah no so so that that's contrastive learning right so you need positive and negative pairs and the problem with that is that you know even though i at the original paper on this i'm actually not very positive about it because it doesn't work in high dimension if your presentation is high dimensional there's just too many ways for two things to be different and and so you would need lots and lots and lots of negative pairs so there is a particular implementation of this which is relatively recent from actually the google toronto group uh where you know jeff intern is the senior member there it's called sim clear sim clr and you know basically a particular way of implementing this idea of contracting running the particular objective function now what i'm much more enthusiastic about these days is non-contrasting methods so other ways to guarantee that uh the representations would be different for different different inputs and it's actually based on an idea that jeff intern proposed in the early 90s with a student at the time sue becker and it's based on the idea of maximizing the mutual information between the outputs of the two systems you only show positive pairs you only show pairs of images that you know are somewhat similar and you train the two networks to be informative but also to be as informative of each other as possible so basically one representation has to be predictable from the other essentially uh and you know he proposed that idea had you know a couple papers in the early 90s and then nothing was done about it for decades and i kind of revived this idea together with my postdocs at fair uh particularly a postdoc called stefanoni
Resume
Categories