Kind: captions Language: en so incredibly you've contributed some of the biggest recent ideas in AI in computer vision language natural language processing reinforcement learning sort of everything in between maybe not ganz is there anything there may not be a topic you haven't touched and of course the the fundamental science of deep learning what is the difference to you between vision language and as in reinforcement learning action as learning problems and what are the commonalities do you see them as all interconnected are they fundamentally different domains that require different approaches ok that's a good question machine learning is a field with a lot of unity a huge amount of unity in what I mean by unity like overlap of ideas overlap of ideas overlap of principles in fact there is only one or two or three principles which are very very simple and then they apply in almost the same way in almost the same way to the different modalities to the different problems and that's why today when someone writes a paper on improving optimization of deep learning in vision it improves the different NLP applications and it improves the different reinforcement learning applications reinforcement learning so I would say that computer vision and NLP are very similar to each other today they differ in that they have slightly different architectures we use transformers in NLP and mis convolutional neural networks in vision but it's also possible that one day this will change and everything will be unified with a single architecture because if you go back a few years ago in natural language processing the work gives a huge number of architectures for every different tiny problem had its own architecture today this is just one transformer for all those different tasks and if you go back in time even more you had even more and more fragmentation and every little problem in AI had its own little sub specialization and sub in a little set of collection of skills people who would know how to engineer the features now solving subsume by deep learning we have this unification and so I expect a vision to become unified with natural languages well origins I expect I think it's I don't want to be too sure because I think on the commercial you know that is very computationally efficient RL is different RL does require slightly different techniques because you really do need to take action you really do need to do something about exploration your variance is much higher but I think there is a lot of unity even there and I would expect for example that at some point there will be some broader unification between RL and supervised learning where somehow they RL will be making decisions to make the supermost don't even go better and it'll be I imagine one big black box and you just throw every you know you shovel travel things into it and in just figures out what to do visit whatever you shovel in it I mean reinforcement learning has some aspects of language and vision combined almost there's elements of a long term memory that you should be utilizing and there's elements of a really rich sensory space so it seems like the it's like the union of the two or something like that but I'd say something slightly differently I'd say that reinforcement learning is neither but it naturally interfaces and integrates view the two of them do you think action is fundamentally different so yeah what is interesting about what is unique about policy of learning to act well so one example for instance is that when you learn to act you're fundamentally in a non-stationary world because as your actions change the things you see start changing you you experience the world in a different way and this is not the case for the more traditional static problem you have at least some distribution and you just apply a model to that distribution you think it's a fundamentally different problem or is it just more difficult generally it's a generalization of the problem of understanding I mean it's it's it's a question of definitions almost there is a huge you know there's a huge amount of commonality for sure there gradients attract you take gradients we try to approximate gradients in both cases in some key in the case of reinforcement learning you have some tools to reduce the variance of the gradients you do that there's lots of commonality use the same neural net in both cases you compute the gradient you apply atom in both cases so I mean there's lots in common for sure but there are some small differences which are not completely insignificant it's really just a matter of your of your point of view what frame of reference you what how much do don't want to zoom in or out as you look at these problems which problem do you think is harder so people like Noam Chomsky believe that language is fundamental to everything so it underlies everything do you think language understanding is harder than visual scene understanding or vice versa I think is it asking if a problem is hard is slightly wrong I think the question is a little bit wrong and I want to explain why so what does it mean for a problem to be hard okay then uninteresting dumb answer to that is there's a there's a benchmark and there's a human level performance on that benchmark and how as the effort required to reach the human level okay benchmark so from the perspective of how much until you get to human level and a very good benchmark yeah like some and I honest I understand what you mean by that so when I was growing up going to say that a lot of it depends on you know once you solve a problem he stops being hard and that's resolved that's always true and so but if something is hard or not depends on water tools can do today so you know I say today through human level language understanding and visual perception are hard and sense that there is no way of solving the problem completely in the next three months right so I agree with that statement beyond that I'm just I'd be my guess would be as good as yours I don't know oh okay so you'd have a fundamental intuition about how hard language understanding is I think I know I changed my mind that's a language is probably going to be hard I mean it depends on how you define it like if you mean absolute top not 100 percent language understanding I'll go with language and so but then if I show you a piece of paper with letters on it is that if you see what I mean it's um you have a vision system you say it's the best human level vision system I show you I open a book and I show you letters if you will to understand how these letters form into words and sentences and meaning is this part of the vision problem where does the vision end and language begin yeah so Chomsky would say it starts at language so vision is just a little example of the kind of structure and you know fundamental hierarchy of ideas that's already represented in our brain somehow that's represented through language but where does vision stop and language begin that's a really interesting question it so one possibility is that it's impossible to achieve really deep understanding in either images or language without basically using the same kind of system so you're going to get the other for free I think I think it's pretty likely that yes if we can get one we probe our machine learning is probably that good that we can get the other but it's not one honey I'm not 100% sure and also but I think a lot a lot of it really does depend on your definitions definitions of like perfect vision because rady no reading his vision but should it count yet to me so my definition of a system looked at an image and then a system looked at a piece of text and then told me something about that and I was really impressed that's relative you'll be impressed for half an hour and then you're gonna say well I mean all the systems do that but here's the thing they don't do yeah but I don't have that with humans humans continue to impress me is that true well the ones okay so I'm a fan of monogamy so I like the idea of marrying somebody being with them for several decades so I believe in the fact that yes it's possible to have somebody continuously giving you pleasurable interesting witty new ideas friends yeah I think I think so they continue to surprise you the surprise it's a you know that injection of randomness seems to be a it seems to be a nice source of yeah continued inspiration like the the width the humor I think yeah that that would be it's a very subjective test but I think if you have enough humans in their own yeah III understand what you mean yeah I feel like I misunderstood what you meant by impressing you I thought you meant to impress you with its intelligence with how how with how good well it understands an image I thought you meant something like I'm gonna show you really complicated image and it's gonna get it right and you gonna say wow that's really cool a systems of you know a January 2020 have not been doing that yeah no I I think it all boils down to like the reason people click like on stuff on the internet which is like it makes them laugh so it's like humor or wit yeah or insight I'm sure we'll get it as get that as well you