Kind: captions Language: en the thing I would very much like to talk about today is the state of the art in deep learning here we stand in 2019 really at the height of some of the greater colleges that have happened but also stand at the beginning it's up to us to define where this incredible data-driven technology takes us and so I'd like to talk a little bit about the breakthroughs that happened in 2017 and 2018 that take us to this point so this lecture is not on the state of the art results on main machine learning benchmarks so the various image classification and object detection or the NLP benchmarks or the Gann benchmarks this isn't about the cutting edge algorithm that's available I'll get help that performs best on a particular benchmark this is about ideas ideas and developments that are at the cutting edge of what defines this exciting field of deep learning and so I'd like to go through a bunch of different areas that I think are really exciting now of course this is also not a lecture that's complete there's other things that I may be totally missing that happened in 2017-18 that are particularly exciting to people here people beyond for example medical applications of deep learning is something I totally don't touch on and protein folding and all kinds of applications that there has been some exciting developments from deep mind and so on that don't touch on so forgive me if your favorite developments are missing but hopefully this encompasses some of the really fundamental things that have happened both on a theory side on the application side and then the community side of all of us being able to work together on this and these kinds of technologies I think 2018 in terms of deep learning is the year of natural language processing many have described this year as the imagenet moment in 2012 for computer vision when Alex net was the first Neal now I really gave that big jump in performance and computer vision it started to inspire people what's possible with deep learning with purely learning based methods in the same way there's been a series of developments from 2016-17 led up to 18 with a development of Burt that has made on benchmarks and in our ability to apply NLP to solve various NLP tasks natural language processing tasks a total leap so let's tell the story of what takes us there there's a few developments I've mentioned a little bit on Monday about the encoder decoder or recurrent neural networks so this idea of recurrent neural networks encode sequences of data and I'll put something out put either a single prediction or another sequence when the input sequence and the output sequence are not the same necessarily the same size they're like in machine translation we have to translate from one language to another the encoder decoder architecture takes the following process it takes in the sequence of words or the sequence of samples as the input and uses the recurrent units whether that's cells TM er user beyond and encodes that sentence into a single vector so forms an embedding of that sentence of what it represented a representation of that sentence and then feeds that representation in the decoder recurrent neural network that then generates the the sequence of words that form the sentence in the language that's being translated to so first you encode by taking the sequence and mapping it to a fixed size vector representation and then you decode by taking that fixed size vector representation rolling it into the sentence that can be of different length than the input sentence okay that's the encoder/decoder structure for recurrent neural networks has been very effective for machine translation and dealing with arbitrary length input sequences arbitrary length output sequences next step attention what is attention well it's the next step beyond it's an improvement on the the encoder decoder architecture it allows the it provides a mechanism that allows to look back at the input sequence so as opposed to saying ever that you have a seek sequence if that's the input sentence and that that all gets collapsed into a single vector representation you're allowed to look back at the particular samples from the input sequence as part of the decoding process that's attention and you can also learn which aspects are important for which aspects of the decoding process which aspects the input sequence are important to the opposite once visualize another way and this is there's a few visualizations here they're quite incredible that are done by jay Alomar I highly recommend you follow the links and and look at the further details of these visualizations of attention so if we look at neural machine translation the encoder RNN takes a sequence of words and throughout after every sequence forms a set of hidden representations it hidden a hidden state that captures the representation of the wars that followed and those sets of hidden representations as opposed to being collapsed to a single fixed size vector are then all pushed forward to the decoder that are then used by the Dakota to translate but in a selective way where the decoder here visualized on the y-axis the input language and though on the X still what the output language the decoder weighs the different parts of the input sequence differently in order to determine how to best translate generate the word that forms a translation in the full output sentence okay that's attention allowing expanding the encoder decoder architecture to allow for selective attention to the input sequence as opposed to collapsing everything down into fixed representation okay next step self attention in the encoding process allowing the encoder to also selectively look informing the hidden representations at other parts of the input sequence in order to form those representations it allows you to determine for certain words what are the important relevant aspects of the input sequence that can help you encode that word the best so it improves the encoder process by allowing it to look at the entirety of the context that's self attention building on that transformer it's using the self attention mechanism in the encoder to form these sets of representations on the input sequence and then as part of the decoding process follow the same but in Reverse with a bunch of self attention that's able to look back again so it's self attention on the encoder attention on the decoder and that's where the magic that that's where the entirety magic is that's able to capture the rich context of the input sequence in order to generate in the contextual way the output sequence so let's take a step back then and look at what is critical to natural language in order to be able to reason about words construct a language model and be able to reason about the words in order to classify a sentence or translate a sentence or compared to sentences and so on there the sentences are collections of words or characters and those characters and words have to have an efficient representation that's meaningful for that kind of understanding and that's what the process of embedding is we talked a little bit about it on Monday and so the traditional word Tyvek process of embedding is you use some kind of trick and unsupervised way to map words into into a compressed representation so language modeling is the a is the process of determining which words follow each other usually so one way you can use it as in a skip gram model taking a huge datasets of words you know there's writing all over the place taking those datasets and feeding a neural network that in a supervised way looks at which words are usually follow the input so the input is a word the output is which word are statistically likely to follow that word and the same with the preceding word and doing this kind of unsupervised learning which is what word Tyvek does if you throw away the output and the input and just taking the hidden representation form in the middle that's how you form this compressed embedding a meaningful representation that when two words are related in a language modeling sense towards they're related they're going to be in that representation close to each other and when they're totally unrelated have nothing to do with each other they're far away elmo is the approach of using bi-directional as L STM's to learn that representation and what bi-directional bi-directionally so looking not just the sequence to let up to the word but in both directions the sequence that following the sequence that before and that allows you to learn the rich full context of the word in learning the rich full context of the word you're forming representations that are much better able to represent the the statistical language model behind the kind of corpus of language that you're you're looking at and this has taken a big leap in ability to then that for further algorithms then with the language model a reasoning about doing things like sentence classification sentence comparisons so on translation that representation is much more effective for working with language the idea of the open AI transformer is the next step forward is taking the the same transformer that I mentioned previously the encoder with self attention decoder with attention looking back at the input sequence and using it taking the like taking the language learned by the decoder and using that as a language model and then chopping off layers and training in a specific on a specific language tasks like sentence classification now Bert is the thing that did the big leap in performance with the transformer formulation there is always there's no bi-directional element there is it's always moving forward so the encoding step and the decoding step with Burt is it's richly bi-directional it takes in the full sequence of the sentence and masks out some percentage of the words 15% of the words 15% of the samples of tokens from the sequence and tasks the entire encoding self attention mechanism to predict the words that are missing that construct and then you stack a ton of them together a ton of those encoders self attention feed-forward Network self attention feed forward Network together and that allows you to learn the rich context of the language to then at the end perform all kinds of tasks you can create first of all like Elmo and like we're Tyvek create rich contextual embeddings take a set of words and represent them in the space that's very efficient to reason with you can do language classification you can do settings pair classification you could do the similarity of two sentences multiple choice question answering general question answering tagging of sentences okay I'll link it on that one a little bit too long but it is also the one I'm really excited about and really if there's a breakthrough this year it's been it's things to Burt the other thing I'm very excited about is totally jumping away from the new rips the theory the those kind of academic developments and deep learning and into the world of applied deep learning so Tesla has a system called autopilot where the hardware version 2 of that system is a newer is uh is a implementation of the Nvidia Drive px 2 system which runs a ton of neural networks there's 8 cameras on the car and a variant of the inception network is now taking in all a cameras at different resolutions as input and performing various tasks like drivable area segmentation like object detection and some basic localization tasks so you have now a huge fleet of vehicles where it's not engineers summer I'm sure our engineers but it's really regular consumers people that have purchased the car have no understanding in many cases of what in your own networks limitations the capabilities are so on now it has and you'll know what is controlling the well being has its decisions its perceptions and the control decisions based on those perceptions are controlling the life of a human being and that to me is one of the great sort of breakthroughs of 17 and 18 the in terms of the development of what AI can do in a practical sense in impacting the world and so one billion miles over 1 billion miles have been driven in autopilot now there's two types of systems in currently operating in Tesla's there's hardware version 1 hardware version 2 hardware version 1 was Intel mobile eye monocular camera perception system as far as we know that was not using a neural network and it was a fixed system that wasn't learning at least online learning in the Tesla's the other is hardware version 2 and it's about half and half now in terms of the miles driven the hardware version 2 as a neural network that's always learning there's weekly updates it's always improving the model shipping new weights and so on that's that's the exciting set of breakthroughs in terms of auto ml the dream of automating some aspects or all aspects there's many aspects as possible of the machine learning process where you can just drop in a data set that you're working on and the the system will automatically determine all the parameters from the details of the architectures the size are the architecture the different modules and then architecture the hyper parameters use for training the architecture running that they're doing the inference everything all is done for you all you just feed it as data so that's been the success of the neural architecture search in 16 and 17 and there's been a few ideas with Google Auto ml that's really trying to almost create an API we just drop in your data set and it's using reinforcement learning and recurrent neural networks to given a few modules stitch them together in such a way where the objective function is optimizing the performance of the overall system and they've showed a lot of exciting results google showed and others that outperform state AR systems both in terms of efficiency and in terms of accuracy now in eighteen they've been a few improvements on this direction and one of them is a Dannette where it's now using the same reinforcement learning auto ml formulation to build ensembles on your network so in many cases state-of-the-art performance can be achieved by as opposed to taking a single architecture is building up a multitude and ensemble a collection of architectures and that's what is doing here is given candidate architectures is stitching them together to form an ensemble to get state-of-the-art performance now that stadia our performance is not a leap a breakthrough leap forward but it's nevertheless a step forward and it's it's a very exciting field that's going to be receiving more and more attention there's an area of machine learning that's heavily under studied and I think it's extremely exciting area and if you look at 2012 with Alex net achieving the breakthrough performance of showing that deep learning networks are capable of from that point on from 2012 to today there's been non-stop extremely active developments of different architectures that even on just imagenet alone on doing the image classification tasks have improved performance over and over and over with totally new ideas now on the other side on the data side there's been very few ideas about how to do data augmentation so data augmentation is the process of you know it's what kids always do when you learn about an object right if you look at an object and you kind of like twist it around is is taking the the the raw data and messing it was such a way that it can give you much richer representation of what this can this data can look like in other forms in other in in other contexts in the real world there's been very few developments I think still and there's this Auto augment is just the step a tiny step into that direction that I hope that we as a community invest a lot of effort in so what our argument does because it says ok so there's these data augmentation methods like translating the image sharing the image doing color manipulation like color inversion let's take those as basic actions you can take and then use reinforcement learning and an RNN again construct to stitch those actions together in such a way that can augment data like an image net - - when you train on that data it gets state-of-the-art performance so mess with the data you know in a way that optimizes the way you mess with the data so and then they've also showed that given that the set of data augmentation policies that are learned to optimize for example for image net given the some kind of architecture you can take that learn the set of policies for data augmentation and apply it to a totally different data set so there's the process of transfer learning so what is transfer learning she talked about transfer learning you have a neural network that learns to do cat versus dog or no learns to do a thousand class classification problem on image net and then you transfer you chop off few layers and you transfer on the task of your own data set of cat versus dog what you're transferring is the weights that are learned on the image net classification task and now you're then fine-tuning those weights on the specific personal cat vs. dog data set you have now you could do the same thing here you can transfer as part of the transfer learning process take the data augmentation policies learned on image net and transfer those you can transfer both the weights and the policies that's a really super exciting idea I think it wasn't quite demonstrated extremely well here in terms of performance so it gotten an improvement in performance and so on but it kind of inspired an idea that's something that we need to really think about how to augment data in an interesting way such that given just a few samples of data we can generate huge data sets in a way that you can then form meaningful complex rich representations from I think that's really exciting in one of the ways that you break open the problem of how do we learn a lot from a little training deep neural networks with synthetic data this also really an exciting topic that a few groups but especially invidious invested a lot in and here's a from a CD PR 2018 probably my favorite work on this topic is they really went crazy and said ok let's mess with synthetic data in every way we could possibly can so on the Left they're showing a set of backgrounds then there's also a set of artificial objects and you have a car or some kind of object that you're trying to classify so let's take that car and mess with it with every way possible apply lighting variation to whatever way possible rotate everything that is crazy so so what Nvidia is really good at is creating realistic scenes and they said okay let's create realistic scenes but let's also go away aboveboard and not do realistic at all do things that can't possibly happen in reality and so generally these huge data sets I wants to train and again achieve quite interesting quite a quite good performance and image classification of course they're trying to apply to image and so on these kinds of tasks you're not going to outperform networks that were trained on image net but they show that with just a small sample from from those real images they can fine tune this network train on synthetic I'm just totally fake images to achieve stated our performance again another way to generate to get to learn a lot for very little by generating fake worlds synthetically the process of annotation which for supervised learning is what you need to do in order to train the network you need to be able to provide ground truth you need to be able to label whatever the entity that is being learned and so frame is classification that's saying what is going on in the image and part of that was done on image net by doing a Google search for creating candidates now saying what's going on in the image is a pretty easy tasks then there is the object detection task of detecting the bounding box and so saying drawing the actual bounding box it's a little bit more difficult but it's a couple of clicks and so on then if we take the finals the the probably one of the higher complexity tasks of perception of image understanding is segmentation is actually drawing either pixel level or polygons the outline of a particular object now if you have to annotate that that's extremely costly so the work with polygon RNN is to use recurrent neural networks to make suggestions for polygons it's really interesting it's there's a few tricks to perform these high-resolution polygons so the idea is it drops in a single point you draw a Paulo bunny box around an object you use convolutional networks to drop you the first point and then you use recurrent neural networks to draw around it and the performance is really good there's a few tricks and this tool is available online it's a really interesting idea again the dream with Auto ml is to remove the human from the picture as much as possible with data augmentation remove the pyramid from the picture as much as possible for a menial data automate the boring stuff and in this case the act of drawing a polygon tried to automate it as much as possible the interesting other dimension along which deep learning is recently been trying to be optimized is how do we make deep learning accessible fast cheap accessible so the dawn bench from Stanford the benchmark the dawn bench benchmark from Stanford asked formulated an interesting competition which got a lot of attention and a lot of progress it's saying if we want to achieve 93% accuracy on image net and 94% of see far let's now compete that's like the requirement let's not compete how you can do it in the least amount of time and for the least amount of dollars do the training in the least amount of time and the training in the least amount of dollars like literally dollars you are allowed to spend to do this and fast AI you know it's a renegade renegade group of deep learning researchers I've been able to train an image net in three hours so this is for training process for 25 bucks so training a network that achieves 93% accuracy for 25 bucks and 94% accuracy for 26 cents on C part-time so the key idea that they were playing with is quite simple but really boils down to messing with the learning rate throughout the process of training so the learning rate is how much you based on the loss function the air the neural network observes how much do you adjust the weights so they found that if they crank up the learning rate while decreasing the momentum which is a parameter of the optimization process where they do it that jointly they're able to make the network learn really fast that's really exciting and the benchmark itself is also really exciting because that's exactly for people sitting in this room that that opens up the door to doing all kinds of fundamental deep learning problems without the resources of google deepmind or open AI or Facebook or so on without computational resources that's important for academia that's important for independent researchers and so on so ganz there's been a lot of work on generative editorial neural networks and in some ways there has not been breakthrough ideas in ganz for quite a quite a bit and I think began from from google deepmind an ability to generate incredibly high-resolution images and it's the same Gantt technique so in terms of break there's innovations but scaled so the increase the model capacity and increase the the batch size the number of images that are felled that are fed to the network it produces incredible images I encourage you to go online and and look at them it's hard to believe that they're generated so that was this of 2018 for Ganz was a year of scaling and parameter tuning as opposed to breakthrough new ideas video to video synthesis this this work is from Nvidia is looking at the problem so there's been a lot of work on general going from image to image so from a particular image generating another image so whether it's colorizing an image or just to traditionally define ganz the idea with video to video synthesis that a few people have been working on but nivea took a good step forward is to make the video to make the temporal consistency the temporal dynamics part of the optimization process so make it look not jumpy so if you look here at the comparison the for this particular so the input is the labels on the top left and the output of the of the the Nvidia approach is on the bottom right see it's temper it's very temporarily consistent if you look at the image to image mapping that's that's state-of-the-art pix to pix HD it's very jumpy it's not temporally consistent at all and there's some naive approaches for trying to maintain temporal consistency that's in the bottom left so you can apply this to all kinds of tasks all kinds of video to video mapping here is mapping it to face edges edge detection on faces mapping it to faces generating faces from just edges you can look at body pose to actual images so you could as an input to the network you can take the pose of the person and generate the the video of the person okay semantic segmentation the problem of perception sort of began with Alex on an image net has been further and further developments where the input the problem is of basic image classification where the input is an image and the output is a classification was going on in that image and the fundamental architecture can be reused for more complex tasks like detection like segmentation and so on interpreting what's going on in the image so these large networks from VJ Gina Google net ResNet escena dense net all these networks are forming rich representations that can then be used for all kinds of tasks whether that task is object detection this here shown is the region based methods where the neural network is tasked the convolutional layers make region proposals so much of candidates to be considered and then there's a step that's determining what's in those different regions and forming bounding boxes around them in a for-loop way and then there is the one-shot method single shot method where in a single pass all of the bonnie boxes in their classes generated and there has been a tremendous amount of work in the space of object detection some are single the single shot method some are region based methods and there's been a lot of exciting work but not more not I would say breakthrough ideas and then we take it to the highest level of perception which is semantic segmentation there's also been a lot of work there the state of the art performance is at least for the open source systems is deep lab v3 plus on the Pascal VLC challenge so semantic segmentation and catch it all up started 2014 with fully convolution neural networks chopping off the fully connected layers and then I'll putting the heatmap very grainy very very low resolution then improving that was segments performing max pooling with a breakthrough idea that's reused in a lot of cases is dilated convolution Atreus convolutions having some spacing which increases the the field of view of the convolutional filter the key idea behind deep lab v3 that i is the state of the art is the multi scale processing without increasing the parameters the multi scale is achieved by quote unquote the Atreus rate so taking those atreus convolutions and increasing the spacing and you can think of the increasing that spacing but by enlarging the model's field of view and so you can consider all these different scales of processing and looking at the at the layers of features so allowing you to be able to grasp the greater context as part of the op sampling deconvolution 'l step and that's what's produced in the stadia our performances and that's where we have the the notebook tutorial on github showing this deep lab architecture trained on cityscapes to say escapes is a driving segmentation data set that is though that is one of the most commonly used for the task of driving scene segmentation okay on the deep reinforcement learning front so this is touching a bit a bit on the 2017 but i think the excitement really settled in in 2018 as the work from google and from open AI deep mind so it started in DQ on paper from google deepmind where they beat a bunch of a bunch of Atari games achieving superhuman performance with deep reinforcement learning methods that are taking in just the raw pixels of the game so the same kind of architecture is able to learn how to beat these how to beat these games super exciting idea that kind of has echoes of what general intelligence is taking in the raw raw information and being able to understand the game the sort of physics of the game sufficient to be able to beat it then in 2016 alphago with some supervision and some playing against itself self play some supervised learning on expert world champ layers and some self play where it plays against itself I was able to beat the top of the world champion and go and then 2017 alphago zero a specialized version of alpha zero was able to beat the alphago with just a few days of training and zero supervision from expert games so through the process of self play again this is kind of getting the human out of the picture more and more and more which is why alpha zero is probably or this alphago zero was the demonstration of the cleanest demonstration of all the nice progress in deep reinforcement learning I think if we look at the history of AI when you're sitting on a porch hundred years from now sort of reminiscing back alpha zero will be a thing that people will remember as an interesting moment in time as a key moment in time and alpha zero was applied in 2017 to beat those old alpha zero paper was in 2017 and it was this year played stockfish in chess which is the best engine chess playing engines is able to beat it with just four hours of training of course the four hours this caveat because four hours for google deepmind is highly distributed training so it's not four hours for an undergraduate student sitting in their dorm room but meaning it's it was able to self play to very quickly learn to beat the state-of-the-art chess engine and learned to beat the state-of-the-art shogi engine Elmo and the interesting thing here is you know with perfect information games like chess you have a tree and you have all the decisions you could possibly make and so the farther along you look at along that tree presumably the the better you do that's how deep blue beat Kasparov in the 90s is you just look as far as possible in a down the tree to determine which is the action is the most optimal if you look at the way grab human grandmasters think it certainly doesn't feel like they're like looking down a tree there's something like creative intuition there's something like you could see the patterns in the board you can do a few calculations but really it's an order of hundreds it's not on the order of millions or billions which is kind of the the stock fish the state-of-the-art chess engine approach and alpha zero is moving closer and closer closer towards the human Grandmaster concerning very few future moves it's able through the neural network estimator that's estimating the quality of the move and the quality of the different the quality the current quality of the board and and the quality of the moves that follow it's able to do much much less look ahead so the neural network learns the fundamental information just like when a grandmaster looks at a board they can tell how good that is so that's again interesting it's a step towards at least echoes of what human intelligence is in this very structured formal constrained world of chess and go and shogi and then there's the other side of the world that's messy it's still games it's still constrained in that way but open AI has taken on the challenge of playing games that are much messier to have this semblance of the real world and the fact that you have to do teamwork you have to look at long time horizons with huge amounts of imperfect information hidden information uncertainty so within that world they've taken on the challenge of a popular game dota 2 on the human side of that there's the competition the International hosted every year where you know in 2018 the winning team gets 11 million dollars so it's a very popular very active competition has been going on for for for a few years they've been improving and it achieved a lot of interesting milestones in 2017 their 1v1 bot beat the top professional dota 2 player the way you achieve great things is as as you try and in 2018 they tried to go 5v5 they'll open the IFI team lost two games go against the top to the top dota 2 players at the 2018 international and of course their ranking here the MMR ranking in dota 2 has been increasing over and over but there's a lot of challenges here that make it extremely difficult to beat the human players and and this is you know in in every story rocky or whatever you think about losing is essential element of of a story that leads to then a movie in a book and the greatness so you better believe that they're coming back next year and there's going to be a lot of exciting developments there this it also dota 2 and this particular video game makes it currently there's really two games that have the public eye in terms of AI taking on his benchmarks so we solve go incredible accomplishment but what's next so last year the associate were the best paper in Europe's there was the heads up Texas No Limit Hold'em AI was able to beat the top level players was completely current well not completely but currently out of reach is the general not heads up one versus one but the general team Texas No Limit Hold'em there you go and on the gaming side this dream of dota 2 now that's the benchmark that everybody's targeting and it's actually incredibly difficult one and some people think would be a long time before we can we can win and on the the more practical side of things the 2018 start in 2017 has been a year of the frameworks growing up of maturing and creating ecosystems around them with tensorflow with the history there dating back a few years has really with tons of 41.0 as has come to be sort of a mature framework pi torch 1.0 came out 2018 is matured as well and now the really exciting developments in the tensile in tensile flow with the eager execution and beyond that's coming out intensive flow 2.0 in in 2019 so really those two those those two players have made incredible leaps in standardizing deep learning in in in in the fact that a lot of the ideas I talked about today and Monday and we'll keep talking about are all have a github repository with implementations intensive flow on PI towards make him extremely accessible and that's really exciting it's probably best to quote Geoff Hinton the quote unquote Godfather of deep learning one of the key people behind backpropagation said recently all brac propagation is my view is throw it all away and start again he's believes backpropagation is is totally broken and an idea that has ancient and it needs to be completely revolutionized and the practical protocol for doing that is he said the future depends on some graduate student who's deeply suspicious of everything I've said that's probably a good way to two to end the discussion about what the state-of-the-art in deep learning holds because everything we're doing is fundamentally based on ideas from the 60s and the 80s and really in terms of new ideas there has not been many new ideas especially the state state-of-the-art results that I've mentioned are all based on fundamentally on stochastic gradient descent and back propagation ripe for totally new ideas so it's up to us to define the real breakthroughs and the real state-of-the-art 2019 and and beyond so that I'd like to thank you and the stuff is on the website deep Larry and I are mighty idea you