Transcript preview
Open
Kind: captions Language: en today we'll talk about how to make machines see computer vision and we'll present Thank You Claire said yes and today we will present a competition that unlike deep traffic which is designed to explore ideas teach you about concepts of deep reinforcement learning seg fuse the deep dynamic driving scene segmentation competition that I'll present today is at the very cutting edge whoever does well in this competition is likely to produce a publication or ideas that would lead the world in the area of perception perhaps together with the people running this class perhaps in your own and I encourage you to do so even more cats today computer vision today as it stands is deep learning majority of the successes in how we interpret form representations understand images and videos utilize to a significant degree neural networks the very ideas we've been talking about that applies for supervised unsupervised and reinforcement learning and for the supervised case is just the focus of today the process is the same the data is essential there's annotated data where the human provides the labels that serves as the ground truth in the training process then the neural network ghost's through that data learning to map from the raw sensory input to the ground truth labels and then generalize or the testing data set and the kind of raw sensors were dealing with their numbers I'll say this again and again that for human vision for us here would take for granted this particular aspect of our ability is to take in raw sensory information through our eyes and interpret but it's just numbers that's something whether you're an expert computer vision person or new to the field you have to always go back to meditate on is what kind of things the Machine is given what what what is the data that is tasked to work with in order to perform the tasks you're asking it to do perhaps the data is given is highly insufficient to do what you want it to do that's the question I'll come up again and again our images enough to understand the world around you and given these numbers the set of numbers sometimes with one channel sometimes with three RGB where every single pixel have three different colors the task is to classify or regress produce a continuous variable or one of a set of class labels as before we must be careful about our intuition of what is hard and what is easy in computer vision let's take a step back to the inspiration for neural networks our own biological neural networks because the human vision system and the computer vision system is a little bit more similar in these regards this and visual cortex is in layers and as information passes from the eyes to the to the parts of the brain that makes sense of the raw sensor information higher and higher order representations have formed this is the inspiration the idea behind using deep neural networks for images higher and higher order representations of form through the layers there early layers taking in the very raw and sensory information then extracting edges connecting those edges forming those edges to form more complex features and finally into the higher-order semantic meaning that we hope to get from these images in computer vision deep learning is hard I'll say this again the illumination variability is the biggest challenge or at least one of the one of the biggest challenges in driving for visible light cameras pose variability the objects as I'll also discuss about some of the advances geoff hinton and the capsule networks the idea with the neural networks as they're currently useful computer vision are not good with representing variable pose these objects in images and this 2d plane of color and texture look very different numerically when the object is rotated and the object is mangled and shaped in different ways the deformable will truncated cat intraclass variability the for the classification task which would be an example today throughout to introduce some of the networks over the past decade that have received success in some of the intuition and insight that made those networks work classification there is a lot of variability inside the classes and very little variability between the classes all of these are cats on top all of those are dogs are bottom they look very different and the other I would say the second biggest problem in driving perception visible light camera perceptions occlusion when part of the object is occluded due to the three-dimensional nature of our world some objects in front of others and they occlude the background object and yet we're still tasked with identifying the object when only part of it is visible and sometimes that part told you there's cats is very hardly visible here we're tasked with classifying a cat with just an ears visible just the leg and in the philosophical level as we'll talk about the motivation for our competition here here's a cat dressed as a monkey eating a banana on a philosophical level most of us understand what's going on in the scene in fact a neural network it's to today successfully classify this image this video as a cat but the context the humour of the situation and in fact you could argue it's a monkey is missing and what else is missing is the dynamic information the temporal dynamics of the scene that's what's missing in a lot of the perception work that has been done to date in the autonomous vehicle space in terms of visible light cameras and we're looking to expand on that that's what psyche fuse is all about image classification pipeline there's a bin with different categories inside each class cat dog mug hat those bins there's a lot of examples of each and your task with when a new example comes along you never seen before to put that image in a bin it's the same as the machine learning tasks before and everything relies on the data that's been ground truth that been labeled by human beings amnesty is a toy data set of handwritten digits often used as examples and Koko safar imagenet places and a lot of other incredible datasets rich data sets of a hundred thousands millions of images out there represent scenes people's faces and different objects those are all ground truth data for testing algorithms and for competing architectures to be evaluated against each other see far ten one of the simplest almost toy datasets of tiny icons with ten categories of airplane automobile bird cat deer dog for our course ship and truck is commonly used to explore some of the basic convolution neural networks we'll discuss so let's come up with a very trivial classifier to explain the concept of how we could go about it in fact this is maybe if you start to think about how to classify an image if you don't know any of these techniques this is perhaps the approach you would take is you would subtract images so in order to know that an image of a cat is different than image of a dog if to compare them when given those two images what what's the what's the way you compare them one way you could do it is you just subtract it and then sum all the pixel wise differences in the image just subtract the intensity of the image pixel by pixel sum it up if that intent if that difference is really high that means the images are very different using that metric we can look at C for 10 and use it as a classifier saying based on this difference function I'm going to find one of the 10 bins for a new image that that is that has the lowest difference find an image in this data set that is most like the image I have and put it in the same bin as that images in so there's 10 classes if we just flip a coin the accuracy of our classifier will be 10% using our image difference classifier we can actually do pretty good much better than random much better than 10% we can do 35 38 percent accuracy that's a classifier we have our first classifier K nearest neighbors let's take our classifier to a whole new level instead of comparing it to just fight trying to find one image that's the closest in our data set we tried to find K closest and say what is what class do the majority of them belong to and we take that k and increase it for 1 to 2 to 3 to 4 to 5 and see how that changes the problem with seven years neighbors which is the optimal under this approach for CFR 10 we achieve 30% accuracy human level is 95% accuracy and with convolutional neural networks will get very close to 100% that's where you'll networks shine this very task of bending images it all starts at this basic computational unit signal in each of the signals are weighed summed bias added and put an input into a nonlinear activation function that produces an output the nonlinear activation function is key all of these put together and more and more hidden layers form a deep neural network and that deep neural network is trained as we've discussed by taking a forward pass and examples have garage with labels seeing how close those labels are to the real ground truth and then punishing the weights that resulted in the incorrect decisions and rewarding the weights that resulted in correct decisions for the case of 10 examples the output of the network is different values the input being handwritten digits from 0 to 9 for 10 of those and we wanted our network to classify what is in this image of a handwritten digit is it 1 is 0 1 2 3 through 9 the way it's often done is there's ten outputs of the network and each of the neurons on the output is responsible for getting really excited when it's number is called and everybody else is supposed to be not excited therefore the number of classes is the number of outputs that's how it's commonly done and you assign a class to the input image based on the highest the neuron which produces the highest output but that's for a fully connected network that we've discussed on Monday there is in deep learning a lot of tricks that make things work that make training much more efficient on large class problems where there's a lot of classes on large data sets when the representation that the neural network is tasked with learning is extremely complex and that's where convolutional neural neural networks step in the trick they use a spatial invariance they use the idea that a cat in the top left corner of an image is the same as a cat in the bottom right corner of an image so we can learn the same features across the image that's where the convolution operation steps in instead of the fully connected networks here there's a third dimension of depth so the blocks in this neural network as input take 3d volumes and as output produced 3d volumes a slice of the image a window and slide it across applying the same exact weights and we'll go through an example the same exact weights as in the fully connected network on the edges that are used to map the input to the output here are used to map this slice of an image this window of an image to the output and you can make several many of such convolutional filters many layers many different options of what kind of features you look for in an image what kind of window you slide across in order to extract all kinds of things all kinds of edges all kind of higher-order patterns in the images the very important thing is the parameters on each of these filters the subset of the image these windows are shared if the feature that defines a cat is useful in the top left corner it's useful in the top right corner it's useful in every aspect of the image this is the trick that makes convolutional neural networks save a lot of a lot of parameters reduce parameter significantly it's the reuse the spatial sharing of features across the space of the image the depth of these 3d volumes is the number of filters the stride is the skip of the filter the step size how many pixels you skip when you apply the filter to the input and the padding is they're padding the zero padding on the outside of the input to a convolutional layer let's go through an example so on the left here and the slides are now available online you can follow them along and I'll step through this example on the left here is a input volume of three channels the left column is the input the three block the three squares there are the three channels and there's numbers inside those channels and then we have a filter in red two of them two channels of filters with a bias and we those filters are three by three each one of them is size three by three and what we do is we take those three by three filters that are to be learned these are our variables our weights that we have to learn and then we slide it across an image to produce the output on the right the green so by applying the filters in the red there's two of them and within each one there's one for every input channel we go from the left to the right from the input volume on the left to the output volume green on the right and you can look it you can pull up the slides yourself now if you can't see the numbers on the screen but the the operations are performed on the input to produce the single value that's highlighted there in the green and the output and we slide this convolution no filter along the image with a stride in this case of to skipping skipping along they sum to the to the right the two channel output in green that's it the convolutional operation that's what's called the convolutional layer neural networks and the parameters here besides the bias are the read values in the middle that's what we're trying to learn and there's a lot of interesting tricks we'll discuss today on top of those but this is at the core this is the spatially invariant sharing of parameters that make convolutional neural networks able to efficiently learn and find patterns and images to build your intuition a little bit more about convolution here's an input image on the left and on the right the identity filter produces the output you see on the right and then there's different ways you can different kinds of edges you can extract with the activate or the resulting activation map seen on the right so when applying the filters with those edge detection filters to the image on the left you produce in white are the parts that activate the convolution the results of these filters and so you can do any kind of filter that's what we're trying to learn any kind of edge any kind of any kind of pattern you can move along in this window and this way that's shown here you slide along the image and you produce the output you see on the right and depending on how many filters you have in every level you have many of such slices VC on the right the input on the left the output on the right if you have dozens of filters you have dozens of images on the right each with different results that show where each of the individual filter patterns were found and we learned what patterns are useful to look for in order to perform the classification task that's the task for the neural network to learn these filters and the filters have higher and higher order of representation going from the very basic edges to the high semantics meaning that spans entire images and the ability to spend images can be done in several ways but traditionally has been successfully done through max pooling through pooling of taking the output of convolutional operation and reducing the resolution of that byte by condensing that information by for example taking the maximum values the maximum activations therefore reducing the spatial resolution which has detrimental effects as we'll talk about in the scene segmentation but it's beneficial for finding higher order representations and the images that bring images together that bring features together to form an entity that we're trying to identify and classify okay so that forms a convolution Yool network such convolutional layers stacked on top of each other is the only addition to a neural network that makes for a convolutional neural network and then at the end the fully connected layers or any kind of other architectures allow us to apply particular domains let's take image net as a case study an image net the data set an image net the challenge the task is classification as I mentioned the first lecture image net is a data set one of the largest in the world of images with 14 million images 21,000 categories and a lot of depth to many of the categories as I mentioned 1200 granny smith apples these allow - these allow the newer networks to learn the rich representations in both pose lighting variability and intraclass class variation for the particular things particular classes like granny smith apples so let's look through the various networks let's discuss them let's see the insights it started with Alex net the first really big successful GPU trained neural network on image net that's achieved a significant boost over the previous year and moved on to vgg net Google net ague Lynnette ResNet see you image and as Annette in 2017 again the numbers will show for the accuracy are based on the top five error rate we get five guesses and it's a one or zero if you get guess if one of the five is correct you get a one for that particular guess otherwise it's a zero and human error is five point one when a human tries to achieve the same tries to perform the same task as the machinist task of doing the air is five point one the human annotation is performed on the images based on binary classification Granny Smith apple or not cat or not the actual tasks that the machine has to perform and that the human competing has to perform is given an image is provide one of the many classes under that human errors 5.1% which was surpassed in 2015 by ResNet to achieve four percent error so let's with Alex net I'll zoom in on the later networks they have some interesting insights but Alex net and vgg net both fall at a very similar architecture very uniform throughout its depth vgg net in 2014 is convolution convolution pooling convolution pooling convolution pooling and fully connected layers at the end there's a certain kind of beautiful simplicity uniformity to these architectures because you can just make it deeper and deeper and makes it very amenable to implementation in a layer stack kind of way and in any of the deep learning frameworks it's clean and beautiful to understand in the case of eg gina was 16 or 19 layers with 138 million parameters not many optimizations and these parameters therefore the number of parameters is much higher than the networks that followed it despite the layers not being that large Google Net introduced the inception module starting to do some interesting things with the small modules within these networks which allow for the training to be more efficient and effective the idea behind the inception module shown here with the previous layer on bottom and the convolutional layer here with the inception module on top produced on top is it used the idea that different size convolutions provide different value for the network smaller convolutions are able to capture or propagate forward features that are very local a high resolution in in in texture larger convolutions are better able to represent and capture and catch highly abstracted features higher-order features so the idea behind the inception module is to say well as opposed to choosing and high in a high pair tuning process or architecture design process choosing which convolution size we want to go with why not do all of them together while several together in the case of the Google net model there's the one by one three by three and five by five convolutions with the old trusty friend of max pooling still left in there as well which has lost favor more and more over time for the image classification task and the results is there's fewer parameters are required if you pick the placing of these inception modules correctly the number of parameters required to achieve a higher performance is much lower res net one of the most popular still to date architectures that we'll discuss in scene segmentation as well came up and use the idea of a residual block the initial inspiring observation which doesn't necessarily hold true as it turns out but that network depth increases representation power so these residual blocks allow you to have much deeper networks and I'll explain why in a second here but the thought was they work so well because the network's so much deeper the key thing that makes these blocks so effective is the same idea that's that reminiscent of recurrent neural networks that I hope would get a chance to talk about the training of them is much easier they take a simple block repeated over and over and they pass the input along without transformation along with the ability to transform it to learn to learn the filters learn the weights so you're allowed to you're allow every layer to not only take on the processing of previous layers but to take in the wrong transform data and learn something new the ability to learn something new allows you to have much deeper networks and the simplicity of this block allows for more effective training the state of the art in 2017 the winner is squeezed and excitation networks that unlike the previous year will see you image which simply took ensemble methods and combined a lot of successful approaches to take a marginal improvement se net got a significant improvement at least in percentages I think there's a 25% reduction in error from 4 percent to 3 percent something like that by using a very simple idea that I think is important to mention a simple insight it added a parameter to each channel and the convolutional layer in the convolutional block so the network can now adjust the weighting on each channel based for for each feature map based on the content based on the input to the network this is kind of a take away to think about about any of the networks who talk about any of the architectures is a lot of times your recurrent neural networks and convolutional neural networks have tricks that significantly reduce the number of parameters the bulk the sort of low-hanging fruit they use spatial invariants a temporal invariants to reduce the number of parameters to represent the input data but they also leave certain things not parameterize they don't allow the network to learn it allow in this case the network to learn the weighting on each of the individual channels so each of the individual filters is something that you learn as along with the filters takes it makes a huge boost the cool thing about this is it's applicable to any architecture this kind of block that's kind of what the the squeeze and excitation block is applicable to any architecture and because obviously it it just simply permit Rises the ability to choose which filter you go with based on the content it's a subtle but crucial thing I think it's pretty cool and for future research it inspires to think about what else can be parameterize in your own networks what else can be controlled as part of the learning process including hiring higher-order hyper parameters which which aspects of the training and the architecture of the network can be part of the learning this is what this network inspires another network has been in development since the 90s ideas with geoff hinton but really received has been published on received significant attention 2017 that i won't go into detail here we are going to release an online-only video about capsule networks it's a little bit too technical but they inspire a very important point that we should always think about with deep learning whenever it's successful is to think about what as I mentioned with the cat eating a banana on a philosophical and the mathematical level you have to consider what assumptions these networks make and what through those assumptions they throw away so neural networks due to the spatial with convolutional neural networks due to their spatial invariants throw away information about the relationship between the the hierarchies between the simple and the complex objects so the face on the left and the face on the right looks the same to accomplish a neural network the presence of eyes and nose and mouth is the central aspect of what makes classification tasks work for convolution Network where it will fire and say this is definitely a face but the spatial relationship is lost is ignored which means there's a lot of implications to this but for things like pose variation that information is lost we're throwing away that away completely and hoping that the pooling operation that's performing these networks is able to sort of mesh everything together to come up with the features that are firing of the different parts of the face that then come up with the total classification that it's a face without representing really the relationship between these features at the low level and and the high level at the low level of the hierarchy at the simple and the complex level this is a super exciting field now that's hopefully will spark developments of how we design your own networks that are able to learn this the rotational the orientation invariance as well ok so as I mentioned you take these combos in your networks chop off the final layer in order to apply to a particular domain and that is what we'll do with fully convolutional neural networks the ones that we task to segment the image at a pixel level as a reminder these networks through the convolutional process are really producing a heat map different parts of the network are getting excited based on the different aspects of the image and so it can be used to do the localization of detecting not just classifying the image but localizing the object and they could do so at a pixel level so the convolutional layers are doing the encoding process they're taking the rich raw sensory information in the image and encoding them into an interpretable set of features representation that can then be used for classification but we can also then use it kotor up sample that information and produce a map like this fully convolutional neural network segmentation semantic scene segmentation image segmentation the goal is to as opposed to classify the entire image you classify every single pixel its pixel level segmentation you color every single pixel with what that pixel what object that pixel belongs to in this 2d space of the image the 2d projection the in the image of a 3-dimensional world so the thing is there's been a lot of advancement in the last three years but it's still an incredibly difficult problem if you if you think if you think about the amount of data that's used for training and the task of pixel level of megapixels here of millions of pixels that are tasked with having a scientist single label it's an extremely difficult problem why is this interesting important problem to try to solve as opposed to bounding boxes around cats well it's whenever precise boundaries of objects are important certainly medical applications when looking at imaging and detecting in particular for example detecting tumors in the in in medical imaging of different different organs and in driving in robotics when objects are involved it's a done scene of all those vehicles pedestrians cyclists we need to be able to not just have a loose estimate of where objects are we need to be able to have the exact boundaries and then potentially through data fusion fusing sensors together fusing this rich textural information about pedestrians cyclists and vehicles to lidar data that's providing us the three-dimensional map of the world or have both the semantic meaning of the different objects and their exact three-dimensional location a lot of this work successfully a lot of the work in the semantic segmentation started with fully convolutional networks for semantic segmentation paper FCN that's where the name of FCN came from in november 2014 now go through a few papers here to give you some intuition where the field is gone and how that takes us to seg fuse the segmentation competition so FCM repurposed the image net pre-trained nets the nets that were trained to classify what's in an image the entire image and chopped off the fully connected layers and then added decoder parts that that up sample there the image to produce a heat map here shown with a tabby cat a heat map of where the cat is in the image it's a much slower much coarser resolution than the input image 1/8 at best skip connections to improve coarseness of up sampling there's a few tricks if you do the most naive approach the up sampling is going to be extremely coarse because that's the whole point of the neural network the encoding part is you throw away all the useless data the YouTube the most essential aspects that represent that image so you're throwing away a lot of information that's necessary to then form a high resolution image so there's a few tricks where you skip a few of the final pooling operations to go in similar way and this is a residual block to go to go to the output produce higher and higher resolution heat map at the end segment in 2015 applied this to the driving context and really taking it to kitty data set and have have shown a lot of interesting results and really explored the encoder decoder or formulation of the problem really solidifying this the place of the encoder/decoder framework for the segmentation task dilated convolution I'm taking you through a few components which are critical here to the state of the art dilated convolutions so the convolution operation as the pooling operation reduces resolution significantly and dilated convolution has a certain kind of gritting as visualized there that maintains the local high resolution textures while still capturing the spatial window necessary it's called dilated convolutional layer and that's in a 2015 paper proved to be much better at up sampling a high resolution image deep lab with a be v1 v2 Navi 3 added conditional random fields which is the final piece of the of the state-of-the-art puzzle here a lot of the successful networks today that do segmentation not all do post process using CRFs conditional random fields and what they do is they smooth the segmentation the up sample segmentation that results from the FCN by looking at the underlying image intensities so that's the key aspects of the successful approaches today you have the encoder decoder framework of a fully accomplished in your network it replaces the fully connected layers with the convolutional layers deconvolution layers and as the years progress from 2014 to today as usual than underlying networks from alex net to vgg net and to now ResNet have been one of the big reasons for the improvements of these to be able to perform the segmentation so naturally they mirrored the imagenet challenge performance in adapting these networks so the state-of-the-art uses ResNet or similar networks conditional random fields for smoothing based on the input image intensities and the dilated convolution that maintains the computational cost but increases the resolution of the up sampling throughout the intermediate feature Maps and that takes us to the state of the art that we used to produce the images to produce the images for the competition present that do you see for dance up sampling convolution instead of bilinear up sampling you make the up sampling learn about you learn the upscaling filters that's on the bottom that's really the key part that made it work there should be a theme here sometimes the the biggest addition they can be done this parameter izing one of the aspects of the network they've taken for granted letting the network learn that aspect and the other I'm not sure how important it is to the success but it's a it's a cool little addition is a hybrid dilated convolution as I showed that visualization where the convolution is spread apart a little bit in the input from the input to the output the steps of that dilated convolution filter when they're changed it produces a smoother result because when it's kept the same there certain input pixels get a lot more attention than others so losing that favoritism is what's achieved by using a variable different dilation rate those are the two tricks but really the biggest one is the parameterization of the upscaling filters okay so that's what we're that's what we used to generate that data and that's what we provides you the code with if you're interested in competing in psyche views the other aspect here that everything we've talked about from the classification to the segmentation to making sense of images is it there the information about time the temporal dynamics of the scene is thrown away and for the driving context of the robotics contest and what we'd like to do with psyche fuse for the segmentation dynamics scene segmentation context of when you try to interpret what's going on in the scene over time and use that information time is essential thus the movement of pixels is essential through time that that understanding how those objects move in a 3d space through the 2d projection of an image it's fascinating and us there's a lot of set of open problems there so flow is what's very helpful to as a starting point to help us understand how these pixels move flow optical flow dense optical flow is the computation that our best of a best approximation of where each pixel in image one and moved in the in temporarily following image after that there's two images in 30 frames a second there's one image at time zero the other is 33.3 milliseconds later and the idents optical flow is our best estimate of how each pixel in the input image moved to in the output image the optical flow for every pixel produces a direction of where we think that pixel moved and the magnitude of how far moved that allows us to take information that we detected about the first frame and try to propagate it forward this is the competition it's to try to segment an image and propagate that information forward for manual annotation of a image so this kind of coloring book annotation where you color every single pixel in the state-of-the-art dataset for driving cityscapes that it takes 1.5 ninth and 1.5 hours 90 minutes to do that coloring that's 90 minutes per image that's extremely long time that's why there doesn't exist today dataset and in this class we're going to create one of segmentation of these images through time through video so long videos where every single frame is fully segmented that's still an open problem that we need to solve flows a piece of that and we also provide you the this computer state-of-the-art flow using flow net 2.0 so flow net 1.0 in May 2015 used neural networks to learn the optical flow the dense optical flow and it did so with two kinds of architectures flow net s flowing that simple and flow net core flow net see the simple one is simply taking the two images so what's what's the task here there's two images and you you want to produce from those two images they follow each other in time thirty-three point three milliseconds apart and your task is the output to produce the dense optical flow so for the simple architecture you just stack them together each are RGB so it produces a six channel input to the network there's a lot of convolution and finally it's the the same kind of process as the fully convolution your networks to produce the optical flow then there is flow net correlation architecture where you perform some convolution separately before using a correlation layer to combine the feature Maps both effective in different data sets and different applications so flow net 2.0 in December 2016 is one of the state-of-the-art frameworks code bases that we used to generate the data all show combines the flow net Assam flow net C and improves over the initial flow net producing a smoother flow field preserves the fine motion detail along the edges of the objects and it runs extremely efficiently depending on the architecture there's a few variants either eight to a hundred forty frames a second and the process there is essentially one that's common across various applications deep learning is stacking these networks together the very interesting aspect here that we're still exploring and again applicable in all of deep learning in this case it seemed that there was a strong effect in taking sparse small multiple data set and doing the training the order of which those data sets were used for the training process mattered a lot that's very interesting so using flow net 2.0 here's the data set we're making available for psych fuse the competition cars that mit.edu slash psych fuse first the original video us driving in high-definition 1080p and a 8k 360 video original video driving around Cambridge we're providing the ground truth for a training set for that training set for every single frame 30 frames a second we're providing the segmentation frame to frame to frame segmented on Mechanical Turk we're also providing the output of the network that I mentioned the state of their our segmentation network that's pretty damn close to the ground truth but still not and our task is this is the interesting thing is our task is to take the output of this network well there's two options one is to take the output of this network and use use other networks to help you propagate the information better so what this segmentation the output of this network does is it only takes a frame by frame by frame it's not using the temporal information at all so the question is can we figure out a way can we figure out tricks to use temporal information to improve this segmentation so it looks more like this segmentation and we're also providing the optical flow from frame to frame to frame so the optical flow based on flowing at 2.00 of how each of the pixels moved okay and that forms a seg fuse competition 10,000 images and the task is to submit code we have starter code in Python and on github to take in the original video take in for the training set the ground truth the segmentation from the state-of-the-art segmentation Network the optical flow from the state-of-the-art optical flow Network and taking that together to improve the the stuff on the bottom left the segmentation to try to achieve the ground truth and on the top right okay with that I'd like to thank you tomorrow at 1 p.m. is way mo in Stata 32 one two three the next lecture next week will be on deep learning for a sense in the human understanding the human and we will release online only lecture on capsule networks and Gans general adversarial networks thank you very much [Applause]
Resume
Categories