Foundations of Unsupervised Deep Learning (Ruslan Salakhutdinov, CMU)
rK6bchqeaN8 • 2016-09-27
Transcript preview
Open
Kind: captions Language: en Sound is good. Okay, great. So, I wanted to talk to you about unsupervised learning. And that's the area where there's been a lot of research. Um, but compared to supervised learning that you've heard about today, like convolutional networks, uh, you know, unsupervised learning is not there yet. All right. So, I'm going to show you lots of uh uh lots of areas. Parts of the talk uh are going to be a little bit more mathematical. uh I apologize for that but I'll try to give you a gist of uh of the foundations the math behind these models as well as try to highlight some uh some of the application areas okay what's the motivation well the motivation is that you know the space of data that we have today is is just growing right you know if you look at the space of images you know speech uh if you look at social network data if you look at scientific data um I would argue that most of the data that we see today is unl labeled right um so how can we develop statistical models models that can discover interesting kind of structure in unsupervised way or semi-supervised way and that's what I'm interested in um as well as how can we sort of apply these models across multiple different uh multiple different domains and one particular framework of doing that is is is the framework of deep learning where you're trying to learn hierarchical representations of data and and again as we go as I go through talk I'm going to show you some uh some examples I've tried. So here's here's one example. Um you know you can take uh a simple bag of words representation of an article or a newspaper. You can use something that's called an autoenccoder um just multiple levels. You extract some uh uh latent code and then you get some representation out of it. Right? And this is done completely in unsupervised way. You don't provide any labels. And if you look at the kind of structure that the model is discovering, you know, it could be useful for visualization, for example, or to see what's what kind of uh uh structure you you you see in your data. This was done on the on the Reuters data set. I've tried to uh kind of um cluster together uh lots of different unsupervised learning techniques and I'll touch on some of them. It's a little bit, you know, it's it's not a full set. uh but the way that I typically think about these models is that there's a class of uh what I would call non-proistic models uh you know models like sparse coding uh autoenccoders uh clustering based methods uh and these are all very very powerful uh powerful techniques and I'll cover some of them in that talk as well and then there is sort of u uh a space of uh proistic models and within proistic models you have tractable models you know things like uh fully observed belief networks. Uh there's a beautiful class of models called neuro neural uh auto reggressive density estimators. More recently, we've seen some successes of of so-called pixel recurrent neural network models or uh uh um and and I'll I'll show you some examples of that. There is a class of so-called intractable models where you know you are looking at models like Boltzman machines uh and models like variational autoenccoders something that's been uh quite uh there's been a lot of development in our community in deep learning community in that space helmhold's machines I'll tell you a little bit about what these models are and a whole bunch of uh others as well right one particular structure within these models is that when you're building these generative models or uh of data you you typically have to specify what the distributions you're looking at. So you have to specify what the probability of the data and and generally doing some kind of approximate maximum likelihood estimation. And then more recently, you know, we've seen some very exciting models uh coming out. Uh these are generative adversarial networks, uh moment matching networks, and this is sort of a slightly different class of models where you don't really have to specify what the density is. You just need to be able to sample from those models. And I'm going to show you some uh some examples of that. Okay. So my talk is going to be sort of structured. I'd like to introduce you to the basic building blocks uh models like uh sparse coding models because I think that these are very important uh classes of models particularly for folks who are working in in in industry and and looking for simpler models. Autoenccord is a beautiful class of models. Um and then the second part of the talk I'll focus more on on generative models. I'll give you an introduction on uh into restricted B machines and deep BS machines. These are sort of models, statistical models um that can model um um complicated uh uh complicated data. Uh and I'll spend some time uh showing you some examples some recent developments in our community specifically in the case of variational autoenccoders which is I view them as a subclass of Helholds machines. uh and I'll finish off by by by giving you an intuition about you know a slightly different class of models which would be these generative adversarial networks. Um okay so let's let's jump into the first part but before I do that um let me just sort of give you a little bit of motivation. I know Andre's done a great job and and Richard sort of alluded to that as well. uh but the idea is you know if I'm trying to classify a particular image right and if I say you know if I'm looking at specific pixel representation might be difficult for me to classify what I'm seeing right on the other hand if I can't find the right representations right the right representations for these images and then I sort of get the right features or get the right uh structure from the data then it might be easier for me to you know see what's uh uh what's going on with my data right so how do I find these these representations and this is uh uh this is sort of uh one of uh uh traditional approaches that we've seen for a long time is that you know you have a data you you creating some features and then you're running your learning algorithm and for the longest time in object recognition or in audio classification you typically use some kind of uh handdesign features and then you start classifying uh what you have and you know like Andre was saying in the space of vision there's been a lot of different uh uh uh features designs of of of what's the right structure we should see in the data uh in the space of um audio same thing is happening how can you find these right representations for your uh for your data and the idea behind representation learning in particular um uh in uh deep learning is is can we actually learn these representations automatically right and more importantly can we actually learn these representations in unsupervised right? By just seeing lots and lots of unlabelled data, can we achieve that? And uh you know, there's been a lot of work done in that space, but we're not there yet. So, so, so I wanted to sort of lower your expectations as as I show you some uh some of the results. Okay, sparse coding. Um this is one of the models that I think that everybody should know uh what it is. uh it was actually you know first has its roots in 96 and it was originally developed to explain early visual processing in the brain sort of uh I think of it as an edge detector uh and the objective here is the following well if I give you set of data points x1 up to xn you'd want to learn a dictionary of bases fi1 up to phi k right so that every single data point can be written as a linear combination of the bases that's fairly simple uh right there's one constraint in that you'd want your coefficients to be sparse. You'd want them to be mostly zero, right? Uh so every data point is represented as a sparse linear combination of bases, right? So uh this is if if if you apply sparse coding to natural images, right? And this is uh this was originally has been a lot of work developed at Stanford with with Andrew's group. So and if you apply sparse coding to you know take little patches of images and learn these bases these dictionaries this is how they look like and it's they look really nice in terms of you know finding sort of edge edge- like structure so if given a new example I can say well this new example can be written as a linear combination of a few of these bases right and taking that representation it turns out that particular representation a sparse representation is quite useful uh as a feature representation of your data, right? So, it's quite useful to have it. And in general, oops. Uh um how do we how do we fit these models? Um well, if I give you uh a whole bunch of image patches, but these don't necessarily have to be image patches. This could be, you know, little speech signals or any kind of uh data you're working with. You'd want to learn a dictionary of basis. You have to form, you have to solve this optimization problem, right? So the first term here you can think of it as a reconstruction error which is to say well I take a linear combination of my bases I want them to match my data. Uh and then there's a second term which is you can think of it as a sparse penalty term which essentially says you know try to penalize uh um my coefficients so that most of them are zero right that way every single data point can be written as just a linear combination sparse linear combination of of of the bases and it turns out there is an easy optimization uh for doing that u if you fix your dictionary of bases right 51 up to 5k uh and you solve for the activation uh that's becomes a standard lasso problem, right? And there's a lot of solvers for uh for solving that particular problem. That's a general very, you know, uh it's it's it's a lasso problem which is fairly easy to uh to optimize. And then if you fix the activations and you optimize for dictionary bases, then it's a well-known quadratic programming problem, right? Uh each problem is convex. So you can sort of alternate between finding coefficients, finding bases and so forth. you can optimize this function and there's been a lot of recent work in the last 10 years of of doing these things online and doing it more efficiently and so forth. Um right at test time given a new input or a new image patch uh and given a set of learned bases once you have your dictionary you can then just solve uh a lasso problem to find the right coefficients right so in this case given a test sample or test patch you can find well it's written by as a linear combination of of uh of subset of the bases right and it turns out again that that particular representation is very useful uh particularly if you're interested in classifying what you see in images and this is done in completely unsupervised way right there is no class labels there is no uh specific supervisory signal that's uh that's here um so back in 2006 there was uh work done uh again at Stanford um uh that basically showed a very interesting result so if I give you an input like this and these are my learned bases remember these little edges what happens is that you just convolve these bases you can get these different feature maps much like you know the feature maps that we've seen in convolutional neural networks and then you take these feature maps and you can just do a classification um right and this was done on one of the older data sets Caltech 101 which sort of a data set that predates imageet and um uh if you look at you know some of the competing algorithms if you do a simple logistic regression versus if you do PCA and then do uh logistic regression versus uh uh finding these features using sparse coding you can get substantial improvements right uh so that's again that's that's uh and and you see sparse coding popping up in a lot of different areas not just in deep learning but folks who are using uh looking at uh uh the medical imaging domain uh in neuroscience these are very popular models because they're easier they're easy to fit they're easy to uh to deal with so uh what's the interpretation of the sparse coding Well, look, let's look at this equation again. And we can think of sparse coding as finding an overcomplete representation of your data. Right? Now, the encoding function, we can think of this encoding function, which is well, I give you an input, find me the features or sparse coefficients or bases uh that make up my image. We can think of encoding as an implicit and very nonlinear function of x, right? But it's an implicit function. We don't really specify it. And the decoder or the reconstruction is just a sim simple linear uh function and it's and it's very explicit. just take your coefficients uh um and then multiply it by the you know find the right basis and get back uh get back the image or the data right and that sort of flows naturally into the ideas of autoenccoders right the autoenccoder is a general framework where you if I give you an input data let's say it's an input image you encode it you get some representation some feature representation and then you have a decoder given that representation you're decoding it back into the image. So you can think of encoder as a as a feed forward bottom up pass right much like in the convolutional neural network given the image you're doing a forward pass and then there is also feedback and generative uh or top down pass right given features you're reconstructing back uh back the input image and the details what's going inside the encoder decoder they matter a lot uh and obviously you need some form of constraints you need some of constraints to avoid learning an identity right because if you don't With these constraints, what you could do is just take your input, copy it to your features, and then reconstruct back, right? And that would be a trivial solution. So, so, so we need to introduce some some additional uh constraints. If you're dealing with uh um uh binary features, if you want to extract binary features, for example, I'm going to show you later why you'd want to do that. You can pass your uh your encoder through sigmoid nonlinearity, much like in a neural network. And then you have a have a linear decoder that reconstruct back the input. And the way we optimize these little building blocks or these little blocks is uh we can just uh um have an encoder right which takes your input takes a linear combination passes it through some nonlinearity the sigmoid nonlinearity or could be rectified linear units or could be 10h nonlinearity and then there's a decoder where you reconstruct back uh your uh original input. Right? So this is nothing more than a neural network with one hidden layer and typically that hidden layer would have a small dimensionality than the input. So we can think of it as a bottleneck layer right and we can determine the network parameters you know the parameters of the encoder and the parameters of the decoder by writing down uh the reconstruction error and that's what the reconstruction would look like you know given the input encode decode and make sure whatever you're decoding is as close as possible to to the original to the original input. All right. And we can use back propagation algorithm to to to uh to train it. There is an interesting uh sort of relationship between autoenccoders and pro and principal component analysis. Many of you have probably heard about PCA as a practitioner. You know, if you're dealing with large data and you want to see what's going on, PCA is the first thing to use, right? Much like logistic regression. Uh so and the idea here is that if the parameters of encoder and decoder are shared and you actually have the hidden layer which is a linear layer so you don't introduce any nonlinearities then it turns out that the space the latent space that the model will discover is going to be the same space as the space discovered by PCA it effectively will collapse the principal component analysis right or doing PCA which is sort of a nice uh uh uh connection because it basically says that autoenccoders you can think of them as nonlinear extensions of PCA, right? So you can learn a little richer features uh if if if if you are uh um uh using autoenccoders. Okay, so here's another model. If you're dealing with binary input, uh sometimes we're dealing with uh like amnest for example, again your encoder and decoder could use sigmoid nonlinearities. So given an input, you extract some binary features, given binary features, you construct back the binary input. um and that's actually you know relates to uh a model called the restricted bulk machine something that I'm going to uh tell you about later in the talk okay there's also uh other classes of models where you can say well I can also introduce some sparsity much like in sparse coding to say that you know I need to constrain my latent features or my latent space uh to be sparse and that's actually uh allows you to learn quite uh reasonable features and nice features here's one particular model called predictive sparse decomposition where you effectively, you know, if you look at the first part of the equation here, the decoder part that pretty much looks like a sparse coding model, right? But in addition, you have an encoding part that essentially says train an encoder such that it actually approximates what my uh latent code should be. Right? So effectively you can think of this model as there is encoder, there's a decoder but then you put the sparity constraint on your latent representation and you can optimize uh um for uh for that model and obviously the other thing that uh we've been doing in the last you know seven eight and 10 years is well what you can do is you can actually stack these things together uh right so you can learn low-level features try to learn high level features and so forth. So just building these blocks uh um and perhaps at the top level if you're trying to solve a classification problem you can do that um or and this is sometimes known as a greedy uh greedy layer wise uh learning and this is sometimes useful whenever you have lots and lots of unlabeled data and when you have a little labelled data right a small sample of labelled data typically these models help you uh find meaningful representations such that you don't need a lot of labeled data to solve a particular task that you're trying to solve, right? And this is again you can remove the decoding part and then you end up with a standard or convolutional architecture. Again, your encoder and decoder could use could be convolutional uh and and it's uh it depends on on what problem you're tackling. Uh and typically, you know, you can stack these things together and optimize for particular uh task that you're trying to solve. Okay. Um here's an example of just wanted to show you some examples some early examples back in 2006. This was uh a way of trying to build these nonlinear autoenccoders. Um and you can sort of pre-train these models using restricted bulk machines or autoenccoders uh generally and then you know you can stitch them together into this deep autoenccoder and back propagate through uh reconstruction loss. Right? One thing I want to point out is that uh here's one particular example. You know the top row I show you real faces. The second row you're seeing faces reconstructed from a bottleneck of of uh of uh 30dimensional uh real valid bottleneck. So you can think of it as just a compression mechanism. Given the data high dimensional data you're compressing it down to 30 dimensional code and then from that 30dimensional code you're reconstructing back the original data. Right? So if you look at the first row this is the data. The second row shows you reconstructed data and the last row shows you PCA solution. Right? One thing I want to point out is that you know the solution here you have a much sharper representation which means that it's capturing a little bit more structure in the data. It's also kind of interesting to see that sometimes these models tend to um how should I say it uh they tend to regularize your data right like for example if you see this person with glasses removes the glasses and that generally has to do with the fact that there's only one person with glasses. So the model just basically said that's noise, get rid of it. Or it sort of gets rid of mustaches, right? Like if you see a face, there's no mustache, right? And then again, that has to do with the fact that there's enough capacity. So the model might think that that's just a noise. Um, and you know, if you're dealing with uh text type of data, uh this was done using a Reuters data set. You have about 800,000 uh stories. You take bag of representation, something very simple. you compress it down to two dimensional space and then you see what that space looks like, right? And I always like to joke that, you know, the model basically discovers that European community economic policies are just next to disasters and accidents, right? This is done this was back in I think the data was collected in 96, right? I think today it's probably going to become closer those two things. Um uh but again this is just a way uh typically autoenccoder is a way of compression or or trying to do dimensionality reduction but we'll see later that they don't have to be. Okay there's another class of algorithm called semantic hashing which is to say well what if you take your data and compress it down to binary representation. Wouldn't that be nice? Because if you have binary representation, you can search in the binary space very efficiently, right? In fact, if you can can compress your data down to 20 dimension, 20 dimensional binary code, 2 to the 20 is about 4 gigabytes. So you can just store everything in memory and you can look at the you know just do memory uh lookups without actually doing any search at all. Uh right. So this sort of representation sometimes have been used successfully in computer vision where you take your images and then you learn these binary representations you know uh 30 dimensional codes or 200 dimensional codes and it turns out it's very efficient to search through large volumes of data using binary representation. So you can you know takes a fraction of a millisecond to retrieve uh images from you know a set of millions and millions of images. Uh and and again this is also an active area of research right now because people are trying to figure out we have these large databases how can you search through them efficiently and trying to learning a semantic hashing function that maps your data to the binary representation turns out to be quite useful. Uh okay now let me step back a little bit and say let's now look at generative models. Let's look at probabistic models and how different they are. And I'm going to show you some examples of of of uh where they applicable. Here's one example of uh a simple model uh trying to learn a distribution of these handwritten characters. So we have you know we have uh Sanskrit, we have Arabic, we have circ um and now we can build a model that says well can you actually generate me what a Sanskrit should look like? The flickering you see at the top these are you know neurons. You can think of them as neurons firing. And what you're seeing at the bottom is you're seeing what the model generates what it believes Sanskrit should look like. Right? So in some sense when you think about generative models, you think about models that can generate uh or they can sample uh the distribution or or they can sample uh the data. Uh this is a fairly simple model. You have about 25,000 characters, you know, coming from 50 different alphabets around the world. You have about two million parameters. This is one of the older models but this is you know what the model believes Sanskrit should look like and I think that I've asked couple of people to say that is that does that really look like Sanskrit okay great which can mean two things it can mean that the model is actually generalizing or the model is overfitting right uh meaning that it's just memorizing what the training data looks like and I'm just showing you examples from the training data we'll come back to that point uh uh as we go through the talk here's You know, you can also do conditional simulation. You know, given half of the image, can you complete the remaining half, right? And more recently, there's been a lot of advances uh um it's actually the last couple of years for the conditional generations and it's pretty amazing what you can do in terms of in painting uh given half of the image what the other half of the image should look like. This is sort of a simple example, but it does show you that it's trying to, you know, be consistent with what different uh strokes look like. Right. So why is it so difficult? Uh in the space of so-called undirected graphical models of both machines, the difficulty really comes from the following fact. If I show you this image which is a 28x 28 image, it's a binary image, right? So some pixels are on, some pixels are off. There are two to the 28 by 28 possible images. So in fact there are two to the 784 possible configurations, right? And that space is exponential. So how can you build models that figure out you know in in the space of characters there's only little tiny subspace in that space right if you start generally generating you know uh 200 by 200 images um you know that space is huge and the space of real images is really really tiny right so how do you find that space how do you generalize to new images that's that's a very difficult question in general to um to answer one class of models uh is so-called fully observed models right there sort of been a stream of uh learning generative models that are tractable and they have very nice properties like you can compute the probabilities you can do can do maximum likelihood estimation here is one example where I can if I try to model the image I can write it down as you know taking the first pixel modeling the first pixel then modeling the second pixel given the first pixel and just just writing it down in terms of uh uh conditional product of the conditional probabilities and each conditional probability can take a very complicated form, right? It could be a complicated neural network. Um, and oh, sorry. So there's been a number of successful models. Uh, one of the early models called neural autogressive density estimator actually developed by Hugo. Uh, real valid extension of these models and more recently we start seeing these flavors of models. There were a couple of papers uh popped up actually this year from deep mind uh where they sort of make these conditionals to be you know sophisticated RNNs LSTMs or convolutional models and they can actually generate remarkable images uh and so this is just a pixel CNN generating I guess uh elephants. Yeah. And actually looks pretty pretty interesting. uh right uh the drawback of these models is that we yet have to see how good of representations these models are are learning so that we can use these representations for other tasks like classifying images or find similar images and such. Right? Um now let me jump into a class of models called restricted boss machines. So this is the class of models where we actually trying to learn some latent structure some latent representation. uh these models belong to the class of so-called graphical models and graphical model is a very powerful framework for representing dependency uh structure between random variables. Uh this is an example where we have uh uh you can think of this particular model. You have some pixels. These are stocastic binary so-called visible variables. You can think of pixels in your image and you have stocastic binary hidden variables. You can think of them as feature detectors. So detecting certain patterns that you see in the data much like sparse coding models. This has a bipart type structure. You can write down the probability the joint distribution over all of these variables. uh you sort of have pair wise term you have union term but it's not really important what they look like the important thing here is that if I look at this conditional probability of the data given given the features I can actually write down explicitly what it looks like what does that mean that basically means that if you tell me what features you see in the image I can generate the data for you right or I can generate uh uh the corresponding input in terms of learning features so what do these uh models learn they sort of learn something similar that we've seen in sparse coding uh right and and so these classes of models are very similar to each other. So given a new image I can say well this new image is made up by some combination of these learned weights or these learned bases. Uh and the numbers here are given by the probabilities that each particular edge is present in the data. Um in terms of how we learn these models, uh one one thing I want to make uh uh uh another point I should make here is that given an input I can actually quickly infer what features I'm seeing in the image. So that operation is is is very easy to do unlike in sparse coding models. It's it's a little bit more closer to an autoenccoder. Given the data, I can actually tell you what features are present in my in my input, which is very important for things like information retrieval or classifying images because you need to do it uh you need to do it fast. How do we learn these models? Let me just give you an intuition maybe a little bit of math behind uh how we learn these models. If I give you set of training examples and I want to learn model parameters, I can maximize the log likelihood objective, right? And you've probably seen that uh uh in these tutorials. Maximum likelihood objective is essentially nothing more than saying I want to make sure that the probability of observing these images is as high as possible. Right? So finding the parameter so that the probability of observing uh what I'm seeing is is high and that's why you're maximizing the the uh likelihood objective or the log of the likelihood objective would just you know take a product into the sum. You take the derivative. There's a little bit of algebra. I promise you it's not uh it's not very difficult. It's like you know second year college uh algebra you differentiate and you basically have this uh uh uh learning rule which is the difference between two terms. The first term you can think of it as looking at uh sufficient statistics so called sufficient statistics driven by the data and the second term is the sufficient statistics driven by the model. Right? And maybe I can parse it out. What does that mean? Intuitively, what that means is that you look at the correlations you see in the data, right? And then you look at the correlations that the model is telling you it's it should be and you're trying to match the two, right? That's what the learning is trying to do, right? It's trying to match the correlations that you see in the data, right? So the model is actually respecting the statistics that you see in the data. uh but it turns out that the second term is very difficult to compute and it's precisely because the space of all possible images is so highdimensional that you need to figure out or use some kind of approximate uh learning algorithms to do that right so you have these difference between these two terms the first term is easy to compute it turns out because of a particular structure of the model uh right and we can actually uh do it uh do it explicitly the second term is the difficult difficult one to compute right so it sort of requires you know summing over all possible configurations, all possible images that that that that you could possibly uh see. So it's this term is intractable. And what a lot of different algorithms are doing and we'll see that over and over again is using so-called Monte Carlo sampling or markup chain Monte Carlo sampling or Monte Carlo estimation. Uh right so let me give you an intuition what what this term is doing and that's a general trick for you know approximating exponential sums. Right? There's a whole sub field in in uh in uh statistics that's basically dedicated to how do we approximate exponential sums. In fact, if you could do that, if you could solve that problem, you could solve a lot of problems in in machine learning. Um and the idea is very simple actually. The idea is to say well you're going to be replacing the average uh by sampling. Um and there's something that's called GIP sampling mark of chain Monte Carlo which is essentially does something very simple. It basically says well start with the data sample the states of the latent variables you know sample the data sample the states of the lat sample the data from these conditional distributions something that you can compute explicitly right uh and that's a general trick you know much like in sparse coding we you know we're optimizing for the basis when we're optimizing for the coefficients here you're inferring the coefficients then you you know inferring what the data should look like and so forth uh and then you can just run a markup chain and sort of approximate approximate uh you know this exponential sum. So you start with the data, you sample the states of the hidden variables, you resample the data and so forth. And the only problem with a lot of these methods is that you know you need to run them up to infinity uh to guarantee that you're sort of getting the right thing. Uh and so obviously you know you will never run them you know infinite uh you don't have time to do that. So there's a very clever algorithm that uh a contrastive divergence algorithm that was developed by Hinton back in 2002 and it was very clever. It basically said well instead of running this thing up to infinity, run it for one step, right? Um and so you're just running it for one step. You start with a training vector. You uh you update the hidden units. You update all the visible units again. So that's your reconstruction. Much like in autoenccoder, you reconstruct your data. uh you update the hidden units again and then you just update the model parameters which is just looking at you know empirically the statistics between the data and the model right very similar to what the autoenccoder is doing but slight slight differences and implementation is basically takes about like 10 lines of MATLAB code I suspect it's going to be you know two lines in TensorFlow although I don't think TensorFlow folks implemented BS machines yet that would be my request um uh But uh you can extend these models to dealing with real value data right so whenever you're dealing with images for example and that's just a little change to the definition of the model and your conditional probabilities here just going to be bunch of Gaussian so that basically means that given the features sample me the space of images and I can sample you give you you know real real valued images uh the structure of the model remains the same if you train this model on you know the the these images you sort tend to find edges uh something similar again to what you'd see in sparse coding in ICA independent component analysis model autoenccoders and such uh and again you can sort of say well every single image is made up by some some linear combination of these basis functions you can also extend these models to dealing with count data right if you're dealing with documents uh in this case again a slight change to the model uh K here denotes your vocabulary size and D key denotes number of words that you're seeing in your document. Right? So if you you know it's it's a bag of words uh representation and the conditional here is given by so-called softmax distribution much like what you've seen in in in uh in the previous classes when you know the distribution of a possible words right um and the parameters here W's you can think of them as you know something similar to as what work to embedding would do um and so if you apply it to you know again some some of uh uh data sets you know you tend to find reasonable features Right? So you tend to find you know features about Russia, about US, about computers and so forth. Right? So much like you found these representations little edges. So every image is made up by some combination of these edges in in in case of uh documents or web pages you're saying it's the same thing. It's just made up some linear combination of of these learned topics. Every single document is made up by some combination of these topics. Right? You can also look at onestep reconstruction. So you can basically say well how can I find similarity between the words. So if I show you chocolate cake, I infer the states of hidden units and then I reconstruct back uh the distribution of a possible words. You know, it tells me, you know, chocolate cake, cake, chocolate, sweet, dessert, cupcake, food, sugar, and so forth, right? I particularly like the one about the flower high and then there is a Japanese sign. Um the model sort of generates flower, Japan, Sakura, Blossom, Tokyo, right? So it sort of picks up again on low-level correlations that you see in your data. You can also apply these kinds of models to collaborative filtering where every single observed variable you can model, you know, can represent um a user rating for a particular movie, right? So every single user would rate a certain subset of movies and so you can represent it as as the state of visibility and your hidden states can represent user preferences, what they are. uh and on the Netflix data set if you look at the latent space uh that the model is learning you know some of these hidden variables are capturing specific movie genre uh right so for example there is there's actually one hidden union dedicated to Michael Michael Moore's movies uh right so it's sort of like very strong I think it's sort of you know either people like it or hate it so there are a few hidden units specifically dedicated to that but it also finds interesting things like you know action movies and so forth right so it finds that particular structure ing the data. So you can model different kinds of modality, real value data, you can model count data, multinnomials and it's very easy to infer the states of the hidden variables. So that's given just the product of of logistic functions and that's very important in a lot of different applications. Given the input I can quickly tell you what topics I see in the data, right? Um one thing that I want to point out and that's an important point is a lot of these models can be viewed as product models. uh sometimes people call them product of experts uh and this is because of the following uh sort of the following intuition. If I write down the joint distribution of my hidden observed variables, I can write it down in this sort of log linear form, right? But if I sum out or integrate out the states of the hidden variables, I have bunch of uh a product of a whole bunch of functions, right? So what is what does it mean? What what's the intuition here? So let me show you an example. Suppose the model finds these specific topics, right? And suppose I'm going to be telling you that the document contains topic government, corruption and mafia. Then the word Sylvia Berlusone will have very high probability, right? I guess does anybody know everybody knows who Sylvia is? Sylvia Berusone, right? He's had like, you know, he's in head of the government. He's connected to mafia. He's uh he's very corrupt, was corrupt. And I guess I should add like a banga banga parties here, right? Then it will become completely clear what I'm talking about. Uh but then you know one point I want to make here is that uh it's it's you know you can think of these models as a product. Each hidden variable defines a distribution of a possible words over possible topics and once you take the intersection of these distributions you can be very precise about what is it that you're modeling. Right? So that's unlike uh uh generally topic models or lat allocation models models where you're actually using um mixture like uh uh uh approach and then typically these models do perform far better than uh traditional mixture based models and this comes to the point of local versus global uh versus distributed representations right in in a lot of different algorithms you know even unsupervised learning algorithms are just clustering um you typically have some you partitioning space and you're finding local uh uh prototypes, right? And the number of parameters for each you have basically, you know, parameters for each region. The number of regions typically grow with linearly with the number of parameters. But in um models like factor models, PCA, restricted Bman machines, deep models, you typically have distributed representations, right? And what's the idea here? The idea here is that if I show you the two inputs, right, each particular neuron can, you know, differentiate between two parts of the plane. Given the second one, you know, I can partition it again. Given the third hidden variable, you can partition it again. So, you can see that every single neuron will be affecting lots of different regions. And that's the idea behind uh distributed representations because every single parameter is affecting many many regions, not just the local region. And so the number of regions grow roughly exponentially with the number of parameters. Right? So that's the differences uh uh between these these two classes of models. Important to know about them. Now let me jump uh and quickly tell you a little bit of inspiration behind what what can we build with these models. Right? As we've seen with convolutional networks, the first layer would typically learn some lowlevel uh features like edges or you know if if you're working with a word uh words will typically learn some low-level structure and the hope is that the high level features will start picking up some high level structure as as you are building and these kinds of models can be built in completely unsupervised way because what you're trying to do is you're trying to model the data. You're trying to model the distribution of uh of the data. You can write down the probability distribution for this model. It's known as a a boss machine model. Um you have dependencies between hidden variables. So now introducing some extra uh um you know some extra uh layers and dependencies between those layers. And if we look at the equation, the first part of the equation is basically the same as what we had with restricted bolts machine. And then the second and third part of the equation essentially modeling dependencies between you know the first and the second hidden layer and the second hidden layer and the third hidden layer right there is also a very natural notion of bottom up and top down. So if I want to see what's the probability of a particular unit being taking value one it's really depend on what's coming from below and what's coming from above. So there has to be some consensus in the model to say ah yes what I'm seeing in the image and what my model believes the overall structure should be should be in agreement. Um right and so in this case of course in this case hidden variables become dependent even when you condition on on the data. So these kinds of models we'll see a lot uh is you're introducing more flexibility you're introducing more structure but then learning becomes uh much more difficult right you have to deal you know how do you do inference in these models um right now let me give you an intuition of what how can we learn these model what's the maximum likelihood uh estimator doing here well if I differentiate this model with respect to parameters I basically run into the same learning rule and it's the same learning rule you see whatever you're working with undirected graphical models, factor graphs, conditional random fields. You might have heard about those uh those ones, it really is just trying to look at the statistics driven by the data, correlations that you see in the data and the correlations that the model is telling you it's seeing in the data and you're just trying to match the two, right? That's exactly what's happening in that particular equation. Uh right, but the first term is no longer factorial. So it's you know you have to do some approximation with these models. But let me give you notation what what each term is doing. So suppose I have some data right and I get to observe these characters. Well, what I can do is I really want to tell the model this is real right these are real characters. So I want to put some probability mass around them and say these are real uh and then there is some sort of uh data point that looks like this just bunch of pixels on and off and I really want to tell my model that you know put almost zero probability on this. This is not real. uh right and so the first term is exactly trying to do that. The first term is just trying to say put the probability mass where you see the data and the second term is effectively trying to say well look at this entire exponential space and just say no everything else is not real just the real thing is what I'm seeing in my data and so you can use sort of advanced techniques for doing that there's a class of uh algorithms called variational inference something that's called stocastic approximation which is multicolor based inference I'm not going to go into these techniques but in general you can you can train these models so one question is How good are they? All right, because there's a lot of approximations that go into these models. Um, so what I'm going to do is if you have if you haven't seen it, I'm going to show you two panels. On one panel, you will see the real data. On another panel, you'll see data simulated by the model or the fake data. And you have to tell me which one is which. Okay. So again, these are handwritten characters coming from, you know, alphabets around the world. How many of you think this is simulated and and the other part was real? Honestly. Okay, some what about the other way around? I get half and half, which is great. Um, if you look at these images a little bit more carefully, you will see the difference, right? So, you will see that this is simulated and this is real, right? Because if you look at the real data, it's much crisper. There's more diversity. when you're simulating the data, there's a lot of structure in the simulated characters, but some, you know, sometimes they look a little bit fuzzy and there isn't as much diversity, right? And I've learned that trick from uh from my uh neuroscience friends. If I show you it quickly enough, you won't see the difference, right? Um and and uh uh you know, if if if you're using these models for for for classifying, you know, you can do proper analysis, which is to say given a new character, you find infer the states of the latent variables, hidden variables. if I classify based on that how good are they and and they are uh they're you know they're much better than some of the existing techniques. This is another example you know trying to generate 3D objects. This is sort of a to data sets and later on I'll show you some you know bigger advances that's been happening in the last few years. This was done a few years ago you know if you look at the space of generated samples uh they you know they sort of uh you know obviously you can see the difference. Here's here's P. Look at this particular image. Right? This image looks like car with wings, don't you think? Right. So there's sometimes it can sort of simulate things that are not necessarily realistic. And for some reason it just doesn't generate donkeys and elephants too often, right? But it generates people with guns more often. Like if you look at here and here and here and that again has to do with the fact that you know you're exploring this exponential space of possible uh images and it's sometimes it's very hard to assign the right probabilities to different parts of the space. Um right and then obviously you can do things like pattern completion. So given half of the image can you complete the remaining half. So the second one shows what the completions look like and the last one is what the truth is. So you can do you can do these things. So where else can we use these models? These are sort of toish examples. But where else? Let me show you one example uh where these models can potentially succeed which is trying to model the space of uh the multimodel space which is the space of you know images and text or you know generally if you look at the data it's not just single source. It's a collection of different modalities. Uh right. So how can we take all of these modalities uh into account? And this is really just the idea of you know given images and text can you actually find a concept that relates these two different sources of uh sources of data. Uh and there are a few challenges and that's why you know models like genative models uh sometimes proistic models could be useful. In general one of the biggest challenge we've seen is that typically when you're working with images and text these are very different modalities. Right? If you think about images and pixel representation, they're very dense. If you're looking at text, it's typically very sparse, right? So, it's very difficult to learn these crossmodel features from low-level representation. Perhaps a bigger challenge is that uh a lot of times we see data that's very noisy, right? Sometimes it's just non-existent given an image there is no text or if you look at the first image you know a lot of the tags about is what kind of camera was
Resume
Categories