Foundations of Unsupervised Deep Learning (Ruslan Salakhutdinov, CMU)
rK6bchqeaN8 • 2016-09-27
Transcript preview
Open
Kind: captions
Language: en
Sound is good. Okay, great. So, I wanted
to talk to you about unsupervised
learning. And that's the area where
there's been a lot of research. Um, but
compared to supervised learning that
you've heard about today, like
convolutional networks, uh, you know,
unsupervised learning is not there yet.
All right. So, I'm going to show you
lots of uh uh lots of areas. Parts of
the talk uh are going to be a little bit
more mathematical. uh I apologize for
that but I'll try to give you a gist of
uh of the foundations the math behind
these models as well as try to highlight
some uh some of the application areas
okay what's the motivation well the
motivation is that you know the space of
data that we have today is is just
growing right you know if you look at
the space of images you know speech uh
if you look at social network data if
you look at scientific data um I would
argue that most of the data that we see
today is unl labeled right um so how can
we develop statistical models models
that can discover interesting kind of
structure in unsupervised way or
semi-supervised way and that's what I'm
interested in um as well as how can we
sort of apply these models across
multiple different uh multiple different
domains and one particular framework of
doing that is is is the framework of
deep learning where you're trying to
learn hierarchical representations of
data and and again as we go as I go
through talk I'm going to show you some
uh some
examples I've tried. So here's here's
one example. Um you know you can take uh
a simple bag of words representation of
an article or a
newspaper. You can use something that's
called an autoenccoder um just multiple
levels. You extract some uh uh latent
code and then you get some
representation out of it. Right? And
this is done completely in unsupervised
way. You don't provide any labels. And
if you look at the kind of structure
that the model is discovering, you know,
it could be useful for visualization,
for example, or to see what's what kind
of uh uh structure you you you see in
your data. This was done on the on the
Reuters data
set. I've tried to uh kind of um cluster
together uh lots of different
unsupervised learning techniques and
I'll touch on some of them. It's a
little bit, you know, it's it's not a
full set. uh but the way that I
typically think about these models is
that there's a class of uh what I would
call non-proistic models uh you know
models like sparse coding uh
autoenccoders uh clustering based
methods uh and these are all very very
powerful uh powerful techniques and I'll
cover some of them in that talk as well
and then there is sort of
u uh a space of uh proistic models and
within proistic models you have
tractable models you know things like uh
fully observed belief networks. Uh
there's a beautiful class of models
called neuro neural uh auto reggressive
density estimators. More recently, we've
seen some successes of of so-called
pixel recurrent neural network models or
uh uh um and and I'll I'll show you some
examples of that. There is a class of
so-called intractable models where you
know you are looking at models like
Boltzman machines uh and models like
variational autoenccoders something
that's been uh quite uh there's been a
lot of development in our community in
deep learning community in that space
helmhold's machines I'll tell you a
little bit about what these models are
and a whole bunch of uh others as well
right one particular structure within
these models is that when you're
building these generative models or uh
of data you you typically have to
specify what the distributions you're
looking at. So you have to specify what
the probability of the data and and
generally doing some kind of approximate
maximum likelihood estimation. And then
more recently, you know, we've seen some
very exciting models uh coming out. Uh
these are generative adversarial
networks, uh moment matching networks,
and this is sort of a slightly different
class of models where you don't really
have to specify what the density is. You
just need to be able to sample from
those models. And I'm going to show you
some uh some examples of that. Okay. So
my talk is going to be sort of
structured. I'd like to introduce you to
the basic building blocks uh models like
uh sparse coding models because I think
that these are very important uh classes
of models particularly for folks who are
working in in in industry and and
looking for simpler models. Autoenccord
is a beautiful class of models. Um and
then the second part of the talk I'll
focus more on on generative models. I'll
give you an introduction on uh into
restricted B machines and deep BS
machines. These are sort of models,
statistical models um that can model um
um complicated uh uh complicated data.
Uh and I'll spend some time uh showing
you some examples some recent
developments in our community
specifically in the case of variational
autoenccoders which is I view them as a
subclass of Helholds machines. uh and
I'll finish off by by by giving you an
intuition about you know a slightly
different class of models which would be
these generative adversarial networks.
Um okay so let's let's jump into the
first part but before I do that um let
me just sort of give you a little bit of
motivation. I know Andre's done a great
job and and Richard sort of alluded to
that as well. uh but the idea is you
know if I'm trying to classify a
particular image right and if I say you
know if I'm looking at specific pixel
representation might be difficult for me
to classify what I'm seeing right on the
other hand if I can't find the right
representations right the right
representations for these images and
then I sort of get the right features or
get the right uh structure from the data
then it might be easier for me to you
know see what's uh uh what's going on
with my data right so how do I find
these these representations and this is
uh uh this is sort of uh one of uh uh
traditional approaches that we've seen
for a long time is that you know you
have a data you you creating some
features and then you're running your
learning algorithm and for the longest
time in object recognition or in audio
classification you typically use some
kind of uh handdesign features and then
you start classifying uh what you have
and you know like Andre was saying in
the space of vision there's been a lot
of different uh uh uh features designs
of of of what's the right structure we
should see in the data uh in the space
of um audio same thing is happening how
can you find these right representations
for your uh for your data and the idea
behind representation learning in
particular um uh in uh deep learning is
is can we actually learn these
representations automatically right and
more importantly can we actually learn
these representations in unsupervised
right? By just seeing lots and lots of
unlabelled data, can we achieve that?
And uh you know, there's been a lot of
work done in that space, but we're not
there yet. So, so, so I wanted to sort
of lower your expectations as as I show
you some uh some of the
results. Okay, sparse coding. Um this is
one of the models that I think that
everybody should know uh what it is. uh
it was actually you know first has its
roots in 96 and it was originally
developed to explain early visual
processing in the brain sort of uh I
think of it as an edge detector uh and
the objective here is the following well
if I give you set of data points x1 up
to xn you'd want to learn a dictionary
of bases fi1 up to phi k right so that
every single data point can be written
as a linear combination of the bases
that's fairly simple uh right there's
one constraint in that you'd want your
coefficients to be sparse. You'd want
them to be mostly zero, right? Uh so
every data point is represented as a
sparse linear combination of
bases, right? So uh this is if if if you
apply sparse coding to natural images,
right? And this is uh this was
originally has been a lot of work
developed at Stanford with with Andrew's
group. So and if you apply sparse coding
to you know take little patches of
images and learn these bases these
dictionaries this is how they look like
and it's they look really nice in terms
of you know finding sort of edge edge-
like structure so if given a new example
I can say well this new example can be
written as a linear combination of a few
of these bases right and taking that
representation it turns out that
particular representation a sparse
representation is quite useful uh as a
feature representation of your data,
right? So, it's quite useful to have it.
And in general, oops. Uh um how do we
how do we fit these models? Um well, if
I give you uh a whole bunch of image
patches, but these don't necessarily
have to be image patches. This could be,
you know, little speech signals or any
kind of uh data you're working with.
You'd want to learn a dictionary of
basis. You have to form, you have to
solve this optimization problem, right?
So the first term here you can think of
it as a reconstruction error which is to
say well I take a linear combination of
my bases I want them to match my data.
Uh and then there's a second term which
is you can think of it as a sparse
penalty term which essentially says you
know try to penalize uh um my
coefficients so that most of them are
zero right that way every single data
point can be written as just a linear
combination sparse linear combination of
of of the bases and it turns out there
is an easy optimization uh for doing
that u if you fix your dictionary of
bases right 51 up to 5k uh and you solve
for the activation
uh that's becomes a standard lasso
problem, right? And there's a lot of
solvers for uh for solving that
particular problem. That's a general
very, you know, uh it's it's it's a
lasso problem which is fairly easy to uh
to optimize. And then if you fix the
activations and you optimize for
dictionary bases, then it's a well-known
quadratic programming problem, right? Uh
each problem is convex. So you can sort
of alternate between finding
coefficients, finding bases and so
forth. you can optimize this function
and there's been a lot of recent work in
the last 10 years of of doing these
things online and doing it more
efficiently and so forth.
Um right at test time given a new input
or a new image patch uh and given a set
of learned bases once you have your
dictionary you can then just solve uh a
lasso problem to find the right
coefficients right so in this case given
a test sample or test patch you can find
well it's written by as a linear
combination of of uh of subset of the
bases right and it turns out again that
that particular representation is very
useful uh particularly if you're
interested in classifying what you see
in images and this is done in completely
unsupervised way right there is no class
labels there is no uh specific
supervisory signal that's uh that's here
um so back in 2006 there was uh work
done uh again at Stanford
um uh that basically showed a very
interesting result so if I give you an
input like this and these are my learned
bases remember these little edges what
happens is that you just convolve these
bases you can get these different
feature maps much like you know the
feature maps that we've seen in
convolutional neural networks and then
you take these feature maps and you can
just do a classification um right and
this was done on one of the older data
sets Caltech 101 which sort of a data
set that predates imageet and um uh if
you look at you know some of the
competing algorithms if you do a simple
logistic regression versus if you do PCA
and then do uh logistic regression
versus uh uh finding these features
using sparse coding you can get
substantial improvements right uh so
that's again that's that's uh and and
you see sparse coding popping up in a
lot of different areas not just in deep
learning but folks who are using uh
looking at uh uh the medical imaging
domain uh in neuroscience these are very
popular models because they're easier
they're easy to fit they're easy to uh
to deal
with so uh what's the interpretation of
the sparse coding Well, look, let's look
at this equation again. And we can think
of sparse coding as finding an
overcomplete representation of your
data. Right? Now, the encoding function,
we can think of this encoding function,
which is well, I give you an input, find
me the features or sparse coefficients
or bases uh that make up my image. We
can think of encoding as an implicit and
very nonlinear function of x, right? But
it's an implicit function. We don't
really specify it. And the decoder or
the reconstruction is just a sim simple
linear uh function and it's and it's
very explicit. just take your
coefficients
uh um and then multiply it by the you
know find the right basis and get back
uh get back the image or the data right
and that sort of flows naturally into
the ideas of autoenccoders right the
autoenccoder is a general framework
where you if I give you an input data
let's say it's an input image you encode
it you get some representation some
feature representation and then you have
a decoder given that representation
you're decoding it back into the image.
So you can think of encoder as a as a
feed forward bottom up pass right much
like in the convolutional neural network
given the image you're doing a forward
pass and then there is also feedback and
generative uh or top down pass right
given features you're reconstructing
back uh back the input image and the
details what's going inside the encoder
decoder they matter a lot uh and
obviously you need some form of
constraints you need some of constraints
to avoid learning an identity right
because if you don't With these
constraints, what you could do is just
take your
input, copy it to your features, and
then reconstruct back, right? And that
would be a trivial solution. So, so, so
we need to introduce some some
additional uh
constraints. If you're dealing with uh
um uh binary features, if you want to
extract binary features, for example,
I'm going to show you later why you'd
want to do that. You can pass your uh
your encoder through sigmoid
nonlinearity, much like in a neural
network. And then you have a have a
linear decoder that reconstruct back the
input. And the way we optimize these
little building blocks or these little
blocks is uh we can just uh um have an
encoder right which takes your input
takes a linear combination passes it
through some nonlinearity the sigmoid
nonlinearity or could be rectified
linear units or could be 10h
nonlinearity and then there's a decoder
where you reconstruct back uh your uh
original input. Right? So this is
nothing more than a neural network with
one hidden layer and typically that
hidden layer would have a small
dimensionality than the input. So we can
think of it as a bottleneck layer right
and we can determine the network
parameters you know the parameters of
the encoder and the parameters of the
decoder by writing down uh the
reconstruction error and that's what the
reconstruction would look like you know
given the input encode decode and make
sure whatever you're decoding is as
close as possible to to the original to
the original input. All right. And we
can use back propagation algorithm to to
to uh to train it. There is an
interesting uh sort of relationship
between autoenccoders and pro and
principal component analysis. Many of
you have probably heard about PCA as a
practitioner. You know, if you're
dealing with large data and you want to
see what's going on, PCA is the first
thing to use, right? Much like logistic
regression. Uh so and the idea here is
that if the parameters of encoder and
decoder are shared and you actually have
the hidden layer which is a linear layer
so you don't introduce any
nonlinearities then it turns out that
the space the latent space that the
model will discover is going to be the
same space as the space discovered by
PCA it effectively will collapse the
principal component analysis right or
doing PCA which is sort of a nice uh uh
uh connection because it basically says
that autoenccoders you can think of them
as nonlinear extensions of PCA, right?
So you can learn a little richer
features uh if if if if you are uh
um uh using
autoenccoders. Okay, so here's another
model. If you're dealing with binary
input, uh sometimes we're dealing with
uh like amnest for example, again your
encoder and decoder could use sigmoid
nonlinearities. So given an input, you
extract some binary features, given
binary features, you construct back the
binary input. um and that's actually you
know relates to uh a model called the
restricted bulk machine something that
I'm going to uh tell you about later in
the talk okay there's also uh other
classes of models where you can say well
I can also introduce some sparsity much
like in sparse coding to say that you
know I need to constrain my latent
features or my latent space uh to be
sparse and that's actually uh allows you
to learn quite uh reasonable features
and nice features here's one particular
model called predictive sparse
decomposition where you effectively, you
know, if you look at the first part of
the equation here, the decoder part that
pretty much looks like a sparse coding
model, right? But in addition, you have
an encoding part that essentially says
train an encoder such that it actually
approximates what my uh latent code
should be. Right? So effectively you can
think of this model as there is encoder,
there's a decoder but then you put the
sparity constraint on your latent
representation and you can optimize
uh um for uh for that
model and obviously the other thing that
uh we've been doing in the last you know
seven eight and 10 years is well what
you can do is you can actually stack
these things together uh right so you
can learn low-level features try to
learn high level features and so forth.
So just building these blocks uh um and
perhaps at the top level if you're
trying to solve a classification problem
you can do that um or and this is
sometimes known as a greedy uh greedy
layer wise uh learning and this is
sometimes useful whenever you have lots
and lots of unlabeled data and when you
have a little labelled data right a
small sample of labelled data typically
these models help you uh find meaningful
representations such that you don't need
a lot of labeled data to solve a
particular task that you're trying to
solve, right? And this is again you can
remove the decoding part and then you
end up with a standard or convolutional
architecture. Again, your encoder and
decoder could use could be convolutional
uh and and it's uh it depends on on what
problem you're tackling. Uh and
typically, you know, you can stack these
things together and optimize for
particular uh task that you're trying to
solve. Okay. Um here's an example of
just wanted to show you some examples
some early examples back in 2006. This
was uh a way of trying to build these
nonlinear
autoenccoders. Um and you can sort of
pre-train these models using restricted
bulk machines or autoenccoders uh
generally and then you know you can
stitch them together into this deep
autoenccoder and back propagate through
uh reconstruction loss. Right? One thing
I want to point out is that uh here's
one particular example. You know the top
row I show you real faces. The second
row you're seeing faces reconstructed
from a bottleneck of of uh of uh
30dimensional uh real valid bottleneck.
So you can think of it as just a
compression mechanism. Given the data
high dimensional data you're compressing
it down to 30 dimensional code and then
from that 30dimensional code you're
reconstructing back the original data.
Right? So if you look at the first row
this is the data. The second row shows
you reconstructed data and the last row
shows you PCA solution. Right? One thing
I want to point out is that you know the
solution here you have a much sharper
representation which means that it's
capturing a little bit more structure in
the data. It's also kind of interesting
to see that sometimes these models tend
to um how should I say it uh they tend
to regularize your data right like for
example if you see this person with
glasses removes the glasses and that
generally has to do with the fact that
there's only one person with glasses. So
the model just basically said that's
noise, get rid of it. Or it sort of gets
rid of mustaches, right? Like if you see
a face, there's no mustache, right? And
then again, that has to do with the fact
that there's enough capacity. So the
model might think that that's just a
noise. Um, and you know, if you're
dealing with uh text type of data, uh
this was done using a Reuters data set.
You have about 800,000 uh stories. You
take bag of representation, something
very simple. you compress it down to two
dimensional space and then you see what
that space looks like, right? And I
always like to joke that, you know, the
model basically discovers that European
community economic policies are just
next to disasters and accidents, right?
This is done this was back in I think
the data was collected in 96, right? I
think today it's probably going to
become closer those two things. Um uh
but again this is just a way uh
typically autoenccoder is a way of
compression or or trying to do
dimensionality reduction but we'll see
later that they don't have to be. Okay
there's another class of algorithm
called semantic hashing which is to say
well what if you take your data and
compress it down to binary
representation. Wouldn't that be nice?
Because if you have binary
representation, you can search in the
binary space very efficiently, right? In
fact, if you can can compress your data
down to 20 dimension, 20 dimensional
binary code, 2 to the 20 is about 4
gigabytes. So you can just store
everything in memory and you can look at
the you know just do memory uh lookups
without actually doing any search at
all. Uh right. So this sort of
representation sometimes have been used
successfully in computer vision where
you take your images and then you learn
these binary representations you know uh
30 dimensional codes or 200 dimensional
codes and it turns out it's very
efficient to search through large
volumes of data using binary
representation. So you can you know
takes a fraction of a millisecond to
retrieve uh images from you know a set
of millions and millions of images. Uh
and and again this is also an active
area of research right now because
people are trying to figure out we have
these large databases how can you search
through them efficiently and trying to
learning a semantic hashing function
that maps your data to the binary
representation turns out to be quite
useful. Uh okay now let me step back a
little bit and say let's now look at
generative models. Let's look at
probabistic models and how different
they are. And I'm going to show you some
examples of of of uh where they
applicable. Here's one example of uh a
simple model uh trying to learn a
distribution of these handwritten
characters. So we have you know we have
uh Sanskrit, we have Arabic, we have
circ um and now we can build a model
that says well can you actually generate
me what a Sanskrit should look like? The
flickering you see at the top these are
you know neurons. You can think of them
as neurons firing. And what you're
seeing at the bottom is you're seeing
what the model generates what it
believes Sanskrit should look like.
Right? So in some sense when you think
about generative models, you think about
models that can generate uh or they can
sample uh the distribution or or they
can sample uh the data. Uh this is a
fairly simple model. You have about
25,000 characters, you know, coming from
50 different alphabets around the world.
You have about two million parameters.
This is one of the older models but this
is you know what the model believes
Sanskrit should look like and I think
that I've asked couple of people to say
that is that does that really look like
Sanskrit
okay great which can mean two things it
can mean that the model is actually
generalizing or the model is overfitting
right uh meaning that it's just
memorizing what the training data looks
like and I'm just showing you examples
from the training data we'll come back
to that point uh uh as we go through the
talk here's You know, you can also do
conditional simulation. You know, given
half of the image, can you complete the
remaining half, right? And more
recently, there's been a lot of advances
uh
um it's actually the last couple of
years for the conditional generations
and it's pretty amazing what you can do
in terms of in painting uh given half of
the image what the other half of the
image should look like. This is sort of
a simple example, but it does show you
that it's trying to, you know, be
consistent with what different uh
strokes look like. Right. So why is it
so difficult? Uh in the space of
so-called undirected graphical models of
both machines, the difficulty really
comes from the following fact. If I show
you this image which is a 28x 28 image,
it's a binary image, right? So some
pixels are on, some pixels are off.
There are two to the 28 by 28 possible
images. So in fact there are two to the
784 possible configurations, right? And
that space is exponential. So how can
you build models that figure out you
know in in the space of characters
there's only little tiny subspace in
that space right if you start generally
generating you know uh 200 by 200 images
um you know that space is huge and the
space of real images is really really
tiny right so how do you find that space
how do you generalize to new images
that's that's a very difficult question
in general to um to answer one class of
models uh is so-called fully observed
models right there sort of been a stream
of uh learning generative models that
are tractable and they have very nice
properties like you can compute the
probabilities you can do can do maximum
likelihood estimation here is one
example where I can if I try to model
the image I can write it down as you
know taking the first pixel modeling the
first pixel then modeling the second
pixel given the first pixel and just
just writing it down in terms of uh uh
conditional product of the conditional
probabilities and each conditional
probability can take a very complicated
form, right? It could be a complicated
neural network. Um, and oh,
sorry. So there's been a number of
successful models. Uh, one of the early
models called neural autogressive
density estimator actually developed by
Hugo. Uh, real valid extension of these
models and more recently we start seeing
these flavors of models. There were a
couple of papers uh popped up actually
this year from deep mind uh where they
sort of make these conditionals to be
you know sophisticated RNNs LSTMs or
convolutional models and they can
actually generate remarkable images uh
and so this is just a pixel CNN
generating I guess uh
elephants. Yeah. And actually looks
pretty pretty interesting. uh right uh
the drawback of these models is that we
yet have to see how good of
representations these models are are
learning so that we can use these
representations for other tasks like
classifying images or find similar
images and such. Right? Um now let me
jump into a class of models called
restricted boss machines. So this is the
class of models where we actually trying
to learn some latent structure some
latent representation.
uh these models belong to the class of
so-called graphical models and graphical
model is a very powerful framework for
representing dependency uh structure
between random variables. Uh this is an
example where we have
uh uh you can think of this particular
model. You have some pixels. These are
stocastic binary so-called visible
variables. You can think of pixels in
your image and you have stocastic binary
hidden variables. You can think of them
as feature detectors. So detecting
certain patterns that you see in the
data much like sparse coding models.
This has a bipart type structure. You
can write down the probability the joint
distribution over all of these
variables. uh you sort of have pair wise
term you have union term but it's not
really important what they look like the
important thing here is that if I look
at this conditional probability of the
data given given the features I can
actually write down explicitly what it
looks like what does that mean that
basically means that if you tell me what
features you see in the image I can
generate the data for you right or I can
generate uh uh the corresponding input
in terms of learning features so what do
these uh models learn they sort of learn
something similar that we've seen in
sparse coding
uh right and and so these classes of
models are very similar to each other.
So given a new image I can say well this
new image is made up by some combination
of these learned weights or these
learned bases. Uh and the numbers here
are given by the probabilities that each
particular edge is present in the data.
Um in terms of how we learn these
models, uh one one thing I want to make
uh uh uh another point I should make
here is that given an input I can
actually quickly infer what features I'm
seeing in the image. So that operation
is is is very easy to do unlike in
sparse coding models. It's it's a little
bit more closer to an autoenccoder.
Given the data, I can actually tell you
what features are present in my in my
input, which is very important for
things like information retrieval or
classifying images because you need to
do it uh you need to do it fast. How do
we learn these models? Let me just give
you an intuition maybe a little bit of
math behind uh how we learn these
models. If I give you set of training
examples and I want to learn model
parameters, I can maximize the log
likelihood objective, right? And you've
probably seen that uh uh in these
tutorials. Maximum likelihood objective
is essentially nothing more than saying
I want to make sure that the probability
of observing these images is as high as
possible. Right? So finding the
parameter so that the probability of
observing uh what I'm seeing is is high
and that's why you're maximizing the the
uh likelihood objective or the log of
the likelihood objective would just you
know take a product into the sum. You
take the derivative. There's a little
bit of algebra. I promise you it's not
uh it's not very difficult. It's like
you know second year college uh algebra
you differentiate and you basically have
this uh uh uh learning rule which is the
difference between two terms. The first
term you can think of it as looking at
uh sufficient statistics so called
sufficient statistics driven by the data
and the second term is the sufficient
statistics driven by the model. Right?
And maybe I can parse it out. What does
that mean? Intuitively, what that means
is that you look at the correlations you
see in the data, right? And then you
look at the correlations that the model
is telling you it's it should be and
you're trying to match the two, right?
That's what the learning is trying to
do, right? It's trying to match the
correlations that you see in the data,
right? So the model is actually
respecting the statistics that you see
in the data. uh but it turns out that
the second term is very difficult to
compute and it's precisely because the
space of all possible images is so
highdimensional that you need to figure
out or use some kind of approximate uh
learning algorithms to do that right so
you have these difference between these
two terms the first term is easy to
compute it turns out because of a
particular structure of the model uh
right and we can actually uh do it uh do
it explicitly the second term is the
difficult difficult one to compute right
so it sort of requires you know summing
over all possible configurations, all
possible images that that that that you
could possibly uh see. So it's this term
is intractable. And what a lot of
different algorithms are doing and we'll
see that over and over again is using
so-called Monte Carlo sampling or markup
chain Monte Carlo sampling or Monte
Carlo estimation. Uh right so let me
give you an intuition what what this
term is doing and that's a general trick
for you know approximating exponential
sums. Right? There's a whole sub field
in in uh in uh statistics that's
basically dedicated to how do we
approximate exponential sums. In fact,
if you could do that, if you could solve
that problem, you could solve a lot of
problems in in machine learning. Um and
the idea is very simple actually. The
idea is to say well you're going to be
replacing the average uh by sampling. Um
and there's something that's called GIP
sampling mark of chain Monte Carlo which
is essentially does something very
simple. It basically says well start
with the data sample the states of the
latent variables you know sample the
data sample the states of the lat sample
the data from these conditional
distributions something that you can
compute explicitly right uh and that's a
general trick you know much like in
sparse coding we you know we're
optimizing for the basis when we're
optimizing for the coefficients here
you're inferring the coefficients then
you you know inferring what the data
should look like and so forth uh and
then you can just run a markup chain and
sort of approximate
approximate uh you know this exponential
sum. So you start with the data, you
sample the states of the hidden
variables, you resample the data and so
forth. And the only problem with a lot
of these methods is that you know you
need to run them up to infinity
uh to guarantee that you're sort of
getting the right thing. Uh and so
obviously you know you will never run
them you know infinite uh you don't have
time to do that. So there's a very
clever algorithm that uh a contrastive
divergence algorithm that was developed
by Hinton back in 2002 and it was very
clever. It basically said well instead
of running this thing up to infinity,
run it for one
step, right? Um and so you're just
running it for one step. You start with
a training vector. You uh you update the
hidden units. You update all the visible
units again. So that's your
reconstruction. Much like in
autoenccoder, you reconstruct your data.
uh you update the hidden units again and
then you just update the model
parameters which is just looking at you
know empirically the statistics between
the data and the model right very
similar to what the autoenccoder is
doing but slight slight differences and
implementation is basically takes about
like 10 lines of MATLAB code I suspect
it's going to be you know two lines in
TensorFlow although I don't think
TensorFlow folks implemented BS machines
yet that would be my request
um uh But uh you can extend these models
to dealing with real value data right so
whenever you're dealing with images for
example and that's just a little change
to the definition of the model and your
conditional probabilities here just
going to be bunch of Gaussian so that
basically means that given the features
sample me the space of images and I can
sample you give you you know real real
valued images uh the structure of the
model remains the same if you train this
model on you know the the these images
you sort tend to find edges uh something
similar again to what you'd see in
sparse coding in ICA independent
component analysis model autoenccoders
and such uh and again you can sort of
say well every single image is made up
by some some linear combination of these
basis functions you can also extend
these models to dealing with count data
right if you're dealing with documents
uh in this case again a slight change to
the model uh K here denotes your
vocabulary size and D key denotes number
of words that you're seeing in your
document. Right? So if you you know it's
it's a bag of words uh representation
and the conditional here is given by
so-called softmax distribution much like
what you've seen in in in uh in the
previous classes when you know the
distribution of a possible words right
um and the parameters here W's you can
think of them as you know something
similar to as what work to embedding
would do um and so if you apply it to
you know again some some of uh uh data
sets you know you tend to find
reasonable features
Right? So you tend to find you know
features about Russia, about US, about
computers and so forth. Right? So much
like you found these representations
little edges. So every image is made up
by some combination of these edges in in
in case of uh documents or web pages
you're saying it's the same thing. It's
just made up some linear combination of
of these learned topics. Every single
document is made up by some combination
of these topics. Right? You can also
look at onestep reconstruction. So you
can basically say well how can I find
similarity between the words. So if I
show you chocolate cake, I infer the
states of hidden units and then I
reconstruct back uh the distribution of
a possible words. You know, it tells me,
you know, chocolate cake, cake,
chocolate, sweet, dessert, cupcake,
food, sugar, and so forth, right? I
particularly like the one about the
flower high and then there is a Japanese
sign. Um the model sort of generates
flower, Japan, Sakura, Blossom, Tokyo,
right? So it sort of picks up again on
low-level correlations that you see in
your data. You can also apply these
kinds of models to collaborative
filtering where every single observed
variable you can model, you know, can
represent um a user rating for a
particular movie, right? So every single
user would rate a certain subset of
movies and so you can represent it as as
the state of visibility and your hidden
states can represent user preferences,
what they are. uh and on the Netflix
data set if you look at the latent space
uh that the model is learning you know
some of these hidden variables are
capturing specific movie genre uh right
so for example there is there's actually
one hidden union dedicated to Michael
Michael Moore's movies uh right so it's
sort of like very strong I think it's
sort of you know either people like it
or hate it so there are a few hidden
units specifically dedicated to that but
it also finds interesting things like
you know action movies and so forth
right so it finds that particular
structure ing the data. So you can model
different kinds of modality, real value
data, you can model count
data, multinnomials and it's very easy
to infer the states of the hidden
variables. So that's given just the
product of of logistic functions and
that's very important in a lot of
different applications. Given the input
I can quickly tell you what topics I see
in the data, right? Um one thing that I
want to point out and that's an
important point is a lot of these models
can be viewed as product models. uh
sometimes people call them product of
experts uh and this is because of the
following uh sort of the following
intuition. If I write down the joint
distribution of my hidden observed
variables, I can write it down in this
sort of log linear form, right? But if I
sum out or integrate out the states of
the hidden variables, I have bunch of uh
a product of a whole bunch of functions,
right? So what is what does it mean?
What what's the intuition here? So let
me show you an example. Suppose the
model finds these specific topics,
right? And suppose I'm going to be
telling you that the document contains
topic government, corruption and mafia.
Then the word Sylvia Berlusone will have
very high probability, right? I guess
does anybody know everybody knows who
Sylvia is? Sylvia Berusone, right? He's
had like, you know, he's in head of the
government. He's connected to mafia.
He's uh he's very corrupt, was corrupt.
And I guess I should add like a banga
banga parties here, right? Then it will
become completely clear what I'm talking
about. Uh but then you know one point I
want to make here is that uh it's it's
you know you can think of these models
as a product. Each hidden variable
defines a distribution of a possible
words over possible topics and once you
take the intersection of these
distributions you can be very precise
about what is it that you're modeling.
Right? So that's unlike uh uh generally
topic models or lat allocation models
models where you're actually using um
mixture like uh uh uh approach and then
typically these models do perform far
better than uh traditional mixture based
models and this comes to the point of
local versus global uh versus
distributed representations right in in
a lot of different algorithms you know
even unsupervised learning algorithms
are just clustering um you typically
have some you partitioning space and
you're finding local uh uh prototypes,
right? And the number of parameters for
each you have basically, you know,
parameters for each region. The number
of regions typically grow with linearly
with the number of parameters. But in um
models like factor models, PCA,
restricted Bman machines, deep models,
you typically have distributed
representations, right? And what's the
idea here? The idea here is that if I
show you the two inputs, right, each
particular neuron can, you know,
differentiate between two parts of the
plane. Given the second one, you know, I
can partition it again. Given the third
hidden variable, you can partition it
again. So, you can see that every single
neuron will be affecting lots of
different regions. And that's the idea
behind uh distributed representations
because every single parameter is
affecting many many regions, not just
the local region. And so the number of
regions grow roughly exponentially with
the number of parameters. Right? So
that's the differences uh uh between
these these two classes of models.
Important to know about them. Now let me
jump uh and quickly tell you a little
bit of inspiration behind what what can
we build with these models. Right? As
we've seen with convolutional networks,
the first layer would typically learn
some lowlevel uh features like edges or
you know if if you're working with a
word uh words will typically learn some
low-level structure and the hope is that
the high level features will start
picking up some high level structure as
as you are building and these kinds of
models can be built in completely
unsupervised way because what you're
trying to do is you're trying to model
the data. You're trying to model the
distribution of uh of the data. You can
write down the probability distribution
for this model. It's known as a a boss
machine model. Um you have dependencies
between hidden variables. So now
introducing some extra uh um you know
some extra uh layers and dependencies
between those layers. And if we look at
the equation, the first part of the
equation is basically the same as what
we had with restricted bolts machine.
And then the second and third part of
the equation essentially modeling
dependencies between you know the first
and the second hidden layer and the
second hidden layer and the third hidden
layer right there is also a very natural
notion of bottom up and top down. So if
I want to see what's the probability of
a particular unit being taking value one
it's really depend on what's coming from
below and what's coming from above. So
there has to be some consensus in the
model to say ah yes what I'm seeing in
the image and what my model believes the
overall structure should be should be in
agreement. Um right and so in this case
of course in this case hidden variables
become dependent even when you condition
on on the data. So these kinds of models
we'll see a lot uh is you're introducing
more flexibility you're introducing more
structure but then learning becomes uh
much more difficult right you have to
deal you know how do you do inference in
these models um right now let me give
you an intuition of what how can we
learn these model what's the maximum
likelihood uh estimator doing here well
if I differentiate this model with
respect to parameters I basically run
into the same learning rule and it's the
same learning rule you see whatever
you're working with undirected graphical
models, factor graphs, conditional
random fields. You might have heard
about those uh those ones, it really is
just trying to look at the statistics
driven by the data, correlations that
you see in the data and the correlations
that the model is telling you it's
seeing in the data and you're just
trying to match the two, right? That's
exactly what's happening in that
particular equation. Uh right, but the
first term is no longer factorial. So
it's you know you have to do some
approximation with these models. But let
me give you notation what what each term
is doing. So suppose I have some data
right and I get to observe these
characters. Well, what I can do is I
really want to tell the model this is
real right these are real characters. So
I want to put some probability mass
around them and say these are real uh
and then there is some sort of uh data
point that looks like this just bunch of
pixels on and off and I really want to
tell my model that you
know put almost zero probability on
this. This is not real. uh right and so
the first term is exactly trying to do
that. The first term is just trying to
say put the probability mass where you
see the data and the second term is
effectively trying to say well look at
this entire exponential space and just
say no everything else is not real just
the real thing is what I'm seeing in my
data and so you can use sort of advanced
techniques for doing that there's a
class of uh algorithms called
variational inference something that's
called stocastic approximation which is
multicolor based inference I'm not going
to go into these techniques but in
general you can you can train these
models so one question is How good are
they? All right, because there's a lot
of approximations that go into these
models. Um, so what I'm going to do is
if you have if you haven't seen it, I'm
going to show you two panels. On one
panel, you will see the real data. On
another panel, you'll see data simulated
by the model or the fake data. And you
have to tell me which one is which.
Okay. So again, these are handwritten
characters coming from, you know,
alphabets around the world. How many of
you think this is simulated and and the
other part was real? Honestly.
Okay, some what about the other way
around? I get half and half, which is
great.
Um, if you look at these images a little
bit more carefully, you will see the
difference, right? So, you will see that
this is
simulated and this is real, right?
Because if you look at the real data,
it's much crisper. There's more
diversity. when you're simulating the
data, there's a lot of structure in the
simulated characters, but some, you
know, sometimes they look a little bit
fuzzy and there isn't as much
diversity, right? And I've learned that
trick from uh from my uh neuroscience
friends. If I show you it quickly
enough, you won't see the difference,
right? Um and and uh uh you know, if if
if you're using these models for for for
classifying, you know, you can do proper
analysis, which is to say given a new
character, you find infer the states of
the latent variables, hidden variables.
if I classify based on that how good are
they and and they are uh they're you
know they're much better than some of
the existing techniques. This is another
example you know trying to generate 3D
objects. This is sort of a to data sets
and later on I'll show you some you know
bigger advances that's been happening in
the last few years. This was done a few
years ago you know if you look at the
space of generated samples uh they you
know they sort of
uh you know obviously you can see the
difference. Here's here's P. Look at
this particular image. Right? This image
looks like car with wings, don't you
think? Right. So there's sometimes it
can sort of simulate things that are not
necessarily realistic. And for some
reason it just doesn't generate donkeys
and elephants too often, right? But it
generates people with guns more often.
Like if you look at here and here and
here and that again has to do with the
fact that you know you're exploring this
exponential space of possible uh images
and it's sometimes it's very hard to
assign the right probabilities to
different parts of the space. Um right
and then obviously you can do things
like pattern completion. So given half
of the image can you complete the
remaining half. So the second one shows
what the completions look like and the
last one is what the truth is. So you
can do you can do these things. So where
else can we use these models? These are
sort of toish examples. But where else?
Let me show you one example uh where
these models can potentially succeed
which is trying to model the space of uh
the multimodel space which is the space
of you know images and text or you know
generally if you look at the data it's
not just single source. It's a
collection of different modalities. Uh
right. So how can we take all of these
modalities uh into account? And this is
really just the idea of you know given
images and text can you actually find a
concept that relates these two different
sources of uh sources of data. Uh and
there are a few challenges and that's
why you know models like genative models
uh sometimes proistic models could be
useful. In general one of the biggest
challenge we've seen is that typically
when you're working with images and text
these are very different modalities.
Right? If you think about images and
pixel representation, they're very
dense. If you're looking at text, it's
typically very sparse, right? So, it's
very difficult to learn these crossmodel
features from low-level representation.
Perhaps a bigger challenge is that uh a
lot of times we see data that's very
noisy, right? Sometimes it's just
non-existent given an image there is no
text or if you look at the first image
you know a lot of the tags about is what
kind of camera was
Resume
Read
file updated 2026-02-13 13:24:23 UTC
Categories
Manage