Transcript

Ow25mjFjSmg • Complete Statistical Theory of Learning (Vladimir Vapnik) | MIT Deep Learning Series
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0305_Ow25mjFjSmg.txt
Back Raw
Kind: captions
Language: en
- Today, we're happy and honored
to have Vladimir Vapnik with us,
co-inventor of supported
vector machines, support vector
clustering, VC theory
of statistical learning,
and author of "Statistical
Learning Theory".
He's one of the greatest
and most impactful
statisticians and computer
scientists of our time.
Plus, he grew up in the Soviet Union,
eventually heading the
Computer Science Department,
Institute of Controlled
Sciences in Moscow.
So he will give today's lecture in Russian
and I will translate.
Just kidding, right.
(laughing)
(audience laughs)
It's an honor and a pleasure
to have Vladimir with us today,
so please give him a warm welcome.
(audience applauds)
- Thank you.
About 50 years ago,
Professor Chervonenkis and me
started statistical learning theory.
The problem was to answer the question
when if we will do well with training data
if you will have small amount
of our own training data,
you will do well also on the test data.
You will minimize expectation of error.
So this solves this problem.
There are this theory in most in all books
they might be written
in different languages,
but mostly they follow this line.
The line is that just
law of large numbers is not enough,
then you need uniform law of large number,
you need convergence, and so on,
but because we started this
discussion about empirical error
how good you do on the training data,
and bounds shows the better
you do in the training data,
you will be better on test data.
People decided that this is
only way to have a training data
and do something with this training data,
to minimize number of
error and all algorithm
was constructed based on this principle.
About five years ago I
found that there exists
another principle, even more
interesting than this one,
because in this principle
it is brute force principle,
give me more data, you
give better answer with.
So the second principle
is intelligent principle
and I will talk today about this.
So I will start with
statistical learning theory
and then I will introduce this new theory,
but because there are only
two ways for generalization,
one is data, another I will show
what I call the complete
statistical learning theory
because there are more,
another way to do something.
There is no short way for generalization.
You should use both of them,
so that is complete theory.
But it is not so bad
because you will see that learning theory
move in different direction.
In direction of intelligence
to understanding
what is intellence.
It is not the same but Turing explained.
Turing told that you should
imitate intelligent person
but now, question what is intelligence,
and we will discuss this.
So let me start.
The first part is, this is
theory of generalization,
and that is the question,
when in set of given set of function,
you can minimize functional.
This is pretty general functional.
Instead of y, f of x,
and L is loss function,
I can consider difference
between y and function.
It is what they're doing in regression
and pattern recognition
and all this stuff,
but this is more general setting.
But this is more important
that when we consider
minimization of functional,
we should say in which set of fucntions.
In the given set of function,
you have to minimize functional.
If probability measure is unknown,
but we are given iid data pairs.
That exact setting of
both pattern recognition
and regression estimation problem
of pattern recognition set of function
is indicated functions
for regression function
because continuous function,
but the statement is the same,
but the answer is very simple.
We can minimize this functional using data
if and only if this dimension h
of set of function is finite.
Everything depend on which set of function
you have to minimize this functional,
and you can not avoid that.
So we call it capacity,
but maybe better called,
diversity of the set of function,
but measure of this capacity,
diversity, is VC-dimension.
And what is VC dimension,
I will give you definition.
First for your set of indicator functions,
t is step function you
consider continuous function f,
and in front of it, you're
looking for indicator.
If function is positive you say one,
if not positive you say
zero, that is indicator.
The VC-dimension of set
indicator function equals h
if h is the maximal number of vectors
that can be shattered,
separated in all possible
two h subsets
using indicator functions from this set.
So you have set, you should find h vectors
which you can shatter in all possible way,
in the l possible way.
But you cannot shatter
in h plus one vector.
Then VC-dimension of this
set of function will be h.
This is purely a rhetorical definition.
And if you can shatter for any l,
then this says the VC
dimension is infinite.
And then I will give you two
theorems which we will use,
and then the main theorem,
probably of this theory.
If set of function has VC-dimension h,
then this probability one
minus eta for all functions
in the set the bound holds true.
And because this bound
holds true for all function,
and you would like to
have minimal estimate,
minimal right hand side,
you will choose function
which minimize empirical loss.
And the second theorem, if you
have set of linear functions,
indicator from the set of linear function,
it so happen that your
vector x inside of circle
for radius one, and w inside
of C, then we see the dimension
is bounded by maximum of two radius C
and n, n is the dimensionality
of the object plus one.
That means exactly that
VC-dimension can be smaller
than dimensionality of the space.
And you can control some
part of VC-dimension,
and I will show you
afterwards how to do it,
it is very important theorem.
But, what is the general way
suggest the VC-dimension for
searching for functions.
You have set of function,
maybe set of function
have infinite VC-dimension.
Then you make a structure
of this set of function.
You choose subset of function,
the small subset of function,
with VC-dimension h, then
another subset of function
which includes a small
one, VC-dimension h two,
so this last VC, we have
last VC type of function.
It's loss to VC dimension.
And then, when you
minimize exist functional,
you're doing two things.
You choose appropriate subset,
and then in this subset,
you pick up function
which minimize empirical loss.
But, you can see,
that epsilon in this depend
on VC-dimension of subset you choose,
at VC-dimension over L.
And also, it depend on empirical
loss which you achieve.
So you can do as soon
as you make a structure,
you can do whatever you want,
even if you have infinite VC-dimension
in initial set of function.
But, now, this is more
or less, all what contain
the main result of VC theory.
But VC theory does not answer
four very important questions.
How to choose loss function L y f?
I told that any function
how to select admissible
set of function f of x?
I told you that given set of function,
but when there's a stupid set of function,
how to construct good set of function?
How to construct structure on
admissible set of function?
And then, how to minimize function
and how to construct the structure.
In this talk I will try to
answer all those questions.
Target functional for minimization.
And this is important
slide, God plays dice.
What is setting of pattern
recognition problem?
I will consider in this talk,
just pattern recognition
problem for two class specification,
but generalization is straightforward.
So, what is setting of
pattern-recognition problem?
Given generator, given
nature we generate randomly,
independently, citation x
which come on the object,
and this object is
conditional probability of y.
Given x, say that y equals one, given x,
and y equals zero given x.
This object knows this
conditional probability,
and plays dice.
So he has the function of
conditional probability,
he has x on the input,
he plays dice, and say y.
That is the most general
setting of logic problem,
deterministical particular
case of this set.
And what does learning machine,
learning machine has a set of functions,
and it can choose any
function from this set.
The problem with observing l observations,
x one y one, x l y l,
to pick up the function
for classification.
That means, given observations
generated by a kind of
conditional probability P x, y
equal P y given x on P
of x, that's our theorem,
finds the rule that minimize function l.
So in my case when I have y zero or one,
or in theta, also zero, one,
it is indicator function,
so last function l, just
collect how many errors I do,
but I have probability measure,
it collect expectation of error.
I would like to find
function which guarantee
the smallest expectation of l.
But this is not very good.
Why it not very good?
Because my function l, this model
is just, it is everywhere zero
except for some points where
it is one or minus one,
and if I could model it is one.
So, it is zero everywhere in defined,
and one in some point.
So I cannot use gradient in this case.
So I should to do something smarter.
And what people doing is they replace
model and indicator function
with just y minus f of x.
This create error.
Instead of whatever I formulated before.
It's not so bad choice
because it so happen
that minimum of this function l
gives conditional probability function,
probability of y equal one given x,
and then when we can find this probability
of y equal one given x,
we easily can construct
our decision or rule, we
just consider function
if our conditional probability exceed .5,
say first class, if it's
less than .5, second class,
and this is optimal solution.
But something wrong with this replacement.
Let us rewrite the first line.
I will subtract from bracket
inside on the first term.
Regression is an odd regression,
so I have two brackets instead of one,
and then I make a square.
So the last integral show
me the first integral
does not depend on function,
which I looking for,
and I have to minimize
my function l over f
over set of function,
just sum of two last terms.
Have a good it is just
normal binome for two terms.
Square of one, square of second,
and two terms is multiplications.
But our goal is to
minimize first integral,
to find function which is close
to conditional probability of function,
not sum of two integrals.
We can show the second
integral eventually will go,
it goes to the, it'll go to zero,
but it will slow down rate of convergence.
To have a rate of convergence big,
we need to find way, how
to minimize first integral,
not sum of these two.
And that means that not this square loss,
but something else.
What they can do?
There exists,
first of all, when y, it is zero or one,
probability of y equal one given x
is some real valued function
between zero and one,
because from Bayesian formula,
we know that conditional probability of y
equals one given x, and then p of x,
it is joint density with
y given now comma x.
That is just always true.
Now, if I will multiply on some function,
G of x minus x, star,
which belong to L two space.
And the integral, I
will have this equation.
And they can say that
conditional probability
solution of this equation.
It was constructed like that,
because you should put
conditional probability,
I did something like that.
But, I would like to solve this equation
to find function when I don't
know probability measure,
but I'm given data, given observations
generated according to p x, p y,x,
I would like to solve this equation.
But solving equation,
it is ill-posed problem.
OK, let's do that.
But before I will do that,
I would like to mention
that in classical statistics,
there is a way
how to replace
unknown probability measure
with empirical measure.
And that is the most important part,
is main inductive step statistics.
In statistics, we're given data
and would like to know function,
and it doesn't matter how many data.
We will see, it is not
equivalent to function.
So in classical statistics,
people suggest to approximate
cumulative distribution function
by empirical cumulative
distribution function,
and that is empirical cumulative
distribution function.
And 30 years, mathematicians
tried to prove,
that it is good idea that we can do that,
and in '33, Komogorov found exact bound
which is on the last line,
almost the same like in the last line,
and then people prove that
you can bound like that.
Now, if we can replace unknown measure
with empirical measure,
we can construct our problem,
our constructive problem,
what we should do.
Let us replace in this function now
which we would like to minimize
instead of a real mirror our
measure, empirical measure.
And then you have,
empirical, real square root functional
which you have to minimize
to find our problem,
to find our conditional
probability of this
is whatever you want.
But let me consider new
constructive setting,
where we also will replace
unknown probability measure
with empirical probability measure,
obtained on the training data.
And you will see the last equation
to find conditional probability,
we have to solve this equation
in set of function f of x,
right hand side is known
because our function G is known.
In left hand side you don't
know f of x or it is our goal
to set a function to find this function.
In classical statistics,
it was one algorithm
called Watson-Nadaraya estimator
which show how to estimate
conditional probability
or integration of function.
They just somehow defined this.
And this is, I show function which is
have a numerator, and denominator.
So, G is special kernel, say, Gaussian.
So this is estimate of regression,
a very general way of
estimation regression,
conditional probability,
and all this business,
how to find, if Gaussian,
how to find the best value
of variance to approximate
conditional probability well.
So they spent a lot of
time on this subject
and they have this.
But you can see the line in middle,
this is Watson, Nadaraya-Watson estimator
comes from corrupted equation,
not from equation which we derive.
Here, it is in the middle is f of x,
not of f of x, i
like in last, like in kernel,
so then you can put out
of sum f of x and you
will get this function.
So, actually classical
Nadaraya-Watson estimator,
it is solution of corrupted
equation which we obtained.
But what means to solve equation?
To solve equation means I
just take the difference
between left hand side
and right hand side.
Define area where I would like
that my function will operate,
take the square,
and sum, and integrate over
sum probability measure,
and minimize this functional.
So let us do that, and if
you will do simple algebra,
it is just very simple,
and you can check it,
you will have this R f function
which is y minus f of x,
y j minus f of x j
multiply on some
coefficients, where x y, x j.
So, we can estimate this value,
and this value is j x y, x i,
j x i, x g over measure,
and this is matrix, if you
know from Watson-Nadaraya
exact formula, we know this
matrix, this element of matrix.
So, what is V-matrix?
If we will replace this integral
with empirical integral,
we will have this estimate of the matrix.
If we will use just, say,
line between minus one and one,
and new f x is uniformly
distributed over this line,
we will have,
we will have
n G's Gaussian distribution,
we will have this V-matrix.
So V-matrix is easy to find.
Now, I would like to use vector notation.
What I will do, I will call Y vectors
of elements of training data,
y one, y l, I am given l pairs,
so I create Y, vector Y,
which is vector of all Ys.
I will create F, capital from f,
which is also all dimensional vector
of I will pick up function f,
and this is really of this function,
endpoint x one, and the last
is value of this function,
is a point x l.
So this is f, and I have V-matrix.
So, and I can rewrite
this functional in the way
that in matrix form I have
y minus F, capital from f,
V, y minus F, capital from f.
But if I will write this
notation in least squares method,
I will have Y minus F capital from f,
Y minus F capital from f,
and here, instead of Y, identity matrix.
So, I got some improvement
over least squares method.
And I hope that it will
give me a rate of convergence
better than this least squares method,
but, let's see.
But it is not major stuff.
Because, OK, I am prove
rate of convergence,
but is still this square method good?
But now, the most important part,
selection of admissible set of functions.
What it means?
When you construct a neural network,
you talking that you are
doing smart structure,
what it means, you're talking
that you're constructing
smart, admissible set of functions.
You know something, you are smart guys,
you're just constructing,
and then you minimize
over the set of functions.
But, let me consider from
a theoretical perspective,
what it is.
If you consider Hilbert space,
and also Euclidean space,
in this space, there are
two ways of convergence.
Strong convergence, it is
convergence of functions,
it is first line.
My set of, my sequence of
function f l converge to f zero,
if f l is integral converged
from f l goes to infinity.
But there exists weak convergence.
We say that my set of, my
sequence of function f l
converge for f zero, if this inner product
converged to this inner
product for all function phi,
from Hilbert space.
You can see that this is an inner product
described property of function,
if for all functions,
property is the same,
that will be convergence.
So it is easy to show that if
you have strong convergence,
you also have weak convergence,
just from Watson, from Schwarz inequality,
Cauchy-Schwarz inequality.
But also, one can prove that
if you have weak convergence,
and your set of function
belong to compact,
you also have strong convergence.
So, in some sense, the both
convergence are equivalent.
In our consideration of machine logic,
you can see the strong
convergence everywhere, 100%.
But what about weak convergence?
Let's explore this opportunity.
Let us consider pattern recognition case.
For pattern recognition
case, the first line,
I just write in the
definition of weak convergence
equals second, and that is I use
bias in the equation,
it is phi from dP equal one over phi
for all functions from phi.
So it converge,
it must converge for
all functions from phi.
If it converge for all functions from phi,
I will have one function
which is desired one.
But it is not realistic.
Let us do following.
Let us select from set of Hilbert space,
m function phi.
We will talk a lot how to
select these functions.
So, and then we will consider equality,
not for all function, but
just for this m function.
And we will call admissible
set of functions,
the set of functions which
satisfies this equality.
We know that our function must satisfy
this equality for any phi,
any phi, because of the convergence,
but we select something which we want.
OK, if you will use, instead of our
cumulative distribution
function, an empirical estimate,
we will have instead of
integral property like here,
the property written by the sum.
Again, let me use the same notation
for matrix m, I will use Y vector,
I will use function F capital
from f, which is f from x one,
f from x l, which is vector,
for any function f I have vector,
and also, I introduce new vector,
vector of predicates
on the value x one, x l.
Because phi is function,
I can consider value of
function in this case.
Then, I can write
my equation in this form.
I would like that my
admissible set of function
satisfy in vector form, these m equations.
Now,
let me explain what we talking about.
There is a duck test logic.
If it looks like a
duck, swims like a duck,
and quack like a duck,
then it probably is a duck.
What this means, we have
these statistical invariants
in vector form like this, this line.
What it does, it collect
admissible function
which identify animals as a duck
if it looks, swims and quack like a duck.
So if you will choose
predicate which explains
what means, looks, swims, and quack,
then your admissible
function will be a function,
such function for which classify animals
that swim, quack, and looks like a duck.
Concept of predicate, it is
very different from feature.
Why so?
With increasing number of predicates,
the VC-dimension of admissible
set of function is decreased.
Why does it decrease?
Because we have set of function,
then we from this set of
function, select function
which satisfy new predicate.
Not all of function will satisfy,
and consider only set of
function which satisfy
all these predicates.
But with increasing number of features,
VC-dimension increase
because your decisions
are all becoming more and more diverse.
So what is exact setting of
complete learning problem?
Minimize functional
which is this V matrix.
A little bit improved of
the square functional,
subject to this constraint.
And this constraint is
what you would like to see
in the set of admissible functions.
But, existing classical
method of pattern recognition,
they just minimize this functional,
the least square functional.
So, minimizing this functional
subject to this constraint,
that is our set.
That was exact setting,
which is in mathematical is called
conditional optimization,
optimization of functional
under conditions.
But the approximation is
unconditional optimization.
I would like minimize this functional
subject to this constraint,
but I will do following.
I will make sum of this functional,
and I will take square of difference
between this constraint and make a sum,
the sum weight, weight, it should be one.
So I would like to do both,
to minimize over both,
and both weights have
important for me to minimize,
they can have important for
me to minimize constraint.
And then if I will do that,
I can rewrite this function in this way
where you have to
minimize this functional,
and where P
is just covariance
matrix of your predicate.
And everything is simple compute.
So that is concept, what we have to do.
We have to solve our problem
using both big and strong,
strong convergence that
means, using invariance
and minimizing functionals,
and we can do it in exact way,
and in approximation.
But, here it was written
for any set of functions.
I did not talk how I will minimize that.
So it is true how as the
least squares method,
you can minimize the
least square functional
for any set of functions,
that is the same here.
But now, let me see
how I can find the solution.
And first of all, I will
do it for reproducing
kernel Hilbert space.
This is the definition of
reproducing kernel Hilbert space.
You have some kernel
which is Mercer kernel,
you multiply, you're taking the product
this function f of x, and
you have the same function,
it is important to use the same function
it is called reproducing
kernel of Hilbert space.
And it is known that
kernel, Mercer kernel,
have expansion of lambda where lambda
is there's no negative values,
and psi is orthonormal functions.
So, set of function
is inner product and norm.
This is inner product, and that is norm.
Forms reproducing kernel Hilbert space,
it is very easy to check.
It means that if you will use
some function psi,
orthonormal function psi,
and its expansion of c,
and if you will have this set of function,
and if you will introduce
special inner product,
inner product of this type,
and then you will have the definition.
So you will have reproduction
kernel Hilbert space.
It is pretty general space.
But, in reproducing kernel Hilbert space,
you have a great theorem
called representer theorem.
And representer theorem says
if you would like to
minimize this functional,
you subset the function,
and subset of function,
it's norm of your function
in reproducing kernel
Hilbert space is bounded,
so then your solution has
a linear representation,
over kernel with finite
number of parameters.
So, let's introduce
matrix, vector of functions
f K x one effects, it is vector expansion,
and then
square of our norm
will be A, and this is A
over K, this is what is,
how it looks, your function
which you're looking for,
A K A is norm of your function
from reproducing kernel Hilbert space,
K is matrix K xi xj,
and this is F of f
from reproducing kernel Hilbert space
is just linear function.
Subset of function is bounded
norm inside of VC-dimension,
the smaller C, the smaller VC-dimension.
And that is according to
second theorem with j,
show you before.
To control VC-dimension,
you should control just C.
You should be looking for
norm of this function.
So the conditional
minimization in producing
kernel Hilbert space,
has closed form solution
to minimize this functional
subject to this constraint.
And constraint on bound of the norm
will give you, and this is your solution,
linear expansion of this
vector of functions,
and value of coefficients
is like that
where it is just in closed form
with this function of matrix,
this multiplication of matrix,
this is gamma c, C is depends on this ,
where you see gamma of
c depends on this C,
this is identical matrix,
or you have the solution.
And to find u over here, you
have to solve linear equation.
So, you're solving linear equation,
and then you have closed form solution.
So the complete problem in
reproducing kernel Hilbert space
have closed form solution.
But what about unconditional minimization,
approximate minimization like that
in this constraint?
It also has closed form solution,
and this is how it looks,
closed form solution.
Your solution is coefficients
over this expansion,
and you have this equation
you have to find A,
so everything is computable.
But very special rule play in explanation
support vector machine.
What is support vector machine?
Given data, I would
I would like when reproducing
kernel Hilbert space,
this bounded norm
minimize this functional.
Then, when they're
minimizing this functional,
that is exactly you will
get support vector machine.
If you look there I've
supported the machine,
it is just minimization this function now,
this set of function.
But here, you'll do something else.
I will do data like in
support vector machine,
y i minus A theta A,
that is I would like approximate data.
It's unknown like here,
but also I would like to
that my invariant will be good enough,
will be close.
Left hand side and right hand side,
often invariant will be close.
So I would like minimize this functional
under the same constraint.
And this is the solution.
The solution is like A,
A is function which is phi from t,
and phi from t, as you remember,
it is vector of predicate,
and this is indicator vector.
If I would like, here
to make strong for
invariants and not too strong
for approximation function,
I just want to use just weak convergence.
I will have
that my A is defined by invariants,
by function of predicate.
But, my function of predicate,
how I choose predicate?
I can choose any function I want.
That means that if I
can choose one function
which will give me optimal solution,
then there exists a smart predicate
that I will not need a lot of data.
I need what, why I need data,
I need data for expansion of our kernel.
But for estimating
coefficients, I don't need data.
I use my predicate and that means that
what is, what means predicate?
It is property,
it is explanation about what I want.
I will talk about this later.
So, this example of
support vector machine show
that philosophy of this
learning is very different.
According to representer theorem,
solution of learning
problem in reproducing
kernel Hilbert space have a property.
It defined linear parametric function
in form of expansion of kernel function.
That means that optimal expansion
belong to one layer network,
not multi-layer network,
but because of reproducing
kernel Hilbert space,
is richer than this.
Maybe it is not the best idea to use
deep network, deep network,
OK, we will discuss this deep net.
Observation vectors and kernel
define basis for expansion
for optimal l parametric solution.
But parameters is defined by invariants,
but you could use these invariants,
and what it means if you
will write in the form,
that you have K which is matrix,
depending on your training data,
phi is predicate matrix, so
you have vector over here,
and this is Y vector, F is vector,
so you have element, you have different
formulation of learning problem
in term of this vector and these values.
So since function in Hilbert space can be,
since any function from
Hilbert space can be used,
because when we talked
about weak convergence,
it converged for all functions
in the case of Hilbert space.
So any function can be used.
So we have a chance to pick
up several smart functions,
or maybe even one, and it will
be enough for our training,
and then we will talk
about what means smart,
it is intelligent learning,
not just brute force learning.
So, let me give you an
illustration to have a feeling.
This is least square method, what it does.
This is V-matrix method,
it will do a little bit improvement.
It's a little bit better.
Here, if I will introduce
invariants, I will do better.
Here, if I will use both
invariant and V-matrix,
so you can see there's
difference for 48 points,
and to be sure that it
is difficult problem,
we took 16 from one class
and 32 from another class,
it is not difficult.
This is 98 points, the same story,
this is 192 points, the same story,
and this is multidimensional case,
so we check something and we did it.
And that very interesting case.
We introduce some invariants
and what 22.73 error rate for diabetes.
And we decided can we find invariant
to improve it, performance.
And what we did, we're
just looking for area
where our invariants does not take place,
they violate it.
Then we just took predicate which
just doing with this area
which can't have one when your,
when point belong to this area,
and zero when it doesn't belong.
And we improve using this invariant
from 73, .73
to .07.
So, that means that if you have smart way
to looking for invariants,
then you can have a chance
to improve your performance,
but this is exactly the same
philosophy which use physicist.
Find the solution, the box in figure,
where the existing solution,
find situation, the box,
where existing solution, the
approximation which we obtained
contradicts evidence,
does not keep invariants,
contradicts invariance inside the box.
And then modify the solution,
obtain new approximation
which resolves this contradiction,
which doesn't have this contradiction.
So you're just looking,
you inventing invariants,
looking where you have contradiction,
and that is exactly the same principle
that use physicists to
discover law of nature.
To discover law of nature,
physicists first trying to find situation
where existing theory
contradict observations.
So invariant fail.
Theoretical predictions do
not supported by experiments,
that means invariant, but.
Then they trying to reconstruct theory
to remove contradiction, they
construct new approximation
of theory which does not
contain contradiction observed.
But the most important part in physics
like in here is the more difficult part
in scientific discovery,
how to find contradictive situation.
Let me show something about neural net.
I am not fond of neural nets,
but we can use
our theory from neural net as well.
What is neural net?
That is neural net, you're
minimizing least square error.
I would minimize this
approximation of invariants
which is contained VP
matrix, V plus P matrix,
matrix P with invariant,
and then I will do the
same back propagation.
It so happen that I easily
can do back propagation,
not just for this matrix,
but also for this matrix.
And if I will do back propagation,
I need the back propagation
to do only one correction.
Instead of back propagation error
where is y minus what you
have on the last layer,
and you have all l observations.
You just have this matrix and you multiply
your vector of propagation on this matrix,
you're correcting your propagation.
So that is what is back propagation about.
You do for one step, it is OK, the same.
You have back propagation,
it is border condition,
in back propagation step you
should do some correction,
only one on the very last level,
and then update your steps.
So I came to NSC
and ask guys who have a neural network
to incorporate this,
just this improvement,
small improvement this.
I took just one invariant,
a very trivial invariant.
C of x equals one, c of x equals one,
I will show you that it's
not so simple invariant,
we will discuss what
it, what invariant does.
And then, we set V equal
I, we did not use V matrix.
Here it is just identity matrix.
We used 1,000 examples,
100 per class, batch six.
And this is this line
is what does neural network,
deep neural network, they have
a good deep neural network.
And that's what this does
improve neural network.
Instead of 3.1 they got 2.9.
OK, let's do just one,
not very important cosine coefficient.
I have a picture of my, my digit.
I make cosine expansion,
so I have Fourier coefficients
for one coefficient for here,
and I will use this predicate
so I will use predicate
with this one coefficient.
And again, you will see just one invariant
with stupid cosine makes improvement, OK?
Then we decide, let's do more.
Let's do 16 coefficients of Fourier.
Four from x one and four for x two.
And we got .6 bigger,
but I can do whatever I want,
I can do 1600 invariants.
And I can make, it's a
lot of game can be played.
But, let's also, but in neural net,
we use approximation
of exact solution but it works.
The statistical part of
learning theory is complete.
Why is it complete?
Because there exist only
two ways for convergence
in Hilbert space.
Convergence in functions,
convergence in functionals.
There are no third way of convergence.
So, from a conceptual point of view,
you can play one of two
game or two games together.
So why I call this complete,
you cannot imagine something else.
If you would like to do
something
what is improved learning,
you should ask yourself
how to choose invariants,
and that's what I will talk about.
Invariant, it is something
about intelligence,
when they talk about duck test,
I use looks like a
duck, swims like a duck,
quack like a duck, but I can
say play chess like a duck.
I can say whatever I want
and it should be invariant equality,
or say singing can not be like a duck, OK?
But, among all these stupid predicate,
there exists smart predicate,
and subject of learning
and subject of intelligent learning,
some have to pick up the smart invariants.
Let us discuss
what is predicate.
I don't know, exactly,
I think this is for many
hundred years theory,
I will show you that it is continuation
of major philosophy from Plato to Hegel
to Wigner and so on, I will show you.
But it is in predicate, so my claim
that when you're talking
not about imitation,
but what is essence of
intelligence, essence in predicate.
Predicate is something extra.
OK, I will show you later.
Let's say predicate, that is
that invariant holds true, mathematically.
So let's take f of x equals one.
What does this predicate?
Expected number of element
of class y equals one
computed using this predicate,
equal to number of training
example of the first class.
When you will use this predicate,
you will pick up function
which will give you
on this training data,
the number of represents
of the first class, exactly the same
like in your training data.
And that is this predicate.
And you saw how strong this
predicate for neural net
in terms of class of recognition
because they have something.
Let's take another
stupid predicate, just x.
It looks like a duck,
it is center of mass.
I want the expected center of mass,
expected to be the respect
the conditional probability
will be equal to average to
center of mass which I see
on my training data.
That looks like a duck.
But you can do smart looks like a duck.
So, I can consider x x transport
which makes matrix,
it is correlation, covariation matrix,
and I can see n squared
over two predicate of this type
that's covariation which I will get
using this predicate
using function which, sorry.
Predication which I will
get with obtained solution
will be the same like
covariation which I observed
on my training data.
That means this predicate.
So, but,
I sense that we should not
go for general predicate,
and we can imagine when your
predicate, I am not so smart,
that to construct very general predicate,
but let's do for 2D image predicate.
Like u x one, x two, the
function of two variables,
it is image of digits, say, in our case.
And we have this function,
y, this function, y.
And let's consider predicate
like I will take image,
I will consider coefficients for Fourier,
it is my predicate.
I want expected value with respect to this
over my predicate, will be the same
like I show my training data.
I can consider convolution,
because convolution neural
network is one predicate.
I will show you which is predicate.
And this is this convolution of point x y,
x of different points.
You can use value, you
can use whatever you want,
because whatever is
coming from inner product
it is you who decided what you should use.
And the understanding
of image recognition,
it means understand which
predicate involved in that.
But also, the difference
between predicate and invariant,
predicate is abstract concept,
but invariant is from your training data,
it's what makes specific
your abstract concept.
It's also general, but it is specific.
And that is,
I want you to show
instruments for special predicates.
Let us consider vector x y, x j,
just the digit recognition,
keep in mind, your digit recognition.
And suppose x is your pixel space,
and you may clean your transformation,
small linear transformation
of your pixel space.
And if you have small linear
transformation of pixel space,
you transform your picture.
But you can transform picture,
you can also transform
using Lie derivative,
I will show you what is that.
But, to see what is
that, I took this picture
from paper by Simard et al.
Show you, this is
transformation, the first line.
In pixel space, you may
clean your transformation.
And here, you make the same transformation
as Lie derivative, this is Lie derivative,
this black one.
It just computed Lie derivative,
I will show you how to compute.
And alpha is coefficient, and
using different coefficients,
a equals minus two, you
just have this pair,
a equals minus one, you can have this,
this, this, and all this stuff.
So, you can create clone,
clones of digit two,
which is transformed with
respect to Lie derivative.
You don't need to have a lot of data.
You need to have for digital recognition,
you need to have predicate,
and from any data
you can get this predicate.
And OK, I will show you
invariant with this predicate.
And this is what is Lie derivative.
It is horizontal translation,
x, first coordinate, plus a,
you just move it in direction a x one.
Then vertical transformation.
This is rotation, this
standard from geometry,
for full rotation, for small
rotation, you have that.
And this is d dx,
this is d dx two and so on.
You have all this stuff here.
And this is a big illustration
from Patrick Simard.
Clones, you have this three and
you create all these clones.
Just you choose, one, two, three, four,
five Lie derivatives, two
different coefficients,
and then you have all this stuff.
But this is smart guy, why you not taking,
like predicate, just Lie derivative of it?
And we'll take invariant with
respect to Lie derivative,
it is, I would like to learn
if using statistical invariant
try to estimate such
conditional probability,
which keep invariant with
respect to all derivatives.
But even more, it's
again from what was done.
So suppose I have
this set of clones of my digit.
This is set of clones of another digit.
I call tangent distance
the closest element
from these two clones, what that means.
I have two digits, they are
different, say, two, three.
Then I massage them with
linear transformation
to make it as close as is possible,
and measure closeness of them.
That's called Lie invariant.
That's called tangent distance.
And now, I believe that
this is general concept
of predicate symmetry.
When you have any picture, you can say
what is measure of symmetry
of these two pictures.
So for example, you have
three, what I can do,
I have, this is my digit three,
I will have for horizontal symmetry,
I will take first line
here, second line here,
last line here, then I
will do the following.
I will make another digit.
I will make last line the first line.
This line, the second line.
So what I am doing.
I take three,
so I leave it like vector like that.
Now I will do this.
It is two different images.
And now, I will say, let me massage them,
this tangent distance, these
two different pictures,
how close they are.
If they are very close, I can say
this is coefficient of
symmetry of this three.
But I can say horizontal
symmetry, vertical symmetry,
antisymmetry, what means antisymmetry?
Digit s, it has vertical antisymmetry.
I can have vertical symmetry, everything.
So you can play many games with symmetry,
and this is just one predicate.
I have conclusion remarks.
What we did, is that we can
minimize this functional
which is slightly better
than, say, least square
subject to this constraint
which is serious,
because this constraint means
admissible set of functions
which are trying to construct being smart.
So I can consider, say,
all continuous functions.
And then from these continuous functions,
select by smart predicate,
admissible set of functions.
And I can do it, because
according to weak convergence,
any invariant take
place with any function,
invariant take place with any predicate.
So, and also, this provide unique solution
for reproducing kernel Hilbert space,
and approximation for neural
network, approximate solution.
But further progress goes
beyond statistical reasoning.
It goes in direction of
searching of predicate
which forms basis for
understanding of problems
existing in the world.
And what means understanding?
It means that, in say,
2D image recognition,
there exists concept of symmetry,
there exists concept of structure,
and if you will know these concepts,
I believe that it's not a lot,
I will show you why I say it is not a lot,
you will understand this problem.
And I think that this
line, it's very old line.
It start from Plato, what says Plato?
Plato says that there is a vault of ideas,
and vault of things, and vault of ideas
make vault of things.
But you see that I have ideas
which is predicate,
which abstract can be applied
to different situations,
but I have vault of things.
But then, in 300 years ago,
it was classical German philosophy
about that, what it means,
ideas, what means things.
And Hegel told, whatever
is reasonable, it exists.
It is exactly what we
said about predicate.
And whatever exists, it is reasonable.
So there is two connections.
But recently, 60 years ago,
Wigner wrote an article
about unreasonable
effectiveness of mathematics.
It just says that mathematics
knows something about reality.
If you would like to understand reality,
you should look in equation
and you will see how it works.
So predicate, it's abstract idea,
while invariants that are built using them
form elements of solution.
These two concepts reflect
essence of intelligence,
not just its imitation which
is in artificial intelligence.
But, that is subject
which we should attack.
And also, I have two more
slides, one slide is challenge.
I know that people from deep network
get .5% of error rate
using 60,000 observations.
The challenge is, use 1% of this data
and get the same result.
But even smart predicate and
all these clones which exist,
I think that it is doable.
And the very last slide.
In 1928, guy Valdimir Propp published book
"Morphology of Folk Tale" where
he describes 31 predicates
that allow to synthesize
all Russian folk tales.
Later, his morphology, the 31 predicates,
was applied to literature, to theater,
to film, to television,
to television series,
to games, et cetera,
and this 31 was enough.
And this I read from
Wikipedia, you can check it,
with Wikipedia of the book.
Propp found 31 predicates which describe
different actions of people in real world.
Probably there exist a
small amount of predicates
that describe 2D images.
And that is intelligence,
that is how to find them.
That what I believe
should be learning about.
Thank you.
(audience applauds)
- [Host] I think we have
time for a few questions.
- [Man in Audience] Hello,
thank you, I have two questions.
First one is, do you know of any
predicates that you recommend for
language classification
tasks, specifically?
And the second question is,
do you have any strategies
for hedging against over fitting?
Like if you specify too many predicates,
then you might be sort of--
- Sorry, I not hear you well,
but second question is about over fitting?
- [Man in Audience] Over fitting, yes.
- Yes, let me answer this.
- [Man in Audience] Sure.
- The more predicate, you have
why over fitting can happen.
Because your set of function is big,
and you have small amount of
data in selecting function.
So you can select whatever you want.
But if you increase number of predicate,
you decrease set of function.
So the more predicate, the
less over fitting happened.
And if you will, a theory
of mathematics says,
that if you have infinite
number of predicate,
you are left with one
function, if you want.
- [Host] He also asked
about natural language.
Recommendations for
predicates for language,
natural language
processing, the Turing test,
any good predicates.
- You know it is very complicated story,
natural language, I don't know.
- Questions?
- You know, it is,
whatever I am talking it
is very trivial, simple.
Everyone familiar with 2D images.
And we can think, like
this guy Vladimir Propp,
what is predicate in these images.
Can we formulate, if you're smart guy,
say, couple of dozens, or
maybe one dozen predicate,
it should be enough.
- [Host] But language
is harder than images.
- Oh yeah, absolutely.
(audience laughs)
Yeah, but don't do the
immediately hard problem.
Try to--
- Try?
- Yeah, I tried a very
simple, just step-by-step.
It so happens that is
main line of philosophy
from Plato to this guy
who says that ideas is not too much.
There's not too many
ideas that are existing
in world of ideas.
It could be like that.
- Vladimir, thank you so
much for coming today,
and please give him a big hand.
(audience applauding)