Transcript
Ow25mjFjSmg • Complete Statistical Theory of Learning (Vladimir Vapnik) | MIT Deep Learning Series
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0305_Ow25mjFjSmg.txt
Kind: captions Language: en - Today, we're happy and honored to have Vladimir Vapnik with us, co-inventor of supported vector machines, support vector clustering, VC theory of statistical learning, and author of "Statistical Learning Theory". He's one of the greatest and most impactful statisticians and computer scientists of our time. Plus, he grew up in the Soviet Union, eventually heading the Computer Science Department, Institute of Controlled Sciences in Moscow. So he will give today's lecture in Russian and I will translate. Just kidding, right. (laughing) (audience laughs) It's an honor and a pleasure to have Vladimir with us today, so please give him a warm welcome. (audience applauds) - Thank you. About 50 years ago, Professor Chervonenkis and me started statistical learning theory. The problem was to answer the question when if we will do well with training data if you will have small amount of our own training data, you will do well also on the test data. You will minimize expectation of error. So this solves this problem. There are this theory in most in all books they might be written in different languages, but mostly they follow this line. The line is that just law of large numbers is not enough, then you need uniform law of large number, you need convergence, and so on, but because we started this discussion about empirical error how good you do on the training data, and bounds shows the better you do in the training data, you will be better on test data. People decided that this is only way to have a training data and do something with this training data, to minimize number of error and all algorithm was constructed based on this principle. About five years ago I found that there exists another principle, even more interesting than this one, because in this principle it is brute force principle, give me more data, you give better answer with. So the second principle is intelligent principle and I will talk today about this. So I will start with statistical learning theory and then I will introduce this new theory, but because there are only two ways for generalization, one is data, another I will show what I call the complete statistical learning theory because there are more, another way to do something. There is no short way for generalization. You should use both of them, so that is complete theory. But it is not so bad because you will see that learning theory move in different direction. In direction of intelligence to understanding what is intellence. It is not the same but Turing explained. Turing told that you should imitate intelligent person but now, question what is intelligence, and we will discuss this. So let me start. The first part is, this is theory of generalization, and that is the question, when in set of given set of function, you can minimize functional. This is pretty general functional. Instead of y, f of x, and L is loss function, I can consider difference between y and function. It is what they're doing in regression and pattern recognition and all this stuff, but this is more general setting. But this is more important that when we consider minimization of functional, we should say in which set of fucntions. In the given set of function, you have to minimize functional. If probability measure is unknown, but we are given iid data pairs. That exact setting of both pattern recognition and regression estimation problem of pattern recognition set of function is indicated functions for regression function because continuous function, but the statement is the same, but the answer is very simple. We can minimize this functional using data if and only if this dimension h of set of function is finite. Everything depend on which set of function you have to minimize this functional, and you can not avoid that. So we call it capacity, but maybe better called, diversity of the set of function, but measure of this capacity, diversity, is VC-dimension. And what is VC dimension, I will give you definition. First for your set of indicator functions, t is step function you consider continuous function f, and in front of it, you're looking for indicator. If function is positive you say one, if not positive you say zero, that is indicator. The VC-dimension of set indicator function equals h if h is the maximal number of vectors that can be shattered, separated in all possible two h subsets using indicator functions from this set. So you have set, you should find h vectors which you can shatter in all possible way, in the l possible way. But you cannot shatter in h plus one vector. Then VC-dimension of this set of function will be h. This is purely a rhetorical definition. And if you can shatter for any l, then this says the VC dimension is infinite. And then I will give you two theorems which we will use, and then the main theorem, probably of this theory. If set of function has VC-dimension h, then this probability one minus eta for all functions in the set the bound holds true. And because this bound holds true for all function, and you would like to have minimal estimate, minimal right hand side, you will choose function which minimize empirical loss. And the second theorem, if you have set of linear functions, indicator from the set of linear function, it so happen that your vector x inside of circle for radius one, and w inside of C, then we see the dimension is bounded by maximum of two radius C and n, n is the dimensionality of the object plus one. That means exactly that VC-dimension can be smaller than dimensionality of the space. And you can control some part of VC-dimension, and I will show you afterwards how to do it, it is very important theorem. But, what is the general way suggest the VC-dimension for searching for functions. You have set of function, maybe set of function have infinite VC-dimension. Then you make a structure of this set of function. You choose subset of function, the small subset of function, with VC-dimension h, then another subset of function which includes a small one, VC-dimension h two, so this last VC, we have last VC type of function. It's loss to VC dimension. And then, when you minimize exist functional, you're doing two things. You choose appropriate subset, and then in this subset, you pick up function which minimize empirical loss. But, you can see, that epsilon in this depend on VC-dimension of subset you choose, at VC-dimension over L. And also, it depend on empirical loss which you achieve. So you can do as soon as you make a structure, you can do whatever you want, even if you have infinite VC-dimension in initial set of function. But, now, this is more or less, all what contain the main result of VC theory. But VC theory does not answer four very important questions. How to choose loss function L y f? I told that any function how to select admissible set of function f of x? I told you that given set of function, but when there's a stupid set of function, how to construct good set of function? How to construct structure on admissible set of function? And then, how to minimize function and how to construct the structure. In this talk I will try to answer all those questions. Target functional for minimization. And this is important slide, God plays dice. What is setting of pattern recognition problem? I will consider in this talk, just pattern recognition problem for two class specification, but generalization is straightforward. So, what is setting of pattern-recognition problem? Given generator, given nature we generate randomly, independently, citation x which come on the object, and this object is conditional probability of y. Given x, say that y equals one, given x, and y equals zero given x. This object knows this conditional probability, and plays dice. So he has the function of conditional probability, he has x on the input, he plays dice, and say y. That is the most general setting of logic problem, deterministical particular case of this set. And what does learning machine, learning machine has a set of functions, and it can choose any function from this set. The problem with observing l observations, x one y one, x l y l, to pick up the function for classification. That means, given observations generated by a kind of conditional probability P x, y equal P y given x on P of x, that's our theorem, finds the rule that minimize function l. So in my case when I have y zero or one, or in theta, also zero, one, it is indicator function, so last function l, just collect how many errors I do, but I have probability measure, it collect expectation of error. I would like to find function which guarantee the smallest expectation of l. But this is not very good. Why it not very good? Because my function l, this model is just, it is everywhere zero except for some points where it is one or minus one, and if I could model it is one. So, it is zero everywhere in defined, and one in some point. So I cannot use gradient in this case. So I should to do something smarter. And what people doing is they replace model and indicator function with just y minus f of x. This create error. Instead of whatever I formulated before. It's not so bad choice because it so happen that minimum of this function l gives conditional probability function, probability of y equal one given x, and then when we can find this probability of y equal one given x, we easily can construct our decision or rule, we just consider function if our conditional probability exceed .5, say first class, if it's less than .5, second class, and this is optimal solution. But something wrong with this replacement. Let us rewrite the first line. I will subtract from bracket inside on the first term. Regression is an odd regression, so I have two brackets instead of one, and then I make a square. So the last integral show me the first integral does not depend on function, which I looking for, and I have to minimize my function l over f over set of function, just sum of two last terms. Have a good it is just normal binome for two terms. Square of one, square of second, and two terms is multiplications. But our goal is to minimize first integral, to find function which is close to conditional probability of function, not sum of two integrals. We can show the second integral eventually will go, it goes to the, it'll go to zero, but it will slow down rate of convergence. To have a rate of convergence big, we need to find way, how to minimize first integral, not sum of these two. And that means that not this square loss, but something else. What they can do? There exists, first of all, when y, it is zero or one, probability of y equal one given x is some real valued function between zero and one, because from Bayesian formula, we know that conditional probability of y equals one given x, and then p of x, it is joint density with y given now comma x. That is just always true. Now, if I will multiply on some function, G of x minus x, star, which belong to L two space. And the integral, I will have this equation. And they can say that conditional probability solution of this equation. It was constructed like that, because you should put conditional probability, I did something like that. But, I would like to solve this equation to find function when I don't know probability measure, but I'm given data, given observations generated according to p x, p y,x, I would like to solve this equation. But solving equation, it is ill-posed problem. OK, let's do that. But before I will do that, I would like to mention that in classical statistics, there is a way how to replace unknown probability measure with empirical measure. And that is the most important part, is main inductive step statistics. In statistics, we're given data and would like to know function, and it doesn't matter how many data. We will see, it is not equivalent to function. So in classical statistics, people suggest to approximate cumulative distribution function by empirical cumulative distribution function, and that is empirical cumulative distribution function. And 30 years, mathematicians tried to prove, that it is good idea that we can do that, and in '33, Komogorov found exact bound which is on the last line, almost the same like in the last line, and then people prove that you can bound like that. Now, if we can replace unknown measure with empirical measure, we can construct our problem, our constructive problem, what we should do. Let us replace in this function now which we would like to minimize instead of a real mirror our measure, empirical measure. And then you have, empirical, real square root functional which you have to minimize to find our problem, to find our conditional probability of this is whatever you want. But let me consider new constructive setting, where we also will replace unknown probability measure with empirical probability measure, obtained on the training data. And you will see the last equation to find conditional probability, we have to solve this equation in set of function f of x, right hand side is known because our function G is known. In left hand side you don't know f of x or it is our goal to set a function to find this function. In classical statistics, it was one algorithm called Watson-Nadaraya estimator which show how to estimate conditional probability or integration of function. They just somehow defined this. And this is, I show function which is have a numerator, and denominator. So, G is special kernel, say, Gaussian. So this is estimate of regression, a very general way of estimation regression, conditional probability, and all this business, how to find, if Gaussian, how to find the best value of variance to approximate conditional probability well. So they spent a lot of time on this subject and they have this. But you can see the line in middle, this is Watson, Nadaraya-Watson estimator comes from corrupted equation, not from equation which we derive. Here, it is in the middle is f of x, not of f of x, i like in last, like in kernel, so then you can put out of sum f of x and you will get this function. So, actually classical Nadaraya-Watson estimator, it is solution of corrupted equation which we obtained. But what means to solve equation? To solve equation means I just take the difference between left hand side and right hand side. Define area where I would like that my function will operate, take the square, and sum, and integrate over sum probability measure, and minimize this functional. So let us do that, and if you will do simple algebra, it is just very simple, and you can check it, you will have this R f function which is y minus f of x, y j minus f of x j multiply on some coefficients, where x y, x j. So, we can estimate this value, and this value is j x y, x i, j x i, x g over measure, and this is matrix, if you know from Watson-Nadaraya exact formula, we know this matrix, this element of matrix. So, what is V-matrix? If we will replace this integral with empirical integral, we will have this estimate of the matrix. If we will use just, say, line between minus one and one, and new f x is uniformly distributed over this line, we will have, we will have n G's Gaussian distribution, we will have this V-matrix. So V-matrix is easy to find. Now, I would like to use vector notation. What I will do, I will call Y vectors of elements of training data, y one, y l, I am given l pairs, so I create Y, vector Y, which is vector of all Ys. I will create F, capital from f, which is also all dimensional vector of I will pick up function f, and this is really of this function, endpoint x one, and the last is value of this function, is a point x l. So this is f, and I have V-matrix. So, and I can rewrite this functional in the way that in matrix form I have y minus F, capital from f, V, y minus F, capital from f. But if I will write this notation in least squares method, I will have Y minus F capital from f, Y minus F capital from f, and here, instead of Y, identity matrix. So, I got some improvement over least squares method. And I hope that it will give me a rate of convergence better than this least squares method, but, let's see. But it is not major stuff. Because, OK, I am prove rate of convergence, but is still this square method good? But now, the most important part, selection of admissible set of functions. What it means? When you construct a neural network, you talking that you are doing smart structure, what it means, you're talking that you're constructing smart, admissible set of functions. You know something, you are smart guys, you're just constructing, and then you minimize over the set of functions. But, let me consider from a theoretical perspective, what it is. If you consider Hilbert space, and also Euclidean space, in this space, there are two ways of convergence. Strong convergence, it is convergence of functions, it is first line. My set of, my sequence of function f l converge to f zero, if f l is integral converged from f l goes to infinity. But there exists weak convergence. We say that my set of, my sequence of function f l converge for f zero, if this inner product converged to this inner product for all function phi, from Hilbert space. You can see that this is an inner product described property of function, if for all functions, property is the same, that will be convergence. So it is easy to show that if you have strong convergence, you also have weak convergence, just from Watson, from Schwarz inequality, Cauchy-Schwarz inequality. But also, one can prove that if you have weak convergence, and your set of function belong to compact, you also have strong convergence. So, in some sense, the both convergence are equivalent. In our consideration of machine logic, you can see the strong convergence everywhere, 100%. But what about weak convergence? Let's explore this opportunity. Let us consider pattern recognition case. For pattern recognition case, the first line, I just write in the definition of weak convergence equals second, and that is I use bias in the equation, it is phi from dP equal one over phi for all functions from phi. So it converge, it must converge for all functions from phi. If it converge for all functions from phi, I will have one function which is desired one. But it is not realistic. Let us do following. Let us select from set of Hilbert space, m function phi. We will talk a lot how to select these functions. So, and then we will consider equality, not for all function, but just for this m function. And we will call admissible set of functions, the set of functions which satisfies this equality. We know that our function must satisfy this equality for any phi, any phi, because of the convergence, but we select something which we want. OK, if you will use, instead of our cumulative distribution function, an empirical estimate, we will have instead of integral property like here, the property written by the sum. Again, let me use the same notation for matrix m, I will use Y vector, I will use function F capital from f, which is f from x one, f from x l, which is vector, for any function f I have vector, and also, I introduce new vector, vector of predicates on the value x one, x l. Because phi is function, I can consider value of function in this case. Then, I can write my equation in this form. I would like that my admissible set of function satisfy in vector form, these m equations. Now, let me explain what we talking about. There is a duck test logic. If it looks like a duck, swims like a duck, and quack like a duck, then it probably is a duck. What this means, we have these statistical invariants in vector form like this, this line. What it does, it collect admissible function which identify animals as a duck if it looks, swims and quack like a duck. So if you will choose predicate which explains what means, looks, swims, and quack, then your admissible function will be a function, such function for which classify animals that swim, quack, and looks like a duck. Concept of predicate, it is very different from feature. Why so? With increasing number of predicates, the VC-dimension of admissible set of function is decreased. Why does it decrease? Because we have set of function, then we from this set of function, select function which satisfy new predicate. Not all of function will satisfy, and consider only set of function which satisfy all these predicates. But with increasing number of features, VC-dimension increase because your decisions are all becoming more and more diverse. So what is exact setting of complete learning problem? Minimize functional which is this V matrix. A little bit improved of the square functional, subject to this constraint. And this constraint is what you would like to see in the set of admissible functions. But, existing classical method of pattern recognition, they just minimize this functional, the least square functional. So, minimizing this functional subject to this constraint, that is our set. That was exact setting, which is in mathematical is called conditional optimization, optimization of functional under conditions. But the approximation is unconditional optimization. I would like minimize this functional subject to this constraint, but I will do following. I will make sum of this functional, and I will take square of difference between this constraint and make a sum, the sum weight, weight, it should be one. So I would like to do both, to minimize over both, and both weights have important for me to minimize, they can have important for me to minimize constraint. And then if I will do that, I can rewrite this function in this way where you have to minimize this functional, and where P is just covariance matrix of your predicate. And everything is simple compute. So that is concept, what we have to do. We have to solve our problem using both big and strong, strong convergence that means, using invariance and minimizing functionals, and we can do it in exact way, and in approximation. But, here it was written for any set of functions. I did not talk how I will minimize that. So it is true how as the least squares method, you can minimize the least square functional for any set of functions, that is the same here. But now, let me see how I can find the solution. And first of all, I will do it for reproducing kernel Hilbert space. This is the definition of reproducing kernel Hilbert space. You have some kernel which is Mercer kernel, you multiply, you're taking the product this function f of x, and you have the same function, it is important to use the same function it is called reproducing kernel of Hilbert space. And it is known that kernel, Mercer kernel, have expansion of lambda where lambda is there's no negative values, and psi is orthonormal functions. So, set of function is inner product and norm. This is inner product, and that is norm. Forms reproducing kernel Hilbert space, it is very easy to check. It means that if you will use some function psi, orthonormal function psi, and its expansion of c, and if you will have this set of function, and if you will introduce special inner product, inner product of this type, and then you will have the definition. So you will have reproduction kernel Hilbert space. It is pretty general space. But, in reproducing kernel Hilbert space, you have a great theorem called representer theorem. And representer theorem says if you would like to minimize this functional, you subset the function, and subset of function, it's norm of your function in reproducing kernel Hilbert space is bounded, so then your solution has a linear representation, over kernel with finite number of parameters. So, let's introduce matrix, vector of functions f K x one effects, it is vector expansion, and then square of our norm will be A, and this is A over K, this is what is, how it looks, your function which you're looking for, A K A is norm of your function from reproducing kernel Hilbert space, K is matrix K xi xj, and this is F of f from reproducing kernel Hilbert space is just linear function. Subset of function is bounded norm inside of VC-dimension, the smaller C, the smaller VC-dimension. And that is according to second theorem with j, show you before. To control VC-dimension, you should control just C. You should be looking for norm of this function. So the conditional minimization in producing kernel Hilbert space, has closed form solution to minimize this functional subject to this constraint. And constraint on bound of the norm will give you, and this is your solution, linear expansion of this vector of functions, and value of coefficients is like that where it is just in closed form with this function of matrix, this multiplication of matrix, this is gamma c, C is depends on this , where you see gamma of c depends on this C, this is identical matrix, or you have the solution. And to find u over here, you have to solve linear equation. So, you're solving linear equation, and then you have closed form solution. So the complete problem in reproducing kernel Hilbert space have closed form solution. But what about unconditional minimization, approximate minimization like that in this constraint? It also has closed form solution, and this is how it looks, closed form solution. Your solution is coefficients over this expansion, and you have this equation you have to find A, so everything is computable. But very special rule play in explanation support vector machine. What is support vector machine? Given data, I would I would like when reproducing kernel Hilbert space, this bounded norm minimize this functional. Then, when they're minimizing this functional, that is exactly you will get support vector machine. If you look there I've supported the machine, it is just minimization this function now, this set of function. But here, you'll do something else. I will do data like in support vector machine, y i minus A theta A, that is I would like approximate data. It's unknown like here, but also I would like to that my invariant will be good enough, will be close. Left hand side and right hand side, often invariant will be close. So I would like minimize this functional under the same constraint. And this is the solution. The solution is like A, A is function which is phi from t, and phi from t, as you remember, it is vector of predicate, and this is indicator vector. If I would like, here to make strong for invariants and not too strong for approximation function, I just want to use just weak convergence. I will have that my A is defined by invariants, by function of predicate. But, my function of predicate, how I choose predicate? I can choose any function I want. That means that if I can choose one function which will give me optimal solution, then there exists a smart predicate that I will not need a lot of data. I need what, why I need data, I need data for expansion of our kernel. But for estimating coefficients, I don't need data. I use my predicate and that means that what is, what means predicate? It is property, it is explanation about what I want. I will talk about this later. So, this example of support vector machine show that philosophy of this learning is very different. According to representer theorem, solution of learning problem in reproducing kernel Hilbert space have a property. It defined linear parametric function in form of expansion of kernel function. That means that optimal expansion belong to one layer network, not multi-layer network, but because of reproducing kernel Hilbert space, is richer than this. Maybe it is not the best idea to use deep network, deep network, OK, we will discuss this deep net. Observation vectors and kernel define basis for expansion for optimal l parametric solution. But parameters is defined by invariants, but you could use these invariants, and what it means if you will write in the form, that you have K which is matrix, depending on your training data, phi is predicate matrix, so you have vector over here, and this is Y vector, F is vector, so you have element, you have different formulation of learning problem in term of this vector and these values. So since function in Hilbert space can be, since any function from Hilbert space can be used, because when we talked about weak convergence, it converged for all functions in the case of Hilbert space. So any function can be used. So we have a chance to pick up several smart functions, or maybe even one, and it will be enough for our training, and then we will talk about what means smart, it is intelligent learning, not just brute force learning. So, let me give you an illustration to have a feeling. This is least square method, what it does. This is V-matrix method, it will do a little bit improvement. It's a little bit better. Here, if I will introduce invariants, I will do better. Here, if I will use both invariant and V-matrix, so you can see there's difference for 48 points, and to be sure that it is difficult problem, we took 16 from one class and 32 from another class, it is not difficult. This is 98 points, the same story, this is 192 points, the same story, and this is multidimensional case, so we check something and we did it. And that very interesting case. We introduce some invariants and what 22.73 error rate for diabetes. And we decided can we find invariant to improve it, performance. And what we did, we're just looking for area where our invariants does not take place, they violate it. Then we just took predicate which just doing with this area which can't have one when your, when point belong to this area, and zero when it doesn't belong. And we improve using this invariant from 73, .73 to .07. So, that means that if you have smart way to looking for invariants, then you can have a chance to improve your performance, but this is exactly the same philosophy which use physicist. Find the solution, the box in figure, where the existing solution, find situation, the box, where existing solution, the approximation which we obtained contradicts evidence, does not keep invariants, contradicts invariance inside the box. And then modify the solution, obtain new approximation which resolves this contradiction, which doesn't have this contradiction. So you're just looking, you inventing invariants, looking where you have contradiction, and that is exactly the same principle that use physicists to discover law of nature. To discover law of nature, physicists first trying to find situation where existing theory contradict observations. So invariant fail. Theoretical predictions do not supported by experiments, that means invariant, but. Then they trying to reconstruct theory to remove contradiction, they construct new approximation of theory which does not contain contradiction observed. But the most important part in physics like in here is the more difficult part in scientific discovery, how to find contradictive situation. Let me show something about neural net. I am not fond of neural nets, but we can use our theory from neural net as well. What is neural net? That is neural net, you're minimizing least square error. I would minimize this approximation of invariants which is contained VP matrix, V plus P matrix, matrix P with invariant, and then I will do the same back propagation. It so happen that I easily can do back propagation, not just for this matrix, but also for this matrix. And if I will do back propagation, I need the back propagation to do only one correction. Instead of back propagation error where is y minus what you have on the last layer, and you have all l observations. You just have this matrix and you multiply your vector of propagation on this matrix, you're correcting your propagation. So that is what is back propagation about. You do for one step, it is OK, the same. You have back propagation, it is border condition, in back propagation step you should do some correction, only one on the very last level, and then update your steps. So I came to NSC and ask guys who have a neural network to incorporate this, just this improvement, small improvement this. I took just one invariant, a very trivial invariant. C of x equals one, c of x equals one, I will show you that it's not so simple invariant, we will discuss what it, what invariant does. And then, we set V equal I, we did not use V matrix. Here it is just identity matrix. We used 1,000 examples, 100 per class, batch six. And this is this line is what does neural network, deep neural network, they have a good deep neural network. And that's what this does improve neural network. Instead of 3.1 they got 2.9. OK, let's do just one, not very important cosine coefficient. I have a picture of my, my digit. I make cosine expansion, so I have Fourier coefficients for one coefficient for here, and I will use this predicate so I will use predicate with this one coefficient. And again, you will see just one invariant with stupid cosine makes improvement, OK? Then we decide, let's do more. Let's do 16 coefficients of Fourier. Four from x one and four for x two. And we got .6 bigger, but I can do whatever I want, I can do 1600 invariants. And I can make, it's a lot of game can be played. But, let's also, but in neural net, we use approximation of exact solution but it works. The statistical part of learning theory is complete. Why is it complete? Because there exist only two ways for convergence in Hilbert space. Convergence in functions, convergence in functionals. There are no third way of convergence. So, from a conceptual point of view, you can play one of two game or two games together. So why I call this complete, you cannot imagine something else. If you would like to do something what is improved learning, you should ask yourself how to choose invariants, and that's what I will talk about. Invariant, it is something about intelligence, when they talk about duck test, I use looks like a duck, swims like a duck, quack like a duck, but I can say play chess like a duck. I can say whatever I want and it should be invariant equality, or say singing can not be like a duck, OK? But, among all these stupid predicate, there exists smart predicate, and subject of learning and subject of intelligent learning, some have to pick up the smart invariants. Let us discuss what is predicate. I don't know, exactly, I think this is for many hundred years theory, I will show you that it is continuation of major philosophy from Plato to Hegel to Wigner and so on, I will show you. But it is in predicate, so my claim that when you're talking not about imitation, but what is essence of intelligence, essence in predicate. Predicate is something extra. OK, I will show you later. Let's say predicate, that is that invariant holds true, mathematically. So let's take f of x equals one. What does this predicate? Expected number of element of class y equals one computed using this predicate, equal to number of training example of the first class. When you will use this predicate, you will pick up function which will give you on this training data, the number of represents of the first class, exactly the same like in your training data. And that is this predicate. And you saw how strong this predicate for neural net in terms of class of recognition because they have something. Let's take another stupid predicate, just x. It looks like a duck, it is center of mass. I want the expected center of mass, expected to be the respect the conditional probability will be equal to average to center of mass which I see on my training data. That looks like a duck. But you can do smart looks like a duck. So, I can consider x x transport which makes matrix, it is correlation, covariation matrix, and I can see n squared over two predicate of this type that's covariation which I will get using this predicate using function which, sorry. Predication which I will get with obtained solution will be the same like covariation which I observed on my training data. That means this predicate. So, but, I sense that we should not go for general predicate, and we can imagine when your predicate, I am not so smart, that to construct very general predicate, but let's do for 2D image predicate. Like u x one, x two, the function of two variables, it is image of digits, say, in our case. And we have this function, y, this function, y. And let's consider predicate like I will take image, I will consider coefficients for Fourier, it is my predicate. I want expected value with respect to this over my predicate, will be the same like I show my training data. I can consider convolution, because convolution neural network is one predicate. I will show you which is predicate. And this is this convolution of point x y, x of different points. You can use value, you can use whatever you want, because whatever is coming from inner product it is you who decided what you should use. And the understanding of image recognition, it means understand which predicate involved in that. But also, the difference between predicate and invariant, predicate is abstract concept, but invariant is from your training data, it's what makes specific your abstract concept. It's also general, but it is specific. And that is, I want you to show instruments for special predicates. Let us consider vector x y, x j, just the digit recognition, keep in mind, your digit recognition. And suppose x is your pixel space, and you may clean your transformation, small linear transformation of your pixel space. And if you have small linear transformation of pixel space, you transform your picture. But you can transform picture, you can also transform using Lie derivative, I will show you what is that. But, to see what is that, I took this picture from paper by Simard et al. Show you, this is transformation, the first line. In pixel space, you may clean your transformation. And here, you make the same transformation as Lie derivative, this is Lie derivative, this black one. It just computed Lie derivative, I will show you how to compute. And alpha is coefficient, and using different coefficients, a equals minus two, you just have this pair, a equals minus one, you can have this, this, this, and all this stuff. So, you can create clone, clones of digit two, which is transformed with respect to Lie derivative. You don't need to have a lot of data. You need to have for digital recognition, you need to have predicate, and from any data you can get this predicate. And OK, I will show you invariant with this predicate. And this is what is Lie derivative. It is horizontal translation, x, first coordinate, plus a, you just move it in direction a x one. Then vertical transformation. This is rotation, this standard from geometry, for full rotation, for small rotation, you have that. And this is d dx, this is d dx two and so on. You have all this stuff here. And this is a big illustration from Patrick Simard. Clones, you have this three and you create all these clones. Just you choose, one, two, three, four, five Lie derivatives, two different coefficients, and then you have all this stuff. But this is smart guy, why you not taking, like predicate, just Lie derivative of it? And we'll take invariant with respect to Lie derivative, it is, I would like to learn if using statistical invariant try to estimate such conditional probability, which keep invariant with respect to all derivatives. But even more, it's again from what was done. So suppose I have this set of clones of my digit. This is set of clones of another digit. I call tangent distance the closest element from these two clones, what that means. I have two digits, they are different, say, two, three. Then I massage them with linear transformation to make it as close as is possible, and measure closeness of them. That's called Lie invariant. That's called tangent distance. And now, I believe that this is general concept of predicate symmetry. When you have any picture, you can say what is measure of symmetry of these two pictures. So for example, you have three, what I can do, I have, this is my digit three, I will have for horizontal symmetry, I will take first line here, second line here, last line here, then I will do the following. I will make another digit. I will make last line the first line. This line, the second line. So what I am doing. I take three, so I leave it like vector like that. Now I will do this. It is two different images. And now, I will say, let me massage them, this tangent distance, these two different pictures, how close they are. If they are very close, I can say this is coefficient of symmetry of this three. But I can say horizontal symmetry, vertical symmetry, antisymmetry, what means antisymmetry? Digit s, it has vertical antisymmetry. I can have vertical symmetry, everything. So you can play many games with symmetry, and this is just one predicate. I have conclusion remarks. What we did, is that we can minimize this functional which is slightly better than, say, least square subject to this constraint which is serious, because this constraint means admissible set of functions which are trying to construct being smart. So I can consider, say, all continuous functions. And then from these continuous functions, select by smart predicate, admissible set of functions. And I can do it, because according to weak convergence, any invariant take place with any function, invariant take place with any predicate. So, and also, this provide unique solution for reproducing kernel Hilbert space, and approximation for neural network, approximate solution. But further progress goes beyond statistical reasoning. It goes in direction of searching of predicate which forms basis for understanding of problems existing in the world. And what means understanding? It means that, in say, 2D image recognition, there exists concept of symmetry, there exists concept of structure, and if you will know these concepts, I believe that it's not a lot, I will show you why I say it is not a lot, you will understand this problem. And I think that this line, it's very old line. It start from Plato, what says Plato? Plato says that there is a vault of ideas, and vault of things, and vault of ideas make vault of things. But you see that I have ideas which is predicate, which abstract can be applied to different situations, but I have vault of things. But then, in 300 years ago, it was classical German philosophy about that, what it means, ideas, what means things. And Hegel told, whatever is reasonable, it exists. It is exactly what we said about predicate. And whatever exists, it is reasonable. So there is two connections. But recently, 60 years ago, Wigner wrote an article about unreasonable effectiveness of mathematics. It just says that mathematics knows something about reality. If you would like to understand reality, you should look in equation and you will see how it works. So predicate, it's abstract idea, while invariants that are built using them form elements of solution. These two concepts reflect essence of intelligence, not just its imitation which is in artificial intelligence. But, that is subject which we should attack. And also, I have two more slides, one slide is challenge. I know that people from deep network get .5% of error rate using 60,000 observations. The challenge is, use 1% of this data and get the same result. But even smart predicate and all these clones which exist, I think that it is doable. And the very last slide. In 1928, guy Valdimir Propp published book "Morphology of Folk Tale" where he describes 31 predicates that allow to synthesize all Russian folk tales. Later, his morphology, the 31 predicates, was applied to literature, to theater, to film, to television, to television series, to games, et cetera, and this 31 was enough. And this I read from Wikipedia, you can check it, with Wikipedia of the book. Propp found 31 predicates which describe different actions of people in real world. Probably there exist a small amount of predicates that describe 2D images. And that is intelligence, that is how to find them. That what I believe should be learning about. Thank you. (audience applauds) - [Host] I think we have time for a few questions. - [Man in Audience] Hello, thank you, I have two questions. First one is, do you know of any predicates that you recommend for language classification tasks, specifically? And the second question is, do you have any strategies for hedging against over fitting? Like if you specify too many predicates, then you might be sort of-- - Sorry, I not hear you well, but second question is about over fitting? - [Man in Audience] Over fitting, yes. - Yes, let me answer this. - [Man in Audience] Sure. - The more predicate, you have why over fitting can happen. Because your set of function is big, and you have small amount of data in selecting function. So you can select whatever you want. But if you increase number of predicate, you decrease set of function. So the more predicate, the less over fitting happened. And if you will, a theory of mathematics says, that if you have infinite number of predicate, you are left with one function, if you want. - [Host] He also asked about natural language. Recommendations for predicates for language, natural language processing, the Turing test, any good predicates. - You know it is very complicated story, natural language, I don't know. - Questions? - You know, it is, whatever I am talking it is very trivial, simple. Everyone familiar with 2D images. And we can think, like this guy Vladimir Propp, what is predicate in these images. Can we formulate, if you're smart guy, say, couple of dozens, or maybe one dozen predicate, it should be enough. - [Host] But language is harder than images. - Oh yeah, absolutely. (audience laughs) Yeah, but don't do the immediately hard problem. Try to-- - Try? - Yeah, I tried a very simple, just step-by-step. It so happens that is main line of philosophy from Plato to this guy who says that ideas is not too much. There's not too many ideas that are existing in world of ideas. It could be like that. - Vladimir, thank you so much for coming today, and please give him a big hand. (audience applauding)