Marcus Hutter: Universal Artificial Intelligence, AIXI, and AGI | Lex Fridman Podcast #75
E1AxVXt2Gv4 • 2020-02-26
Transcript preview
Open
Kind: captions Language: en the following is a conversation with Marcus hunter senior research scientists the google deepmind throughout his career of research including with Juergen Smith Huber and Shayne leg he has proposed a lot of interesting ideas in and around the field of artificial general intelligence including the development of IHC spelled a ixi model which is a mathematical approach to AGI that incorporates ideas of Kolmogorov complexity solomonoff induction and reinforcement learning in 2006 Marcus launched the 50,000 euro hütter prize for lossless compression of human knowledge the idea behind this prize is that the ability to compress well is closely related to intelligence this to me is a profound idea specifically if you can compress the first 100 megabytes or 1 gigabyte of Wikipedia better than your predecessors your compressor likely has to also be smarter the intention of this prize is to encourage the development of intelligent compressors as a path to AGI in conjunction with this podcast release just a few days ago Markus announced the 10x increase in several aspects of the surprise including the money to 500,000 euros the better your compressor works relative to the previous winners the higher fraction of that prize money is awarded to you you can learn more about it if you Google simply Qatar prize I have a big fan of benchmarks for developing AI systems and the harder prize may indeed be one that will spark some good ideas for approaches that will make progress on the path of developing a GI systems this is the artificial intelligence podcast if you enjoy it subscribe on YouTube give it five stars an Apple podcast supported on patreon or simply connect with me on Twitter at lex Friedman spelled Fri D M am as usual I'll do one or two minutes of ads now and never any ads in the middle that can break the flow of the conversation I hope that works for you and doesn't hurt the listening experience this show is presented by cash app the number one finance app in the App Store when you get it use collects podcast cash app lets you send money to friends buy Bitcoin and invest in the stock market with as little as one dollar brokerage services that provided by cash up investing a subsidiary of square and member s IPC since cash app allows you to send and receive money digitally peer-to-peer and security in all digital transactions very important let me mention the PCI data security standard that cash app is compliant with big fan of standards for safety and security PCI DSS is a good example of that or a bunch of competitors got together and agreed that there needs to be a global standard around the security of transactions now we just need to do the same for autonomous vehicles and AI systems in general so again if you get cash out from the App Store or Google Play and use the code Lex Podcast you'll get ten dollars and cash app will also donate ten dollars the first one of my favorite organizations that is helping to advance robotics and STEM education for young people around the world and now here's my conversation with Markus cutter as a computer or maybe an information processing system let's go with a big question first okay I with a big question first yeah I think it's very interesting hypothesis or idea and I have a background in physics so I know a little bit about physical theories the standard model of particle physics and general relativity theory and they are amazing and describe virtually everything in the universe and they're all in a sense computable theories I mean they're very hard to compute and you know it's very elegant simple theories which describe virtually everything in the universe so there's a strong indication that somehow the universe is computable but it's a plausible hypothesis so what what do you think just like you said general relativity quantum field theory what do you think that the laws of physics are so nice and beautiful and simple and compressible do you think our universe was designed is naturally this way are we just focusing on the parts that are especially compressible our human minds just enjoy something about that simplicity and in fact there's other things that are not so compressible no I strongly believe and I'm pretty convinced that the universe is inherently beautiful elegant and simple and described by these equations and we're not just picking that I mean if the versatile phenomena which cannot be need to describe scientists would try that right and you know there's biology which is more messy but we understand that it's an emergent phenomena and you know it's complex systems but they still follow the same rules right of quantum electrodynamics and all of chemistry follows that and we know that I mean we cannot compute everything because we have limited computational resources now I think it's not a bias of the humans but it's objectively simple I mean of course you never know you know maybe there's some corners very far out in the universe or super super tiny below the nucleus of atoms or well parallel universes where which are not nice and simple but there's no evidence for that and you should apply Occam's razor and you know just the simple story consistent with but also it's a little bit for friendship so maybe a quick pause what is Occam's razor so or comes razor says that you should not multiply entities beyond necessity which sort of if you translate it to proper English means and and you know in a scientific context means that if you have two series or hypotheses or models which equally well describe the phenomenon your study or the data you should choose the more simple one so that's just the principle you're sort of that's not like a provable law perhaps perhaps we'll kind of discuss it and think about it but what's the intuition of why the simpler answer is the one that is likely to be more correct descriptor of whatever we're talking about I believe that Occam's razor is probably the most important principle in science I mean of course we logically Duck shouldn't be do experimental design but science is about finding understanding the world finding models of the world and we can come up with crazy complex models which you know explain everything but predict nothing but the simple model seem to have predictive power and it's a valid question why yeah and the two answers to that you can just accept it that is the principle of science and we use this principle and it seems to be successful we don't know why but it just happens to be or you can try you know find another principle which explains or comes razor and if we start with the assumption that the world is governed by simple rules then there's a bias toward simplicity and pliant Occam's razor is the mechanism to finding these rules and actually in a more quantitative sense and we come back to that later in terms of some Roman attraction you can rigorously prove that usually assume that the world is simple then Occam's razor is the best you can do in a certain sense so I apologize for the romanticized question but why do you think outside of its effectiveness why do we do you think we find simplicity so appealing as human beings well just why does e equals mc-squared seems so beautiful to us humans I guess mostly in general many things can be explained by an evolutionary argument and you know there's some artifacts and humans which you know are just artifacts and not an evolutionary necessary but there's this beauty and simplicity it's I believe at least the core is about like science finding regularities in the world understanding the world which is necessary for survival right you know if I look at a bush right and I just seen Norris and there is a tiger right and eats me then I'm dead but if I try to find a pattern and we know that humans are prone to find more patterns in data than they are you know like the you know Mars face and all these things but these buyers towards finding patterns even if they are not but I mean its best of course if they are yeah helps us for survival yeah that's fascinating I haven't thought really about this I thought I just loved science but they're indeed from in terms of just for survival purposes there is an evolutionary argument for why why we find the work of Einstein is so beautiful maybe a quick small tangent could you describe what's Solomonov induction is yeah so that's a theory which I claim and Riesling enough sort of claimed you know a long time ago that this solves the big philosophical problem of induction and I believe the claim is essentially true and what it does is the following so okay for the picky listener induction can be interpreted narrowly and wildly narrow means inferring models from data and widely means also then using these models for doing predictions or predictions also part of of the induction so I'm little sloppy sort of as a terminology and maybe that comes from ray solomonoff you know being sloppy maybe saying it we can't complain anymore so let me explain a little bit this theory yeah in simple terms so assume we have a data sequence make it very simple the simplest one say 1 1 1 1 1 and you see if 100 ones yeah what do you think comes next the natural order I repeat up a little bit the natural answer is of course you know 1 ok and questions why ok well we see a pattern there yeah ok there's a 1 and we repeat it and why should it suddenly after a hundred ones be different so what we're looking for is simple explanations or models for the data we have and now the question is a model has to be presented in a certain language in which language to be used in science we want formal languages and we can use mathematics or we can use programs on a computer so abstract me on a Turing machine for instance or can be a general-purpose computer so and they of course lots of models of you can say maybe it's a hundred ones and then 100 zeros and a hundred ones that's a model right but there are simpler models there's a model print one loop and it also explains the data and if you push the to the extreme you are looking for the shortest program which if you run this program reproduces the data you have it will not stop it will continue naturally and this you take for your prediction and on the sequence of ones it's very plausible right at the print one loop it's the shortest program we can give some more complex examples like 1 2 3 4 5 what comes next the short program is again you know counter and so that is roughly speaking house a lot of interaction works the extra twist is that it can also deal with noisy data so if you have for instance a coin flip say a biased coin which comes up head with 60% probability then it will predict if you learn and figure this out and after a while it predict or the next coin flip will be head with probability 60% so it's the stochastic version of that but the goal is the dream is always the search for the short program yes yeah well in solomonov induction precisely what you do is so you combine so looking for the shortest program is like applying AAPIs race like looking for the simplest theory there's also a pakoras principle which says if you have multiple hypotheses which equally well describe you data don't discard any of them keep all of them around you never know and you can put it together and say ok have a buyer's to her simplicity but I don't rule out the larger models and technically what we do is we weigh the shorter models higher and the longer models lower and you use a Bayesian techniques you have a prior and which is precisely 2 to the minus the complexity of the program and you weigh all this hypotheses and take this mixture and then you get also this plasticity in yeah like many of your ideas that's just a beautiful idea of weighing based on the simplicity of the program I love that that that seems to me may be a very human central concept seems to be a very appealing way of discovering good programs in this world you've used the term compression quite a bit I think it's a beautiful idea sort of we just talked about simplicity and maybe science or just all of our intellectual pursuits is basically the attempt to compress the complexity all around us into something simple so what does this word mean to you compression I essentially have already explained it so it compression means for me finding short programs for the data or the phenomena at hand you could interpret it more widely as you know finding simple theories which can be mathematical theory so maybe even informal you know like you know just inverts compression means finding short descriptions explanations programs little data do you see science as a kind of our human attempt at compression so we're speaking more generally because when you say programs kind of zooming in a particular sort of almost like computer science artificial intelligence focus but do you see all of human endeavor as a kind of compression well at least all of science ICSI and evolve compression at all of humanity maybe and well they are so other aspects of science like experimental design right I mean we we create experiments specifically to get extra knowledge and this is that isn't part of the decision-making process but once we have the data to understand the data is essentially compression so I don't see any difference between contrast compression understanding and prediction so we're jumping around topics a little bit but returning back the simplicity a fascinating concept of komagawa of complexity so in your sense the most objects in our mathematical universe have high komagawa of complexity and maybe what is first of all what is coma graph complexity ok Kolmogorov complexity is a notion of simplicity or complexity and it takes the compression view to the extreme so I explained before that if you have some data sequence just think about a file on a computer and best sort of you know just a string of bits and if you and we have data compresses likely compress big files in terms a sip files with certain compressors and you can also put yourself extracting archives that means as an executable if you run it it reproduces the original file without needing an extra decompressor it's just a decompressor plus the archive together in one and now there are better and worse compressors and you can ask what is the ultimate compressor so what is the shortest possible self-extracting archives you could produce for a certain data set yeah which reproduces the data set and the length of this is called the Kolmogorov complexity and arguably that is the information content in the data set I mean if the data set is very redundant or very boring you can compress it very well so the information content should be low and you know it is low according to this difference this is the length of the shortest program that summarizes the data yes yeah and what's your sense of our sort of universe when we think about the different the different objects in our universe that we each are concepts or whatever the at every level do they have higher or local girl complexity so what's the hope do we have a lot of hope and be able to summarize much of our world that's a tricky and difficult question so as I said before I believe that the whole universe based on the evidence we have is very simple so it has a very short description the whole sorry did you would you linger on that the whole universe what does I mean do you mean at the very basic fundamental level in order to create the universe yes yeah so you need a very short program when you run it to get the thing going you get the thing going and then it will reproduce our universe and there's a problem with noise we can come back to the later possibly noise a problem or a fear is it a bug or a feature I would say it makes our life as a scientist really really much harder I didn't think about without noise we wouldn't need all of the statistics but that maybe we wouldn't feel like there's a free will maybe we need that for the ethics this is an illusion that Norris can give you freezing that way it's a feature but also if you don't have noise you have chaotic phenomena which are effectively like noise so we can't you know get away with statistics even then I mean think about rolling a dice and you know forget about quantum mechanics and you know exactly how you you throw it but I mean it's still so hard to compute a trajectory that effectively it is best to model it you know as you know coming out this a number this probability 1 over 6 but from from this set of philosophical como go of complexity perspective if we didn't have noise then arguably you could describe the whole universe as well as standard model plus general relativity I mean we don't have a theory of everything yet but sort of assuming we are close to it or have it here plus the initial conditions which may hopefully be simple and then you just run it and then you would reproduce the universe but that's all by noise or by chaotic systems or by initial conditions which you know may be complex so now if we don't the whole universe but just a subset you know just take planet Earth planet Earth cannot be compressed you know into a couple of equations this is a hugely complex just so interesting so when you look at the window like the whole thing might be simple when you just take a small window then it may become complex and that may be counterintuitive but there's a very nice analogy the the book the library of all books so imagine you have a normal library with interesting books and you go there great lots of information and you quite complex yeah so now I create a library which contains all possible books say of 500 pages so the first book just has a aaaa over all the pages the next book aaaa and ends with P and so on I create this library of all books I can write a super short program which creates this library so this library which has all books has zero information content and you take a subset of this library and suddenly have a lot of information in there so that's fascinating I think one of the most beautiful object mathematical objects that at least today seems to be under study or under talked about is cellular automata what lessons do you draw from sort of the game of life for cellular automata where you start with the simple rules just like you're describing with the universe and somehow complexity emerges do you feel like you have an intuitive grasp on the behavior the fascinating behavior of such systems where some like you said some chaotic behavior it could happen some complexity could emerge some it could die out and some very rigid structures you have a sense about cellular automata that somehow transfers maybe to the bigger questions of our universe is a cellular automata and especially the Conway's Game of Life is really great because this rule are so simple you can explain it to every child and mean by hand you can simulate a little bit and you see these beautiful patterns emerge and people have proven you know that is even Turing complete you cannot just use a computer to simulate game of life but you can also use game of life to simulate any computer that is truly amazing and it's it's the prime example probably to demonstrate that very simple rules can lead to very rich phenomena and people you know sometimes you know how can how is chemistry and biology is so rich I mean this can't be based on simple rules yeah but now we know quantum electrodynamics describes all of chemistry and and become later back to that I claim intelligence can be explained or described in one single equation this very rich phenomenon you asked also about whether you know I understand this phenomenon and it's probably not and this is saying you never understand really things you just get used to them and pretty using used to sell all automata so you believe that you understand now why this phenomenon happens but I give you a different example I didn't play too much with this converse game of life but a little bit more with fractals and with the Mandelbrot set and it's beautiful you know patterns just just look Mandelbrot set and well when the computers were really slow in our just a black and white monitor and programmed my own program sana in assembler - Wow Wow to get these vectors on the screen and it was mesmerised and much later so I returned to this you know every couple of years and then I try to understand what is going on and you can understand a little bit so I try to derive the locations you know there are these circles and the Apple shape and then you have smaller Mandelbrot sets recursively in this set in this way to mathematically by solving high order polynomials to figure out where these centers are and what size there are approximately and by sort of ant mathematically approaching this problem you slowly get a feeling of why things are like they are and that sort of isn't you know first step to understanding why this rich phenomena do you think as P as possible what's your intuition you think it's possible to reverse engineer and find the short program that generated the these fractals sort of by what looking at the fractals well in principle yes yeah so I mean in principle what you can do is you take you know any data set you know you take these fractals or you take whatever your data set whatever you have say a picture of conveys game of life and you run through all programs you take your programs 1 2 3 4 and all these programs around them all in parallel in so called dovetailing fashion give them computational resources first one 50% second 1/2 resources and so on and let them run wait until they halt give an output compare it to your data and if some of these programs produced the correct data then you stop and then you have already used some program it may be a long program because it's faster and then you continue and you get shorter and shorter programs until you eventually find the shortest program the interesting thing you can never know whether to short this program because there could be an even shorter program which is just even slower and you just have to wait here but asymptotically and actually after finite time you have this shortest program so this is a theoretical but completely impractical way of finding the underlying structure in every data set and there was a lot of interaction dolls and Kolmogorov complexity in practice of course we have to approach the problem more intelligently and then if you take resource limitations into account there's friends the field of pseudo-random numbers yeah and these are random that must so these are deterministic sequences but no algorithm which is fast fast means runs in polynomial time can detect that it's actually deterministic so we can produce interesting I mean random numbers maybe not that interesting but just an example we can produce complex looking data and we can then prove that no fast algorithm can detect the underlying pattern which is unfortunately is it that's a big challenge for our search for simple programs in the space of artificial intelligence perhaps yes it definitely is quantitative intelligence and it's quite surprising that it's I can't say easy here I mean worked really hard to find his theories but apparently it was possible for human minds to find these simple rules in the universe it could have been different right it could have been different it's it's uh it's inspiring so let me ask another absurdly big question what is intelligence in your view so I have of course a definition I wasn't sure what you're gonna say because you could have just as easily said I have no clue which many people would say I'm not modest in this question so the the informal version which ever got together be shame like who co-founded in mind is that intelligence measures an agent's ability to perform well in a wide range of environments so that doesn't sound very impressive and but it these words have been very carefully chosen and there is a mathematical theory behind it and we come back to that later and if you look at this this definition right itself it seems like yeah okay but it seems a lot of things are missing but if you think it through then you realize that most and I claim all of the other traits at least of rational intelligence which we usually associate intelligence are emergent phenomena from this definition in creativity memorization planning knowledge you all need that in order to perform well in a wide range of environments so you don't have to explicitly mention that in a definition interesting so yeah so the consciousness abstract reasoning or all these kinds of things are just emerging phenomena that help you in towards can you say the definition against multiple environments did you mention or goals no but we have an alternative definition instead of performing value conscious replace it by goals so intelligence measures an agent ability to achieve goals in a wide range of environments that's more or less because in there there's an injection of the word goals so you to specify their there should be a goal yeah but perform well is sort of what is it does it mean it's the same problem yeah there's a little gray area but it's much closer to something that could be formalized re in your view are humans where do humans fit into that definition are they general intelligence systems that are able to perform in like how good are they at fulfilling that definition at performing well in multiple environments yeah that's a big question I mean the humans are performing best among all species as we know we know of yeah depends you could say that trees and plants are doing better job they'll probably outlast us so yeah but they're in a much more narrow environment right I mean you just you know I have a little bit of air pollutions and these trees die and we can adapt right we build houses with filters we we we do geoengineering so multiple environment part yes that is very important yes so that distinguish narrow intelligence from wide intelligence also in the AI research so let me ask the the Alan Turing question can machines think can machines be intelligent so in your view I have to kind of ask the answer is probably yes but I want to kind of here with your thoughts on it can machines be made to fulfill this definition of intelligence to achieve intelligence well we are sort of getting there and you know on a small scale we are already there the wide range of environments is missing about yourself driving cars we have programs which play go and chess we have speech recognition so it's pretty amazing but you can you know these are narrow environments but if you look at alpha zero that was also developed by deep mind I mean what famous alphago and then came alpha zero a year later there was truly amazing so on reform a learning algorithm which is able just by self play to play chess and then also go and I mean yes they're both games but they're quite different games and you know this you didn't don't feed them the rules of the game and the most remarkable thing which is still a mystery to me that usually for any decent chess program I don't know much about go you need opening books and endgame tables and so on - and nothing in there nothing was put in there it was alpha zero there's the self play mechanism starting from scratch being able to learn actually new strategies is uh yeah it did rediscovered you know all these famous openings within four hours by himself what I was really happy about I'm a terrible chess player but I like queen Gumby and alpha zero figured out that this is the best opening correct so yes that you do to answer your question yes I believe that general intelligence is possible and it also depends how you define it do you say AGI with general intelligence artificial general intelligence only refers to if you achieve human-level or a subhuman level but quite broad is it also general intelligence so we have to distinguish or it's only super human intelligence general artificial intelligence is there a test in your mind like the Turing test for natural language or some other test that would impress the heck out of you that would kind of cross the line of your sense of intelligence within the framework that you said well the Turing test well has been criticized a lot but I think it's not as bad as some people thinking some people think it's too strong so it tests not just for a system to be intelligent but it also has to fake human deception this section right which is you know much harder and on the other hand they say it's too weak yeah because it just may be fakes you know emotions or intelligent behavior it's not real but I don't think that's the problem or big problem so if if you would pass the Turing test so conversation over terminal with a bot for an hour or maybe a day or so and you can fool a human into you know not knowing whether this is a human or not that it's during tests I would be truly impressed and we have this annual competitions alumna price and I mean it started with Elijah that was the first conversational program and what is it called the Japanese Mitsouko or so that's the winner of the last you know a couple of years and well impressive yes quite impressive and then google has developed Meena right just just recently that's an open domain conversational but just a couple of weeks ago I think yeah I kind of like the metric that sort of the Alexa price has proposed and he maybe it's obvious to you it wasn't to me of setting sort of a length of a conversation like you want the bot to be sufficiently interestingly you'd want to keep talking to it for like 20 minutes and that's a that's a surprisingly effective in aggregate metric because it really like nobody has the patience to be able to talk to about that's not interesting in intelligent and witty and is able to go on the different tangents jump domains be able to you know say something interesting to maintain your attention maybe many humans whoops also fail this test unfortunately we set just like with autonomous vehicles with chat BOTS we also set a bar that's way too hard high to reach I said you know the Turing test is not as bad as some people believe you got what is really not useful about the Turing test it gives us no guidance how to develop these systems in the first place of course you know we can develop them by trial and error and you know do whatever and and then run the test and see whether it works or not but a mathematical definition of intelligence gives us you know an objective which we can then analyze by you know theoretical tools or computational and you know maybe improve how close we are and we will come back to that later with a sexy model so or I mention the compression right so in natural language processing and they have chiefed amazing results and are one way to test this of course you know take the system you train it then you you know see how well it performs on the task but a lot of performance measurement is done by so called perplexity this is essentially the same as complexity or compression length so the NLP community develops new systems and then they measure the compression length and then they have ranking and leaks because there's a strong correlation between compressing well and then this systems performing well at the task at hand it's not perfect but it's good enough for them as as an intermediate aim so you mean a measure so this is kind of almost returning to the coma girl of complexity so you're saying good compression usually means good intelligence yes so you mentioned you're one of the one of the only people who dared boldly to try to formalize our the idea of artificial general intelligence to have a a mathematical framework for intelligence just like as we mentioned termed IHC AI X I so let me ask the basic question what is IHC okay so let me first say what it stands for because letter stands for actually that's probably the more basic question but it the first question is usually how how it's pronounced but finally I put it on the website how it's pronounced and you figured it out yeah the name comes from AI artificial intelligence and the X I is the Greek letter X I which are used for solo manav's distribution for quite stupid reasons which I'm not willing to repeat here in front of camera so it just happened to be more less arbitrary I chose to excite but it also has nice other interpretations so their actions and perceptions in this model write an agent his actions and perceptions and overtime so this is a Index IX index I so this action at time I and then followed by reception at time I will go with that I let it out the first part yes I'm just kidding I have some interpretations so at some point maybe five years ago or ten years ago I discovered in in Barcelona it wasn't a big church there wasn't you know stone engraved some text and the word I see appeared there I was very surprised and and and and happy about it and I looked it up so it is Catalan language and it means with some interpretation of debts it that's the right thing to do yeah Eureka Oh so it's almost like destined somehow came yeah yeah came to you in a dream so Osceola there's a Chinese word I she also written a galaxy if you could transcribe that opinion then the final one is that is AI crossed with induction because status and that's going more to the content now so good old-fashioned AI is more about you know planning and known data mystic world and induction is more about often yellow area D data and inferring models and essentially what this accident does is combining these two and I actually also recently I think heard that in Japanese AI means love so so if you can combine excise somehow with that I think we can there might be some interesting ideas there so I let's then take the next step can you maybe talk at the big level of what is this mathematical framework yeah so it consists essentially of two parts one is the learning and induction and prediction part and the other one is the planning part so let's come first to the learning induction prediction part which essentially I explained already before so what we need for any agent to act well is that it can somehow predict what happens I mean if you have no idea what your actions do how can you decide which acts not good or not so you need to have some model of what your actions affect so what you do is you have some experience you build models like scientists you know of your experience then you hope these models are roughly correct and then you use these models for prediction and the model is sorry to interrupt our model is based on you perception of the world how your actions will affect that world that's not so what is the important part but it is technically important but at this stage we can just think about predicting say stock market data whether data or IQ sequences one two three four five what comes next yeah so of course our actions affect what we're doing but I come back to that in a second so and I'll keep just interrupting so just to draw a line between prediction and planning or what do you mean by prediction in this and this where it's trying to predict the environment without your long-term action in the environment what is prediction okay if you want to put the actions in now okay then let's put in a now yes so the question okay so this is the simplest form of prediction is that you just have data which you passively observe yes and you want to predict what happens without you know interfering as I said weather forecasting stock market IQ sequences or just anything okay and Salama of zeref interaction based on compression so you look for the shortest program which describes your data sequence and then you take this program run it which reproduces your data sequence by definition and then you let it continue running and then it will produce some predictions and you can rigorously prove that for any prediction task this is essentially the best possible predictor of course if there's a prediction task or tasks which is unpredictable like you know your fair coin flips yeah I cannot predict the next fair country but Solomon of Tarsus says okay next head is probably 50% it's the best you can do so if something is unpredictable Salama will also not magically predicted but if there is some pattern and predictability then Solomonov induction we'll figure that out eventually and not just eventually but rather quickly and you can have proof convergence rates whatever your data is so there's pure magic in a sense what's the catch well the catch is that is not computable and we come back to that later you cannot just implement it in even this Google resources here and run it and you know predict the stock market and become rich I mean if ray solomonoff already not write it at the time but the basic task is you know you're in the environment and you're interacting with an environment to try to learn a model the environment and the model is in the space as these all these programs and your goal is to get a bunch of programs that are simple and so let's let's go to the actions now but actually good that you asked usually I skip this part also there is also a minor contribution which I did so the action part but they usually sort of just jump to the decision path so let me explain to the action part now thanks for asking so you have to modify it a little bit by now not just predicting a sequence which just comes to you but you have an observation then you act somehow and then you want to predict the next observation based on the past observation and your action then you take the next action you don't care about predicting it because you're doing it and then you get the next observation and you want more before you get it you want to predict it again based on your past action and observation sequence it's just condition extra on your actions there's an interesting alternative that you also try to predict your own actions if you want oh in the past or the future your future actions wait let me wrap I think my brain is broke we should maybe discussed it later Biff after I've explained the Ising model it's an interesting variation but this is a really interesting variation and a quick comment I don't know if you want to insert that in here but you're looking at in terms of observations you're looking at the entire the big history a long history of the observations exactly it's very important the whole history from birth sort of of the agent and we can come back to that I'm also why this is important here often you know in RL you have MVPs Markov decision processes which are much more limiting okay so now we can predict conditioned on actions so even if the influenced environment but prediction is not all we want to do right we also want to act really in the world and the question is how to choose the actions and we don't want to greedily choose the actions you know just you know what is best in in the next time step and we first I should say you know what is you know how to be measure performance so we measure performance by giving the agent reward that's the so called reinforcement learning framework so every time step you can give it a positive reward or negative reward or baby no reward it could be a very scarce right like if you play chess just at the end of the game you give +1 for winning or -1 for losing so in the aixi framework that's completely sufficient so occasionally you give a reward signal and you ask the agent to maximise reverb but not greedily sort of you know the next one next one because that's very bad in the long run if you're greedy so but over the lifetime of the agent so let's assume the agent lives for M times that'll say it dies in sort of hundred years sharp that's just you know the simplest model to explain so it looks at the future reward sum and ask what is my action sequence or actually more precisely my policy which leads in expectation because I don't know the world to the maximum reward some let me give you an analogy in chess for instance we know how to play optimally in theory it's just a minimax strategy I play the move which seems best to me under the assumption that the opponent plays the move which is best for him so best serve worst for me and the assumption that he I play again the best move and then you have this expecting max three to the end of the game and then you back propagate and then you get the best possible move so that is the optimal strategy which for norman already figured out a long time ago for playing adversarial games luckily or maybe unluckily for the theory it becomes harder the world is not always adversarial so it can be if the other humans even cooperative fear or nature is usually I mean the dead nature is stochastic you know you know things just happen randomly or I don't care about you so what you have to take into account is a noise now and not necessarily Realty so you'll replace the minimum on the opponent's side by an expectation which is general enough to include also the serial cases so now instead of a minimax trials you have an expecting max strategy so far so good so that is well known it's called sequential decision theory but the question is on which probability distribution do you base that if I have the true probability distribution like say I play backgammon right there's dice and there's certain randomness involved you know I can calculate probabilities and feed it in the expecting max or the signature disease we come up is the optimal decision if I have enough compute but in the for the real world we don't know that you know what is the probability you drive in front of me brakes and I don't know you know so depends on all kinds of things and especially new situations I don't know so this is this unknown thing about prediction and there's where solomonoff comes in so what you do is in sequential decision jury it just replace the true distribution which we don't know by this Universal distribution I didn't explicitly talk about it but this is used for universal prediction and plug it into the sequential decision tree mechanism and then you get the best of both worlds you have a long-term planning agent but it doesn't need to know anything about the world because there's a lot of induction part learns can you explicitly try to describe the universal distribution and how some of induction plays a role here yeah I'm trying to understand so what it does it I'm so in the simplest case I said take the shortest program describing your data run it have a prediction which would be deterministic yes okay but you should not just take a shortest program but also consider the longer ones but keep it lower a priori probability so in the Bayesian framework you say a priori any distribution which is a model or stochastic program has a certain a priori probability which is 2 to the minus and Y to the minus length you know I could explain length of this program so longer programs are punished yes a priori and then you multiplied with the so-called likelihood function yeah which is as the name suggests is how likely is this model given the data at hand so if you have a very wrong model it's very unlikely that this model is true so it is very small number so even if the model is simple it gets penalized by that and what you do is then you take just the some word this is the average over it and this gives you a probability distribution so with universal distribution of phenomena of distribution so it's weighed by the simplicity of the program and likelihood yes it's kind of a nice idea yeah so okay and then you said there's you're playing N or M or forgot the letter steps into the future so how difficult is that problem what's involved there okay so here's a customization problem what do we do yes so you have a planning problem up to horizon M and that's exponential time in in the horizon M which is I mean it's computable but in fact intractable I mean even for chess it's already intractable to do that exactly and you know it could be also discounted kind of framework or yes so so having a heart arising you know at numbered years it's just for simplicity of discussing the model and also sometimes the math is simple but there are lots of variations actually quite interesting parameter is its there's nothing really problematic about it but it's very interesting so for instance you think no let's let's then let's let the parameter M tend to infinity right you want an agent which lives forever all right if you do it novel you have two problems first the mathematics breaks down because you have an infinite reward some which may give infinity and getting river 0.1 in the time step is infinity and giving you got one every time service Definity so equally good not really what we want other problem is that if you have an infinite life you can be lazy for as long as you want for ten years yeah and then catch up with the same expected reward and you know think about yourself or you know or maybe you know some friends or so if they knew they lived forever you know why work hard now you know just enjoy your life you know and then catch up later so that's another problem with infinite horizon and you mentioned yes we can go to discounting but then the standard discounting is so called geometric discounting so $1 today is about worth as much as you know one dollar and five cents tomorrow so if you do this so called geometric discounting you have introduced an effective horizon so the Aged is now motivated to had a certain amount of time effectively it's likely moving horizon and for any fixed effective horizon there is a problem to solve which requires larger horizon so if I look ahead you know five time steps I'm a terrible chess player right and I'll need to look ahead longer if I play go I probably have to look ahead even longer so for every problem there forever horizon there is a problem which this horizon cannot solve yes but I introduced the so-called near harmonic horizon which goes down with one or tea rather than exponential in T which produces an agent which effectively looks into the future proportional to its age so if it's five years old it plans for five years if it's hundred years older than plans for hundred years interesting and a little bit similar to humans - right and my children don't plan ahead very long but then we get the doll - a player I had more longer maybe when we get all very old I mean we know that we don't live forever and you're maybe then how horizon shrinks again so just adjusting the horizon what is there some mathematical benefit of that of or is just a nice I mean intuitively empirically probably a good idea to sort of push the horizon back to uh extend the horizon as you experience more of the world but is there some mathematical conclusions here that are beneficial mr. Loman who talks just a prediction probably have extremely strong finite time but no finite data result so you have sown so much data then you lose on so much so so the dt r is really great with the aixi model with the planning part many results are only asymptotic which well this is what is asymptotic means you can prove for instance that in the long run if the agent you know x long enough then you know it performs optimal or some nice things happens so but you don't know how fast it converges yeah so it may converge fast but we're just not able to prove it because a difficult so that is really dead slow yeah so so that is what asymptotic means sort of eventually but we don't know how fast and if I give the agent a fixed horizon M yeah then I cannot prove asymptotic results right so I mean sort of people dies in hundred years then and hundred uses over cannot say eventually so this is the advantage of the discounting that I can prove on some topic results so just to clarify so so I okay I made I've built up a model well now in a moment I've have this way of looking several steps ahead how do I pick what action I will take it's like with a playing chess right you do this minimax in this case here do expect the max based on the selamat of distribution you propagate back and then while inaction falls out the action which maximizes the future expected reward on the Solano's distribution and then you just take this action and then repeat until you get a new observation and you feed it in this excellent observation then you repeat and the reward so on yeah so you're a row - yeah and then maybe you can even predict your own action however the idea but okay this big framework what is it this is I mean it's kind of a beautiful mathematical framework to think about artificial general intelligence what can you what does it help you into it about how to build such systems or maybe from another perspective what does it help us to in understanding AGI so when I started in the field I was always interested two things one was you know AGI i'm the name didn't exist 10 24th of january iowa strong AI and physics he over everything so i switched back and forth between computer science and physics quite often you said the theory of everything the theory of everything just alike it was a basically the string of flavors problems before all all of humanity yeah I can explain if you wanted some later time you know why I'm interesting these two questions Nestle and a small tangent if if if one to be it was one to be solved which one would you if one if you were if an apple found you head and there was a brilliant insight and you could arrive at the solution to one would it be AGI or the theory of everything definitely AGI because once the AGI problem solve they can ask the AGI to solve the other problem for me yeah brilliant a put okay so so as you were saying about it okay so and the reason why I didn't settle I mean this thought about you know once we have solved HDI it solves all kinds of other
Resume
Categories