File TXT tidak ditemukan.
Jim Keller: Abstraction Layers from the Atom to the Data Center | AI Podcast Clips
7bLeQFhPwzk • 2020-02-16
Transcript preview
Open
Kind: captions Language: en so let's get into the basics before we zoom back out how do you build the computer from scratch what is a microprocessor what is it microarchitecture what's an instruction set architecture maybe even as far back as what is a transistor so the special charm of computer engineering is there's a relatively good understanding of abstraction layers so down to bottom you have atoms and atoms get put together in materials like silicon or dope silicon or metal and we build transistors on top of that we build logic gates right and in functional units like an adder or subtractor or an instruction parsing unit and we assemble those into you know processing elements modern computers are built out of you know probably 10 to 20 locally you know organic processing elements or coherent processing elements and then that runs computer programs right so there's abstraction layers and then software you know there's an instruction set you run and then there's assembly language C C++ Java JavaScript you know there's abstraction layers you know essentially from the atom to the data center right so when you when you build a computer you know first there's a target like what's it for look how fast does it have to be which you know today there's a whole bunch of metrics about what that is and then in an organization of you know a thousand people who build a computer there's lots of different disciplines that you have to operate on does that make sense and so so there's a bunch of levels abstraction of in organizational I can tell and in your own vision there's a lot of brilliance that comes in it every one of those layers some of it is science some what is engineering some of his art what's the most if you could pick favorites what's the most important your favorite layer on these layers of abstractions where does the magic enter this hierarchy I don't really care that's the fun you know I'm somewhat agnostic to that so I would say for relatively long periods of time instruction sets are stable so the x86 instruction said the arm instruction set was an instruction set so it says how do you encode the basic operations load still or multiply add subtract conditional branch you know there aren't that many interesting instructions look if you look at a program and it runs you know 90% of the execution is on 25 opcodes you know 25 instructions and those are stable right what does it mean stable until architecture has been around for 25 years it works it works and that's because the basics you know are defined a long time ago right now the way an old computer ran is you fetched instructions and you executed them in order to the load do the ad do the compare the way a modern computer works is you fetch large numbers of instructions say 500 and then you find the dependency graph between the instructions and then you you execute in independent units those little micro graphs so a modern computer like people like to say computer should be simple and clean but it turns out the market for a simple complete clean slow computers is zero right we don't sell any simple clean computers now you can there's how you build it can be clean but the computer people want to buy that's say you know phone data center such as a large number of instructions computes the dependency graph and then executes it in a way that gets the right answers and optimizes that graph somehow yeah they run deeply out of order and then there's semantics around how memory ordering works and other things work so the computer sort of has a bunch of bookkeeping tables it says what order CDs operations finishing or appear to finish him but to go fast you have to fetch a lot of instruct and find all the parallelism now there's a second kind of computer which we call GPUs today and I called the difference there's found parallelism like you have a program with a lot of dependent instructions you fetch a bunch and then you go figure out the dependency graph and you issues instructions out order that's because you have one serial narrative to execute which in fact is and can be done out of order you call a narrative yeah well so yeah so humans think of serial narrative so read it read a book right there's you know there's the sense after sentence after sentence and there's paragraphs now you could diagram that imagine you diagrams it properly and you said which sentences could be read in anti order any order without changing the meaning right so that's a fascinating question that risk of a book yeah yeah you could do that right so some paragraphs could be reordered some sentences can be reordered you could say he is tall and smart and X right and it doesn't matter the order of tall and smart but if you say that tall man is wearing the red shirt what colors you know like you can create dependencies right right and so GPUs on the other hand run simple programs on pixels but you're given a million of them and the first order the screen you're looking at doesn't care which order you do it in so I call that given parallelism simple narratives around the large numbers of things where you can just say it's parallel because you told me it was so found parallelism where the narrative is it's sequential but you discover like little pockets of parallelism of versus turns out large pockets of parallelism large so how hard is it to discuss well how hard is it that's just transistor count right so once you crack the problem you say here's how you fetch ten instructions at a time here's how you calculated the dependencies between them here's how you describe the dependencies here's you know these are pieces right so once you describe the dependencies then it's just a graph sort of it's an algorithm that finds what is that I'm sure there's a graph there is the theoretical answer here that's solved but in general programs modern programs that human beings right how much found parallelism is there an ax what is 10x mean well you execute it in order versus yeah you would get what's called cycles per instruction and it would be about you know three instructions three cycles per instruction because of the latency of the operations and stuff and in a modern computer or execute it but like point to point point to five cycles per instruction so it's about with today fine 10x and there and there's two things one is the found parallelism in the narrative right and the other is to predictability of the narrative right so certain operations they do a bunch of calculations and if greater than one do this else do that that that decision is predicted in modern computers to high 90% accuracy so branches happen a lot so imagine you have you have a decision to make every six instructions which is about the average right but you want to fetch five under instructions figure out the graph and execute them all in parallel that means you have let's say if you effect six hundred instructions it's every six you have to fetch you have to predict ninety-nine out of a hundred branches correctly for that window to be effective okay so parallelism you can't paralyze branches or you can looking pretty you can what does predict a branch mean or what open take so imagine you do a computation over and over you're in a loop so Wow and it's greater than one do and you go through that loop a million times so every time you look at the branch you say it's probably still greater than one and you're saying you could do that accurately very accurately monitoring comes my mind is blown how the heck did you that wait a minute well you want to know this is really sad 20 years ago yes you simply recorded which way the branch went last time and predicted the same thing right okay what's the accuracy of that 85 percent so then somebody said hey let's keep a couple of bits and have a little counter so and it predicts one way we count up and then pins so say you have a three bit counter so you count up and then count down and if it's you know you can use the top bit as the sign bit so you have a sign to bit number so if it's greater than one you predict taken and lesson one you predict not-taken right or listen zero or whatever the thing is and that got us to 92 percent oh okay no is this better this branch depends on how you got there so if you came down the code one way you're talking about Bob and Jane right and then said is just Bob like Jane ik went one way but if you're talking about Bob until this Bob like changes you go a different way right so that's called history so you take the history and a counter that's cool but that's not how anything works today they use something that looks a little like a neural network so modern you take all the execution flows and then you do basically deep pattern recognition of how the program is executing and you do that multiple different ways and you have something that chooses what the best result is there's a little supercomputer inside the computer that's trying to project that calculates which way branches go so the effective window that it's worth finding grassing gets bigger why was that gonna make me sad that's amazing it's amazingly complicated oh well here's the funny thing so to get to 85% took a thousand bits to get to 99% takes tens of megabits so this is one of those to get the result you you know to get from a window of say 50 instructions to 500 it took three orders of magnitude or four orders of magnitude more bets now if you get the prediction of a branch wrong what happens then watch the pipe you flush the pipe says just the performance cost but it gets even better yeah so we're starting to look at stuff that says so they executed down this path and then you had two ways to go but far far away there's something that doesn't matter which path you went so you miss you took the wrong path you executed a bunch of stuff then you had the Miss predicting you backed it up but you remembered all the results you already calculated some of those are just fine look if you read a book and you misunderstand the paragraph your understanding is the next paragraph sometimes is invariant to that I'm not just understanding sometimes it depends on it and you can kind of anticipate that invariance yeah well you can keep track of whether that data changed and so when you come back to a piece of code should you calculate it again or do the same thing okay how much does this is art and how much of it is science because it sounds pretty complicated so well how do you describe a situation so imagine you come to a point in the road we have to make a decision right and you have a bunch of knowledge about which way to go maybe you have a map so you want to go is the shortest way or do you want to go the fastest way or you want to take the nicest road so it's just some set of data so imagine you're doing something complicated like a building in the computer and there's hundreds of decision points all with hundreds of possible ways to go and the ways you pick interacts in a complicated way right and then you have to pick the right spot right so there's other science oh I don't know yeah avoided the question you just described do the Robert Frost problem of road less taken I describe the Robin truss problem which we do as computer designers it's all poetry okay great yeah I don't know how to describe that because some people are very good at making those intuitive leaps it seems like the combinations of things some people are less good at it but they are really good at evaluating your alternatives right and everybody has a different way to do it and some people can't make those sleeps but they're really good at analyzing it so when you see computers are designed by teams of people of very different skill sets and a good team has lots of different kinds of people I suspect you would describe some of them as artistic but not very many unfortunately or fortunately or something well you know computer science heart it's 99% perspiration and the 1% inspiration is really important but you send you the 99 yeah you got to do a lot of work and then there's there are interesting things to do at every level that stack so at the end of the day if you're on the same program multiple times does it always produce the same result is is there some room for fuzziness there that's a math problem so if you run a correct C program the definition is every time you run it you get the same answer yeah that would that's a math statement that's a language definitional statement so yes for years when people did when we first did 3d acceleration of graphics you could run the same scene multiple times and get different answers right right and then some people thought that was okay and some people thought it was a bad idea and then when the HPC world used GPUs for calculations they thought it was a really bad idea okay now in modern AI stuff people are looking at networks where the precision of the data is low enough that the date has somewhat noisy and the observation is the input data is unbelievably noisy so why should the calculation be not noisy and people have experimented with algorithms that say can get faster answers by being noisy like as a network starts to converge if you look at the computation graph it starts out really wide and it gets narrower and you can say is that last little bit that important or should I start the graph on the next rap rev before we would live all the way down to the answer right so you can create algorithms that are noisy now if you're developing something and every time you run it you get a different answer it's really annoying and so most people think even today every time you run the program you get the same answer now you know but the the question is that's the formal definition of a programming language there is a definition of languages that don't get the same answer but people who use those you always want something because you get a bad answer and then you're wondering is it because of something in your brother because of this and so everybody wants a little switch that says no matter what do it deterministically and it's really weird because almost everything going into modern calculations is noisy so why the answers have to be so clear it's all right so what he used to end by design computers for people who run programs so somebody says I want and deterministic answer like most people want that can you deliver a deterministic answer I guess is the question like when you hopefully sure that what people don't realize is you get a deterministic answer even though the execution flow is very undetermined distich so if you run this program a hundred times it never runs the same way twice ever and the answer is arise at the same input it gets the same answer every time it's just just the it's just amazing you
Resume
Categories