Efficient Computing for Deep Learning, Robotics, and AI (Vivienne Sze) | MIT Deep Learning Series
WbLQqPw_n88 • 2020-01-23
Transcript preview
Open
Kind: captions Language: en we'd have Viviane see here with us she's a professor here at MIT working in the very important and exciting space of developing energy efficient and high-performance systems for machine learning computer vision and other multimedia applications this involves joint design of algorithms architectures circus systems to enable optimal trade-offs between power speed and quality of results one of the important differences between the human brain and AI systems is the energy efficiency of the brain so Vivian is a world-class researcher at the forefront of discovering how we can close that gap so please give her a warm welcome I'm really happy to be here to share some of the research and an overview of this area efficient computing so actually what I'm going to be talking about today is gonna be a little bit broader than just deep learning will start with deep learning but we will also move to you know how we might apply this to robotics and other AI tasks and why and why it's really important to have efficient computing to enable a lot of these exciting applications also I just want to mention that a lot of the work I'm going to present today is not done by myself but in collaboration with a lot of folks at MIT over here and of course if you want access to the slides are available on our website so given that it's the deep learning lecture series I want to first start out talking up a little bit about deep neural nets so we know that deep neural Nets has you know generate a lot of a lot of interest has a very many very compelling applications but one of the things that has you know come in to light over the past few years is increasing need of compute opene I actually showed over the past few years that there's been a significant increase in the amount of compute that is required to form deep learning applications and to do the training for deep learning over the past few years so it's actually grown exponentially over the past few years it's don't grow in fact by over three hundred thousand times in terms of the amount of compute we need to drive and increase the accuracy a lot of the tasks that we're trying to achieve at the same time if we start looking at basically the environmental implications of all of this processing can be quite severe so if we look at for example the carbon footprint of you know training neural nets if you think of you know the amount of carbon footprint of flying across North America from New York to San Francisco or the carbon footprint of an average human life you can see that you know neural networks are orders of magnitude greater than that so the environmental or carbon footprint implications of computing for deep neural nets can be quite severe as well now this is a lot having to do with compute in the cloud another important area where we want to do compute is actually moving the compute from the cloud to the edge itself into the device where a lot of the data is being collected so why would we want to do that so there's a couple of reasons first of all communication so in a lot of places around the world and just even a lot of just placing is generally you might not have a very strong communication infrastructure right so you don't want to necessarily to rely on a communication network in order to do a lot of these applications so again you know removing your tethering from the cloud is important another reason is a lot of the times that we you know apply deep learning on a lot of applications where the data is very sensitive so you can think about things like health care where you're collecting very sensitive data and so privacy and security again is really critical and you would rather than sending the data to the cloud you'd like to bring the compute to the data itself finally another compelling reason for you know bringing the compute into the device or into the robot is latency so this is particularly true for interactive applications so you can think of things like autonomous navigation robotics or self-driving vehicles where you need to interact with the real world you can imagine if you're driving very quickly down the highway and you detect an obstacle you might not have enough time to send the data to the cloud wait for it to be processed and send the instruction back in so again you want to move the compute into the robot or into the vehicle itself okay so hopefully this is establishing why we want to move the compute into the edge but one of the big challenges of doing processing in the robot or in the device actually has to do with power consumption itself so if we take the self-driving car as an example been reported that it consumes over 2000 watts of power just for the computation itself just to process all the sensor data that it's collecting right and this actually generates a lot of heat it takes up a lot of space you can see in this prototype that's being placed in all the computes a specs are being placed in the trunk generates a lot of heat it generates and often needs water cooling so this can be a big cost and logistical challenges for self-driving vehicles now you can imagine that this is gonna be much more challenging if we shrink shrink down the form factor of the device itself to something that is perhaps portable in your hands you can think about smaller robots or something like your smartphone or cell phone in these particular cases when you think about portable devices you actually have very limited energy capacity and this is based on the fact that though battery itself is limited in terms of the size weight and its cost right so you can't have very large amount of energy on these particular devices itself secondly when you take a look at you know the embedded platforms that are currently used for embedded processing for these particular applications they tend to consume you know over 10 watts which is an order of magnitude higher than the power consumption that you typically would allow for for these particular handheld devices so in these handheld devices typically you're limited to under a watt due to the heat dissipation for example you don't want your cell phone to get super hot ok so in the past you know decade or so or decades what we would do to address this challenge is that we would wait for transistors become smaller faster and more efficient however this has become a challenge over the past few years so transistors are not getting more efficient so for example Moore's Law which typically makes transistors smaller and faster has been slowing down and Dennard scaling which has made transistors more efficient has also slowed down our endeth so you can see here over the past 10 years this trend has really flattened out ok so this is a particular challenge because we want more and more compute to drive deep neural network applications but the transistors are not becoming more efficient right so what we have to turn to in order to address this is we need to turn towards specialized hardware to achieve the significant speed and energy throughputs that we require for our particular application and we talked about designing specialized Harvard this is really about thinking about how we can redesign the hardware from the ground up particularly targeted at these AI deep learning and robotic tasks that we're really excited about okay so this notion is not new in fact it's become extremely popular to do this over the past few years there's been a large number of startups and companies that have focused on building specialized hardware for deep learning so in fact New York Times reported I guess it's two years ago that there's a record number of startups looking at building specialized hardware for AI and for a deep learning okay so we'll talk a little bit about what specialized hardware looks like for these particular applications now if you really care about energy and power efficiency the first question you should ask is where is the power actually going for these applications and so as it turns out power is dominated by data movement so it's actually not the computations themselves that are expensive but moving the data to the computation engine that's expensive so for example I shown here in blue is you know a range of power consumption energy consumption for a variety of types of computations for example multiplications and additions at various different Precision's so you have for example floating point to fixed point and integer and same with additions and you can see as it makes sense as you scale down the precision the energy consumption of each of these operations reduce but what's really surprising here is that if you look lower at the energy consumption of data movement right again this is delivering the input data to do the multiplication and then you know moving the output of the multiplication somewhere into memory it can be very expensive so for example if you look at the energy consumption of a 32-bit Reed from an SRAM memory this is an 8 kilobyte SRAM so it's a very small memory that you would have on the processor or on the chip itself this is already going to consume 5 Pico joules of energy so equivalent or even more than a 32-bit floating-point mode multiplied and it's from a very small memory if you need to read this data from off chips so outside the processor for example in DRAM it's going to be even more offensive so in this particular case we're showing 640 Pico joules in terms of energy and sequence notice here on the horizontal axis that this is basically the this is an exponential axis so you're talking about orders of meant to increase in energy in terms of data movement compared to the compute itself right so this is a key takeaway here so if we really want to address the energy consumption of these particular types of processing we really want to look at reducing data movement okay but what's the challenge here so if we take a look at a popular a I robotics or type of application like autonomous navigation the real challenge here though is that these applications use a lot of data right so for example one of the things you need to do in autonomous navigation is what we call semantic understanding so you need to be able to identify you know which pixel belongs to what so for example in this scene you need to know that this pixel represents the ground this pixel represents the sky this pixel represents you know a person itself okay so this important type of processing often if you're traveling quickly you want to be able to do this at a very high frame rate you might need to have large resolution so for example typically if you want HD images you're talking about 2 million pixels per frame and then often if you also want to be able to detect objects at different scales or see objects that are far away you need to do what we call data expansion for example build a pyramid for this and this would increase the amount of pixels or amount of data you need to process by you know one or two orders of magnitude so that's a huge amount of data that you have to process right off the back there another type of processing or understand that you want to do for Thomas navigation is what we call it geometric understanding and that's when you're kind of navigating you want to build a 3d map of the world that's around you and you can imagine the longer you travel for the larger the map you're gonna build and again that's going to be more data that you're gonna have to process and compute on ok so this is a significant challenge for autonomous navigation in terms of mounted data other aspects of Thomas navigations also other applications like AR VR and so on is understanding your environment right so a typical thing you might need to do is to do depth estimation so for example if I give you an image can you estimate the distance of how far a given pixel is from and also semantic segmentation we just talked about that before so these are important types of ways to understand your environment when you're trying to navigate I mean it should be no surprise to you that in order to do these types of processing the state-of-the-art approaches utilize deep neural nets right but the challenge here that these deep neural nets often require several hundred millions of operations and weights to do the computation so when you try and compare it to something like you would all have on your phone for example video compression you're talking about you know two to three orders of magnitude increase in computational complexity so this is significant challenge because if we'd like to have you know deep neural networks be as ubiquitous as something like video compression we really have to figure out how to address this computational complexity we also know that deep neural networks are not just used for understanding the environment or autonomous navigation but it's really become the cornerstone of many AI applications from computer vision speech recognition gameplay and even medical applications and I'm sure a lot of these have been covered through this course so briefly I'm just gonna give a quick overview of some of the key components and deep neural nets not because you know I'm sure all of you understand it but because since this area is very popular the terminology can vary from discipline to discipline so I'll just do a brief overview to align ourselves on the terminology itself so what are deep neural Nets basically you can view it as a way of for example understanding in the environment it's a chain of different layers of processing where you can imagine for an input image at the low level or the earlier parts of the neural net you're trying to learn different low-level features such as edges of an image and as you get deeper into the network as you chain more of these kind of computational layers together you start being able to detect higher and higher level features until you can you know recognize a vehicle for example and you know the difference of this particular approach compared to more traditional ways of doing computer vision is that how we extract these features are learned from the data itself as opposed to having an expert come and say hey look for the edges look for you know the wheels and so on the fact that it recognizes this features is it and approach okay what is it doing at each of these layers well it's actually doing a very simple computation this is looking at the inference side of things basically effectively what is doing is a weighted sum right so you have the input values and we'll color code the inputs as blue here and try and stay consistent with that's what the talk we apply certain weights to them these weights are learned from the training data and then they would generate an output which is typically read here it's basically a weighted sum as we can see we then passed this weighted sum through some form of non-linearity so you know traditionally used to be sigmoids more recently we use things like real ooze which basically set you know non zero values or negative values to zero but the key takeaway here is if you look at this computational kernel the key operation to a lot of these neural networks is performing this multiply and accumulate to compute the weighted sum and this accounts for over 90% of the computation so if we really want to focus on you know accelerating neural nets or making them more efficient we want to focus on minimizing the cost of this multiply and accumulate itself there are also various popular types of deep neural network layer layers used for deep neural networks they also often vary in terms of you know how you connect up the different layers so for example you can have feed-forward layers where the inputs are always connected to the outputs you can have feedback where the outputs are connected back into the inputs you can have fully connected inputs where basically all the outputs are connected to all the inputs or sparsely connected and you might be familiar with some of these layers so for example fully connected layers just like what we talked about all inputs and all outputs are connected there tend to be feed-forward and when you put them together they're typically referred to as a multi-layer perceptron you have convolutional layers which are also feed-forward but then you have sparsely connected weight sharing connections and when you put them together they often referred to as convolutional and that works and they're typically used for image based processing you have current layers where we have this feedback connection so the output is fed back to the input when we combine two recurrent layers they're referred to as recurrent neural Nets and these are typically used to process sequential data so speech or language based processing and then most recently which is become really popular it's the tension layers or tension based mechanisms and they often involve matrix multiply which is again multiplied and accumulate and there when you combine these are often referred to as transformers okay so let's first kind of get an idea as to why you know convolutional or deep learning is much more computationally more complex than other types of processing so we'll focus on you know convolutional neural Nets is an example although many of these principles apply to other types of neural nets and the first thing that'll kind of take a look as to why it's complicated is to look at the computational kernel so how does it actually perform convolution itself so let's say you have this 2d input image if it's at the input of the neural net would be an image if it's deeper in the neural net would be the input feature map and it's going to be composed of activations or you can think from an image it's going to be composed of pixels and we convolve it with let's say a 2d filter which is composed of weights right so typical convolution what you would do is you would do an element-wise multiplication of the filter weights with the input feature map activations you would sum them all together to generate one output value that we would refer to that as the output activation right and then what because it's convolution we would basically slide the filter across this input feature map and generate all the other output feature map activations and so this cut this kind of 2d convolution is pretty standard in image processing we've been doing this for decades right what makes convolutional neural nets much more challenging as the increase in dimensionality so first of all rather than doing just this 2d convolution we often stack multiple channels so there's this third dimension called channels and then what we're doing here is that we need to do a 2d convolution on each of the channels and then add it all together right and you can think of these channels for an image these channels would be kind of the green and blue components for example and as you get deeper into the feature map the number of channels could potentially increase so if you look at Alex net which is a popular neural net the number of channels ranges from 3 to 192 okay so that already increases the dimensionality one dimension of the neural our neural net itself in terms of processing another dimension that we increase is we actually apply multiple filters to this same input feature map ok so for example you might apply and filters to the same input feature map and then you would generate an output feature map of M channels right so in the previous slide we showed that you know convolving this 3d filter generates one output channel on the output feature map if we apply em input M feet filters we're gonna generate M output channels in the output feature map and again just to give you an idea in terms of the scale of this when you talk about things like Alec's net we're talking about between 96 to 384 filters and of course it's increasing to you know thousands for other advanced or more modern neural Nets itself and then finally often you want to process more than one image at a given time right so if you want to actually do that we can actually extend it so N and input images we can become n output images or and input feature maps we can becomes n output feature maps and a typical we typically refer to this as a batch size like the number of images you're processing at the same time and this can range from 1 to 256 ok so these are all the various different dimensions of the neural net and so really what someone does when they're trying to define what we call the network architecture of the neural net itself is that they're going to select the different or define the shape of the neural network for each of the different layers so it's going to you know define all these different dimensions of the neural net itself and these shapes can vary across the different layers just to give you an idea if you look at mobile net as an example this is a very popular neural net cells you can see that the filter size is right so the height and width of the filters and the number of filters and number of channels will vary across the different blocks or layers itself the other thing I also want to mention is that when we look towards popular enn models we can also see important trends so shown here are the various different models they've been developed over the years that are quite popular a couple of interesting trends one is that the networks tend to become deeper so you can see in the convolutional layers they're getting deeper and deeper and then also the number of weights that they're using and the number of max are also increasing as well so this is an important trend the DNN models are getting larger and deeper and so again they're becoming much more computationally demanding and so we need more sophisticated hardware to be able to process them all right so that's kind of a quick intro overview into the deep neural network space I hope we're all aligned so the first thing I'm going to talk about is how can we actually build hardware to make the processing of these neural networks more efficient and to run faster and often we refer to this as hardware acceleration all right so we know these neural networks are very large there's a lot of compute but are there types of properties that we can leverage to make computing or processing of these networks more efficient so the first thing that's really friendly is that they actually exhibit a lot of parallelism so all these multiplies and accumulates you can actually do them all in parallel right so that's great so that what that means is high throughput or high speed is actually possible cuz I can do a lot of these process things in parallel what is difficult in what should not be a surprise to you know is that the memory accesses the ball bottlenecks so delivering the data to the multiply and accumulate engine is what's really challenging so I'll give you an insight as to why this is the case so let's take say we take this multiply and accumulate engine what we call a Mac it takes in three inputs for every Mac so you have the filter weight you have the input image pixel or if you're deeper in the network you would be input feature math activation and it also takes the partial sum which is like the partially accumulated value from the previous multiply that it did and then it would generate an updated partial sum so for every computation that you do for every Mac that you're doing you need to have four memory accesses so it's a four to one ratio in terms of memory accesses versus compute the other challenge that you have is as we mentioned moving data is going to be very expensive so in the absolute worst case and you would always try to avoid this if you read the data from DRAM it's off ship memory every time you access data from DRAM it's going to be two orders of magnitude more expensive than the computation of performing a Mac itself okay so that's really really bad so if you can imagine again if we look at Alex net which has 700 million max we're talking about three billion DRAM accesses to do that computation okay but again all is not lost there are some things that we can exploit to help us along with this problem so one is what we call input data reuse opportunities which means that a lot of data that we're reading we're using to perform these multiplies and accumulates they're actually used for many multiplies and accumulates so if we read the data once we can reuse it multiple times for many operations right so I'll show you some examples of that first is what we call convolutional reuse so again if you remember we're taking a filter and we're sliding it across this input image and so as a result the the activations from the feature map and the weights from the filter are going to be reused in different combinations to compute the different multiplier and accumulate values or different max itself so there's a lot of what we call convolutional reuse opportunities there another example is that we're actually if you recall going to apply multiple filters on the same input feature map so that means that each activation in that input feature map can be reused multiple times across the different filters finally if we're going to process many images at the same time or many feature maps are given weight in the filter itself can be reused multiple times across these input feature Maps so that's what we called filter eaters okay so there's a lot of these great filter reuse opportunities in the neural network itself and so what what can we do to exploit this reuse opportunities well all we can do is we can bill what we call a memory hierarchy that contains very low cost memories that allow us to reduce the overall cost of moving this data so what do we mean here we mean that if I have if I build a multiply and accumulate engine I'm gonna have a very small memory right beside the multiply and accumulate engine and by small I mean something on the order of under a kilobyte of memory locally besides that multiplying accumulate Anjan why do I want that because accessing that very small memory can be very cheap so for example if to perform a multiplying accumulate with an ALU x1 X reading from this very small memory beside the multiply to accumulate engine is also going to be still the same amount of energy I can also allow these processing elements in a processing element is going to be this multiply and accumulate plus the small memory I can also allow the different processing elements to also share data ok and so reading from a neighboring processing element is going to be 2x the energy and then finally you can have a shared larger memory called a global buffer and that's going to be able to be shared across all different processing elements this tends to be larger between hundred and 500 K bytes and that's going to be more expensive about 6 X the energy itself and of course if you go off chip to DRAM that's going to be more the most expensive at 200 X the energy right and so the big issue here is you can the way that you can think about this is what you would ideally like to do is to access all of the data from this very small local memory but the challenge here is that this very small local memory is only 1 K byte but we're talking about neural networks that are millions of weights in terms of size right so how do we go about doing that there so there's many challenges of doing that I'm just as an analogy for you guys to kind of think through how this is related you could imagine that you know accessing something from like let's say your backpack is gonna be much cheaper than accessing something from your neighbor or you know going back to let's say your office here somewhere on campus to get the data versus going back all the way home right so ideally you'd like to access all of your data from your backup but if you have a lot of work to do you might not be able to fill it in your backpack so the question is how can I you know break up my large piece of work into smaller chunks so that I can access them all from this small memory itself and that's the big challenge that you have and so there's been a lot of research in this area in terms of what's the best way to break up the data and what should I store in this very small local memory so one approach is what we call a weight stationary and the idea here is I'm gonna store the weight information of the neural net into this small local memory okay and so as a result I really minimize the weight energy but the challenge here is that the other types of data that you have in your system so for example your input activations show in the blue and then the partial sums are shown in the red now those still have to move through the rest of the system itself so through their networking from the global buffer okay our typical types of work that are popular that use this type of kind of data flow or weight stationary data flow which will be call because the weight remains stationary are things like the TPU from Google and the envy de la accelerator from it video another approach that people take or they will they say well so the weight I only have her have to read it but the partial sums I have to read it and write it because the partials I'm going to read accumulate like update it and then write it back to them so there's two memory accesses for that partial sum data type so what maybe I should put that partial sum locally into that small memory itself so this is what we call output stationary because the accumulation of the output is going to be local within that one processing element that's not going to move the trade-off of course is the activations of weights now have to move through the network and then there's various different works called like for example so we're from Katie Leuven and some work from the Chinese Academy of Sciences that are using this approach another piece of work is saying well you know forget about the inputs and the or so the outputs and the wastes themselves let's keep the input stationary within this small membrane it's called input stationary and some of the work again from some research work from Nvidia has examined this but all of these different types of work really focus on you know not moving one piece of type of data right either focus on minimizing weight energy or a partial sum energy or input energy I think what's important to think about is that maybe you want to reduce the data movement of all different data types all types of energy right so another approach is something we've developed within our own group is looking at we call the row stationary data flow and within each of the processing elements you're gonna do one row of convolution and this row is a mixture of all the different data types right so you have filter information so the weights of the filter you have the activations of your input feature map and then you also have your partial sum information so you're really trying to balance the data movement of all the different data types not just one particular data type this is just performing a one row but you just talked about the fact that the neural network is much more than a 1d convolution so you can imagine expanding this to higher dimensions so this is just showing how you might expand this one deconvolution into a 2d convolution and then there's other you know higher dimensionality that you architecture as well I won't get through the details of this but the key takeaway here is that you might not want to focus on one particular data type you want to actually optimize for all the amount all the different types of data that you're moving around in your system ok and this can just show you you know some results in terms of how these different data types or different types of data flows would work so for example in the weight stationary case as expected the weight energy the energy required to move the weights shown in green is going to be the lowest but then the red portion which is the energy of the partial sums and the green are so the blue and blue part which is the input feature map or input pixels that's going to be very high output stationary is another approach as we talked about you're trying to reduce the data movement of the partial sums shown here in red so the red part is really minimized they can see that the green part which is the weight stationary data movement or weight movement is going to be increased and the blue is the inputs going to be increased there's another potion called no coloca reuse we don't have time to talk about that but you can see that ROS stationary for example really aims to balance the data movement of all the different data types right so the big takeaway here is that you know when you're trying to optimize you know given piece of hardware you don't want to just optimize one you know for one particular type of data you want to optimize overall for all the movement in the hardware itself okay another thing that you can also exploit to save a bit of power is the fact that you know some of the data could be zero so we know that anything multiplied by zero was going to be zero right so if you know that one of the inputs to your multiplying accumulate is going to be zero you might as well skip that multiplication in fact you might as well skip you know accessing data or accessing the other input to that multiply and accumulate engine so by doing that you can actually reduce the power consumption by almost 50 percent another thing that you can do is that if you have a bunch of zeros you can also compress the data for example you can use things like run length encoding which we're basically a run of zeros is going to be represented rather than you know zero zero zero zero zero you can just say have a run of five zeros and this can actually reduce the amount of data movement by up to two X in your system itself and in fact in you know neural nets there's a large you know possibilities of actually generating zeros first of all if you remember that real loop it's setting negative values to zero so naturally generate zeros and then there's other techniques for example we call pruning which is setting some of the weights of the neural so this can exploit all that okay so you know what is the impact of all these types of things so we actually looked at building hardware I'm in particular a customized chip that we called iris to demonstrate these particular proaches in particular the row stationary data flow and exploiting sparsity in the activation data so this Irish ship has 14 by 12 so 168 processing elements you can see that there's a shared buffer that's 100 kilobytes and it has some compression decompression because 4 goes to off chip TM and again that's because accessing DRM is the most expensive I'm shown here on the right hand side is a die photo of the fabricated chip itself right and this is 4 millimeters by 4 Miller in terms of size and so using that you know rows stationary data flow it exploits a lot of data reuse so it actually reduces the number of times we access this global buffer by a hundred X and it also reduces the amount of times we access the off-chip memory by over a thousand decks this is all because you know each of these processing elements has you know a local memory that is trying to read most of the vit status from it's also sharing with other processing elements so overall when you compare it to a mobile GPU you're talking about an order of magnitude reduction and energy consumption if you'd like to learn a little bit more about that I invite you to visit the iris project website ok so this is great we can build custom hardware but what does this actually mean in terms of you know building a system that can efficiently compute neural nets so let's say we take a step back let's say we don't care anything about the hardware and we're you know a systems provider we want to build you know an overall system and what we really care about is the trade-off between energy and accuracy right that's that's the key thing that we care about I'm so shown here is a plot and let's say this is for an object detection task right so accuracy is on the x-axis and it's listed in terms of average precision which is a metric that we use for object detection it's on a linear scale and higher the better vertically we have energy consumption on this is the energy that's being consumed per pixel so you kind of average it I can imagine a higher resolution image than consume more energy it's going to be an exponential scale so let's first start on the accuracy axis and so if you think before neural nets you know had its resurgence in around 2011 2012 actually state-of-the-art approaches used features called histogram of oriented gradients this is a very popular approach to be very efficient in terms of quite a cure in terms of object detection and we referred to as hog the reason why you know neural and that's really took off is because they really improve the accuracy so you can imagine Alex said here almost doubled the accuracy and then vgg you know further increase the accuracy so it's super exciting there but and we want to look also on the vertical axis which is the energy consumption and I should mention you know basically you'll see these dots we have the energy consumption for each of these different approaches these approaches are actually measured or these energy numbers are measured on specialized Hardware all right that's been designed for that particular task so we have a chip here that's built in 65 nanometer CMOS process will use the same transistors around the same size that does object detection using the hog features and then here's the iris chip that we just talked about I should also know that these both of these chips are built in my group the students who built this these chips you know started designing the chips at the same time and taped out at the same time so somewhat of a controlled experiment in terms of optimization okay so what does this tell us when we look on the energy axis we can see that histogram of oriented gradients or hog features are actually very efficient from an energy point of view in fact if we compare it to something like video compression again something that you all have in your phone hogs features are actually more efficient than video compression meaning for the same energy that you would spend compressing a pixel you could actually understand that pixel so that's pretty impressive but if we start looking at Alex net or vgg we can see that the energy increases by two to three orders of magnitude which is quite significant I'll give you an example so if I told you on your cell phone I'm gonna double the accuracy of its recognition but your phone would die three hundred times faster who here would be interested in that technology right so exactly so nobody right so then the sense that battery life is so critical to how we actually use these types of technologies so we should not just look at the accuracy which is the x-axis point of view we should really also consider the energy consumption and we really don't want the energy to be so high and we can see that even with specialized hardware we're still quite far away from making neural nets as efficient as something like video compression that you all have on your phones so we really have to think of how we can further push the energy consumption down without sacrificing accuracy of course okay so actually there's been a huge amount of research in this space because we know neural nets are popular and we know that they have a wide range of applications but energy is really a big challenge so people have looked at you know how can we design new hardware that can be more efficient or how can we design algorithms that are more efficient to enable energy and processing of DNS and so in fact within our own research group we spend quite a bit of time kind of surveying the area understanding what are the various different types of developments that people been looking at so if you're interested in this topic we actually generated various tutorials on this material as well as overview papers this is an overview paper that's about 30 pages and what we're currently expanding it into a book so if you're interested in this topic I would encourage you to visit these resources but the main thing that we learned about as we were doing this kind of survey of the area is that we actually identified various limitations in terms of how people are approaching or how the research is approaching this problem so first let's look on the algorithm sign so there again there's a wide range of approaches that people are using to try and make the DNN algorithms or models more efficient so for example we kind of mentioned the idea of pruning the idea here is you're going to set some of the weights to become zero and again anything times zero is zero so you can skip those operations so there's a wide range of research there there's also looking at efficient network architectures meaning rather than making my neural networks very large these high three-dimensional convolutions can I decompose them into smaller filters right so rather than this 3d filter can I make it a 2d filter and kind of you know I also Trudy but you know one by one and into the screen itself another very popular thing is reduced precision so rather than using the default of 32-bit float can I reduce the number of bits down to eight bits or even binary and we saw before that as we reduce the precision of these operations you also get energy savings and you also reduce data movement as well Kalif to move less data a lot of this work really focuses on reducing the number of Max and the number of weights and those primarily because those are easy to count but the question that we should be asking if we care about the system is does this actually translate into energy savings and reduce latency because from a systems point of view those are the things that we care about right we don't really you know when you're thinking about something running on your phone you don't care about the number of Max and weighs you care about how much energy is consuming because that's gonna affect the battery life or how quickly it might react regulus that's a basically a measure of latency and again hopefully haven't forgotten but basically data movement is a pensive right so you really depends on you know how you move the data through the system so the key takeaway from this slide is that if you remember where the energy come from comes from which is the data movement it's not because of how many weights are how many max you have but really it depends on where the weight comes from if it comes from this small you know a small memory register file that's nearby it's gonna be super cheap as opposed to coming from Austria so all weights are basically not created equal all Macs are not created equal it really depends on the memory hierarchy and the data flow of the hardware itself okay so we can't just look at the number of weights and the number of Max and estimate how much Energy's gonna be consumed so this is quite a difficult challenge so within our group we've actually looked at developing different tools that allow us to estimate the energy consumption of the neural network itself so for example in this particular tool which is available on this website we basically take in you know the DNN weights and the input data including its sparsity we know the different shapes of the different neural of the different layers of the neural net and we run an optimization that figures out you know the memory access how much you know the energy consumed by the data movement and then the energy consumed by the multiply and accumulate computations and then the output is going to be a breakdown of the energy for the different layers and once you have this you can kind of figure out well where is the energy going so I can target my design to minimize that energy consumption okay and so by doing this when we take a look it should be no surprise what are the key observations for this exercise is that the weights alone are not a good metric for energy consumption if you take a look at Google Annette for example it's running on kind of the IRAs architecture you can see that the weights only account for 22% of the overall energy in fact a lot of the energy goes into moving the input feature maps and the output feature maps as well right and also computation so in general this is the same message as before we shouldn't just look at the data move in one particular data type we should look at the energy consumption of all the different data types to give us an overall view of where the energy is actually going okay and so once we actually know where the energy go is going how can we factor that into this of the neural networks to make them more efficient so we talked about the concept of pruning right so again pruning was setting some of the weights of the neural net to zero or you can think of it as removing some of the weights and so what we want to do here is that now we know that we know where the energy is going why don't we incorporate the energy into the design of the algorithm for example to guide us to figure out where we should actually remove the weights from you know so for example let's say here this is on Alec's net for the same accuracy across the different approaches traditionally what happens is that people tend to remove the weights that are small then we call this magnitude based of pruning and you can see that you get about a 2x reduction in terms of energy consumption however we know that like the size of the weight has nothing to do with or the value of the way is nothing to do with the energy consumption ideally what you'd like to do is remove the weights that consume the most energy right in particularly we also know that the more weights that we move remove the accuracy is going to go down so to get the biggest bang for your buck you want to remove the weights that consume the most energy first one way you can do this is you can take your neural network figure out the energy consumption of each of the layers of the neural network you can sort then sort the layers in terms of higher and higher energy layer to low Leonard Leonard G layers and then you prune the high energy layers first so this is what we call energy we're pruning and then by doing this you actually now get a 3.7 X reduction in energy consumption compared to 2x for the same accuracy and again this is because we factor in energy consumption into the design of the neural network itself or and the prune models are all available in the iris website another important thing that we care about from a performance point of view is latency right so for example latency has to do with how long it takes when I you know give it an image how long will I get the result back people are very sensitive to latency but the challenge here is that latency again is not directly correlated to things like number of multiplies and accumulates and so this is some data that was released by Google's mobile vision team and they're showing here on the x-axis the number of multiplies and accumulates you can do so go towards the left you're increasing and then on the y-axis this is the latency so this is actually the measured latency or delay it takes to get a results and what they're showing here is that the number of Max is not really a good approximation of latency so in fact for example given a you know layers the neural networks at the same number of Max there can be a 2x range or two explaining in terms of latency or looking at in a different way giving you know layer our neural Nets of the same latency they can have a 3x swing in terms of number of Max all right so the key takeaway here is that you can't just count the number of Max and say oh is this how quickly it's going to run it's actually much more challenging than that and so what we want to ask is is there a way that we can take latency and use that again to design the neural net correctly so rather than looking at max use latency and so together with Google's mobile vision team we developed this approach called net adopt and this is really a way that you can tailor your particular neural network for a given mobile platform for a latency or an energy budget right so it automatically adapts the neural net for that platform itself and really what's driving the design is empirical measurements so measurements of how that particular network perform on that platform some measurements for things like latency and energy and the reason why we want to use empirical measurements is that you can't often generate models for all the different types of hardware out there in the case of Google what they want is that you know if they have a new phone you can automatically tune the network for that particular phone you don't want to have to model the phone as well okay and so how does this work I'll walk you through it so you'll start off with a pre trained network so this is a network that's let's say trained in the cloud for very high accuracy great start off with that but it tends to be very large let's say and so what you're gonna do is you're going to take that into the net adapt algorithm you're gonna take a budget so a budget will tell you like oh I can afford only this type of latency or this amount of latency this amount of energy what net adapt will do is gonna generate a bunch of proposals so different options of how it might modify the network in terms of its dimensions it's going to measure these proposals on that platter --get platform that you care about and then based on these empirical measurements Ned adapt is going to then generate a new set of proposals and it will just iterate across this until it gets it and opted okay and again all of this is on the net adapt
Resume
Categories