Kind: captions Language: en we'd have Viviane see here with us she's a professor here at MIT working in the very important and exciting space of developing energy efficient and high-performance systems for machine learning computer vision and other multimedia applications this involves joint design of algorithms architectures circus systems to enable optimal trade-offs between power speed and quality of results one of the important differences between the human brain and AI systems is the energy efficiency of the brain so Vivian is a world-class researcher at the forefront of discovering how we can close that gap so please give her a warm welcome I'm really happy to be here to share some of the research and an overview of this area efficient computing so actually what I'm going to be talking about today is gonna be a little bit broader than just deep learning will start with deep learning but we will also move to you know how we might apply this to robotics and other AI tasks and why and why it's really important to have efficient computing to enable a lot of these exciting applications also I just want to mention that a lot of the work I'm going to present today is not done by myself but in collaboration with a lot of folks at MIT over here and of course if you want access to the slides are available on our website so given that it's the deep learning lecture series I want to first start out talking up a little bit about deep neural nets so we know that deep neural Nets has you know generate a lot of a lot of interest has a very many very compelling applications but one of the things that has you know come in to light over the past few years is increasing need of compute opene I actually showed over the past few years that there's been a significant increase in the amount of compute that is required to form deep learning applications and to do the training for deep learning over the past few years so it's actually grown exponentially over the past few years it's don't grow in fact by over three hundred thousand times in terms of the amount of compute we need to drive and increase the accuracy a lot of the tasks that we're trying to achieve at the same time if we start looking at basically the environmental implications of all of this processing can be quite severe so if we look at for example the carbon footprint of you know training neural nets if you think of you know the amount of carbon footprint of flying across North America from New York to San Francisco or the carbon footprint of an average human life you can see that you know neural networks are orders of magnitude greater than that so the environmental or carbon footprint implications of computing for deep neural nets can be quite severe as well now this is a lot having to do with compute in the cloud another important area where we want to do compute is actually moving the compute from the cloud to the edge itself into the device where a lot of the data is being collected so why would we want to do that so there's a couple of reasons first of all communication so in a lot of places around the world and just even a lot of just placing is generally you might not have a very strong communication infrastructure right so you don't want to necessarily to rely on a communication network in order to do a lot of these applications so again you know removing your tethering from the cloud is important another reason is a lot of the times that we you know apply deep learning on a lot of applications where the data is very sensitive so you can think about things like health care where you're collecting very sensitive data and so privacy and security again is really critical and you would rather than sending the data to the cloud you'd like to bring the compute to the data itself finally another compelling reason for you know bringing the compute into the device or into the robot is latency so this is particularly true for interactive applications so you can think of things like autonomous navigation robotics or self-driving vehicles where you need to interact with the real world you can imagine if you're driving very quickly down the highway and you detect an obstacle you might not have enough time to send the data to the cloud wait for it to be processed and send the instruction back in so again you want to move the compute into the robot or into the vehicle itself okay so hopefully this is establishing why we want to move the compute into the edge but one of the big challenges of doing processing in the robot or in the device actually has to do with power consumption itself so if we take the self-driving car as an example been reported that it consumes over 2000 watts of power just for the computation itself just to process all the sensor data that it's collecting right and this actually generates a lot of heat it takes up a lot of space you can see in this prototype that's being placed in all the computes a specs are being placed in the trunk generates a lot of heat it generates and often needs water cooling so this can be a big cost and logistical challenges for self-driving vehicles now you can imagine that this is gonna be much more challenging if we shrink shrink down the form factor of the device itself to something that is perhaps portable in your hands you can think about smaller robots or something like your smartphone or cell phone in these particular cases when you think about portable devices you actually have very limited energy capacity and this is based on the fact that though battery itself is limited in terms of the size weight and its cost right so you can't have very large amount of energy on these particular devices itself secondly when you take a look at you know the embedded platforms that are currently used for embedded processing for these particular applications they tend to consume you know over 10 watts which is an order of magnitude higher than the power consumption that you typically would allow for for these particular handheld devices so in these handheld devices typically you're limited to under a watt due to the heat dissipation for example you don't want your cell phone to get super hot ok so in the past you know decade or so or decades what we would do to address this challenge is that we would wait for transistors become smaller faster and more efficient however this has become a challenge over the past few years so transistors are not getting more efficient so for example Moore's Law which typically makes transistors smaller and faster has been slowing down and Dennard scaling which has made transistors more efficient has also slowed down our endeth so you can see here over the past 10 years this trend has really flattened out ok so this is a particular challenge because we want more and more compute to drive deep neural network applications but the transistors are not becoming more efficient right so what we have to turn to in order to address this is we need to turn towards specialized hardware to achieve the significant speed and energy throughputs that we require for our particular application and we talked about designing specialized Harvard this is really about thinking about how we can redesign the hardware from the ground up particularly targeted at these AI deep learning and robotic tasks that we're really excited about okay so this notion is not new in fact it's become extremely popular to do this over the past few years there's been a large number of startups and companies that have focused on building specialized hardware for deep learning so in fact New York Times reported I guess it's two years ago that there's a record number of startups looking at building specialized hardware for AI and for a deep learning okay so we'll talk a little bit about what specialized hardware looks like for these particular applications now if you really care about energy and power efficiency the first question you should ask is where is the power actually going for these applications and so as it turns out power is dominated by data movement so it's actually not the computations themselves that are expensive but moving the data to the computation engine that's expensive so for example I shown here in blue is you know a range of power consumption energy consumption for a variety of types of computations for example multiplications and additions at various different Precision's so you have for example floating point to fixed point and integer and same with additions and you can see as it makes sense as you scale down the precision the energy consumption of each of these operations reduce but what's really surprising here is that if you look lower at the energy consumption of data movement right again this is delivering the input data to do the multiplication and then you know moving the output of the multiplication somewhere into memory it can be very expensive so for example if you look at the energy consumption of a 32-bit Reed from an SRAM memory this is an 8 kilobyte SRAM so it's a very small memory that you would have on the processor or on the chip itself this is already going to consume 5 Pico joules of energy so equivalent or even more than a 32-bit floating-point mode multiplied and it's from a very small memory if you need to read this data from off chips so outside the processor for example in DRAM it's going to be even more offensive so in this particular case we're showing 640 Pico joules in terms of energy and sequence notice here on the horizontal axis that this is basically the this is an exponential axis so you're talking about orders of meant to increase in energy in terms of data movement compared to the compute itself right so this is a key takeaway here so if we really want to address the energy consumption of these particular types of processing we really want to look at reducing data movement okay but what's the challenge here so if we take a look at a popular a I robotics or type of application like autonomous navigation the real challenge here though is that these applications use a lot of data right so for example one of the things you need to do in autonomous navigation is what we call semantic understanding so you need to be able to identify you know which pixel belongs to what so for example in this scene you need to know that this pixel represents the ground this pixel represents the sky this pixel represents you know a person itself okay so this important type of processing often if you're traveling quickly you want to be able to do this at a very high frame rate you might need to have large resolution so for example typically if you want HD images you're talking about 2 million pixels per frame and then often if you also want to be able to detect objects at different scales or see objects that are far away you need to do what we call data expansion for example build a pyramid for this and this would increase the amount of pixels or amount of data you need to process by you know one or two orders of magnitude so that's a huge amount of data that you have to process right off the back there another type of processing or understand that you want to do for Thomas navigation is what we call it geometric understanding and that's when you're kind of navigating you want to build a 3d map of the world that's around you and you can imagine the longer you travel for the larger the map you're gonna build and again that's going to be more data that you're gonna have to process and compute on ok so this is a significant challenge for autonomous navigation in terms of mounted data other aspects of Thomas navigations also other applications like AR VR and so on is understanding your environment right so a typical thing you might need to do is to do depth estimation so for example if I give you an image can you estimate the distance of how far a given pixel is from and also semantic segmentation we just talked about that before so these are important types of ways to understand your environment when you're trying to navigate I mean it should be no surprise to you that in order to do these types of processing the state-of-the-art approaches utilize deep neural nets right but the challenge here that these deep neural nets often require several hundred millions of operations and weights to do the computation so when you try and compare it to something like you would all have on your phone for example video compression you're talking about you know two to three orders of magnitude increase in computational complexity so this is significant challenge because if we'd like to have you know deep neural networks be as ubiquitous as something like video compression we really have to figure out how to address this computational complexity we also know that deep neural networks are not just used for understanding the environment or autonomous navigation but it's really become the cornerstone of many AI applications from computer vision speech recognition gameplay and even medical applications and I'm sure a lot of these have been covered through this course so briefly I'm just gonna give a quick overview of some of the key components and deep neural nets not because you know I'm sure all of you understand it but because since this area is very popular the terminology can vary from discipline to discipline so I'll just do a brief overview to align ourselves on the terminology itself so what are deep neural Nets basically you can view it as a way of for example understanding in the environment it's a chain of different layers of processing where you can imagine for an input image at the low level or the earlier parts of the neural net you're trying to learn different low-level features such as edges of an image and as you get deeper into the network as you chain more of these kind of computational layers together you start being able to detect higher and higher level features until you can you know recognize a vehicle for example and you know the difference of this particular approach compared to more traditional ways of doing computer vision is that how we extract these features are learned from the data itself as opposed to having an expert come and say hey look for the edges look for you know the wheels and so on the fact that it recognizes this features is it and approach okay what is it doing at each of these layers well it's actually doing a very simple computation this is looking at the inference side of things basically effectively what is doing is a weighted sum right so you have the input values and we'll color code the inputs as blue here and try and stay consistent with that's what the talk we apply certain weights to them these weights are learned from the training data and then they would generate an output which is typically read here it's basically a weighted sum as we can see we then passed this weighted sum through some form of non-linearity so you know traditionally used to be sigmoids more recently we use things like real ooze which basically set you know non zero values or negative values to zero but the key takeaway here is if you look at this computational kernel the key operation to a lot of these neural networks is performing this multiply and accumulate to compute the weighted sum and this accounts for over 90% of the computation so if we really want to focus on you know accelerating neural nets or making them more efficient we want to focus on minimizing the cost of this multiply and accumulate itself there are also various popular types of deep neural network layer layers used for deep neural networks they also often vary in terms of you know how you connect up the different layers so for example you can have feed-forward layers where the inputs are always connected to the outputs you can have feedback where the outputs are connected back into the inputs you can have fully connected inputs where basically all the outputs are connected to all the inputs or sparsely connected and you might be familiar with some of these layers so for example fully connected layers just like what we talked about all inputs and all outputs are connected there tend to be feed-forward and when you put them together they're typically referred to as a multi-layer perceptron you have convolutional layers which are also feed-forward but then you have sparsely connected weight sharing connections and when you put them together they often referred to as convolutional and that works and they're typically used for image based processing you have current layers where we have this feedback connection so the output is fed back to the input when we combine two recurrent layers they're referred to as recurrent neural Nets and these are typically used to process sequential data so speech or language based processing and then most recently which is become really popular it's the tension layers or tension based mechanisms and they often involve matrix multiply which is again multiplied and accumulate and there when you combine these are often referred to as transformers okay so let's first kind of get an idea as to why you know convolutional or deep learning is much more computationally more complex than other types of processing so we'll focus on you know convolutional neural Nets is an example although many of these principles apply to other types of neural nets and the first thing that'll kind of take a look as to why it's complicated is to look at the computational kernel so how does it actually perform convolution itself so let's say you have this 2d input image if it's at the input of the neural net would be an image if it's deeper in the neural net would be the input feature map and it's going to be composed of activations or you can think from an image it's going to be composed of pixels and we convolve it with let's say a 2d filter which is composed of weights right so typical convolution what you would do is you would do an element-wise multiplication of the filter weights with the input feature map activations you would sum them all together to generate one output value that we would refer to that as the output activation right and then what because it's convolution we would basically slide the filter across this input feature map and generate all the other output feature map activations and so this cut this kind of 2d convolution is pretty standard in image processing we've been doing this for decades right what makes convolutional neural nets much more challenging as the increase in dimensionality so first of all rather than doing just this 2d convolution we often stack multiple channels so there's this third dimension called channels and then what we're doing here is that we need to do a 2d convolution on each of the channels and then add it all together right and you can think of these channels for an image these channels would be kind of the green and blue components for example and as you get deeper into the feature map the number of channels could potentially increase so if you look at Alex net which is a popular neural net the number of channels ranges from 3 to 192 okay so that already increases the dimensionality one dimension of the neural our neural net itself in terms of processing another dimension that we increase is we actually apply multiple filters to this same input feature map ok so for example you might apply and filters to the same input feature map and then you would generate an output feature map of M channels right so in the previous slide we showed that you know convolving this 3d filter generates one output channel on the output feature map if we apply em input M feet filters we're gonna generate M output channels in the output feature map and again just to give you an idea in terms of the scale of this when you talk about things like Alec's net we're talking about between 96 to 384 filters and of course it's increasing to you know thousands for other advanced or more modern neural Nets itself and then finally often you want to process more than one image at a given time right so if you want to actually do that we can actually extend it so N and input images we can become n output images or and input feature maps we can becomes n output feature maps and a typical we typically refer to this as a batch size like the number of images you're processing at the same time and this can range from 1 to 256 ok so these are all the various different dimensions of the neural net and so really what someone does when they're trying to define what we call the network architecture of the neural net itself is that they're going to select the different or define the shape of the neural network for each of the different layers so it's going to you know define all these different dimensions of the neural net itself and these shapes can vary across the different layers just to give you an idea if you look at mobile net as an example this is a very popular neural net cells you can see that the filter size is right so the height and width of the filters and the number of filters and number of channels will vary across the different blocks or layers itself the other thing I also want to mention is that when we look towards popular enn models we can also see important trends so shown here are the various different models they've been developed over the years that are quite popular a couple of interesting trends one is that the networks tend to become deeper so you can see in the convolutional layers they're getting deeper and deeper and then also the number of weights that they're using and the number of max are also increasing as well so this is an important trend the DNN models are getting larger and deeper and so again they're becoming much more computationally demanding and so we need more sophisticated hardware to be able to process them all right so that's kind of a quick intro overview into the deep neural network space I hope we're all aligned so the first thing I'm going to talk about is how can we actually build hardware to make the processing of these neural networks more efficient and to run faster and often we refer to this as hardware acceleration all right so we know these neural networks are very large there's a lot of compute but are there types of properties that we can leverage to make computing or processing of these networks more efficient so the first thing that's really friendly is that they actually exhibit a lot of parallelism so all these multiplies and accumulates you can actually do them all in parallel right so that's great so that what that means is high throughput or high speed is actually possible cuz I can do a lot of these process things in parallel what is difficult in what should not be a surprise to you know is that the memory accesses the ball bottlenecks so delivering the data to the multiply and accumulate engine is what's really challenging so I'll give you an insight as to why this is the case so let's take say we take this multiply and accumulate engine what we call a Mac it takes in three inputs for every Mac so you have the filter weight you have the input image pixel or if you're deeper in the network you would be input feature math activation and it also takes the partial sum which is like the partially accumulated value from the previous multiply that it did and then it would generate an updated partial sum so for every computation that you do for every Mac that you're doing you need to have four memory accesses so it's a four to one ratio in terms of memory accesses versus compute the other challenge that you have is as we mentioned moving data is going to be very expensive so in the absolute worst case and you would always try to avoid this if you read the data from DRAM it's off ship memory every time you access data from DRAM it's going to be two orders of magnitude more expensive than the computation of performing a Mac itself okay so that's really really bad so if you can imagine again if we look at Alex net which has 700 million max we're talking about three billion DRAM accesses to do that computation okay but again all is not lost there are some things that we can exploit to help us along with this problem so one is what we call input data reuse opportunities which means that a lot of data that we're reading we're using to perform these multiplies and accumulates they're actually used for many multiplies and accumulates so if we read the data once we can reuse it multiple times for many operations right so I'll show you some examples of that first is what we call convolutional reuse so again if you remember we're taking a filter and we're sliding it across this input image and so as a result the the activations from the feature map and the weights from the filter are going to be reused in different combinations to compute the different multiplier and accumulate values or different max itself so there's a lot of what we call convolutional reuse opportunities there another example is that we're actually if you recall going to apply multiple filters on the same input feature map so that means that each activation in that input feature map can be reused multiple times across the different filters finally if we're going to process many images at the same time or many feature maps are given weight in the filter itself can be reused multiple times across these input feature Maps so that's what we called filter eaters okay so there's a lot of these great filter reuse opportunities in the neural network itself and so what what can we do to exploit this reuse opportunities well all we can do is we can bill what we call a memory hierarchy that contains very low cost memories that allow us to reduce the overall cost of moving this data so what do we mean here we mean that if I have if I build a multiply and accumulate engine I'm gonna have a very small memory right beside the multiply and accumulate engine and by small I mean something on the order of under a kilobyte of memory locally besides that multiplying accumulate Anjan why do I want that because accessing that very small memory can be very cheap so for example if to perform a multiplying accumulate with an ALU x1 X reading from this very small memory beside the multiply to accumulate engine is also going to be still the same amount of energy I can also allow these processing elements in a processing element is going to be this multiply and accumulate plus the small memory I can also allow the different processing elements to also share data ok and so reading from a neighboring processing element is going to be 2x the energy and then finally you can have a shared larger memory called a global buffer and that's going to be able to be shared across all different processing elements this tends to be larger between hundred and 500 K bytes and that's going to be more expensive about 6 X the energy itself and of course if you go off chip to DRAM that's going to be more the most expensive at 200 X the energy right and so the big issue here is you can the way that you can think about this is what you would ideally like to do is to access all of the data from this very small local memory but the challenge here is that this very small local memory is only 1 K byte but we're talking about neural networks that are millions of weights in terms of size right so how do we go about doing that there so there's many challenges of doing that I'm just as an analogy for you guys to kind of think through how this is related you could imagine that you know accessing something from like let's say your backpack is gonna be much cheaper than accessing something from your neighbor or you know going back to let's say your office here somewhere on campus to get the data versus going back all the way home right so ideally you'd like to access all of your data from your backup but if you have a lot of work to do you might not be able to fill it in your backpack so the question is how can I you know break up my large piece of work into smaller chunks so that I can access them all from this small memory itself and that's the big challenge that you have and so there's been a lot of research in this area in terms of what's the best way to break up the data and what should I store in this very small local memory so one approach is what we call a weight stationary and the idea here is I'm gonna store the weight information of the neural net into this small local memory okay and so as a result I really minimize the weight energy but the challenge here is that the other types of data that you have in your system so for example your input activations show in the blue and then the partial sums are shown in the red now those still have to move through the rest of the system itself so through their networking from the global buffer okay our typical types of work that are popular that use this type of kind of data flow or weight stationary data flow which will be call because the weight remains stationary are things like the TPU from Google and the envy de la accelerator from it video another approach that people take or they will they say well so the weight I only have her have to read it but the partial sums I have to read it and write it because the partials I'm going to read accumulate like update it and then write it back to them so there's two memory accesses for that partial sum data type so what maybe I should put that partial sum locally into that small memory itself so this is what we call output stationary because the accumulation of the output is going to be local within that one processing element that's not going to move the trade-off of course is the activations of weights now have to move through the network and then there's various different works called like for example so we're from Katie Leuven and some work from the Chinese Academy of Sciences that are using this approach another piece of work is saying well you know forget about the inputs and the or so the outputs and the wastes themselves let's keep the input stationary within this small membrane it's called input stationary and some of the work again from some research work from Nvidia has examined this but all of these different types of work really focus on you know not moving one piece of type of data right either focus on minimizing weight energy or a partial sum energy or input energy I think what's important to think about is that maybe you want to reduce the data movement of all different data types all types of energy right so another approach is something we've developed within our own group is looking at we call the row stationary data flow and within each of the processing elements you're gonna do one row of convolution and this row is a mixture of all the different data types right so you have filter information so the weights of the filter you have the activations of your input feature map and then you also have your partial sum information so you're really trying to balance the data movement of all the different data types not just one particular data type this is just performing a one row but you just talked about the fact that the neural network is much more than a 1d convolution so you can imagine expanding this to higher dimensions so this is just showing how you might expand this one deconvolution into a 2d convolution and then there's other you know higher dimensionality that you architecture as well I won't get through the details of this but the key takeaway here is that you might not want to focus on one particular data type you want to actually optimize for all the amount all the different types of data that you're moving around in your system ok and this can just show you you know some results in terms of how these different data types or different types of data flows would work so for example in the weight stationary case as expected the weight energy the energy required to move the weights shown in green is going to be the lowest but then the red portion which is the energy of the partial sums and the green are so the blue and blue part which is the input feature map or input pixels that's going to be very high output stationary is another approach as we talked about you're trying to reduce the data movement of the partial sums shown here in red so the red part is really minimized they can see that the green part which is the weight stationary data movement or weight movement is going to be increased and the blue is the inputs going to be increased there's another potion called no coloca reuse we don't have time to talk about that but you can see that ROS stationary for example really aims to balance the data movement of all the different data types right so the big takeaway here is that you know when you're trying to optimize you know given piece of hardware you don't want to just optimize one you know for one particular type of data you want to optimize overall for all the movement in the hardware itself okay another thing that you can also exploit to save a bit of power is the fact that you know some of the data could be zero so we know that anything multiplied by zero was going to be zero right so if you know that one of the inputs to your multiplying accumulate is going to be zero you might as well skip that multiplication in fact you might as well skip you know accessing data or accessing the other input to that multiply and accumulate engine so by doing that you can actually reduce the power consumption by almost 50 percent another thing that you can do is that if you have a bunch of zeros you can also compress the data for example you can use things like run length encoding which we're basically a run of zeros is going to be represented rather than you know zero zero zero zero zero you can just say have a run of five zeros and this can actually reduce the amount of data movement by up to two X in your system itself and in fact in you know neural nets there's a large you know possibilities of actually generating zeros first of all if you remember that real loop it's setting negative values to zero so naturally generate zeros and then there's other techniques for example we call pruning which is setting some of the weights of the neural so this can exploit all that okay so you know what is the impact of all these types of things so we actually looked at building hardware I'm in particular a customized chip that we called iris to demonstrate these particular proaches in particular the row stationary data flow and exploiting sparsity in the activation data so this Irish ship has 14 by 12 so 168 processing elements you can see that there's a shared buffer that's 100 kilobytes and it has some compression decompression because 4 goes to off chip TM and again that's because accessing DRM is the most expensive I'm shown here on the right hand side is a die photo of the fabricated chip itself right and this is 4 millimeters by 4 Miller in terms of size and so using that you know rows stationary data flow it exploits a lot of data reuse so it actually reduces the number of times we access this global buffer by a hundred X and it also reduces the amount of times we access the off-chip memory by over a thousand decks this is all because you know each of these processing elements has you know a local memory that is trying to read most of the vit status from it's also sharing with other processing elements so overall when you compare it to a mobile GPU you're talking about an order of magnitude reduction and energy consumption if you'd like to learn a little bit more about that I invite you to visit the iris project website ok so this is great we can build custom hardware but what does this actually mean in terms of you know building a system that can efficiently compute neural nets so let's say we take a step back let's say we don't care anything about the hardware and we're you know a systems provider we want to build you know an overall system and what we really care about is the trade-off between energy and accuracy right that's that's the key thing that we care about I'm so shown here is a plot and let's say this is for an object detection task right so accuracy is on the x-axis and it's listed in terms of average precision which is a metric that we use for object detection it's on a linear scale and higher the better vertically we have energy consumption on this is the energy that's being consumed per pixel so you kind of average it I can imagine a higher resolution image than consume more energy it's going to be an exponential scale so let's first start on the accuracy axis and so if you think before neural nets you know had its resurgence in around 2011 2012 actually state-of-the-art approaches used features called histogram of oriented gradients this is a very popular approach to be very efficient in terms of quite a cure in terms of object detection and we referred to as hog the reason why you know neural and that's really took off is because they really improve the accuracy so you can imagine Alex said here almost doubled the accuracy and then vgg you know further increase the accuracy so it's super exciting there but and we want to look also on the vertical axis which is the energy consumption and I should mention you know basically you'll see these dots we have the energy consumption for each of these different approaches these approaches are actually measured or these energy numbers are measured on specialized Hardware all right that's been designed for that particular task so we have a chip here that's built in 65 nanometer CMOS process will use the same transistors around the same size that does object detection using the hog features and then here's the iris chip that we just talked about I should also know that these both of these chips are built in my group the students who built this these chips you know started designing the chips at the same time and taped out at the same time so somewhat of a controlled experiment in terms of optimization okay so what does this tell us when we look on the energy axis we can see that histogram of oriented gradients or hog features are actually very efficient from an energy point of view in fact if we compare it to something like video compression again something that you all have in your phone hogs features are actually more efficient than video compression meaning for the same energy that you would spend compressing a pixel you could actually understand that pixel so that's pretty impressive but if we start looking at Alex net or vgg we can see that the energy increases by two to three orders of magnitude which is quite significant I'll give you an example so if I told you on your cell phone I'm gonna double the accuracy of its recognition but your phone would die three hundred times faster who here would be interested in that technology right so exactly so nobody right so then the sense that battery life is so critical to how we actually use these types of technologies so we should not just look at the accuracy which is the x-axis point of view we should really also consider the energy consumption and we really don't want the energy to be so high and we can see that even with specialized hardware we're still quite far away from making neural nets as efficient as something like video compression that you all have on your phones so we really have to think of how we can further push the energy consumption down without sacrificing accuracy of course okay so actually there's been a huge amount of research in this space because we know neural nets are popular and we know that they have a wide range of applications but energy is really a big challenge so people have looked at you know how can we design new hardware that can be more efficient or how can we design algorithms that are more efficient to enable energy and processing of DNS and so in fact within our own research group we spend quite a bit of time kind of surveying the area understanding what are the various different types of developments that people been looking at so if you're interested in this topic we actually generated various tutorials on this material as well as overview papers this is an overview paper that's about 30 pages and what we're currently expanding it into a book so if you're interested in this topic I would encourage you to visit these resources but the main thing that we learned about as we were doing this kind of survey of the area is that we actually identified various limitations in terms of how people are approaching or how the research is approaching this problem so first let's look on the algorithm sign so there again there's a wide range of approaches that people are using to try and make the DNN algorithms or models more efficient so for example we kind of mentioned the idea of pruning the idea here is you're going to set some of the weights to become zero and again anything times zero is zero so you can skip those operations so there's a wide range of research there there's also looking at efficient network architectures meaning rather than making my neural networks very large these high three-dimensional convolutions can I decompose them into smaller filters right so rather than this 3d filter can I make it a 2d filter and kind of you know I also Trudy but you know one by one and into the screen itself another very popular thing is reduced precision so rather than using the default of 32-bit float can I reduce the number of bits down to eight bits or even binary and we saw before that as we reduce the precision of these operations you also get energy savings and you also reduce data movement as well Kalif to move less data a lot of this work really focuses on reducing the number of Max and the number of weights and those primarily because those are easy to count but the question that we should be asking if we care about the system is does this actually translate into energy savings and reduce latency because from a systems point of view those are the things that we care about right we don't really you know when you're thinking about something running on your phone you don't care about the number of Max and weighs you care about how much energy is consuming because that's gonna affect the battery life or how quickly it might react regulus that's a basically a measure of latency and again hopefully haven't forgotten but basically data movement is a pensive right so you really depends on you know how you move the data through the system so the key takeaway from this slide is that if you remember where the energy come from comes from which is the data movement it's not because of how many weights are how many max you have but really it depends on where the weight comes from if it comes from this small you know a small memory register file that's nearby it's gonna be super cheap as opposed to coming from Austria so all weights are basically not created equal all Macs are not created equal it really depends on the memory hierarchy and the data flow of the hardware itself okay so we can't just look at the number of weights and the number of Max and estimate how much Energy's gonna be consumed so this is quite a difficult challenge so within our group we've actually looked at developing different tools that allow us to estimate the energy consumption of the neural network itself so for example in this particular tool which is available on this website we basically take in you know the DNN weights and the input data including its sparsity we know the different shapes of the different neural of the different layers of the neural net and we run an optimization that figures out you know the memory access how much you know the energy consumed by the data movement and then the energy consumed by the multiply and accumulate computations and then the output is going to be a breakdown of the energy for the different layers and once you have this you can kind of figure out well where is the energy going so I can target my design to minimize that energy consumption okay and so by doing this when we take a look it should be no surprise what are the key observations for this exercise is that the weights alone are not a good metric for energy consumption if you take a look at Google Annette for example it's running on kind of the IRAs architecture you can see that the weights only account for 22% of the overall energy in fact a lot of the energy goes into moving the input feature maps and the output feature maps as well right and also computation so in general this is the same message as before we shouldn't just look at the data move in one particular data type we should look at the energy consumption of all the different data types to give us an overall view of where the energy is actually going okay and so once we actually know where the energy go is going how can we factor that into this of the neural networks to make them more efficient so we talked about the concept of pruning right so again pruning was setting some of the weights of the neural net to zero or you can think of it as removing some of the weights and so what we want to do here is that now we know that we know where the energy is going why don't we incorporate the energy into the design of the algorithm for example to guide us to figure out where we should actually remove the weights from you know so for example let's say here this is on Alec's net for the same accuracy across the different approaches traditionally what happens is that people tend to remove the weights that are small then we call this magnitude based of pruning and you can see that you get about a 2x reduction in terms of energy consumption however we know that like the size of the weight has nothing to do with or the value of the way is nothing to do with the energy consumption ideally what you'd like to do is remove the weights that consume the most energy right in particularly we also know that the more weights that we move remove the accuracy is going to go down so to get the biggest bang for your buck you want to remove the weights that consume the most energy first one way you can do this is you can take your neural network figure out the energy consumption of each of the layers of the neural network you can sort then sort the layers in terms of higher and higher energy layer to low Leonard Leonard G layers and then you prune the high energy layers first so this is what we call energy we're pruning and then by doing this you actually now get a 3.7 X reduction in energy consumption compared to 2x for the same accuracy and again this is because we factor in energy consumption into the design of the neural network itself or and the prune models are all available in the iris website another important thing that we care about from a performance point of view is latency right so for example latency has to do with how long it takes when I you know give it an image how long will I get the result back people are very sensitive to latency but the challenge here is that latency again is not directly correlated to things like number of multiplies and accumulates and so this is some data that was released by Google's mobile vision team and they're showing here on the x-axis the number of multiplies and accumulates you can do so go towards the left you're increasing and then on the y-axis this is the latency so this is actually the measured latency or delay it takes to get a results and what they're showing here is that the number of Max is not really a good approximation of latency so in fact for example given a you know layers the neural networks at the same number of Max there can be a 2x range or two explaining in terms of latency or looking at in a different way giving you know layer our neural Nets of the same latency they can have a 3x swing in terms of number of Max all right so the key takeaway here is that you can't just count the number of Max and say oh is this how quickly it's going to run it's actually much more challenging than that and so what we want to ask is is there a way that we can take latency and use that again to design the neural net correctly so rather than looking at max use latency and so together with Google's mobile vision team we developed this approach called net adopt and this is really a way that you can tailor your particular neural network for a given mobile platform for a latency or an energy budget right so it automatically adapts the neural net for that platform itself and really what's driving the design is empirical measurements so measurements of how that particular network perform on that platform some measurements for things like latency and energy and the reason why we want to use empirical measurements is that you can't often generate models for all the different types of hardware out there in the case of Google what they want is that you know if they have a new phone you can automatically tune the network for that particular phone you don't want to have to model the phone as well okay and so how does this work I'll walk you through it so you'll start off with a pre trained network so this is a network that's let's say trained in the cloud for very high accuracy great start off with that but it tends to be very large let's say and so what you're gonna do is you're going to take that into the net adapt algorithm you're gonna take a budget so a budget will tell you like oh I can afford only this type of latency or this amount of latency this amount of energy what net adapt will do is gonna generate a bunch of proposals so different options of how it might modify the network in terms of its dimensions it's going to measure these proposals on that platter --get platform that you care about and then based on these empirical measurements Ned adapt is going to then generate a new set of proposals and it will just iterate across this until it gets it and opted okay and again all of this is on the net adapt website I'm just to give you a quick example of how this might work so let's say you start off with it as your input a neural network you know that has the accuracy that you want but the latency is a hundred milliseconds and you would like for it to be 80 milliseconds you want it to be faster so what it's going to do is it's going to generate a bunch of proposals and what the proposals could involve doing is taking one layer of the neural net and reducing the number of channels until it hits the latency budget of 80 milliseconds and they can do that for all the different layers then it's going to tune these different layers and measure the accuracy right so let's say up this one where I just shortened the number of channels in layer one maintains actors at a 60% so that means I'm going to pick that and that's going to be the input or the output of this particular design so the output at 80 milliseconds hitting actors through 60 percent it's gonna be the input to the next iteration and then I'm going to tighten the budget okay again if you're interested I just invite you to go take a look at the net adapt paper but what are the what is the impact of this particular approach well it gives you actually a very much improved trade-off between latency and accuracy right so if you look at this plot again on the x-axis is the latency right so to the left is better so slower latency and then on the x-axis or y-axis that's gonna be the accuracy so higher better so here you want higher to the left is good and so we have first shown in blue and green various kind of handcrafted neural network based approaches and you can see Netta tap which generates no the red dots as it's iterating through like it's optimization and you can see that it Jeeves you know for the same accuracy can be 1 up to 1.7 X faster then you know a manually designed approach this approach is also under you know the umbrella of you know basically network architecture so just kind of also in that kind of flavor but in general the takeaway here is that if you're going to design neural networks or efficient neural network that you want to run quickly or you want to be energy-efficient you should really take you know put hardware into the design loop and take in you know the accurate energy or latency measurements into the design itself of the neural network this particular you know example here is shown for an image classification task meaning I give you an image and you can classify it to the right you can say what's in the image itself you can imagine that that's type of approach is kind of like reducing information right from a 2d image you reduce it down to a label this is very commonly used now but we actually want to see if we can still apply this approach to a more difficult task of something like depth estimation in this case you know I give you a 2d image and the output is also a 2d image where each pixel shows the depth of each or you know the output or the pressure it's basically showing the depth of each pixel at the input this is often what we referred to as you know monocular depth so I give you just a 2d you know depth image input and you can estimate the depth itself the reason why you want to do this is you know 2d cameras irregular cameras are pretty cheap right so I'd be ideal to be able to do this you can imagine like the way that we would do this is to use an auto encoder so the front half of the neural net is still looking like it what we call it encoder it's a reduction element so this is very similar to what you would do for a classification but then the back end of the auto encoder is a decoder so it's going to expand the information back out right and so as I mentioned again this is gonna be much more difficult than just classification because now my output has to be also very dense as well and so we want to see if we could make this really fast with approaches that we just talked about for example in that adapt I'm so indeed you can make it pretty fast if you apply Net adapt closely you know compact network design and then do some deploys decomposition you can actually increase the frame rate by an order of magnitude so again here I'm gonna show the plot on the x axis here is the frame rate on a Jetson th to GPU this is a magic measure with the batch size of one with 32-bit float and on the vertical axis the accuracy their depth estimation terms to the Delta one metric which means the percentage of pixels that are within 25 percent of the correct depth so higher the better and so you can see you know the various different proaches out out there this star red star is the approach using fast a fast step using all the different efficient network design techniques that we talked about you can see you can get an order of magnitude over a 10x speed up while maintaining accuracy and the models and all the code to do this is available on the fast step website we presented this at Achra which is a robotics conference in the middle of last year and we want to show some live footage there so a dick row we actually captured some footage you know on an iPhone and showed the you know real-time depth estimation on an iPhone itself and you can do achieved about 40 frames per second on an iPhone using fast steps so yeah and if you're interested in this particular type of application or efficient networks for depth estimation invite you to visit the website for that okay so that's the algorithmic side of things but let's return to the hardware building specialized hardware that are efficient for a neural network processing so again we saw that you know there's many different ways of you know making the neural network efficient from Network pruning to efficient network architectures to reduce precision the challenge for the hardware designer though is that there's no guarantee as to which type of approach someone might apply to the algorithm that they're gonna run on the hardware right so if you only own the hardware you don't know what kind of algorithm someone's gonna run on your hardware unless you own the whole stack so as a result you really really need to have flexible hardware so it can support all of these different approaches and translate these approaches to improvements in energy efficiency and latency now the challenge is a lot of these specialized DNA and hardware that exists out there I often rely on certain properties of the DNN an order achieve high efficiency so a very typical structure that you might see is that you might have an array of multiply and accumulate units so Mac array and it's going to reduce memory access by amortize amides across erase what do I mean by that so if I read a weight once from the memory weight memory bus I'm gonna reuse it multiple times across the array send it across the array so one read and it can be used multiple times by multiple engines for multiple Macs similarly activation memory a memory read input once and we use it multiple times okay on the issue here is that the amount of reuse and the rate utilization depends on the number of channels you have on your neural net the size of the feature map and the batch size right so this is again just showing two different variations of you know you're gonna reuse based on the number of filters number of input channels feature map patch size and the problem now is that we start looking at these efficient neural network models they're not gonna have as much reuse rights particularly for the compact cases so for example a very typical approach is to use what we call depth wise layers we saw you took that 3d filter and then decomposed it into a 2d filter and a one by one right and so as a result you only have one channel so you're not gonna have much reuse across the input Channel and so rather than you know filling this array with a lot of computation that you can process you're only gonna be able to utilize a very small subset which I've highlighted here in green of the array itself for computation so even though you throw down you know a thousand multiplies ten thousand multiplies the Camellia engine only a very small subset of them can actually do work and that's not great so this is also an issue because as I scale up the array size it's gonna become less efficient ideally what you would like is that after I put more you know cores or processing elements down the system should run faster and I'm paying for more thing more coarse but it doesn't because it can't the data I can't reach or be reused by all of these is from cores and also be difficult to exploit sparsity so what you need here are two things one is a very flexible data flow meaning that there's many different ways for the data to move through this array right and so you can imagine rows stationary is a very flexible way that we can basically map the neural network onto the array itself you can see here in the iris a row stationary case that you know a lot of the processing elements can be used another thing is how do you actually deliver the data for this varying degree of reuse so here's like the spectrum of you know on chip networks in terms of basically how can I deliver data from that global buffer to all those parallel processing engines right um one use case is when I use these huge neural Nets that have a lot of reuse what I wanted his multicast meeting I read once from the global buffer and then I reused that data multiple times and all of my processing elements you can think about it's like broadcasting information out and a type of network that you would do for that is shown here on the right-hand side so this is lobe 8 bandwidth so I'm only reading very little data but high spatial reuse many many engines are using it on the other extreme when I design these very efficient neural networks I'm not gonna have very much reuse and so what I want is unique has meaning I'm gonna I don't want to spend send out unique information to each of the processing elements right so that they can all you know work on so that's going to be as shown here on the left hand side a case where you have very high bandwidth there's a lot of unique information going out and low spatial reuse they're not sharing data now it's very challenging to go across this entire spectrum one solution would be what we call in all - all network that satisfies all of this so all things are can all inputs are connected all book that's gonna be very expensive and not scalable one solution that we have - this is what we call a hierarchical mesh so you can break this problem into two steps at the lowest level you can use an all - all connection right and then at the higher level you can use a mesh connection and so the mesh will allow you to scale up but the all-to-all allows you to achieve a lot of different types of reuse and with this type of network on chip you can basically support a lot of different delivery mechanisms to deliver data from the global buffer to all the processing elements so that all your cores all your computes can be happening at the same time and at its core this is one of the key things that enable the second version of iris to be both flexible and efficient right so this is some results from the second version of iris it supports a wide rater to filter state suppose the very large shapes as well as very compact including convolutional fully connected depth wise layers so you can see here in this plot you know depending on the shape you can get up to an order of magnitude speed-up um it also supports a wide range of sparsity is both dense and sparse so this is really important because some networks can be very sparse because you've done a lot of pruning but some are not and so you want to officially support all of those you also want to be scalable so as you increase the number of processing elements you know the throughput also speeds up and as a result of this particular type of design you get an order of magnitude improvement in both speed and energy efficiency alright so this is great and this is one way that you can you know speed up and make neural networks more efficient but it's also important to take a step back and look beyond just you know building specialized Hardware the accelerator itself both in terms of algorithms and the hardware so can we look beyond the DNA on accelerator for acceleration and so one good place to show this an example is the task of super resolution so how many of you are familiar with the task of super resolution alright so for those of you who aren't the ideas as follows so I want to basically generate a high resolution image from a small and resolution image and why do you want to do that well there are a couple of reasons one is that it can allow you to basically reduce the transmitted bandwidth so for example if you have limited communication I'm going to send a low res version of a video let's there image to your phone and then therefore you can make it high res okay that's one way another reason is that you know screens in general are getting larger and larger so every year at CES they announce like a higher resolution screen but you know if you think about the movies that we watch they're all a lot of them are still 1080p for example or a fixed resolution so again you want to generate a high resolution representation of that you know low resolution input and the idea here is that your high resolutions not just interpolation because it can be very blurry but there's ways that I kind of hallucinate a high resolution version of the video or image itself and that's basically called super resolution but the one of the challenges for super resolution is that it's computationally very expensive so again you can imagine that the state of the art approaches for super res use deep neural nets a lot of the examples we just talked about what about neural nets are talking about input images of like 200 by 200 pixels now imagine if you extend that to like an HD image it's going to be very very expensive so what we want to do is think of different ways that we can speed up the super resolution process not just by making dnns faster but kind of looking around the other components of the system and seeing if we can make it faster as well so one of the things approaches we took is as framework called fast where we're looking at accelerating any super-resolution algorithm by an order of magnitude and this is operated on a compressed video so you know before I was a faculty here I worked a lot on video compression and if you think about the video compression community they look at very video very differently than people who process super resolution so typically when you're thinking about image processing is for resolution when I give you a compressed video what you basically think of it is as a stack of pixels right a bunch of different images together but if you asked a video compression person you know what is the compressed video look like it's actually a compressed video is a very structured representation of the redundancy in the video itself so why is it that we can compress videos is because things like different frames look very you know consecutive frames look very similar so it's telling you you know which pixels in frame one is related to which pixel or looks like which pixel in frame two and so as a result you have to send the pixels in frame two and that's where you get the compression from so actually what a compressed video looks like is a description of the structure of the video itself okay and so you can use this representation to accelerate super resolution so for example rather than applying super resolution to every single low res frame which is the typical approach that you would apply super resolution to each low res frame and you would generate a bunch of high res frame outputs what you can actually do is apply super resolution to one of the small low resolution frames and then you can use that free information you can the compressed video that tells you the structure of the video to generate or transfer and generate all those high resolution videos from that and so only needs to run on a subset of frames and then the complexity to reconstruct all those high resolution frames once you have that structured image is going to be very low so for example if I'm gonna transfer to n frames I'm gonna get an end frame an X speed-up so to evaluate this we showcase this on a range of videos this range of videos is the data set that we use to develop video standards so it's quite broad you can see first on the left hand side is that if I transfer to like four different frames you can get a four acceleration and then the psnr which indicates the quality doesn't change it's the same quality but 4x faster if I do transfer to 16 frames or 16 acceleration there's a slight drop in quality but still you get a basically a 16x acceleration so the key idea here is again you'd want to look beyond you know the processing of the neural network itself to around it to see if you can speed it up I'm usually with Pearson or you can't really tell too much for the quality so another way to look at it is actually look at the video itself or subjective quality so on the left hand side here this is if I applied super resolution on every single frame so this is a traditional way of doing it on the right hand side here this is if I just did interpolation on every single frame and so where you can tell the difference is by looking at things like the text you can see that the text is much sharper on the Left video than the right video now a fast plus SRC on using fast is somewhere in between so fast actually has the same quality as the video on the left hand side but it's just as efficient in terms of processing speed as the approach on the right hand side so it kind of has the best of both worlds and so the key takeaway for this is that if you want to accelerate dnns for a given process it's good to look beyond you know the hardware for the acceleration we can look at things like the structure of the data that's entering the neural network accelerator there might be opportunities there for example I hear a temporal correlation that allows you to further accelerate the processing again if you're interested in this all the code is on the website so to end this lecture I just want to talk about things that are actually beyond deep neural nets I also I know neural nets are great they're useful for many applications but I think there's a lot of exciting problems outside the space of neural nets as well which also require efficient computing so the first thing is what we call visual inertial localization or visual odometry this is something as widely used for robots to kind of figure out where they are and the willwill so you can imagine for autonomous navigation before you you know navigate the world you have to know where you actually are in the world so that's the localization this is also widely used for things like AR and while Rex you can know where you're actually looking and they aren't veer what does this actually mean it means that you can basically take in a sequence of images so you can Majan like a camera that's mounted on the robot or the person as well as an IMU so it has accelerometer and gyroscope information and in visual inertial odometry which is a subset of SLAM basically fuses this information together and the outcome of visual inertial Dom tree is the localization so you can see here basically you're trying to estimate where you are in the 3d space and the pose based on in this case the camera feed but you can also measure IMU information there as well and if you're in an unknown environment you could also generate a map so you know one of these is a very key task and navigation and the key thing is can you do in a fair energy efficient way um so we've looked at kind of building specialized hardware to do localization this is actually the first chip that performs like complete visual inertial odometry on chip we call it Navi on this and done in collaboration with search Carmen um so you can see here here's the chip itself it's four millimeters by four five millimeters you can see that it's smaller than a quarter and it can you imagine mounting it on a small robot at the front end it does basically processing of the camera information it does things like feature detection tracking outlier elimination it also processes uh does pre integration on the IMU and then on the back end it fuses this information together using a factor graph okay and so when you compare you know this particular design this Navion chip design compared to mobile or desktop CPUs you're talking about two to three orders of magnitude reduction in energy consumption because you have the specialized chip to do it so one you know what is the key component of this chip then able to do it well again sticking with the theme the key thing is reduction in data movement in particular we reduce the amount of data that needs to be moved on and off chip so all of the processing is located on the chip itself and then furthermore because we want to reduce the size of the chip and the size of the memories we do things like apply low-cost compression on the frames and then also exploit sparsity which means number of zeros in the factor graph itself so all of the compression and exploiting sparsity can actually reduce the storage cost down to a megabyte of storage on ship to do this processing and that allows us to achieve this really low power consumption of 25 below 25 milliwatts another thing that really matters for autonomous navigation is once you know where you are where are you gonna go next so this is kind of a planning and mapping problem and so in the context of things like robot exploration where you where and I basically explore an unknown area you can you do this by doing what we call a Shannon's computing Shannon's mutual information basically you wanna figure out where should I go next but I will discover the most amount of new information compared to what I already know right so you can imagine so stone here's like an occupancy map so this is basically the light colors showed the place where it's free space is empty nothing's occupied the dark gray area is unknown and then the black lines are occupied things like walls for example and the question is if I know that this is my current occupancy map where should I go and scan let's say with the depth sensor to figure out you know more information about the map itself so what you can do is you can compute what we call the mutual information of the map itself based on what you already know and then you go to the location with the most information and you scan it and then you get an updated map so shown here below is a miniature race card that's doing exactly that right so over here over here is the mutual information that's being computed so it's trying to go to those light you know light areas of the yellow areas that has the most information so you can see that it's going to try and back up and come and scan this region to cover or figure out more information about that okay so that's great it's a very principled way of doing this the problem of this kind of computation the reason why it's been challenging is again the computation in particular the data movement so you can imagine at any given position you're gonna do like a kind of a 3d scanning with your lidar across a wide range of neighboring regions with your beams you can imagine each of these beams with your lighter scan can be processed with different cores so they can all be processed in parallel so parallelism again here just like the deep learning case is very easily available the Challenge is data delivery right so what happens is that you're actually storing your occupancy map all in one memory but now you have multiple cores that are gonna try and process the scans on this occupancy map right and so you only actually typically for these types of numbers you're limited to two ports but if you want to have you know nCore 16 cores 30 course it's going to be a challenge in terms of how to read data from this occupancy map and deliver to the course themselves if we take a closer look at you know the memory access pattern you can see here that as you scan it out the numbers indicate which cycle you would use to read you know each of the locations on the map itself okay and you can see it's kind of a diagonal pattern so the question is can I break this map into smaller memories right and then access these smaller memories in parallel and the question is I can break it into smaller memories how should I decide what part of the map should go into which of these memories so show here on the right hand side in the different different colors basically in indicate different memories or different banks of the memory so they store different parts of the map and again if you think of the numbers as you know the cycle with which each location is accessed what you'll notice is that for any given color and most two numbers are the same meaning that I'm only going to access two pieces of the location for any given thing so there's going to be no conflict so I can process all of these beams in parallel okay and so by doing this this allows you to compute the mutual information of the entire map and by internet can be very large map let say 200 meters by 200 meters at point 1 meter resolution in under a second this is very different from before where you know you can only compute the mutual information of a subset of location then just try and pick the best one now you can compute it on the entire map so you can know the absolute best location to go to to get the most information this is 100x speed-up compared to a CPU at 1/10 of the power right on an FPGA so that's another important example of how data movements really critical in order to allow you to process things very very quickly and having having specialized Hardware Knable that all right so one last thing is looking at you know so we talked about robotics talk about deep learning but actually what's really important there's a lot of important applications that where you can apply efficient processing that can you know help a lot of people around the world so in particularly looking at monitoring neurodegenerative disease of disorders so we know things like dementia so things like Alzheimer's Parkinson's affects you know tens of millions of people around the world and continues to grow this is a very severe disease the challenge for this disease is that one of the many challenges but one of the challenges that the neurological assessments for these disease can be very time consuming and require a trained specialists so normally if you are suffering from one of these diseases or you might have this disease what you need to do is you need to go see a specialist and they'll ask you a series of questions like a mini mental is that like what year is it where are you now can you count backwards and so on or you might be familiar with like people are asked to draw the clock these tests and so you can imagine going to a specialist to do these type of things can be costly and time consuming so you don't go very frequently so that as a result the data that's collected very sparse also it's very qualitative right so if you go to different specialists they might come up with a different assessment right so repeatability is also very much an issue well it's been super exciting is it's been shown a literature that there's actually a quantitative way of measuring or quantitative evaluating these types of diseases potentially using eye movements right so I mean this can be used by quantitative way to evaluate the severity or progression or regression of these particular type of diseases you imagine doing things like you know if you're taking a certain drug is your disease getting better or worse and this movement can give a quantitative evaluation for that but the challenge is that to do these eye movement evaluations you still need to go into that so first you need a very high-speed camera that can be very expensive often you need to have substantial head supports your head doesn't move so you can really detect the eye move it and you might even need IR illumination so you more clear can more clearly see the eye and so again this still has the challenge that for clinical measurements of what we call saccade lanes your eye movement latency or eye reaction time they're done in very constrained environments you still have to go see the SPECIAL itself and here they use very specialized and costly equipment so in the vein of and you know enabling efficient and computing and bringing compute to various devices our question is can we actually do these eye measurement measurements on a phone itself that we all have and so indeed you can you can develop various algorithms that can detect your eye reaction time on you know consumer grade camera like your phone or an iPad and we've shown that you can actually replicate replicate the quality of results as you could with a phantom camera so shown here in the red are basically eye reaction times that are measured on a subject on an iPhone 6 which is obviously under $1,000 way cheaper now compared to a phantom camera shown here in blue you can see that the distributions of the reaction times are about the same and this what why is this exciting because it enables us to do low-cost in-home measurements so you can you can imagine as a patient could do these measurements at home for many days not just the day they go in and then they can bring in this information and this can give the physician or the specialist additional information to make the assessment as well so this can be complementary but it gives a much more rich set of information to do the diagnosis and evaluation so we're talking about computing but there's also other parts of the system that burn power as well in particular when we talk about things like depth estimation using time of flight time of slides kind of very similar to lidar basically what you're doing is you're sending a pulse and waiting for it to come back and how long it takes to come back indicates the depth of whatever object you're trying to detect the challenge with you know depth estimation with hama flight sensors that can be very expensive right you're emitting a pulse waiting for to come back so talking about what you know up to tens of watts of power the question is can we also reduce the sensor power if we can do efficient computing so for example can I reduce how often I omit the depth sensor and kind of recover the other information just using a monocular based camera so for example you know typically you have a pair of a depth sensor and an RGB camera if at times zero I turn both of them on and time one and two I turn them off but I still keep my RGB camera on can I estimate the depth for at time 2 and time 3 ok and then the key thing here is to make sure that the you know algorithms that you're running to estimate the depth without turning on the depth sensor itself is super cheap so we actually have algorithms that can run on VGA at 30 frames per second on our cortex ace which is a super low-cost embedded processor and just to give you an idea of how it looks like so let's see here's the left is the RGB image in the middle is the depth map or the ground truth so if I always have the depth sensor on that's what it would look like and then on the right hand side is the EPS tomato depth map in this particular case we're only turning on the sensor only eleven percent of the time so every every ninth frame and you're mean and relative error is only about 0.7 percent so the accuracy or quality is pretty aligned okay so you know a high level what are the key takeaways I want you guys to get from today's lecture first is the efficient computing is really important it can extend the reach of AI beyond the cloud itself because it can reduce communication network can cause enable privacy and provide low latency and so we can use AI for a wide range of applications ranging for things like robotics to healthcare and in order to achieve this energy-efficient computing it really requires cross layer design so not just focusing on the harlot but specialized Hardware plays an important role but also the algorithms itself and this is gonna be really key to enabling AI for the next decade or so or beyond ok and we also covered a lot of points in the lecture so the slides are all available on our website I'm also just because it's deep learning seminar series I just want to point some other resources that you might be interested if you want to learn more about efficient processing of neural nets so again I want to point you first to this survey paper that we've developed this with my collaborator Joel Emmer Tom really kind of covers what is you know what are the different techniques that people are looking at and gives some insights of the key design principles we also have a book coming soon it's going to be within the next few weeks we also have slides from various tutorials that we've given on this particular topic in fact we also teach a course on this here at MIT 6-8 to 5 if you're interested in you know updates on all these types of materials I invite you to join the mailing list or the Twitter feed the other thing is if you're not an MIT student but you want to take like a two-day course on this particular topic I also invite you to take a look at the MIT professional education options so we run short courses on MIT campus over the summer so you can come for two days we can talk about the various different approaches that people used to build efficient deep learning systems and then finally just if you're interested in just video and tutorial videos on this talk I actually at the end of November during Europe's I gave like a 90 minute tutorial that goes really in-depth in terms of how to build efficient deep learning systems so I invite you to visit that and we also have some talks at like the Mars Conference on robotics and we have a YouTube channel where this is all located and then finally I'd be remiss if I didn't acknowledge you know a lot of the work here is done by the students all the students in our group as well as Mykel Aboriginal immerser chairman and Thomas health and then all of our sponsors that make this research possible so that concludes my talk thank you very much [Applause] you