Efficient Computing for Deep Learning, Robotics, and AI (Vivienne Sze) | MIT Deep Learning Series
WbLQqPw_n88 • 2020-01-23
Transcript preview
Open
Kind: captions
Language: en
we'd have Viviane see here with us she's
a professor here at MIT working in the
very important and exciting space of
developing energy efficient and
high-performance systems for machine
learning computer vision and other
multimedia applications this involves
joint design of algorithms architectures
circus systems to enable optimal
trade-offs between power speed and
quality of results one of the important
differences between the human brain and
AI systems is the energy efficiency of
the brain so Vivian is a world-class
researcher at the forefront of
discovering how we can close that gap so
please give her a warm welcome I'm
really happy to be here to share some of
the research and an overview of this
area efficient computing so actually
what I'm going to be talking about today
is gonna be a little bit broader than
just deep learning will start with deep
learning but we will also move to you
know how we might apply this to robotics
and other AI tasks and why and why it's
really important to have efficient
computing to enable a lot of these
exciting applications also I just want
to mention that a lot of the work I'm
going to present today is not done by
myself but in collaboration with a lot
of folks at MIT over here and of course
if you want access to the slides are
available on our website so given that
it's the deep learning lecture series I
want to first start out talking up a
little bit about deep neural nets so we
know that deep neural Nets has you know
generate a lot of a lot of interest has
a very many very compelling applications
but one of the things that has you know
come in to light over the past few years
is increasing need of compute opene I
actually showed over the past few years
that there's been a significant increase
in the amount of compute that is
required to form deep learning
applications and to do the training for
deep learning over the past few years so
it's actually grown exponentially over
the past few years it's don't grow in
fact by over three hundred thousand
times in terms of the amount of compute
we need to drive and increase the
accuracy a lot of the tasks that we're
trying to achieve at the same time if we
start looking at basically the
environmental implications of
all of this processing can be quite
severe so if we look at for example the
carbon footprint of you know training
neural nets if you think of you know the
amount of carbon footprint of flying
across North America from New York to
San Francisco or the carbon footprint of
an average human life you can see that
you know neural networks are orders of
magnitude greater than that so the
environmental or carbon footprint
implications of computing for deep
neural nets can be quite severe as well
now this is a lot having to do with
compute in the cloud another important
area where we want to do compute is
actually moving the compute from the
cloud to the edge itself
into the device where a lot of the data
is being collected so why would we want
to do that so there's a couple of
reasons first of all communication so in
a lot of places around the world and
just even a lot of just placing is
generally you might not have a very
strong communication infrastructure
right so you don't want to necessarily
to rely on a communication network in
order to do a lot of these applications
so again you know removing your
tethering from the cloud is important
another reason is a lot of the times
that we you know apply deep learning on
a lot of applications where the data is
very sensitive so you can think about
things like health care where you're
collecting very sensitive data and so
privacy and security again is really
critical and you would rather than
sending the data to the cloud you'd like
to bring the compute to the data itself
finally another compelling reason for
you know bringing the compute into the
device or into the robot is latency so
this is particularly true for
interactive applications so you can
think of things like autonomous
navigation robotics or self-driving
vehicles where you need to interact with
the real world you can imagine if you're
driving very quickly down the highway
and you detect an obstacle you might not
have enough time to send the data to the
cloud wait for it to be processed and
send the instruction back in so again
you want to move the compute into the
robot or into the vehicle itself okay so
hopefully this is establishing why we
want to move the compute into the edge
but one of the big challenges of doing
processing in the robot or in the device
actually has to do with power
consumption itself so if we take the
self-driving car as an example
been reported that it consumes over 2000
watts of power just for the computation
itself just to process all the sensor
data that it's collecting right and this
actually generates a lot of heat it
takes up a lot of space you can see in
this prototype that's being placed in
all the computes a specs are being
placed in the trunk generates a lot of
heat it generates and often needs water
cooling so this can be a big cost and
logistical challenges for self-driving
vehicles now you can imagine that this
is gonna be much more challenging if we
shrink shrink down the form factor of
the device itself to something that is
perhaps portable in your hands you can
think about smaller robots or something
like your smartphone or cell phone in
these particular cases when you think
about portable devices you actually have
very limited energy capacity and this is
based on the fact that though battery
itself is limited in terms of the size
weight and its cost right so you can't
have very large amount of energy on
these particular devices itself secondly
when you take a look at you know the
embedded platforms that are currently
used for embedded processing for these
particular applications they tend to
consume you know over 10 watts which is
an order of magnitude higher than the
power consumption that you typically
would allow for for these particular
handheld devices so in these handheld
devices typically you're limited to
under a watt due to the heat dissipation
for example you don't want your cell
phone to get super hot ok so in the past
you know decade or so or decades what we
would do to address this challenge is
that we would wait for transistors
become smaller faster and more efficient
however this has become a challenge over
the past few years so transistors are
not getting more efficient so for
example Moore's Law which typically
makes transistors smaller and faster has
been slowing down and Dennard scaling
which has made transistors more
efficient has also slowed down our
endeth so you can see here over the past
10 years this trend has really flattened
out ok so this is a particular challenge
because we want more and more compute to
drive deep neural network applications
but the transistors are not becoming
more efficient right so what we have to
turn to in order to address this is we
need to turn towards specialized
hardware to achieve the significant
speed
and energy throughputs that we require
for our particular application and we
talked about designing specialized
Harvard this is really about thinking
about how we can redesign the hardware
from the ground up particularly targeted
at these AI deep learning and robotic
tasks that we're really excited about
okay so this notion is not new in fact
it's become extremely popular to do this
over the past few years there's been a
large number of startups and companies
that have focused on building
specialized hardware for deep learning
so in fact New York Times reported I
guess it's two years ago that there's a
record number of startups looking at
building specialized hardware for AI and
for a deep learning okay so we'll talk a
little bit about what specialized
hardware looks like for these particular
applications now if you really care
about energy and power efficiency the
first question you should ask is where
is the power actually going for these
applications and so as it turns out
power is dominated by data movement so
it's actually not the computations
themselves that are expensive but moving
the data to the computation engine
that's expensive so for example I shown
here in blue is you know a range of
power consumption energy consumption for
a variety of types of computations for
example multiplications and additions at
various different Precision's so you
have for example floating point to fixed
point and integer and same with
additions and you can see as it makes
sense as you scale down the precision
the energy consumption of each of these
operations reduce but what's really
surprising here is that if you look
lower at the energy consumption of data
movement right again this is delivering
the input data to do the multiplication
and then you know moving the output of
the multiplication somewhere into memory
it can be very expensive so for example
if you look at the energy consumption of
a 32-bit Reed from an SRAM memory this
is an 8 kilobyte SRAM so it's a very
small memory that you would have on the
processor or on the chip itself this is
already going to consume 5 Pico joules
of energy so equivalent or even more
than a 32-bit floating-point mode
multiplied and it's from a very small
memory if you need to read this data
from off chips so outside the processor
for example in DRAM it's going to be
even more
offensive so in this particular case
we're showing 640 Pico joules in terms
of energy and sequence notice here on
the horizontal axis that this is
basically the this is an exponential
axis so you're talking about orders of
meant to increase in energy in terms of
data movement compared to the compute
itself right so this is a key takeaway
here so if we really want to address the
energy consumption of these particular
types of processing we really want to
look at reducing data movement okay but
what's the challenge here so if we take
a look at a popular a I robotics or type
of application like autonomous
navigation the real challenge here
though is that these applications use a
lot of data right so for example one of
the things you need to do in autonomous
navigation is what we call semantic
understanding so you need to be able to
identify you know which pixel belongs to
what so for example in this scene you
need to know that this pixel represents
the ground this pixel represents the sky
this pixel represents you know a person
itself okay so this important type of
processing often if you're traveling
quickly you want to be able to do this
at a very high frame rate you might need
to have large resolution so for example
typically if you want HD images you're
talking about 2 million pixels per frame
and then often if you also want to be
able to detect objects at different
scales or see objects that are far away
you need to do what we call data
expansion for example build a pyramid
for this and this would increase the
amount of pixels or amount of data you
need to process by you know one or two
orders of magnitude so that's a huge
amount of data that you have to process
right off the back there another type of
processing or understand that you want
to do for Thomas navigation is what we
call it geometric understanding and
that's when you're kind of navigating
you want to build a 3d map of the world
that's around you and you can imagine
the longer you travel for the larger the
map you're gonna build and again that's
going to be more data that you're gonna
have to process and compute on ok so
this is a significant challenge for
autonomous navigation in terms of
mounted data other aspects of Thomas
navigations also other applications like
AR VR and so on is understanding your
environment right so a typical thing you
might need to do is to do depth
estimation so for example if I give you
an image can you estimate the distance
of how far a given pixel is from
and also semantic segmentation we just
talked about that before so these are
important types of ways to understand
your environment when you're trying to
navigate I mean it should be no surprise
to you that in order to do these types
of processing the state-of-the-art
approaches utilize deep neural nets
right but the challenge here that these
deep neural nets often require several
hundred millions of operations and
weights to do the computation so when
you try and compare it to something like
you would all have on your phone for
example video compression you're talking
about you know two to three orders of
magnitude increase in computational
complexity so this is significant
challenge because if we'd like to have
you know deep neural networks be as
ubiquitous as something like video
compression we really have to figure out
how to address this computational
complexity we also know that deep neural
networks are not just used for
understanding the environment or
autonomous navigation but it's really
become the cornerstone of many AI
applications from computer vision speech
recognition gameplay and even medical
applications and I'm sure a lot of these
have been covered through this course so
briefly I'm just gonna give a quick
overview of some of the key components
and deep neural nets not because you
know I'm sure all of you understand it
but because since this area is very
popular the terminology can vary from
discipline to discipline so I'll just do
a brief overview to align ourselves on
the terminology itself so what are deep
neural Nets basically you can view it as
a way of for example understanding in
the environment it's a chain of
different layers of processing where you
can imagine for an input image at the
low level or the earlier parts of the
neural net you're trying to learn
different low-level features such as
edges of an image and as you get deeper
into the network as you chain more of
these kind of computational layers
together you start being able to detect
higher and higher level features until
you can you know recognize a vehicle for
example and you know the difference of
this particular approach compared to
more traditional ways of doing computer
vision is that how we extract these
features are learned from the data
itself as opposed to having an expert
come and say hey look for the edges look
for you know the wheels and so on the
fact that it recognizes this features is
it
and approach okay what is it doing at
each of these layers well it's actually
doing a very simple computation this is
looking at the inference side of things
basically effectively what is doing is a
weighted sum right so you have the input
values and we'll color code the inputs
as blue here and try and stay consistent
with that's what the talk we apply
certain weights to them these weights
are learned from the training data and
then they would generate an output which
is typically read here it's basically a
weighted sum as we can see we then
passed this weighted sum through some
form of non-linearity so you know
traditionally used to be sigmoids more
recently we use things like real ooze
which basically set you know non zero
values or negative values to zero but
the key takeaway here is if you look at
this computational kernel the key
operation to a lot of these neural
networks is performing this multiply and
accumulate to compute the weighted sum
and this accounts for over 90% of the
computation so if we really want to
focus on you know accelerating neural
nets or making them more efficient we
want to focus on minimizing the cost of
this multiply and accumulate itself
there are also various popular types of
deep neural network layer layers used
for deep neural networks they also often
vary in terms of you know how you
connect up the different layers so for
example you can have feed-forward layers
where the inputs are always connected to
the outputs you can have feedback where
the outputs are connected back into the
inputs you can have fully connected
inputs where basically all the outputs
are connected to all the inputs or
sparsely connected and you might be
familiar with some of these layers so
for example fully connected layers just
like what we talked about all inputs and
all outputs are connected there tend to
be feed-forward and when you put them
together they're typically referred to
as a multi-layer perceptron you have
convolutional layers which are also
feed-forward but then you have sparsely
connected weight sharing connections and
when you put them together they often
referred to as convolutional and that
works and they're typically used for
image based processing
you have current layers where we have
this feedback connection so the output
is fed back to the input when we combine
two recurrent layers they're referred to
as recurrent neural Nets and these are
typically used to process sequential
data so speech or language based
processing and then most recently which
is become really popular it's the
tension layers or tension based
mechanisms and they often involve matrix
multiply which is again multiplied and
accumulate and there when you combine
these are often referred to as
transformers
okay so let's first kind of get an idea
as to why you know convolutional or deep
learning is much more computationally
more complex than other types of
processing so we'll focus on you know
convolutional neural Nets is an example
although many of these principles apply
to other types of neural nets and the
first thing that'll kind of take a look
as to why it's complicated is to look at
the computational kernel so how does it
actually perform convolution itself so
let's say you have this 2d input image
if it's at the input of the neural net
would be an image if it's deeper in the
neural net would be the input feature
map and it's going to be composed of
activations or you can think from an
image it's going to be composed of
pixels and we convolve it with let's say
a 2d filter which is composed of weights
right so typical convolution what you
would do is you would do an element-wise
multiplication of the filter weights
with the input feature map activations
you would sum them all together to
generate one output value that we would
refer to that as the output activation
right and then what because it's
convolution we would basically slide the
filter across this input feature map and
generate all the other output feature
map activations and so this cut this
kind of 2d convolution is pretty
standard in image processing we've been
doing this for decades
right what makes convolutional neural
nets much more challenging as the
increase in dimensionality so first of
all rather than doing just this 2d
convolution we often stack multiple
channels so there's this third dimension
called channels and then what we're
doing here is that we need to do a 2d
convolution on each of the channels and
then add it all together right and you
can think of these channels for an image
these channels would be kind of the
green and blue components for example
and as you get deeper into the feature
map the number of channels could
potentially increase so if you look at
Alex net which is a popular neural net
the number of channels ranges from 3 to
192 okay so that already increases the
dimensionality one dimension of the
neural our neural net itself in terms of
processing another dimension that we
increase is we actually apply multiple
filters to this same input feature map
ok so for example you might apply and
filters to the same input feature map
and then you would generate an output
feature map of M channels right so in
the previous slide we showed that you
know convolving this 3d filter generates
one output channel on the output feature
map if we apply em input M feet filters
we're gonna generate M output channels
in the output feature map and again just
to give you an idea in terms of the
scale of this when you talk about things
like Alec's net we're talking about
between 96 to 384 filters and of course
it's increasing to you know thousands
for other advanced or more modern neural
Nets itself and then finally often you
want to process more than one image at a
given time right so if you want to
actually do that we can actually extend
it so N and input images we can become n
output images or and input feature maps
we can becomes n output feature maps and
a typical we typically refer to this as
a batch size like the number of images
you're processing at the same time and
this can range from 1 to 256 ok so these
are all the various different dimensions
of the neural net and so really what
someone does when they're trying to
define what we call the network
architecture of the neural net itself is
that they're going to select the
different or define the shape of the
neural network for each of the different
layers so it's going to you know define
all these different dimensions of the
neural net itself and these shapes can
vary across the different layers just to
give you an idea if you look at mobile
net as an example this is a very popular
neural net cells you can see that the
filter size is right so the height and
width of the filters and the number of
filters and number of channels will vary
across the different blocks or layers
itself the other thing I also want to
mention is that when we look towards
popular
enn models we can also see important
trends so shown here are the various
different models they've been developed
over the years that are quite popular a
couple of interesting trends one is that
the networks tend to become deeper so
you can see in the convolutional layers
they're getting deeper and deeper and
then also the number of weights that
they're using and the number of max are
also increasing as well so this is an
important trend the DNN models are
getting larger and deeper and so again
they're becoming much more
computationally demanding and so we need
more sophisticated hardware to be able
to process them all right so that's kind
of a quick intro overview into the deep
neural network space I hope we're all
aligned so the first thing I'm going to
talk about is how can we actually build
hardware to make the processing of these
neural networks more efficient and to
run faster and often we refer to this as
hardware acceleration all right
so we know these neural networks are
very large there's a lot of compute but
are there types of properties that we
can leverage to make computing or
processing of these networks more
efficient so the first thing that's
really friendly is that they actually
exhibit a lot of parallelism so all
these multiplies and accumulates you can
actually do them all in parallel right
so that's great so that what that means
is high throughput or high speed is
actually possible cuz I can do a lot of
these process things in parallel what is
difficult in what should not be a
surprise to you know is that the memory
accesses the ball bottlenecks so
delivering the data to the multiply and
accumulate engine is what's really
challenging so I'll give you an insight
as to why this is the case so let's take
say we take this multiply and accumulate
engine what we call a Mac it takes in
three inputs for every Mac so you have
the filter weight you have the input
image pixel or if you're deeper in the
network you would be input feature math
activation and it also takes the partial
sum which is like the partially
accumulated value from the previous
multiply that it did and then it would
generate an updated partial sum so for
every computation that you do for every
Mac that you're doing you need to have
four memory accesses so it's a four to
one ratio in terms of memory accesses
versus compute the other challenge that
you have
is as we mentioned moving data is going
to be very expensive
so in the absolute worst case and you
would always try to avoid this if you
read the data from DRAM it's off ship
memory every time you access data from
DRAM it's going to be two orders of
magnitude more expensive than the
computation of performing a Mac itself
okay so that's really really bad so if
you can imagine again if we look at Alex
net which has 700 million max we're
talking about three billion DRAM
accesses to do that computation okay but
again all is not lost there are some
things that we can exploit to help us
along with this problem so one is what
we call input data reuse opportunities
which means that a lot of data that
we're reading we're using to perform
these multiplies and accumulates they're
actually used for many multiplies and
accumulates so if we read the data once
we can reuse it multiple times for many
operations right so I'll show you some
examples of that first is what we call
convolutional reuse so again if you
remember we're taking a filter and we're
sliding it across this input image and
so as a result the the activations from
the feature map and the weights from the
filter are going to be reused in
different combinations to compute the
different multiplier and accumulate
values or different max itself so
there's a lot of what we call
convolutional reuse opportunities there
another example is that we're actually
if you recall going to apply multiple
filters on the same input feature map so
that means that each activation in that
input feature map can be reused multiple
times across the different filters
finally if we're going to process many
images at the same time or many feature
maps are given weight in the filter
itself can be reused multiple times
across these input feature Maps so
that's what we called filter eaters okay
so there's a lot of these great filter
reuse opportunities in the neural
network itself and so what what can we
do to exploit this reuse opportunities
well all we can do is we can bill what
we call a memory hierarchy that contains
very low cost memories that allow us to
reduce the overall
cost of moving this data so what do we
mean here we mean that if I have if I
build a multiply and accumulate engine
I'm gonna have a very small memory right
beside the multiply and accumulate
engine and by small I mean something on
the order of under a kilobyte of memory
locally
besides that multiplying accumulate
Anjan why do I want that because
accessing that very small memory can be
very cheap so for example if to perform
a multiplying accumulate with an ALU x1
X reading from this very small memory
beside the multiply to accumulate engine
is also going to be still the same
amount of energy I can also allow these
processing elements in a processing
element is going to be this multiply and
accumulate plus the small memory I can
also allow the different processing
elements to also share data ok and so
reading from a neighboring processing
element is going to be 2x the energy and
then finally you can have a shared
larger memory called a global buffer and
that's going to be able to be shared
across all different processing elements
this tends to be larger between hundred
and 500 K bytes and that's going to be
more expensive about 6 X the energy
itself and of course if you go off chip
to DRAM that's going to be more the most
expensive at 200 X the energy right and
so the big issue here is you can the way
that you can think about this is what
you would ideally like to do is to
access all of the data from this very
small local memory but the challenge
here is that this very small local
memory is only 1 K byte but we're
talking about neural networks that are
millions of weights in terms of size
right so how do we go about doing that
there so there's many challenges of
doing that I'm just as an analogy for
you guys to kind of think through how
this is related you could imagine that
you know accessing something from like
let's say your backpack is gonna be much
cheaper than accessing something from
your neighbor or you know going back to
let's say your office here somewhere on
campus to get the data versus going back
all the way home right so ideally you'd
like to access all of your data from
your backup but if you have a lot of
work to do you might not be able to fill
it in your backpack so the question is
how can I you know break up my large
piece of work into smaller chunks so
that I can access them
all from this small memory itself and
that's the big challenge that you have
and so there's been a lot of research in
this area in terms of what's the best
way to break up the data and what should
I store in this very small local memory
so one approach is what we call a weight
stationary and the idea here is I'm
gonna store the weight information of
the neural net into this small local
memory okay and so as a result I really
minimize the weight energy but the
challenge here is that the other types
of data that you have in your system so
for example your input activations show
in the blue and then the partial sums
are shown in the red now those still
have to move through the rest of the
system itself so through their
networking from the global buffer okay
our typical types of work that are
popular that use this type of kind of
data flow or weight stationary data flow
which will be call because the weight
remains stationary are things like the
TPU from Google and the envy de la
accelerator from it video another
approach that people take or they will
they say well so the weight I only have
her have to read it but the partial sums
I have to read it and write it because
the partials I'm going to read
accumulate like update it and then write
it back to them so there's two memory
accesses for that partial sum data type
so what maybe I should put that partial
sum locally into that small memory
itself so this is what we call output
stationary because the accumulation of
the output is going to be local within
that one processing element that's not
going to move the trade-off of course is
the activations of weights now have to
move through the network and then
there's various different works called
like for example so we're from Katie
Leuven and some work from the Chinese
Academy of Sciences that are using this
approach another piece of work is saying
well you know forget about the inputs
and the or so the outputs and the wastes
themselves let's keep the input
stationary within this small membrane
it's called input stationary and some of
the work again from some research work
from Nvidia has examined this but all of
these different types of work really
focus on you know not moving one piece
of type of data right either focus on
minimizing weight energy or a partial
sum energy or input energy I think
what's important to think about is that
maybe you want to
reduce the data movement of all
different data types all types of energy
right so another approach is something
we've developed within our own group is
looking at we call the row stationary
data flow and within each of the
processing elements you're gonna do one
row of convolution and this row is a
mixture of all the different data types
right so you have filter information so
the weights of the filter you have the
activations of your input feature map
and then you also have your partial sum
information so you're really trying to
balance the data movement of all the
different data types not just one
particular data type this is just
performing a one row but you just talked
about the fact that the neural network
is much more than a 1d convolution so
you can imagine expanding this to higher
dimensions so this is just showing how
you might expand this one deconvolution
into a 2d convolution and then there's
other you know higher dimensionality
that you architecture as well I won't
get through the details of this but the
key takeaway here is that you might not
want to focus on one particular data
type you want to actually optimize for
all the amount all the different types
of data that you're moving around in
your system ok and this can just show
you you know some results in terms of
how these different data types or
different types of data flows would work
so for example in the weight stationary
case as expected the weight energy the
energy required to move the weights
shown in green is going to be the lowest
but then the red portion which is the
energy of the partial sums and the green
are so the blue and blue part which is
the input feature map or input pixels
that's going to be very high output
stationary is another approach as we
talked about you're trying to reduce the
data movement of the partial sums shown
here in red so the red part is really
minimized they can see that the green
part which is the weight stationary data
movement or weight movement is going to
be increased and the blue is the inputs
going to be increased there's another
potion called no coloca reuse we don't
have time to talk about that but you can
see that ROS stationary for example
really aims to balance the data movement
of all the different data types right so
the big takeaway here is that you know
when you're trying to optimize you know
given piece of hardware you don't want
to just optimize one you know for one
particular type of data you want to
optimize overall for
all the movement in the hardware itself
okay another thing that you can also
exploit to save a bit of power is the
fact that you know some of the data
could be zero so we know that anything
multiplied by zero was going to be zero
right so if you know that one of the
inputs to your multiplying accumulate is
going to be zero you might as well skip
that multiplication in fact you might as
well skip you know accessing data or
accessing the other input to that
multiply and accumulate engine so by
doing that you can actually reduce the
power consumption by almost 50 percent
another thing that you can do is that if
you have a bunch of zeros
you can also compress the data for
example you can use things like run
length encoding which we're basically a
run of zeros is going to be represented
rather than you know zero zero zero zero
zero you can just say have a run of five
zeros and this can actually reduce the
amount of data movement by up to two X
in your system itself and in fact in you
know neural nets there's a large you
know possibilities of actually
generating zeros first of all if you
remember that real loop it's setting
negative values to zero so naturally
generate zeros and then there's other
techniques for example we call pruning
which is setting some of the weights of
the neural so this can exploit all that
okay so you know what is the impact of
all these types of things so we actually
looked at building hardware I'm in
particular a customized chip that we
called iris to demonstrate these
particular proaches in particular the
row stationary data flow and exploiting
sparsity in the activation data so this
Irish ship has 14 by 12 so 168
processing elements you can see that
there's a shared buffer that's 100
kilobytes and it has some compression
decompression because 4 goes to off chip
TM and again that's because accessing
DRM is the most expensive I'm shown here
on the right hand side is a die photo of
the fabricated chip itself right and
this is 4 millimeters by 4 Miller in
terms of size and so using that you know
rows stationary data flow it exploits a
lot of data reuse so it actually reduces
the number of times we access this
global buffer by a hundred X and it also
reduces
the amount of times we access the
off-chip memory by over a thousand decks
this is all because you know each of
these processing elements has you know a
local memory that is trying to read most
of the vit status from it's also sharing
with other processing elements so
overall when you compare it to a mobile
GPU you're talking about an order of
magnitude reduction and energy
consumption if you'd like to learn a
little bit more about that I invite you
to visit the iris project website ok so
this is great we can build custom
hardware but what does this actually
mean in terms of you know building a
system that can efficiently compute
neural nets so let's say we take a step
back let's say we don't care anything
about the hardware and we're you know a
systems provider we want to build you
know an overall system and what we
really care about is the trade-off
between energy and accuracy right that's
that's the key thing that we care about
I'm so shown here is a plot and let's
say this is for an object detection task
right so accuracy is on the x-axis and
it's listed in terms of average
precision which is a metric that we use
for object detection it's on a linear
scale and higher the better vertically
we have energy consumption on this is
the energy that's being consumed per
pixel so you kind of average it I can
imagine a higher resolution image than
consume more energy it's going to be an
exponential scale so let's first start
on the accuracy axis and so if you think
before neural nets you know had its
resurgence in around 2011 2012 actually
state-of-the-art approaches used
features called histogram of oriented
gradients this is a very popular
approach to be very efficient in terms
of quite a cure in terms of object
detection and we referred to as hog the
reason why you know neural and that's
really took off is because they really
improve the accuracy so you can imagine
Alex said here almost doubled the
accuracy and then vgg you know further
increase the accuracy so it's super
exciting there but and we want to look
also on the vertical axis which is the
energy consumption and I should mention
you know basically you'll see these dots
we have the energy consumption for each
of these different approaches these
approaches are actually measured or
these energy numbers are measured on
specialized Hardware all right
that's been designed for that particular
task so we have a chip here that's built
in 65 nanometer CMOS process will use
the same transistors around the same
size that does object detection using
the hog features and then here's the
iris chip that we just talked about
I should also know that these both of
these chips are built in my group the
students who built this these chips you
know started designing the chips at the
same time and taped out at the same time
so somewhat of a controlled experiment
in terms of optimization okay so what
does this tell us when we look on the
energy axis we can see that histogram of
oriented gradients or hog features are
actually very efficient from an energy
point of view in fact if we compare it
to something like video compression
again something that you all have in
your phone hogs features are actually
more efficient than video compression
meaning for the same energy that you
would spend compressing a pixel you
could actually understand that pixel so
that's pretty impressive but if we start
looking at Alex net or vgg we can see
that the energy increases by two to
three orders of magnitude which is quite
significant I'll give you an example so
if I told you on your cell phone I'm
gonna double the accuracy of its
recognition but your phone would die
three hundred times faster who here
would be interested in that technology
right so exactly so nobody right so then
the sense that battery life is so
critical to how we actually use these
types of technologies so we should not
just look at the accuracy which is the
x-axis point of view we should really
also consider the energy consumption and
we really don't want the energy to be so
high and we can see that even with
specialized hardware we're still quite
far away from making neural nets as
efficient as something like video
compression that you all have on your
phones so we really have to think of how
we can further push the energy
consumption down without sacrificing
accuracy of course okay so actually
there's been a huge amount of research
in this space because we know neural
nets are popular and we know that they
have a wide range of applications but
energy is really a big challenge so
people have looked at you know how can
we design new hardware that can be more
efficient or how can we design
algorithms that are more efficient to
enable energy
and processing of DNS and so in fact
within our own research group we spend
quite a bit of time kind of surveying
the area understanding what are the
various different types of developments
that people been looking at so if you're
interested in this topic we actually
generated various tutorials on this
material as well as overview papers this
is an overview paper that's about 30
pages and what we're currently expanding
it into a book so if you're interested
in this topic I would encourage you to
visit these resources but the main thing
that we learned about as we were doing
this kind of survey of the area is that
we actually identified various
limitations in terms of how people are
approaching or how the research is
approaching this problem so first let's
look on the algorithm sign so there
again there's a wide range of approaches
that people are using to try and make
the DNN algorithms or models more
efficient so for example we kind of
mentioned the idea of pruning the idea
here is you're going to set some of the
weights to become zero and again
anything times zero is zero so you can
skip those operations so there's a wide
range of research there there's also
looking at efficient network
architectures meaning rather than making
my neural networks very large these high
three-dimensional convolutions can I
decompose them into smaller filters
right so rather than this 3d filter can
I make it a 2d filter and kind of you
know I also Trudy but you know one by
one and into the screen itself another
very popular thing is reduced precision
so rather than using the default of
32-bit float can I reduce the number of
bits down to eight bits or even binary
and we saw before that as we reduce the
precision of these operations you also
get energy savings and you also reduce
data movement as well Kalif to move less
data a lot of this work really focuses
on reducing the number of Max and the
number of weights and those primarily
because those are easy to count but the
question that we should be asking if we
care about the system is does this
actually translate into energy savings
and reduce latency because from a
systems point of view those are the
things that we care about right we don't
really you know when you're thinking
about something running on your phone
you don't care about the number of Max
and weighs you care about how much
energy is consuming because that's gonna
affect the battery life or how quickly
it might react regulus that's a
basically a measure of latency and again
hopefully haven't forgotten but
basically data movement is a
pensive right so you really depends on
you know how you move the data through
the system so the key takeaway from this
slide is that if you remember where the
energy come from comes from which is the
data movement it's not because of how
many weights are how many max you have
but really it depends on where the
weight comes from if it comes from this
small you know a small memory register
file that's nearby it's gonna be super
cheap as opposed to coming from Austria
so all weights are basically not created
equal all Macs are not created equal it
really depends on the memory hierarchy
and the data flow of the hardware itself
okay so we can't just look at the number
of weights and the number of Max and
estimate how much Energy's gonna be
consumed
so this is quite a difficult challenge
so within our group we've actually
looked at developing different tools
that allow us to estimate the energy
consumption of the neural network itself
so for example in this particular tool
which is available on this website we
basically take in you know the DNN
weights and the input data including its
sparsity we know the different shapes of
the different neural of the different
layers of the neural net and we run an
optimization that figures out you know
the memory access how much you know the
energy consumed by the data movement and
then the energy consumed by the multiply
and accumulate computations and then the
output is going to be a breakdown of the
energy for the different layers and once
you have this you can kind of figure out
well where is the energy going so I can
target my design to minimize that energy
consumption okay and so by doing this
when we take a look it should be no
surprise what are the key observations
for this exercise is that the weights
alone are not a good metric for energy
consumption if you take a look at Google
Annette for example it's running on kind
of the IRAs architecture you can see
that the weights only account for 22% of
the overall energy in fact a lot of the
energy goes into moving the input
feature maps and the output feature maps
as well right and also computation so in
general this is the same message as
before we shouldn't just look at the
data move in one particular data type we
should look at the energy consumption of
all the different data types to give us
an overall view of where the energy is
actually going okay and so once we
actually know where the energy go is
going how can we factor that into this
of the neural networks to make them more
efficient so we talked about the concept
of pruning right so again pruning was
setting some of the weights of the
neural net to zero or you can think of
it as removing some of the weights and
so what we want to do here is that now
we know that we know where the energy is
going why don't we incorporate the
energy into the design of the algorithm
for example to guide us to figure out
where we should actually remove the
weights from you know so for example
let's say here this is on Alec's net for
the same accuracy across the different
approaches
traditionally what happens is that
people tend to remove the weights that
are small then we call this magnitude
based of pruning and you can see that
you get about a 2x reduction in terms of
energy consumption however we know that
like the size of the weight has nothing
to do with or the value of the way is
nothing to do with the energy
consumption ideally what you'd like to
do is remove the weights that consume
the most energy right in particularly we
also know that the more weights that we
move remove the accuracy is going to go
down so to get the biggest bang for your
buck you want to remove the weights that
consume the most energy first one way
you can do this is you can take your
neural network figure out the energy
consumption of each of the layers of the
neural network you can sort then sort
the layers in terms of higher and higher
energy layer to low Leonard Leonard G
layers and then you prune the high
energy layers first so this is what we
call energy we're pruning and then by
doing this you actually now get a 3.7 X
reduction in energy consumption compared
to 2x for the same accuracy and again
this is because we factor in energy
consumption into the design of the
neural network itself or and the prune
models are all available in the iris
website another important thing that we
care about from a performance point of
view is latency right so for example
latency has to do with how long it takes
when I you know give it an image how
long will I get the result back people
are very sensitive to latency but the
challenge here is that latency again is
not directly correlated to things like
number of multiplies and accumulates and
so this is some data that was released
by Google's mobile vision team and
they're showing here on the x-axis the
number of multiplies and accumulates you
can do so go
towards the left you're increasing and
then on the y-axis this is the latency
so this is actually the measured latency
or delay it takes to get a results and
what they're showing here is that the
number of Max is not really a good
approximation of latency so in fact for
example given a you know layers the
neural networks at the same number of
Max there can be a 2x range or two
explaining in terms of latency or
looking at in a different way giving you
know layer our neural Nets of the same
latency they can have a 3x swing in
terms of number of Max all right so the
key takeaway here is that you can't just
count the number of Max and say oh is
this how quickly it's going to run it's
actually much more challenging than that
and so what we want to ask is is there a
way that we can take latency and use
that again to design the neural net
correctly so rather than looking at max
use latency and so together with
Google's mobile vision team we developed
this approach called net adopt and this
is really a way that you can tailor your
particular neural network for a given
mobile platform for a latency or an
energy budget right so it automatically
adapts the neural net for that platform
itself and really what's driving the
design is empirical measurements so
measurements of how that particular
network perform on that platform some
measurements for things like latency and
energy and the reason why we want to use
empirical measurements is that you can't
often generate models for all the
different types of hardware out there in
the case of Google what they want is
that you know if they have a new phone
you can automatically tune the network
for that particular phone you don't want
to have to model the phone as well okay
and so how does this work I'll walk you
through it so you'll start off with a
pre trained network so this is a network
that's let's say trained in the cloud
for very high accuracy great start off
with that but it tends to be very large
let's say and so what you're gonna do is
you're going to take that into the net
adapt algorithm you're gonna take a
budget so a budget will tell you like oh
I can afford only this type of latency
or this amount of latency this amount of
energy what net adapt will do is gonna
generate a bunch of proposals so
different options of how it might modify
the network in terms of its dimensions
it's going to measure these proposals on
that platter --get platform that you
care about
and then based on these empirical
measurements Ned adapt is going to then
generate a new set of proposals and it
will just iterate across this until it
gets it and opted okay and again all of
this is on the net adapt
Resume
Read
file updated 2026-02-13 13:24:11 UTC
Categories
Manage