Kind: captions
Language: en
we'd have Viviane see here with us she's
a professor here at MIT working in the
very important and exciting space of
developing energy efficient and
high-performance systems for machine
learning computer vision and other
multimedia applications this involves
joint design of algorithms architectures
circus systems to enable optimal
trade-offs between power speed and
quality of results one of the important
differences between the human brain and
AI systems is the energy efficiency of
the brain so Vivian is a world-class
researcher at the forefront of
discovering how we can close that gap so
please give her a warm welcome I'm
really happy to be here to share some of
the research and an overview of this
area efficient computing so actually
what I'm going to be talking about today
is gonna be a little bit broader than
just deep learning will start with deep
learning but we will also move to you
know how we might apply this to robotics
and other AI tasks and why and why it's
really important to have efficient
computing to enable a lot of these
exciting applications also I just want
to mention that a lot of the work I'm
going to present today is not done by
myself but in collaboration with a lot
of folks at MIT over here and of course
if you want access to the slides are
available on our website so given that
it's the deep learning lecture series I
want to first start out talking up a
little bit about deep neural nets so we
know that deep neural Nets has you know
generate a lot of a lot of interest has
a very many very compelling applications
but one of the things that has you know
come in to light over the past few years
is increasing need of compute opene I
actually showed over the past few years
that there's been a significant increase
in the amount of compute that is
required to form deep learning
applications and to do the training for
deep learning over the past few years so
it's actually grown exponentially over
the past few years it's don't grow in
fact by over three hundred thousand
times in terms of the amount of compute
we need to drive and increase the
accuracy a lot of the tasks that we're
trying to achieve at the same time if we
start looking at basically the
environmental implications of
all of this processing can be quite
severe so if we look at for example the
carbon footprint of you know training
neural nets if you think of you know the
amount of carbon footprint of flying
across North America from New York to
San Francisco or the carbon footprint of
an average human life you can see that
you know neural networks are orders of
magnitude greater than that so the
environmental or carbon footprint
implications of computing for deep
neural nets can be quite severe as well
now this is a lot having to do with
compute in the cloud another important
area where we want to do compute is
actually moving the compute from the
cloud to the edge itself
into the device where a lot of the data
is being collected so why would we want
to do that so there's a couple of
reasons first of all communication so in
a lot of places around the world and
just even a lot of just placing is
generally you might not have a very
strong communication infrastructure
right so you don't want to necessarily
to rely on a communication network in
order to do a lot of these applications
so again you know removing your
tethering from the cloud is important
another reason is a lot of the times
that we you know apply deep learning on
a lot of applications where the data is
very sensitive so you can think about
things like health care where you're
collecting very sensitive data and so
privacy and security again is really
critical and you would rather than
sending the data to the cloud you'd like
to bring the compute to the data itself
finally another compelling reason for
you know bringing the compute into the
device or into the robot is latency so
this is particularly true for
interactive applications so you can
think of things like autonomous
navigation robotics or self-driving
vehicles where you need to interact with
the real world you can imagine if you're
driving very quickly down the highway
and you detect an obstacle you might not
have enough time to send the data to the
cloud wait for it to be processed and
send the instruction back in so again
you want to move the compute into the
robot or into the vehicle itself okay so
hopefully this is establishing why we
want to move the compute into the edge
but one of the big challenges of doing
processing in the robot or in the device
actually has to do with power
consumption itself so if we take the
self-driving car as an example
been reported that it consumes over 2000
watts of power just for the computation
itself just to process all the sensor
data that it's collecting right and this
actually generates a lot of heat it
takes up a lot of space you can see in
this prototype that's being placed in
all the computes a specs are being
placed in the trunk generates a lot of
heat it generates and often needs water
cooling so this can be a big cost and
logistical challenges for self-driving
vehicles now you can imagine that this
is gonna be much more challenging if we
shrink shrink down the form factor of
the device itself to something that is
perhaps portable in your hands you can
think about smaller robots or something
like your smartphone or cell phone in
these particular cases when you think
about portable devices you actually have
very limited energy capacity and this is
based on the fact that though battery
itself is limited in terms of the size
weight and its cost right so you can't
have very large amount of energy on
these particular devices itself secondly
when you take a look at you know the
embedded platforms that are currently
used for embedded processing for these
particular applications they tend to
consume you know over 10 watts which is
an order of magnitude higher than the
power consumption that you typically
would allow for for these particular
handheld devices so in these handheld
devices typically you're limited to
under a watt due to the heat dissipation
for example you don't want your cell
phone to get super hot ok so in the past
you know decade or so or decades what we
would do to address this challenge is
that we would wait for transistors
become smaller faster and more efficient
however this has become a challenge over
the past few years so transistors are
not getting more efficient so for
example Moore's Law which typically
makes transistors smaller and faster has
been slowing down and Dennard scaling
which has made transistors more
efficient has also slowed down our
endeth so you can see here over the past
10 years this trend has really flattened
out ok so this is a particular challenge
because we want more and more compute to
drive deep neural network applications
but the transistors are not becoming
more efficient right so what we have to
turn to in order to address this is we
need to turn towards specialized
hardware to achieve the significant
speed
and energy throughputs that we require
for our particular application and we
talked about designing specialized
Harvard this is really about thinking
about how we can redesign the hardware
from the ground up particularly targeted
at these AI deep learning and robotic
tasks that we're really excited about
okay so this notion is not new in fact
it's become extremely popular to do this
over the past few years there's been a
large number of startups and companies
that have focused on building
specialized hardware for deep learning
so in fact New York Times reported I
guess it's two years ago that there's a
record number of startups looking at
building specialized hardware for AI and
for a deep learning okay so we'll talk a
little bit about what specialized
hardware looks like for these particular
applications now if you really care
about energy and power efficiency the
first question you should ask is where
is the power actually going for these
applications and so as it turns out
power is dominated by data movement so
it's actually not the computations
themselves that are expensive but moving
the data to the computation engine
that's expensive so for example I shown
here in blue is you know a range of
power consumption energy consumption for
a variety of types of computations for
example multiplications and additions at
various different Precision's so you
have for example floating point to fixed
point and integer and same with
additions and you can see as it makes
sense as you scale down the precision
the energy consumption of each of these
operations reduce but what's really
surprising here is that if you look
lower at the energy consumption of data
movement right again this is delivering
the input data to do the multiplication
and then you know moving the output of
the multiplication somewhere into memory
it can be very expensive so for example
if you look at the energy consumption of
a 32-bit Reed from an SRAM memory this
is an 8 kilobyte SRAM so it's a very
small memory that you would have on the
processor or on the chip itself this is
already going to consume 5 Pico joules
of energy so equivalent or even more
than a 32-bit floating-point mode
multiplied and it's from a very small
memory if you need to read this data
from off chips so outside the processor
for example in DRAM it's going to be
even more
offensive so in this particular case
we're showing 640 Pico joules in terms
of energy and sequence notice here on
the horizontal axis that this is
basically the this is an exponential
axis so you're talking about orders of
meant to increase in energy in terms of
data movement compared to the compute
itself right so this is a key takeaway
here so if we really want to address the
energy consumption of these particular
types of processing we really want to
look at reducing data movement okay but
what's the challenge here so if we take
a look at a popular a I robotics or type
of application like autonomous
navigation the real challenge here
though is that these applications use a
lot of data right so for example one of
the things you need to do in autonomous
navigation is what we call semantic
understanding so you need to be able to
identify you know which pixel belongs to
what so for example in this scene you
need to know that this pixel represents
the ground this pixel represents the sky
this pixel represents you know a person
itself okay so this important type of
processing often if you're traveling
quickly you want to be able to do this
at a very high frame rate you might need
to have large resolution so for example
typically if you want HD images you're
talking about 2 million pixels per frame
and then often if you also want to be
able to detect objects at different
scales or see objects that are far away
you need to do what we call data
expansion for example build a pyramid
for this and this would increase the
amount of pixels or amount of data you
need to process by you know one or two
orders of magnitude so that's a huge
amount of data that you have to process
right off the back there another type of
processing or understand that you want
to do for Thomas navigation is what we
call it geometric understanding and
that's when you're kind of navigating
you want to build a 3d map of the world
that's around you and you can imagine
the longer you travel for the larger the
map you're gonna build and again that's
going to be more data that you're gonna
have to process and compute on ok so
this is a significant challenge for
autonomous navigation in terms of
mounted data other aspects of Thomas
navigations also other applications like
AR VR and so on is understanding your
environment right so a typical thing you
might need to do is to do depth
estimation so for example if I give you
an image can you estimate the distance
of how far a given pixel is from
and also semantic segmentation we just
talked about that before so these are
important types of ways to understand
your environment when you're trying to
navigate I mean it should be no surprise
to you that in order to do these types
of processing the state-of-the-art
approaches utilize deep neural nets
right but the challenge here that these
deep neural nets often require several
hundred millions of operations and
weights to do the computation so when
you try and compare it to something like
you would all have on your phone for
example video compression you're talking
about you know two to three orders of
magnitude increase in computational
complexity so this is significant
challenge because if we'd like to have
you know deep neural networks be as
ubiquitous as something like video
compression we really have to figure out
how to address this computational
complexity we also know that deep neural
networks are not just used for
understanding the environment or
autonomous navigation but it's really
become the cornerstone of many AI
applications from computer vision speech
recognition gameplay and even medical
applications and I'm sure a lot of these
have been covered through this course so
briefly I'm just gonna give a quick
overview of some of the key components
and deep neural nets not because you
know I'm sure all of you understand it
but because since this area is very
popular the terminology can vary from
discipline to discipline so I'll just do
a brief overview to align ourselves on
the terminology itself so what are deep
neural Nets basically you can view it as
a way of for example understanding in
the environment it's a chain of
different layers of processing where you
can imagine for an input image at the
low level or the earlier parts of the
neural net you're trying to learn
different low-level features such as
edges of an image and as you get deeper
into the network as you chain more of
these kind of computational layers
together you start being able to detect
higher and higher level features until
you can you know recognize a vehicle for
example and you know the difference of
this particular approach compared to
more traditional ways of doing computer
vision is that how we extract these
features are learned from the data
itself as opposed to having an expert
come and say hey look for the edges look
for you know the wheels and so on the
fact that it recognizes this features is
it
and approach okay what is it doing at
each of these layers well it's actually
doing a very simple computation this is
looking at the inference side of things
basically effectively what is doing is a
weighted sum right so you have the input
values and we'll color code the inputs
as blue here and try and stay consistent
with that's what the talk we apply
certain weights to them these weights
are learned from the training data and
then they would generate an output which
is typically read here it's basically a
weighted sum as we can see we then
passed this weighted sum through some
form of non-linearity so you know
traditionally used to be sigmoids more
recently we use things like real ooze
which basically set you know non zero
values or negative values to zero but
the key takeaway here is if you look at
this computational kernel the key
operation to a lot of these neural
networks is performing this multiply and
accumulate to compute the weighted sum
and this accounts for over 90% of the
computation so if we really want to
focus on you know accelerating neural
nets or making them more efficient we
want to focus on minimizing the cost of
this multiply and accumulate itself
there are also various popular types of
deep neural network layer layers used
for deep neural networks they also often
vary in terms of you know how you
connect up the different layers so for
example you can have feed-forward layers
where the inputs are always connected to
the outputs you can have feedback where
the outputs are connected back into the
inputs you can have fully connected
inputs where basically all the outputs
are connected to all the inputs or
sparsely connected and you might be
familiar with some of these layers so
for example fully connected layers just
like what we talked about all inputs and
all outputs are connected there tend to
be feed-forward and when you put them
together they're typically referred to
as a multi-layer perceptron you have
convolutional layers which are also
feed-forward but then you have sparsely
connected weight sharing connections and
when you put them together they often
referred to as convolutional and that
works and they're typically used for
image based processing
you have current layers where we have
this feedback connection so the output
is fed back to the input when we combine
two recurrent layers they're referred to
as recurrent neural Nets and these are
typically used to process sequential
data so speech or language based
processing and then most recently which
is become really popular it's the
tension layers or tension based
mechanisms and they often involve matrix
multiply which is again multiplied and
accumulate and there when you combine
these are often referred to as
transformers
okay so let's first kind of get an idea
as to why you know convolutional or deep
learning is much more computationally
more complex than other types of
processing so we'll focus on you know
convolutional neural Nets is an example
although many of these principles apply
to other types of neural nets and the
first thing that'll kind of take a look
as to why it's complicated is to look at
the computational kernel so how does it
actually perform convolution itself so
let's say you have this 2d input image
if it's at the input of the neural net
would be an image if it's deeper in the
neural net would be the input feature
map and it's going to be composed of
activations or you can think from an
image it's going to be composed of
pixels and we convolve it with let's say
a 2d filter which is composed of weights
right so typical convolution what you
would do is you would do an element-wise
multiplication of the filter weights
with the input feature map activations
you would sum them all together to
generate one output value that we would
refer to that as the output activation
right and then what because it's
convolution we would basically slide the
filter across this input feature map and
generate all the other output feature
map activations and so this cut this
kind of 2d convolution is pretty
standard in image processing we've been
doing this for decades
right what makes convolutional neural
nets much more challenging as the
increase in dimensionality so first of
all rather than doing just this 2d
convolution we often stack multiple
channels so there's this third dimension
called channels and then what we're
doing here is that we need to do a 2d
convolution on each of the channels and
then add it all together right and you
can think of these channels for an image
these channels would be kind of the
green and blue components for example
and as you get deeper into the feature
map the number of channels could
potentially increase so if you look at
Alex net which is a popular neural net
the number of channels ranges from 3 to
192 okay so that already increases the
dimensionality one dimension of the
neural our neural net itself in terms of
processing another dimension that we
increase is we actually apply multiple
filters to this same input feature map
ok so for example you might apply and
filters to the same input feature map
and then you would generate an output
feature map of M channels right so in
the previous slide we showed that you
know convolving this 3d filter generates
one output channel on the output feature
map if we apply em input M feet filters
we're gonna generate M output channels
in the output feature map and again just
to give you an idea in terms of the
scale of this when you talk about things
like Alec's net we're talking about
between 96 to 384 filters and of course
it's increasing to you know thousands
for other advanced or more modern neural
Nets itself and then finally often you
want to process more than one image at a
given time right so if you want to
actually do that we can actually extend
it so N and input images we can become n
output images or and input feature maps
we can becomes n output feature maps and
a typical we typically refer to this as
a batch size like the number of images
you're processing at the same time and
this can range from 1 to 256 ok so these
are all the various different dimensions
of the neural net and so really what
someone does when they're trying to
define what we call the network
architecture of the neural net itself is
that they're going to select the
different or define the shape of the
neural network for each of the different
layers so it's going to you know define
all these different dimensions of the
neural net itself and these shapes can
vary across the different layers just to
give you an idea if you look at mobile
net as an example this is a very popular
neural net cells you can see that the
filter size is right so the height and
width of the filters and the number of
filters and number of channels will vary
across the different blocks or layers
itself the other thing I also want to
mention is that when we look towards
popular
enn models we can also see important
trends so shown here are the various
different models they've been developed
over the years that are quite popular a
couple of interesting trends one is that
the networks tend to become deeper so
you can see in the convolutional layers
they're getting deeper and deeper and
then also the number of weights that
they're using and the number of max are
also increasing as well so this is an
important trend the DNN models are
getting larger and deeper and so again
they're becoming much more
computationally demanding and so we need
more sophisticated hardware to be able
to process them all right so that's kind
of a quick intro overview into the deep
neural network space I hope we're all
aligned so the first thing I'm going to
talk about is how can we actually build
hardware to make the processing of these
neural networks more efficient and to
run faster and often we refer to this as
hardware acceleration all right
so we know these neural networks are
very large there's a lot of compute but
are there types of properties that we
can leverage to make computing or
processing of these networks more
efficient so the first thing that's
really friendly is that they actually
exhibit a lot of parallelism so all
these multiplies and accumulates you can
actually do them all in parallel right
so that's great so that what that means
is high throughput or high speed is
actually possible cuz I can do a lot of
these process things in parallel what is
difficult in what should not be a
surprise to you know is that the memory
accesses the ball bottlenecks so
delivering the data to the multiply and
accumulate engine is what's really
challenging so I'll give you an insight
as to why this is the case so let's take
say we take this multiply and accumulate
engine what we call a Mac it takes in
three inputs for every Mac so you have
the filter weight you have the input
image pixel or if you're deeper in the
network you would be input feature math
activation and it also takes the partial
sum which is like the partially
accumulated value from the previous
multiply that it did and then it would
generate an updated partial sum so for
every computation that you do for every
Mac that you're doing you need to have
four memory accesses so it's a four to
one ratio in terms of memory accesses
versus compute the other challenge that
you have
is as we mentioned moving data is going
to be very expensive
so in the absolute worst case and you
would always try to avoid this if you
read the data from DRAM it's off ship
memory every time you access data from
DRAM it's going to be two orders of
magnitude more expensive than the
computation of performing a Mac itself
okay so that's really really bad so if
you can imagine again if we look at Alex
net which has 700 million max we're
talking about three billion DRAM
accesses to do that computation okay but
again all is not lost there are some
things that we can exploit to help us
along with this problem so one is what
we call input data reuse opportunities
which means that a lot of data that
we're reading we're using to perform
these multiplies and accumulates they're
actually used for many multiplies and
accumulates so if we read the data once
we can reuse it multiple times for many
operations right so I'll show you some
examples of that first is what we call
convolutional reuse so again if you
remember we're taking a filter and we're
sliding it across this input image and
so as a result the the activations from
the feature map and the weights from the
filter are going to be reused in
different combinations to compute the
different multiplier and accumulate
values or different max itself so
there's a lot of what we call
convolutional reuse opportunities there
another example is that we're actually
if you recall going to apply multiple
filters on the same input feature map so
that means that each activation in that
input feature map can be reused multiple
times across the different filters
finally if we're going to process many
images at the same time or many feature
maps are given weight in the filter
itself can be reused multiple times
across these input feature Maps so
that's what we called filter eaters okay
so there's a lot of these great filter
reuse opportunities in the neural
network itself and so what what can we
do to exploit this reuse opportunities
well all we can do is we can bill what
we call a memory hierarchy that contains
very low cost memories that allow us to
reduce the overall
cost of moving this data so what do we
mean here we mean that if I have if I
build a multiply and accumulate engine
I'm gonna have a very small memory right
beside the multiply and accumulate
engine and by small I mean something on
the order of under a kilobyte of memory
locally
besides that multiplying accumulate
Anjan why do I want that because
accessing that very small memory can be
very cheap so for example if to perform
a multiplying accumulate with an ALU x1
X reading from this very small memory
beside the multiply to accumulate engine
is also going to be still the same
amount of energy I can also allow these
processing elements in a processing
element is going to be this multiply and
accumulate plus the small memory I can
also allow the different processing
elements to also share data ok and so
reading from a neighboring processing
element is going to be 2x the energy and
then finally you can have a shared
larger memory called a global buffer and
that's going to be able to be shared
across all different processing elements
this tends to be larger between hundred
and 500 K bytes and that's going to be
more expensive about 6 X the energy
itself and of course if you go off chip
to DRAM that's going to be more the most
expensive at 200 X the energy right and
so the big issue here is you can the way
that you can think about this is what
you would ideally like to do is to
access all of the data from this very
small local memory but the challenge
here is that this very small local
memory is only 1 K byte but we're
talking about neural networks that are
millions of weights in terms of size
right so how do we go about doing that
there so there's many challenges of
doing that I'm just as an analogy for
you guys to kind of think through how
this is related you could imagine that
you know accessing something from like
let's say your backpack is gonna be much
cheaper than accessing something from
your neighbor or you know going back to
let's say your office here somewhere on
campus to get the data versus going back
all the way home right so ideally you'd
like to access all of your data from
your backup but if you have a lot of
work to do you might not be able to fill
it in your backpack so the question is
how can I you know break up my large
piece of work into smaller chunks so
that I can access them
all from this small memory itself and
that's the big challenge that you have
and so there's been a lot of research in
this area in terms of what's the best
way to break up the data and what should
I store in this very small local memory
so one approach is what we call a weight
stationary and the idea here is I'm
gonna store the weight information of
the neural net into this small local
memory okay and so as a result I really
minimize the weight energy but the
challenge here is that the other types
of data that you have in your system so
for example your input activations show
in the blue and then the partial sums
are shown in the red now those still
have to move through the rest of the
system itself so through their
networking from the global buffer okay
our typical types of work that are
popular that use this type of kind of
data flow or weight stationary data flow
which will be call because the weight
remains stationary are things like the
TPU from Google and the envy de la
accelerator from it video another
approach that people take or they will
they say well so the weight I only have
her have to read it but the partial sums
I have to read it and write it because
the partials I'm going to read
accumulate like update it and then write
it back to them so there's two memory
accesses for that partial sum data type
so what maybe I should put that partial
sum locally into that small memory
itself so this is what we call output
stationary because the accumulation of
the output is going to be local within
that one processing element that's not
going to move the trade-off of course is
the activations of weights now have to
move through the network and then
there's various different works called
like for example so we're from Katie
Leuven and some work from the Chinese
Academy of Sciences that are using this
approach another piece of work is saying
well you know forget about the inputs
and the or so the outputs and the wastes
themselves let's keep the input
stationary within this small membrane
it's called input stationary and some of
the work again from some research work
from Nvidia has examined this but all of
these different types of work really
focus on you know not moving one piece
of type of data right either focus on
minimizing weight energy or a partial
sum energy or input energy I think
what's important to think about is that
maybe you want to
reduce the data movement of all
different data types all types of energy
right so another approach is something
we've developed within our own group is
looking at we call the row stationary
data flow and within each of the
processing elements you're gonna do one
row of convolution and this row is a
mixture of all the different data types
right so you have filter information so
the weights of the filter you have the
activations of your input feature map
and then you also have your partial sum
information so you're really trying to
balance the data movement of all the
different data types not just one
particular data type this is just
performing a one row but you just talked
about the fact that the neural network
is much more than a 1d convolution so
you can imagine expanding this to higher
dimensions so this is just showing how
you might expand this one deconvolution
into a 2d convolution and then there's
other you know higher dimensionality
that you architecture as well I won't
get through the details of this but the
key takeaway here is that you might not
want to focus on one particular data
type you want to actually optimize for
all the amount all the different types
of data that you're moving around in
your system ok and this can just show
you you know some results in terms of
how these different data types or
different types of data flows would work
so for example in the weight stationary
case as expected the weight energy the
energy required to move the weights
shown in green is going to be the lowest
but then the red portion which is the
energy of the partial sums and the green
are so the blue and blue part which is
the input feature map or input pixels
that's going to be very high output
stationary is another approach as we
talked about you're trying to reduce the
data movement of the partial sums shown
here in red so the red part is really
minimized they can see that the green
part which is the weight stationary data
movement or weight movement is going to
be increased and the blue is the inputs
going to be increased there's another
potion called no coloca reuse we don't
have time to talk about that but you can
see that ROS stationary for example
really aims to balance the data movement
of all the different data types right so
the big takeaway here is that you know
when you're trying to optimize you know
given piece of hardware you don't want
to just optimize one you know for one
particular type of data you want to
optimize overall for
all the movement in the hardware itself
okay another thing that you can also
exploit to save a bit of power is the
fact that you know some of the data
could be zero so we know that anything
multiplied by zero was going to be zero
right so if you know that one of the
inputs to your multiplying accumulate is
going to be zero you might as well skip
that multiplication in fact you might as
well skip you know accessing data or
accessing the other input to that
multiply and accumulate engine so by
doing that you can actually reduce the
power consumption by almost 50 percent
another thing that you can do is that if
you have a bunch of zeros
you can also compress the data for
example you can use things like run
length encoding which we're basically a
run of zeros is going to be represented
rather than you know zero zero zero zero
zero you can just say have a run of five
zeros and this can actually reduce the
amount of data movement by up to two X
in your system itself and in fact in you
know neural nets there's a large you
know possibilities of actually
generating zeros first of all if you
remember that real loop it's setting
negative values to zero so naturally
generate zeros and then there's other
techniques for example we call pruning
which is setting some of the weights of
the neural so this can exploit all that
okay so you know what is the impact of
all these types of things so we actually
looked at building hardware I'm in
particular a customized chip that we
called iris to demonstrate these
particular proaches in particular the
row stationary data flow and exploiting
sparsity in the activation data so this
Irish ship has 14 by 12 so 168
processing elements you can see that
there's a shared buffer that's 100
kilobytes and it has some compression
decompression because 4 goes to off chip
TM and again that's because accessing
DRM is the most expensive I'm shown here
on the right hand side is a die photo of
the fabricated chip itself right and
this is 4 millimeters by 4 Miller in
terms of size and so using that you know
rows stationary data flow it exploits a
lot of data reuse so it actually reduces
the number of times we access this
global buffer by a hundred X and it also
reduces
the amount of times we access the
off-chip memory by over a thousand decks
this is all because you know each of
these processing elements has you know a
local memory that is trying to read most
of the vit status from it's also sharing
with other processing elements so
overall when you compare it to a mobile
GPU you're talking about an order of
magnitude reduction and energy
consumption if you'd like to learn a
little bit more about that I invite you
to visit the iris project website ok so
this is great we can build custom
hardware but what does this actually
mean in terms of you know building a
system that can efficiently compute
neural nets so let's say we take a step
back let's say we don't care anything
about the hardware and we're you know a
systems provider we want to build you
know an overall system and what we
really care about is the trade-off
between energy and accuracy right that's
that's the key thing that we care about
I'm so shown here is a plot and let's
say this is for an object detection task
right so accuracy is on the x-axis and
it's listed in terms of average
precision which is a metric that we use
for object detection it's on a linear
scale and higher the better vertically
we have energy consumption on this is
the energy that's being consumed per
pixel so you kind of average it I can
imagine a higher resolution image than
consume more energy it's going to be an
exponential scale so let's first start
on the accuracy axis and so if you think
before neural nets you know had its
resurgence in around 2011 2012 actually
state-of-the-art approaches used
features called histogram of oriented
gradients this is a very popular
approach to be very efficient in terms
of quite a cure in terms of object
detection and we referred to as hog the
reason why you know neural and that's
really took off is because they really
improve the accuracy so you can imagine
Alex said here almost doubled the
accuracy and then vgg you know further
increase the accuracy so it's super
exciting there but and we want to look
also on the vertical axis which is the
energy consumption and I should mention
you know basically you'll see these dots
we have the energy consumption for each
of these different approaches these
approaches are actually measured or
these energy numbers are measured on
specialized Hardware all right
that's been designed for that particular
task so we have a chip here that's built
in 65 nanometer CMOS process will use
the same transistors around the same
size that does object detection using
the hog features and then here's the
iris chip that we just talked about
I should also know that these both of
these chips are built in my group the
students who built this these chips you
know started designing the chips at the
same time and taped out at the same time
so somewhat of a controlled experiment
in terms of optimization okay so what
does this tell us when we look on the
energy axis we can see that histogram of
oriented gradients or hog features are
actually very efficient from an energy
point of view in fact if we compare it
to something like video compression
again something that you all have in
your phone hogs features are actually
more efficient than video compression
meaning for the same energy that you
would spend compressing a pixel you
could actually understand that pixel so
that's pretty impressive but if we start
looking at Alex net or vgg we can see
that the energy increases by two to
three orders of magnitude which is quite
significant I'll give you an example so
if I told you on your cell phone I'm
gonna double the accuracy of its
recognition but your phone would die
three hundred times faster who here
would be interested in that technology
right so exactly so nobody right so then
the sense that battery life is so
critical to how we actually use these
types of technologies so we should not
just look at the accuracy which is the
x-axis point of view we should really
also consider the energy consumption and
we really don't want the energy to be so
high and we can see that even with
specialized hardware we're still quite
far away from making neural nets as
efficient as something like video
compression that you all have on your
phones so we really have to think of how
we can further push the energy
consumption down without sacrificing
accuracy of course okay so actually
there's been a huge amount of research
in this space because we know neural
nets are popular and we know that they
have a wide range of applications but
energy is really a big challenge so
people have looked at you know how can
we design new hardware that can be more
efficient or how can we design
algorithms that are more efficient to
enable energy
and processing of DNS and so in fact
within our own research group we spend
quite a bit of time kind of surveying
the area understanding what are the
various different types of developments
that people been looking at so if you're
interested in this topic we actually
generated various tutorials on this
material as well as overview papers this
is an overview paper that's about 30
pages and what we're currently expanding
it into a book so if you're interested
in this topic I would encourage you to
visit these resources but the main thing
that we learned about as we were doing
this kind of survey of the area is that
we actually identified various
limitations in terms of how people are
approaching or how the research is
approaching this problem so first let's
look on the algorithm sign so there
again there's a wide range of approaches
that people are using to try and make
the DNN algorithms or models more
efficient so for example we kind of
mentioned the idea of pruning the idea
here is you're going to set some of the
weights to become zero and again
anything times zero is zero so you can
skip those operations so there's a wide
range of research there there's also
looking at efficient network
architectures meaning rather than making
my neural networks very large these high
three-dimensional convolutions can I
decompose them into smaller filters
right so rather than this 3d filter can
I make it a 2d filter and kind of you
know I also Trudy but you know one by
one and into the screen itself another
very popular thing is reduced precision
so rather than using the default of
32-bit float can I reduce the number of
bits down to eight bits or even binary
and we saw before that as we reduce the
precision of these operations you also
get energy savings and you also reduce
data movement as well Kalif to move less
data a lot of this work really focuses
on reducing the number of Max and the
number of weights and those primarily
because those are easy to count but the
question that we should be asking if we
care about the system is does this
actually translate into energy savings
and reduce latency because from a
systems point of view those are the
things that we care about right we don't
really you know when you're thinking
about something running on your phone
you don't care about the number of Max
and weighs you care about how much
energy is consuming because that's gonna
affect the battery life or how quickly
it might react regulus that's a
basically a measure of latency and again
hopefully haven't forgotten but
basically data movement is a
pensive right so you really depends on
you know how you move the data through
the system so the key takeaway from this
slide is that if you remember where the
energy come from comes from which is the
data movement it's not because of how
many weights are how many max you have
but really it depends on where the
weight comes from if it comes from this
small you know a small memory register
file that's nearby it's gonna be super
cheap as opposed to coming from Austria
so all weights are basically not created
equal all Macs are not created equal it
really depends on the memory hierarchy
and the data flow of the hardware itself
okay so we can't just look at the number
of weights and the number of Max and
estimate how much Energy's gonna be
consumed
so this is quite a difficult challenge
so within our group we've actually
looked at developing different tools
that allow us to estimate the energy
consumption of the neural network itself
so for example in this particular tool
which is available on this website we
basically take in you know the DNN
weights and the input data including its
sparsity we know the different shapes of
the different neural of the different
layers of the neural net and we run an
optimization that figures out you know
the memory access how much you know the
energy consumed by the data movement and
then the energy consumed by the multiply
and accumulate computations and then the
output is going to be a breakdown of the
energy for the different layers and once
you have this you can kind of figure out
well where is the energy going so I can
target my design to minimize that energy
consumption okay and so by doing this
when we take a look it should be no
surprise what are the key observations
for this exercise is that the weights
alone are not a good metric for energy
consumption if you take a look at Google
Annette for example it's running on kind
of the IRAs architecture you can see
that the weights only account for 22% of
the overall energy in fact a lot of the
energy goes into moving the input
feature maps and the output feature maps
as well right and also computation so in
general this is the same message as
before we shouldn't just look at the
data move in one particular data type we
should look at the energy consumption of
all the different data types to give us
an overall view of where the energy is
actually going okay and so once we
actually know where the energy go is
going how can we factor that into this
of the neural networks to make them more
efficient so we talked about the concept
of pruning right so again pruning was
setting some of the weights of the
neural net to zero or you can think of
it as removing some of the weights and
so what we want to do here is that now
we know that we know where the energy is
going why don't we incorporate the
energy into the design of the algorithm
for example to guide us to figure out
where we should actually remove the
weights from you know so for example
let's say here this is on Alec's net for
the same accuracy across the different
approaches
traditionally what happens is that
people tend to remove the weights that
are small then we call this magnitude
based of pruning and you can see that
you get about a 2x reduction in terms of
energy consumption however we know that
like the size of the weight has nothing
to do with or the value of the way is
nothing to do with the energy
consumption ideally what you'd like to
do is remove the weights that consume
the most energy right in particularly we
also know that the more weights that we
move remove the accuracy is going to go
down so to get the biggest bang for your
buck you want to remove the weights that
consume the most energy first one way
you can do this is you can take your
neural network figure out the energy
consumption of each of the layers of the
neural network you can sort then sort
the layers in terms of higher and higher
energy layer to low Leonard Leonard G
layers and then you prune the high
energy layers first so this is what we
call energy we're pruning and then by
doing this you actually now get a 3.7 X
reduction in energy consumption compared
to 2x for the same accuracy and again
this is because we factor in energy
consumption into the design of the
neural network itself or and the prune
models are all available in the iris
website another important thing that we
care about from a performance point of
view is latency right so for example
latency has to do with how long it takes
when I you know give it an image how
long will I get the result back people
are very sensitive to latency but the
challenge here is that latency again is
not directly correlated to things like
number of multiplies and accumulates and
so this is some data that was released
by Google's mobile vision team and
they're showing here on the x-axis the
number of multiplies and accumulates you
can do so go
towards the left you're increasing and
then on the y-axis this is the latency
so this is actually the measured latency
or delay it takes to get a results and
what they're showing here is that the
number of Max is not really a good
approximation of latency so in fact for
example given a you know layers the
neural networks at the same number of
Max there can be a 2x range or two
explaining in terms of latency or
looking at in a different way giving you
know layer our neural Nets of the same
latency they can have a 3x swing in
terms of number of Max all right so the
key takeaway here is that you can't just
count the number of Max and say oh is
this how quickly it's going to run it's
actually much more challenging than that
and so what we want to ask is is there a
way that we can take latency and use
that again to design the neural net
correctly so rather than looking at max
use latency and so together with
Google's mobile vision team we developed
this approach called net adopt and this
is really a way that you can tailor your
particular neural network for a given
mobile platform for a latency or an
energy budget right so it automatically
adapts the neural net for that platform
itself and really what's driving the
design is empirical measurements so
measurements of how that particular
network perform on that platform some
measurements for things like latency and
energy and the reason why we want to use
empirical measurements is that you can't
often generate models for all the
different types of hardware out there in
the case of Google what they want is
that you know if they have a new phone
you can automatically tune the network
for that particular phone you don't want
to have to model the phone as well okay
and so how does this work I'll walk you
through it so you'll start off with a
pre trained network so this is a network
that's let's say trained in the cloud
for very high accuracy great start off
with that but it tends to be very large
let's say and so what you're gonna do is
you're going to take that into the net
adapt algorithm you're gonna take a
budget so a budget will tell you like oh
I can afford only this type of latency
or this amount of latency this amount of
energy what net adapt will do is gonna
generate a bunch of proposals so
different options of how it might modify
the network in terms of its dimensions
it's going to measure these proposals on
that platter --get platform that you
care about
and then based on these empirical
measurements Ned adapt is going to then
generate a new set of proposals and it
will just iterate across this until it
gets it and opted okay and again all of
this is on the net adapt website I'm
just to give you a quick example of how
this might work so let's say you start
off with it as your input a neural
network you know that has the accuracy
that you want but the latency is a
hundred milliseconds and you would like
for it to be 80 milliseconds you want it
to be faster so what it's going to do is
it's going to generate a bunch of
proposals and what the proposals could
involve doing is taking one layer of the
neural net and reducing the number of
channels until it hits the latency
budget of 80 milliseconds and they can
do that for all the different layers
then it's going to tune these different
layers and measure the accuracy right so
let's say up this one where I just
shortened the number of channels in
layer one maintains actors at a 60% so
that means I'm going to pick that and
that's going to be the input or the
output of this particular design so the
output at 80 milliseconds hitting actors
through 60 percent it's gonna be the
input to the next iteration and then I'm
going to tighten the budget okay again
if you're interested I just invite you
to go take a look at the net adapt paper
but what are the what is the impact of
this particular approach well it gives
you actually a very much improved
trade-off between latency and accuracy
right so if you look at this plot again
on the x-axis is the latency right so to
the left is better so slower latency and
then on the x-axis or y-axis that's
gonna be the accuracy so higher better
so here you want higher to the left is
good and so we have first shown in blue
and green various kind of handcrafted
neural network based approaches and you
can see Netta tap which generates no the
red dots as it's iterating through like
it's optimization and you can see that
it Jeeves you know for the same accuracy
can be 1 up to 1.7 X faster then you
know a manually designed approach this
approach is also under you know the
umbrella of you know basically network
architecture so just kind of also in
that kind of flavor but in general the
takeaway here is that if you're going to
design neural networks or efficient
neural network
that you want to run quickly or you want
to be energy-efficient you should really
take you know put hardware into the
design loop and take in you know the
accurate energy or latency measurements
into the design itself of the neural
network this particular you know example
here is shown for an image
classification task meaning I give you
an image and you can classify it to the
right you can say what's in the image
itself you can imagine that that's type
of approach is kind of like reducing
information right from a 2d image you
reduce it down to a label this is very
commonly used now but we actually want
to see if we can still apply this
approach to a more difficult task of
something like depth estimation in this
case you know I give you a 2d image and
the output is also a 2d image where each
pixel shows the depth of each or you
know the output or the pressure it's
basically showing the depth of each
pixel at the input this is often what we
referred to as you know monocular depth
so I give you just a 2d you know depth
image input and you can estimate the
depth itself the reason why you want to
do this is you know 2d cameras irregular
cameras are pretty cheap right so I'd be
ideal to be able to do this you can
imagine like the way that we would do
this is to use an auto encoder so the
front half of the neural net is still
looking like it what we call it encoder
it's a reduction element so this is very
similar to what you would do for a
classification but then the back end of
the auto encoder is a decoder so it's
going to expand the information back out
right and so as I mentioned again this
is gonna be much more difficult than
just classification because now my
output has to be also very dense as well
and so we want to see if we could make
this really fast with approaches that we
just talked about for example in that
adapt I'm so indeed you can make it
pretty fast if you apply Net adapt
closely you know compact network design
and then do some deploys decomposition
you can actually increase the frame rate
by an order of magnitude so again here
I'm gonna show the plot on the x axis
here is the frame rate on a Jetson th to
GPU this is a magic measure with the
batch size of one with 32-bit float and
on the vertical axis the accuracy their
depth estimation terms to the Delta one
metric which means the percentage of
pixels that are within 25 percent of the
correct depth so higher the better
and so you can see you know the various
different proaches out out there this
star red star is the approach using fast
a fast step using all the different
efficient network design techniques that
we talked about you can see you can get
an order of magnitude over a 10x speed
up while maintaining accuracy and the
models and all the code to do this is
available on the fast step website we
presented this at Achra which is a
robotics conference in the middle of
last year and we want to show some live
footage there so a dick row we actually
captured some footage you know on an
iPhone and showed the you know real-time
depth estimation on an iPhone itself and
you can do achieved about 40 frames per
second on an iPhone using fast steps so
yeah and if you're interested in this
particular type of application or
efficient networks for depth estimation
invite you to visit the website for that
okay so that's the algorithmic side of
things but let's return to the hardware
building specialized hardware that are
efficient for a neural network
processing so again we saw that you know
there's many different ways of you know
making the neural network efficient from
Network pruning to efficient network
architectures to reduce precision the
challenge for the hardware designer
though is that there's no guarantee as
to which type of approach someone might
apply to the algorithm that they're
gonna run on the hardware right so if
you only own the hardware you don't know
what kind of algorithm someone's gonna
run on your hardware unless you own the
whole stack so as a result you really
really need to have flexible hardware so
it can support all of these different
approaches and translate these
approaches to improvements in energy
efficiency and latency now the challenge
is a lot of these specialized DNA and
hardware that exists out there
I often rely on certain properties of
the DNN an order achieve high efficiency
so a very typical structure that you
might see is that you might have an
array of multiply and accumulate units
so Mac array and it's going to reduce
memory access by amortize amides across
erase what do I mean by that so if I
read a weight once from the memory
weight memory bus I'm gonna reuse it
multiple times across the array send it
across the array so one read and it can
be used multiple times by multiple
engines for multiple Macs
similarly activation memory a memory
read input
once and we use it multiple times okay
on the issue here is that the amount of
reuse and the rate utilization depends
on the number of channels you have on
your neural net the size of the feature
map and the batch size right so this is
again just showing two different
variations of you know you're gonna
reuse based on the number of filters
number of input channels feature map
patch size and the problem now is that
we start looking at these efficient
neural network models they're not gonna
have as much reuse rights particularly
for the compact cases so for example a
very typical approach is to use what we
call depth wise layers we saw you took
that 3d filter and then decomposed it
into a 2d filter and a one by one right
and so as a result you only have one
channel so you're not gonna have much
reuse across the input Channel and so
rather than you know filling this array
with a lot of computation that you can
process you're only gonna be able to
utilize a very small subset which I've
highlighted here in green of the array
itself for computation so even though
you throw down you know a thousand
multiplies ten thousand multiplies the
Camellia engine only a very small subset
of them can actually do work and that's
not great so this is also an issue
because as I scale up the array size
it's gonna become less efficient ideally
what you would like is that after I put
more you know cores or processing
elements down the system should run
faster and I'm paying for more thing
more coarse but it doesn't because it
can't the data I can't reach or be
reused by all of these is from cores and
also be difficult to exploit sparsity so
what you need here are two things one is
a very flexible data flow meaning that
there's many different ways for the data
to move through this array right and so
you can imagine rows stationary is a
very flexible way that we can basically
map the neural network onto the array
itself you can see here in the iris a
row stationary case that you know a lot
of the processing elements can be used
another thing is how do you actually
deliver the data for this varying degree
of reuse so here's like the spectrum of
you know on chip networks in terms of
basically how can I deliver data from
that global buffer to all those parallel
processing engines right um one use case
is when I use these huge neural Nets
that have a lot of reuse what I wanted
his multicast meeting I read once from
the global buffer and then I reused that
data multiple times and all of my
processing elements you can think about
it's like broadcasting information out
and a type of network that you would do
for that is shown here on the right-hand
side so this is lobe 8 bandwidth so I'm
only reading very little data but high
spatial reuse many many engines are
using it on the other extreme when I
design these very efficient neural
networks I'm not gonna have very much
reuse and so what I want is unique has
meaning I'm gonna I don't want to spend
send out unique information to each of
the processing elements right so that
they can all you know work on so that's
going to be as shown here on the left
hand side a case where you have very
high bandwidth there's a lot of unique
information going out and low spatial
reuse they're not sharing data now it's
very challenging to go across this
entire spectrum one solution would be
what we call in all - all network that
satisfies all of this so all things are
can all inputs are connected all book
that's gonna be very expensive and not
scalable one solution that we have -
this is what we call a hierarchical mesh
so you can break this problem into two
steps at the lowest level you can use an
all - all connection right and then at
the higher level you can use a mesh
connection and so the mesh will allow
you to scale up but the all-to-all
allows you to achieve a lot of different
types of reuse and with this type of
network on chip you can basically
support a lot of different delivery
mechanisms to deliver data from the
global buffer to all the processing
elements so that all your cores all your
computes can be happening at the same
time and at its core this is one of the
key things that enable the second
version of iris to be both flexible and
efficient right so this is some results
from the second version of iris it
supports a wide rater to filter state
suppose the very large shapes as well as
very compact including convolutional
fully connected depth wise layers so you
can see here in this plot you know
depending on the shape you can get up to
an order of magnitude speed-up um it
also supports a wide range of sparsity
is both dense and sparse so this is
really important because some networks
can be very sparse because you've done a
lot of pruning but some are not and so
you want to officially support all of
those
you also want to be scalable so as you
increase the number of processing
elements you know the throughput also
speeds up and as a result of this
particular type of design you get an
order of magnitude improvement in both
speed and energy efficiency alright so
this is great and this is one way that
you can you know speed up and make
neural networks more efficient but it's
also important to take a step back and
look beyond just you know building
specialized Hardware the accelerator
itself both in terms of algorithms and
the hardware so can we look beyond the
DNA on accelerator for acceleration and
so one good place to show this an
example is the task of super resolution
so how many of you are familiar with the
task of super resolution alright so for
those of you who aren't the ideas as
follows so I want to basically generate
a high resolution image from a small and
resolution image and why do you want to
do that well there are a couple of
reasons one is that it can allow you to
basically reduce the transmitted
bandwidth so for example if you have
limited communication I'm going to send
a low res version of a video let's there
image to your phone and then therefore
you can make it high res okay that's one
way another reason is that you know
screens in general are getting larger
and larger so every year at CES they
announce like a higher resolution screen
but you know if you think about the
movies that we watch they're all a lot
of them are still 1080p for example or a
fixed resolution so again you want to
generate a high resolution
representation of that you know low
resolution input and the idea here is
that your high resolutions not just
interpolation because it can be very
blurry but there's ways that I kind of
hallucinate a high resolution version of
the video or image itself and that's
basically called super resolution but
the one of the challenges for super
resolution is that it's computationally
very expensive so again you can imagine
that the state of the art approaches for
super res use deep neural nets a lot of
the examples we just talked about what
about neural nets are talking about
input images of like 200 by 200 pixels
now imagine if you extend that to like
an HD image it's going to be very very
expensive so what we want to do is think
of different ways that we can speed up
the super resolution process not just by
making dnns faster but kind of looking
around
the other components of the system and
seeing if we can make it faster as well
so one of the things approaches we took
is as framework called fast where we're
looking at accelerating any
super-resolution algorithm by an order
of magnitude and this is operated on a
compressed video so you know before I
was a faculty here I worked a lot on
video compression and if you think about
the video compression community they
look at very video very differently than
people who process super resolution so
typically when you're thinking about
image processing is for resolution when
I give you a compressed video what you
basically think of it is as a stack of
pixels right a bunch of different images
together but if you asked a video
compression person you know what is the
compressed video look like it's actually
a compressed video is a very structured
representation of the redundancy in the
video itself so why is it that we can
compress videos is because things like
different frames look very you know
consecutive frames look very similar so
it's telling you you know which pixels
in frame one is related to which pixel
or looks like which pixel in frame two
and so as a result you have to send the
pixels in frame two and that's where you
get the compression from so actually
what a compressed video looks like is a
description of the structure of the
video itself okay and so you can use
this representation to accelerate super
resolution so for example rather than
applying super resolution to every
single low res frame which is the
typical approach that you would apply
super resolution to each low res frame
and you would generate a bunch of high
res frame outputs what you can actually
do is apply super resolution to one of
the small low resolution frames and then
you can use that free information you
can the compressed video that tells you
the structure of the video to generate
or transfer and generate all those high
resolution videos from that and so only
needs to run on a subset of frames and
then the complexity to reconstruct all
those high resolution frames once you
have that structured image is going to
be very low so for example if I'm gonna
transfer to n frames I'm gonna get an
end frame an X speed-up so to evaluate
this we showcase this on a range of
videos this range of videos is the data
set that we use to develop video
standards so it's quite broad
you can see first on the left hand side
is that if I transfer to like four
different frames you can get a four
acceleration and then the psnr which
indicates the quality doesn't change
it's the same quality but 4x faster if I
do transfer to 16 frames or 16
acceleration there's a slight drop in
quality but still you get a basically a
16x acceleration so the key idea here is
again you'd want to look beyond you know
the processing of the neural network
itself to around it to see if you can
speed it up I'm usually with Pearson or
you can't really tell too much for the
quality so another way to look at it is
actually look at the video itself or
subjective quality so on the left hand
side here this is if I applied super
resolution on every single frame so this
is a traditional way of doing it on the
right hand side here this is if I just
did interpolation on every single frame
and so where you can tell the difference
is by looking at things like the text
you can see that the text is much
sharper on the Left video than the right
video now a fast plus SRC on using fast
is somewhere in between so fast
actually has the same quality as the
video on the left hand side but it's
just as efficient in terms of processing
speed as the approach on the right hand
side so it kind of has the best of both
worlds
and so the key takeaway for this is that
if you want to accelerate dnns for a
given process it's good to look beyond
you know the hardware for the
acceleration we can look at things like
the structure of the data that's
entering the neural network accelerator
there might be opportunities there for
example I hear a temporal correlation
that allows you to further accelerate
the processing again if you're
interested in this all the code is on
the website so to end this lecture I
just want to talk about things that are
actually beyond deep neural nets I also
I know neural nets are great they're
useful for many applications but I think
there's a lot of exciting problems
outside the space of neural nets as well
which also require efficient computing
so the first thing is what we call
visual inertial localization or visual
odometry
this is something as widely used for
robots to kind of figure out where they
are and the willwill so you can imagine
for autonomous navigation before you you
know navigate the world you have to know
where you actually are in the world so
that's the localization this is also
widely used for things like AR and
while Rex you can know where you're
actually looking and they aren't veer
what does this actually mean it means
that you can basically take in a
sequence of images so you can Majan like
a camera that's mounted on the robot or
the person as well as an IMU so it has
accelerometer and gyroscope information
and in visual inertial odometry
which is a subset of SLAM basically
fuses this information together and the
outcome of visual inertial Dom tree is
the localization so you can see here
basically you're trying to estimate
where you are in the 3d space and the
pose based on in this case the camera
feed but you can also measure IMU
information there as well and if you're
in an unknown environment you could also
generate a map so you know one of these
is a very key task and navigation and
the key thing is can you do in a fair
energy efficient way um so we've looked
at kind of building specialized hardware
to do localization this is actually the
first chip that performs like complete
visual inertial odometry on chip we call
it Navi on this and done in
collaboration with search Carmen um so
you can see here here's the chip itself
it's four millimeters by four five
millimeters you can see that it's
smaller than a quarter and it can you
imagine mounting it on a small robot at
the front end it does basically
processing of the camera information it
does things like feature detection
tracking outlier elimination it also
processes uh does pre integration on the
IMU and then on the back end it fuses
this information together using a factor
graph okay and so when you compare you
know this particular design this Navion
chip design compared to mobile or
desktop CPUs you're talking about two to
three orders of magnitude reduction in
energy consumption because you have the
specialized chip to do it so one you
know what is the key component of this
chip then able to do it well again
sticking with the theme the key thing is
reduction in data movement in particular
we reduce the amount of data that needs
to be moved on and off chip so all of
the processing is located on the chip
itself and then furthermore because we
want to reduce the size of the chip and
the size of the memories we do things
like apply low-cost compression on the
frames and then also exploit sparsity
which means number of zeros in the
factor graph itself so all of the
compression and exploiting sparsity can
actually reduce the storage cost down to
a megabyte of storage on ship to do this
processing and that allows us to achieve
this really low power consumption of 25
below 25 milliwatts another thing that
really matters for autonomous navigation
is once you know where you are where are
you gonna go next so this is kind of a
planning and mapping problem and so in
the context of things like robot
exploration where you where and I
basically explore an unknown area you
can you do this by doing what we call a
Shannon's computing Shannon's mutual
information basically you wanna figure
out where should I go next but I will
discover the most amount of new
information compared to what I already
know
right so you can imagine so stone here's
like an occupancy map so this is
basically the light colors showed the
place where it's free space is empty
nothing's occupied the dark gray area is
unknown and then the black lines are
occupied things like walls for example
and the question is if I know that this
is my current occupancy map where should
I go and scan let's say with the depth
sensor to figure out you know more
information about the map itself
so what you can do is you can compute
what we call the mutual information of
the map itself based on what you already
know and then you go to the location
with the most information and you scan
it and then you get an updated map so
shown here below is a miniature race
card that's doing exactly that right so
over here over here is the mutual
information that's being computed so
it's trying to go to those light you
know light areas of the yellow areas
that has the most information so you can
see that it's going to try and back up
and come and scan this region to cover
or figure out more information about
that okay so that's great it's a very
principled way of doing this the problem
of this kind of computation the reason
why it's been challenging is again the
computation in particular the data
movement so you can imagine at any given
position you're gonna do like a kind of
a 3d scanning with your lidar across a
wide range of neighboring regions with
your beams you can imagine each of these
beams with your lighter scan can be
processed with different cores so they
can all be processed in parallel so
parallelism again here just like the
deep learning case is very easily
available the
Challenge is data delivery right so what
happens is that you're actually storing
your occupancy map all in one memory but
now you have multiple cores that are
gonna try and process the scans on this
occupancy map right and so you only
actually typically for these types of
numbers you're limited to two ports but
if you want to have you know nCore 16
cores 30 course it's going to be a
challenge in terms of how to read data
from this occupancy map and deliver to
the course themselves if we take a
closer look at you know the memory
access pattern you can see here that as
you scan it out the numbers indicate
which cycle you would use to read you
know each of the locations on the map
itself okay and you can see it's kind of
a diagonal pattern so the question is
can I break this map into smaller
memories right and then access these
smaller memories in parallel and the
question is I can break it into smaller
memories how should I decide what part
of the map should go into which of these
memories so show here on the right hand
side in the different different colors
basically in indicate different memories
or different banks of the memory so they
store different parts of the map and
again if you think of the numbers as you
know the cycle with which each location
is accessed what you'll notice is that
for any given color and most two numbers
are the same meaning that I'm only going
to access two pieces of the location for
any given thing so there's going to be
no conflict so I can process all of
these beams in parallel okay and so by
doing this this allows you to compute
the mutual information of the entire map
and by internet can be very large map
let say 200 meters by 200 meters at
point 1 meter resolution in under a
second this is very different from
before where you know you can only
compute the mutual information of a
subset of location then just try and
pick the best one
now you can compute it on the entire map
so you can know the absolute best
location to go to to get the most
information this is 100x speed-up
compared to a CPU at 1/10 of the power
right on an FPGA so that's another
important example of how data movements
really critical in order to allow you to
process things very very quickly and
having having specialized Hardware
Knable that all right so one last thing
is looking at you know so we talked
about robotics talk about deep learning
but actually what's really important
there's a lot of
important applications that where you
can apply efficient processing that can
you know help a lot of people around the
world so in particularly looking at
monitoring neurodegenerative disease of
disorders so we know things like
dementia so things like Alzheimer's
Parkinson's affects you know tens of
millions of people around the world and
continues to grow this is a very severe
disease the challenge for this disease
is that one of the many challenges but
one of the challenges that the
neurological assessments for these
disease can be very time consuming and
require a trained specialists so
normally if you are suffering from one
of these diseases or you might have this
disease what you need to do is you need
to go see a specialist and they'll ask
you a series of questions like a mini
mental is that like what year is it
where are you now can you count
backwards and so on or you might be
familiar with like people are asked to
draw the clock these tests and so you
can imagine going to a specialist to do
these type of things can be costly and
time consuming so you don't go very
frequently so that as a result the data
that's collected very sparse also it's
very qualitative right so if you go to
different specialists they might come up
with a different assessment right so
repeatability is also very much an issue
well it's been super exciting is it's
been shown a literature that there's
actually a quantitative way of measuring
or quantitative evaluating these types
of diseases potentially using eye
movements right so I mean this can be
used by quantitative way to evaluate the
severity or progression or regression of
these particular type of diseases you
imagine doing things like you know if
you're taking a certain drug is your
disease getting better or worse and this
movement can give a quantitative
evaluation for that but the challenge is
that to do these eye movement
evaluations you still need to go into
that so first you need a very high-speed
camera that can be very expensive often
you need to have substantial head
supports your head doesn't move so you
can really detect the eye move it and
you might even need IR illumination so
you more clear can more clearly see the
eye and so again this still has the
challenge that for clinical measurements
of what we call saccade lanes your eye
movement latency or eye reaction time
they're done in very constrained
environments you still have to go see
the SPECIAL itself and here they use
very specialized and costly equipment so
in the vein of and you know enabling
efficient and computing and bringing
compute to various devices
our question is can we actually do these
eye measurement measurements on a phone
itself that we all have and so indeed
you can you can develop various
algorithms that can detect your eye
reaction time on you know consumer grade
camera like your phone or an iPad and
we've shown that you can actually
replicate replicate the quality of
results as you could with a phantom
camera so shown here in the red are
basically eye reaction times that are
measured on a subject on an iPhone 6
which is obviously under $1,000 way
cheaper now compared to a phantom camera
shown here in blue you can see that the
distributions of the reaction times are
about the same and this what why is this
exciting because it enables us to do
low-cost in-home measurements so you can
you can imagine as a patient could do
these measurements at home for many days
not just the day they go in and then
they can bring in this information and
this can give the physician or the
specialist additional information to
make the assessment as well so this can
be complementary but it gives a much
more rich set of information to do the
diagnosis and evaluation so we're
talking about computing but there's also
other parts of the system that burn
power as well in particular when we talk
about things like depth estimation using
time of flight time of slides kind of
very similar to lidar basically what
you're doing is you're sending a pulse
and waiting for it to come back and how
long it takes to come back indicates the
depth of whatever object you're trying
to detect the challenge with you know
depth estimation with hama flight
sensors that can be very expensive right
you're emitting a pulse waiting for to
come back so talking about what you know
up to tens of watts of power the
question is can we also reduce the
sensor power if we can do efficient
computing so for example can I reduce
how often I omit the depth sensor and
kind of recover the other information
just using a monocular based camera so
for example you know typically you have
a pair of a depth sensor and an RGB
camera if at times zero I turn both of
them on and time one and two I turn them
off but I still keep my RGB camera on
can I estimate the depth for at time 2
and time 3 ok and then the key thing
here is to make sure that the you know
algorithms that you're running to
estimate the depth without turning on
the depth sensor itself is super cheap
so we actually have algorithms that can
run on VGA at 30 frames per second on
our cortex ace
which is a super low-cost embedded
processor and just to give you an idea
of how it looks like so let's see here's
the left is the RGB image in the middle
is the depth map or the ground truth so
if I always have the depth sensor on
that's what it would look like and then
on the right hand side is the EPS tomato
depth map in this particular case we're
only turning on the sensor only eleven
percent of the time so every every ninth
frame and you're mean and relative error
is only about 0.7 percent so the
accuracy or quality is pretty aligned
okay so you know a high level what are
the key takeaways I want you guys to get
from today's lecture
first is the efficient computing is
really important it can extend the reach
of AI beyond the cloud itself because it
can reduce communication network can
cause enable privacy and provide low
latency and so we can use AI for a wide
range of applications ranging for things
like robotics to healthcare and in order
to achieve this energy-efficient
computing it really requires cross layer
design so not just focusing on the
harlot but specialized Hardware plays an
important role but also the algorithms
itself and this is gonna be really key
to enabling AI for the next decade or so
or beyond ok and we also covered a lot
of points in the lecture so the slides
are all available on our website I'm
also just because it's deep learning
seminar series I just want to point some
other resources that you might be
interested if you want to learn more
about efficient processing of neural
nets so again I want to point you first
to this survey paper that we've
developed this with my collaborator Joel
Emmer Tom really kind of covers what is
you know what are the different
techniques that people are looking at
and gives some insights of the key
design principles we also have a book
coming soon it's going to be within the
next few weeks we also have slides from
various tutorials that we've given on
this particular topic in fact we also
teach a course on this here at MIT 6-8
to 5 if you're interested in you know
updates on all these types of materials
I invite you to join the mailing list or
the Twitter feed the other thing is if
you're not an MIT student but you want
to take like a two-day course on this
particular topic I also invite you to
take a look at the MIT professional
education options so we run short
courses on MIT campus over the summer so
you can come for two days
we can talk about the various different
approaches that people used to build
efficient deep learning systems and then
finally just if you're interested in
just video and tutorial videos on this
talk I actually at the end of November
during Europe's I gave like a 90 minute
tutorial that goes really in-depth in
terms of how to build efficient deep
learning systems so I invite you to
visit that and we also have some talks
at like the Mars Conference on robotics
and we have a YouTube channel where this
is all located and then finally I'd be
remiss if I didn't acknowledge you know
a lot of the work here is done
by the students all the students in our
group as well as Mykel Aboriginal
immerser chairman and Thomas health and
then all of our sponsors that make this
research possible so that concludes my
talk thank you very much
[Applause]
you