Transcript
u6aEYuemt0M • Deep Learning for Computer Vision (Andrej Karpathy, OpenAI)
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0008_u6aEYuemt0M.txt
Kind: captions
Language: en
Yeah. So thank you very much for the
introduction. Uh so today I'll speak
about uh deep learning especially in the
context of computer vision. So what you
saw in the previous talk is neural
networks. Uh so you saw that neural
networks are organized into these layers
fully connected layers where neurons in
one layer are not connected but they're
connected fully to all the neurons in
the previous layer. And we saw that
basically we have this um layer-wise
structure from input until output um and
there are neurons and nonlinearities
etc. Now, so far we have not made too
many assumptions about the inputs. So,
in particular here, we just assume that
an input is some kind of a vector of
numbers that we plug into this neural
network. So, um that's both a bug and a
feature to some extent. Uh because in
most um in most real world applications,
we actually can make some assumptions
about the input that make learning much
more efficient. Uh um that makes
learning much more efficient. So in
particular um usually we don't just want
to plug in uh into neural networks
vectors of numbers but they actually
have some kind of a structure. So we
don't have vectors of numbers but these
numbers are arranged in some kind of a
uh layout like an n- dimensional array
of numbers. So for example spectrograms
are two-dimensional arrays of numbers.
Images are threedimensional arrays of
numbers. Videos would be
four-dimensional arrays of numbers. Text
you could treat as one dimensional array
of numbers. And so whenever you have
this kind of local connectivity uh
structure in your data then you'd like
to take advantage of it and
convolutional neural networks allow you
to do that. So before I dive into
convolutional neural networks and all
the details of the architectures I'd
like to uh briefly talk about a bit of
the history of how this field evolved
over time. So I like to start off
usually with uh talking about hub and
weasel and the experiments that they
performed in 1960s. So what they were
doing is trying to study the
computations that happened in the early
visual cortex areas of a cat. And so
they had cat and they plugged in
electrodes uh to that could record from
the different uh neurons. And then they
showed the cat different patterns of
light and they were trying to debug
neurons effectively and try to show them
different patterns and see what they
responded to. And a lot of these
experiments uh inspired some of the
modeling that came in afterwards. So in
particular, one of the early models that
tried to take advantage of some of the
results of these experiments where the
um was the model called neurokcognitron
from Fukushima in the 1980s. And so what
you saw here was these uh this
architecture that again is layer-wise
similar to what you see in the cortex
where you have these simple and complex
cells where the simple cells detect
small things in the visual field and
then you have this local connectivity
pattern and the simple and complex cells
alternate in this layered architecture
throughout. And so this was this looks a
bit like a comnet because you have some
of its features like say the local
connectivity but at the time this was
not trained with back propagation. These
were specific heristically chosen uh u
updates that and this was unsupervised
learning back then. So the first time
that we've actually used back
propagation to train some of these
networks was an experiment of Yan Lakun
in the 1990s. And so um this is an
example of one of the networks that was
developed back then in 1990s by Yan
Lakun as Linet 5. And this is what you
would recognize today as a convolutional
neural network. So it has a lot of the
very sim uh convolutional layers and
it's alternating and it's a similar kind
of design to what you would see in the
Fukushima's neurocognitron but this was
actually trained with back propagation
end to end using supervised learning. Um
now so this happened in roughly 1990s
and we're here in 2016 basically about
20 years later. Um now computer vision
has
u has for a long time kind of um worked
on larger images and a lot of these
models back then were applied to very
small uh kind of settings like say
recognizing uh digits um and zip codes
and things like that and they were very
successful in those domains. But back at
least when I entered computer vision in
roughly 2011 it was thought that a lot
of people were aware of these models but
it was thought that they would not scale
up naively into large complex images
that they would be constrained to these
toy tasks for a long time or I shouldn't
say toy because these were very
important tasks but certainly like
smaller visual recognition problems and
so in computer vision in roughly 2011 it
was much more common to use a kind of um
these feature-based approaches at the
time and they didn't work actually that
well so when I entered my PhD in 200 1
working on computer vision, you would
run a state-of-the-art uh object
detector on this image and you might get
something like this uh where cars were
detected in trees and you would kind of
just shrug your shoulders and say,
"Well, that just happens sometimes." You
kind of just accept it as a as a
something that would just happen. Um and
of course this is a caricature. Things
actually were like relatively decent. I
I should say, but uh definitely there
were many mistakes that you would not
see today about four years uh in 2016,
five years later. And so a lot of uh
computer vision kind of looked much more
like this. When you look into a paper of
trying that tried to do image
classification, you would find this
section in the paper on the features
that they used. So this is one page of
features. And so they would use um yeah
a gist etc. And then a second page of
features and all their hyperparameters.
So all kinds of different histograms and
you would extract this kitchen sink of
features and a third page here. And so
you end up with uh this very large
complex codebase because some of these
feature types are implemented in MATLAB,
some of them in Python, some of them in
C++. And you end up with this large
codebase of extracting all these
features, caching them and then
eventually plugging them into linear
classifiers to do some kind of visual
recognition task. So it was uh quite
unwieldy
uh but uh it worked to some extent but
there were definitely room for
improvement and so a lot of this changed
uh in computer vision in 2012 with this
paper from Alex Kepsky, Eliask and Jeff
Hinton. So this is the first time that
um someone took a convolutional neural
network that is very similar to the one
that you saw from 1998 from Yanakun and
I'll go into details of how they differ
exactly uh but they took that kind of
network they scaled it up they made it
much bigger and they trained on a much
bigger data set on GPUs and things
basically ended up working extremely
well and this is the first time that
computer vision community has really
noticed these models and adopted them to
work on larger images. So uh we saw that
the performance uh of these models has
improved drastically. Here we are
looking at the imageet ILSVRC um visual
recognition challenge over the years and
we're looking at the top five errors. So
low is good and you can see that from
2010 uh in the beginning uh these were
feature-based methods and then in 2012
we had this huge jump in performance and
that was due to um the first uh kind of
convolutional neural network in 2012 and
then we've managed to push that over
time and now we're down to about
3.57%. Uh I think the results for
imageet 2000 imageet challenge 2016 are
actually due to come out today but I
don't think that actually they've come
out yet. I have this second tab here
opened.
I was waiting for the result, but I I
don't think this is up yet. Okay. No,
nothing. All right. Well, we'll get to
find out very soon what happens right
here. Uh, so I'm very excited to see
that. Uh, just to put this in context,
by the way, because you're just looking
at numbers like 3.57. How good is that?
That's actually really really good. So,
what something that I did about two
years ago now is that I tried to measure
human accuracy on this data set. And so
what I did that uh for that is I
developed this web interface where I
would show myself imageet images from
the test set. And then I had this
interface here um where I would have all
the different classes of imageet.
There's 10,00 and some example images.
And then basically you go down this list
and you scroll for a long time and you
find what class you think that image
might be. And then I competed against
the comnet uh at the time and this was
Google net in 200 uh in
2014. And uh so hot dog is a very simple
class. You can do that quite easily. Uh
but why is the accuracy not 0%. It well
some of the things like hot dog seems
very easy. Why isn't it trivial for
humans to see? Well, it turns out that
some of the images in a test set of
imageet are actually mislabeled. But
also some of the images are just very
difficult to guess. So in particular, if
you have this terrier, there's 50
different types of terriers and it turns
out to be very difficult task to find
exactly which type of terrier that is.
You can spend minutes trying to find it.
Turns out that convolutional neural
networks are actually extremely good at
this and so this is where I would lose
points compared to comnet. Um so I
estimate that human accuracy based on
this is roughly 2 to 5% range depending
on how much time uh you have and how
much expertise you have and how many
people you involve and how much they
really want to do this which is not too
much and uh so really we're doing
extremely well and so we're down to 3%
and uh I think the error rate if I
remember correctly was about 1.5%. So if
we get below 1.5% I would be extremely
suspicious on imageet. Uh that seems
wrong. So to summarize basically what
we've done is um before 2012 computer
vision looked somewhat like this where
we had these feature extractors and then
we trained a small portion at the end of
the feature extractor extraction step.
And so we only trained this last piece
on top of these features that were
fixed. And we've basically replaced the
feature extraction step with a single
convolutional neural network. And now we
train everything completely end to end.
And this turns out to work uh quite
nicely. So I'm going to go into details
of how this works in a bit. Uh also in
terms of code complexity uh we kind of
went from a setup that looks whoops I'm
way ahead. Okay. We went from a setup
that looks something like that in papers
to something like uh you know instead of
extracting all these things we just say
apply 20 layers with 3x3 combo or
something like that and things work
quite well. Uh this is of course an
overexaggeration but I think it's a
correct first order statement to make is
that we've definitely seen um that we've
reduced code complexity quite a lot
because these architectures are so
homogeneous compared to what we've done
before. So it's also remarkable that so
we had this reduction in complexity. We
had this amazing performance on imageet.
One other thing that was quite amazing
about the results in 2012 that is also a
separate thing that did not have to be
the case is that the features that you
learn by training on imageet turn out to
be quite generic and you can apply them
in different settings. So in other
words, this transfer learning um works
extremely well. And of course, I didn't
go into details of convolutional
networks yet, but uh we start with an
image and we have a sequence of layers
just like in a normal neural network.
And at the end, we have a classifier.
And when you pre-train this network on
imageet, then it turns out that the
features that you learn in the middle
are actually transferable and you can
use them on different data sets and that
this works extremely well. And so that
didn't have to be the case. You might
imagine that you could have a
convolutional network that works
extremely well on imageet but when you
try to run it on some something else
like birds data set or something that it
might just not work well but that is not
the case and that's a very interesting
finding in my opinion. So um people
noticed this back in roughly 2013 after
the first convolutional networks. They
noticed that you can actually take many
computer vision data sets and it used to
be that you would compete on all of
these kind of separately and design
features maybe for some of these
separately and you can just uh shortcut
all those steps that we had designed and
you can just take these pre-trained
features that you get from ImageNet and
you can just train a linear classifier
on every single data set on top of those
features and you obtain many
state-of-the-art results across many
different data sets. And so this was
quite a remarkable finding back then I
believe. So things worked very well on
imageet. Things transferred very well
and the code complexity of course got
much uh much more manageable. So now all
this power is actually available to you
with very few lines of code. If you want
to just use a convolutional network uh
on images it turns out to be only a few
lines of code. If you use for example
caris is one of the deep learning
libraries that I'm going to go into and
I'll mention again later in the talk. Uh
but basically you just load a
state-of-the-art convolutional neural
network. You take an image, you load it
and you compute your predictions and it
tells you that this is an African
elephant inside that image. And this
took a couple hund couple hundred or a
couple 10 milliseconds if you have a
GPU. And so everything got much faster,
much simpler, works really well,
transfers really well. So this was
really a huge advance in computer
vision. And so as a result of all these
nice properties, uh, comnets today are
everywhere. So here's a collection of
some of the some of the things I I try
to uh find across across different
applications. So for example, you can
search Google photos for different types
of um categories like in this case
Rubik's cubes. Um you can find house
numbers very efficiently. You can of
course this is very relevant in
self-driving cars and we're doing
perception in the cars. Convolutional
networks are very relevant there.
Medical image diagnosis recognizing
Chinese characters uh doing all kinds of
medical segmentation tasks. Uh quite
random tasks like whale recognition and
more generally many Kaggle challenges.
uh satellite image analysis recognizing
different types of galaxies. You may
have seen recently that um a waveet from
deep mind also a very interesting paper
that they generate music and they
generate speech. Uh and so this is a
generative model and that's also just a
comet is doing most of the heavy lifting
here. So it's a convolutional network on
top of sound and uh other tasks like
image captioning in the context of
reinforcement learning and agent in
environment interactions. We've also
seen a lot of advances of using comnets
as the core computational building
block. So when you want to play Atari
games or you want to play Alph Go or
Doom or Starcraft or if you want to get
robots to perform interesting
manipulation tasks, all of this uses
comes as a core computational um block
uh to do very impressive things. Uh not
only are we using it for a lot of
different application, we're also
finding uses in art. So um so here are
some examples from DeepDream. So you can
basically uh simulate what it looks
like, what it feels like maybe to be on
some drugs. So you can take images and
you can just hallucinate features using
comnets or you might be familiar with
neural style which allows you to take
arbitrary images and transfer arbitrary
styles of different paintings like Bango
on top of them. And this is all using
convolutional networks. The last thing
I'd like to note that I find also
interesting is that in the process of
trying to develop better computer vision
architectures and trying to basically
optimize for performance on the imageet
challenge, we've actually ended up
converging to something that potentially
might function something like your
visual cortex in some ways. And so these
are some of the experiments that I find
interesting where they've studied macak
monkeys uh and they record from a
subpopul of the um of the IT cortex.
This is the part that does a lot of
object recognition and so they record.
So basically they take a monkey and they
take a comnet and they show them images
and then you look at what those images
are represented at the end of this
network. So inside the monkeykey's brain
or on top of your convolutional network.
And so you look at representations of
different images and then it turns out
that there's a mapping between those two
spaces that actually seems to indicate
to some extent that some of the things
we're doing somehow ended up converging
to something that the brain could be
doing as well in the visual cortex. Um
so that's just some intro. I'm now going
to dive into convolutional networks and
try to explain um the briefly how these
networks work. Of course there's an
entire class on this that I taught which
is a convolutional networks class. And
so I'm going to distill some of you know
those 13 lectures into one lecture. So
we'll see how that goes. I won't cover
everything of course. Okay. So
convolutional neural network is really
just a single function. It goes from
it's a function from the raw pixels of
some kind of an image. So we take 224
x24x3 image. So three here is for the
color channels RGB. You take the raw
pixels, you put it through this
function, and you get 1,000 numbers at
the end. In the case of image
classification, if you're trying to
categorize images into 1,000 different
classes and really functionally all
that's happening in a convolutional
network is just dotproducts and max
operations. That's everything. But
they're wired up together in interesting
ways so that you are basically doing
visual recognition. And in particular
the this function f has a lot of knobs
in it. So these ws here that participate
in these dotproducts and in these
convolutions and fully connected layers
and so on these ws are all parameters of
this network. So normally you might have
about on the order of 10 million
parameters and uh those are basically
knobs that change this function. And so
we'd like to change those knobs of
course so that when you put images
through that function you get
probabilities that are consistent with
your training data. And so that gives us
a lot to tune and turns out that we can
do that tuning automatically with back
propagation uh through that search
process. Now more concretely a
convolutional neural network is made up
of a sequence of layers just as in the
case of normal neural networks. But we
have different types of layers that we
play with. Uh so we have convolutional
layers here I'm using rectified linear
unit relu for short as a nonlinearity.
Uh so I'm making that an explicit its
own layer. Um pooling layers and fully
connected layers. The core computational
building block of a convolutional
network though is this convolutional
layer and we have nonlinearities
interspersed. We are probably getting
rid of things like pooling layers. So
you might see them slightly going away
over time and fully connected layers can
actually be represented. They're
basically equivalent to convolutional
layers as well. And so really uh it's
just a sequence of com layers in the
simplest case. So let me explain
convolutional layer because that's the
core computational building block here
that does all the heavy lifting.
So the entire comnet is this collection
of layers and these layers don't
function over vectors. So they don't
transform vectors as a normal neural
network but they function over volumes.
So a layer will take a volume a
threedimensional volume of numbers an
array. In this case for example we have
a 32x 32x3 image. So those three
dimensions are the width, height and
I'll refer to the third dimension as
depth. We have three channels. Uh that's
not to be confused with the depth of a
network which is the number of layers in
that network. So this is just the depth
of a volume. So this convolutional layer
accepts a threedimensional volume and it
produces a threedimensional volume using
some weights. So the way it actually
produces this output volume is as
follows. We're going to have these
filters in a convolutional layer. So
these filters are always small spatially
like say for example 5x5 filter but
their depth extends always through the
input depth of the uh input volume. So
since the input volume has three
channels, the depth is three, then our
filters will always match that number.
So we have depth of three in our filters
as well. And then we can take those
filters and we can basically convolve
them with the input volume. So uh what
that amounts to is we take this filter.
Um oh yeah, so that's just a point that
the channels here must match. We take
that filter and we slide it through all
spatial positions of the input volume.
And along the way as we're sliding this
filter, we're computing dotproducts. So
wrppose x plus b where w are the filters
and x is a small piece of the input
volume and b is the offset. And so this
is basically the convolutional
operation. You're taking this filter and
you're sliding it through at all spatial
positions and you're computing that
products. So when you do this you end up
with this activation map. So in this
case uh we get a 28x 28 activation map.
28 comes from the fact that there are 28
unique positions to place this 5x5
filter into this 3 32x32 uh space. So
there are 28 by 28 unique positions you
can place that filter in. In every one
of those you're going to get a single
number of how well that filter likes
that part of the input. Um so that
carves out a single activation map. And
now in a convolutional layer we don't
just have a single filter but we're
going to have an entire set of filters.
So here's another filter a green filter.
We're going to slide it through the
input volume. It has its own parameters.
So these there are 75 numbers here that
basically make up a filter. there are
different 75 numbers. We convolve them
through get a new activation map and we
continue doing this for all the filters
in that convolutional layer. So for
example, if we had six filters uh in
this convolutional layer, then we might
end up with 28x 28 activation maps six
times and we stack them along the depth
dimension to arrive at the output volume
of 28x 28x 6. And so really what we've
done is we've re-represented the
original image which is 32x 32x3 into a
kind of a new image that is 28x 28x 6 uh
where this image basically has these six
channels that tell you how well every
filter matches or likes every part of
the input
image. So let's compare this operation
to say using a fully connected layer as
you would in a normal neural network.
So in particular we saw that we
processed a 32x 32x3 volume into 28x
28x6 volume. And uh one question you
might want to ask is how many parameters
would this require if we wanted a fully
connected layer of the same number of
output neurons here? So we wanted 28 x
28x 6 or time 28* 2 * 28 * 6 number of
neurons fully connected. How many
parameters would that be? Turns out that
that would be quite a few parameters,
right? because every single neuron in
the opted volume would be fully
connected to all of the 32x 32x3 numbers
here. So basically every one of those
28x 28x 6 neurons is connected to 32x
32x3 turns out to be about 15 million
parameters and also on that order of
number of multiplies. So you're doing a
lot of compute and you're introducing a
huge amount of parameters into your
network. Now since we're doing
convolution instead uh you'll notice
that think about the number of
parameters that we've introduced with
this example convolutional layer. So
we've used uh we had six filters and
every one of them was a 5x5x3 filter. So
basically we just have 5x5x3 filters. We
have six of them. If you just multiply
that out we have 450 parameters. And in
this I'm not counting the biases. I'm
just counting the raw weights. So
compared to 15 million we've only
introduced very few parameters. Also,
how many multiplies have we done? So,
computationally, how many flops are we
doing? Uh, well, we have 28 by 28 by six
outputs to produce. And every one of
these numbers is a function of a 5x5x3
region in the original image. So,
basically, we have 28 x 28 by 6 and then
there's every one of them is computed by
doing 5* 5* 3 multiplies. So, you end up
with only on the order of 350,000
um multiplies. So, we've reduced from 15
million to quite a few. So we're doing
less flops and we're using fewer
parameters. And really what we've done
here is we've made assumptions, right?
So we've made the assumption that
because um the fully connected layer, if
this was a fully connected layer, could
compute the exact same thing. Uh but it
would um so a specific setting of those
15 million parameters would actually
produce the exact output of this
convolutional layer. But we've done it
much more efficiently. We've done that
by introducing um these biases. So in
particular, we've made assumptions.
We've assumed, for example, that since
we have these fixed filters that we're
sliding across space, we've assumed that
if there's some interesting feature that
you'd like to detect in one part of the
image, like say top left, then that
feature will also be useful somewhere
else like on the bottom right because we
fix these filters and apply them at all
the spatial positions equally. You might
notice that this is not always something
that you might want. For example, if
you're getting inputs that are centered
face images and you're doing some kind
of a face recognition or something like
that, then you might expect that you
might want different filters at
different spatial positions. Like say
for eye regions you might want to have
some eye like filters and for mouth
region you might want to have mouth
specific features and so on. And so in
that case you might not want to use
convolutional layer because those
features have to be shared across all
spatial positions. And the second um
assumptions that we made is that these
filters are small locally and so we
don't have global connectivity. We have
this local connectivity but that's okay
because we end up stacking up these
convolutional layers in sequence. And so
this the neurons at the end of the
comnet will grow their receptive field
as you stack these convolutional layers
on top of each other. So at the end of
the comnet, those neurons end up being a
function of the entire image
eventually. So just to give you an idea
about what these activation maps look
like concretely, here's an example of an
image on the top left. This is a part of
a car I believe. And we have these
different filters at we have 32
different small filters here. And so if
we were to convolve these filters with
this image, we end up with these
activation maps. So this filter if you
convolve it you get this activation map
and so on. So this one for example has
some orange stuff in it. So when we
convolve with this image you see that
this white here is denoting the fact
that that filter matches that part of
the image quite well. And so we get
these activation maps. You stack them up
and then that goes into the next
convolutional layer. So the way this
looks then uh looks like then is that
we've processed this with some kind of a
convolutional layer. We get some output.
We apply a rectified linear unit, some
kind of a nonlinearity as normal and
then we just repeat that operation. So
we keep plugging these con volumes into
the next convolutional layer and so they
plug into each other in sequence. Okay?
And so we end up processing the image
over time. So that's the convolutional
layer. Now you'll notice that there are
a few more layers. So in particular the
pooling layer I'll explain very briefly.
Um pooling layer is quite simple. Uh if
you've used Photoshop or something like
that, you've taken a large image and
you've resized it, you've downsampled
the image. Well, pooling layers do
basically something exactly like that,
but they're doing it on every single
channel independently. So for every one
of these channels independently in a
input volume, we'll pluck out that
activation map. We'll down sample it and
that becomes a channel in the output
volume. So it's really just a
downsampling operation on these volumes.
Uh so for example one of the common ways
of doing this in the context of neural
networks especially is to use max
pooling operation. So in this case it
would be common to say for example use
2x2 filters stride two uh so um and do a
max operation. So if this is an input
channel in a volume then we're basically
what that amounts to is we're truncating
it into these 2x two regions and we're
taking a max over four numbers to
produce uh one piece of the output.
Okay. So this is a very cheap operation
that downsamples your volumes. It's
really a way to control the capacity of
the network. So you don't want too many
numbers. You don't want things to be too
computationally expensive. It turns out
that a pooling layer allows you to down
sample your volumes. You're going to end
up doing less computation and it turns
out to not hurt the performance too
much. So we use them basically as a as a
way of controlling the capacity of these
networks. And the last layer that I want
to briefly mention of course is the
fully connected layer which is exactly
as what you're familiar with. So we have
these volumes throughout as we've
processed the image. At the end you're
left with this volume and now you'd like
to predict some classes. So what we do
is we just take that volume we stretch
it out into a single column and then we
apply a fully connected layer which is
really amounts to just a matrix
multiplication and then that gives us uh
probabilities after applying like a soft
max or something like that. So let me
now show you briefly uh a demo of what
the convolutional network looks like. Uh
so this is comjs. uh this is um a deep
learning library for training
convolutional neural networks that I've
that is implemented in JavaScript. I
wrote this maybe uh two years ago at
this point. So here what we're doing is
we're training a convolutional network
on the CR10 data set. CR10 is a data set
of 50,000 images. Each image is 32x 32x3
and there are different 10 different
classes. So here we are training this
network in the browser and you can see
that the loss is decreasing which means
that we're better classifying these
inputs. And uh so here's the network
specification which you can play with
because this is all done in the browser.
So you can just change this and play
with this. Uh so this is an input image
and this convolutional network I'm
showing here all the intermediate
activations and all the intermediate um
basically activation maps that we're
producing. So here we have a set of
filters. We're convoling them with the
image and getting all these activation
maps. Uh I'm also showing the gradients
but I don't want to dwell on that too
much. Venue threshold. So ReLU
thresholding anything below zero gets
clamped at zero and then you pull. So
this is just a downsampling operation
and then another convolution relu pull
com pool etc until at the end we have a
fully connected layer and then we have
our softmax so that we get probabilities
out and then we apply a loss to those
probabilities and back propagate. And so
here we see that I've been training in
this tab for the last maybe uh 30
seconds or 1 minute and we're already
getting about 30% accuracy on CR10. So
this these are test images from CR10 and
these are the outputs of this
convolutional network and you can see
that it learned that this is already a
car or something like that. So this
trains pretty quickly in JavaScript. Uh
so you can play with this and you can
change the architecture and so
on. Another thing I'd like to show you
is uh this video because it gives you
again this like very intuitive visceral
feeling of exactly what this is
computing is there's a very good video
by Jason Yosinski uh from recent
advance. I'm going to play this in a
bit. This is from the deep visualization
toolbox. So you can download this code
and you can play with this. It's this
interactive convolutional network demo
and neural networks have enabled
computers to better see and understand
the world. They can recognize school
buses and Z top left corner we show the
in this popular. So what we're seeing
here is these are activation maps in
some particular uh shown in real time as
this demo is running. Uh so these are
for the com one layer of an Alex net
which we're going to go into in much
more detail. But these are the different
activation maps that are being produced
at this point. Um neural network called
Alexet running in cafe. By interacting
with the network, we can see what some
of the neurons are
doing. For example, on this first layer,
a unit in the center responds strongly
to light to dark
edges. Its neighbor one neuron over
responds to edges in the opposite
direction, dark to light.
Using optimization, we can synthetically
produce images that light up each neuron
on this layer to see what each neuron is
looking for. We can scroll through every
layer in the network to see what it
does, including convolution, pooling,
and normalization layers. We can switch
back and forth between showing the
actual activations and showing images
synthesized to produce high activation.
By the time we get to the fifth
convolutional layer, the features being
computed represent abstract
concepts. For example, this neuron seems
to respond to faces. We can further
investigate this neuron by showing a few
different types of information. First,
we can artificially create optimized
images using new regularization
techniques that are described in our
paper. These synthetic images show that
this neuron fires in response to a face
and shoulders. We can also plot the
images from the training set that
activate this neuron the most as well as
pixels from those images most
responsible for the high activations
computed via the deconvolution
technique. This feature responds to
multiple faces in different locations.
And by looking at the
decons, we can see that it would respond
more strongly if we had even darker eyes
and rosier lips. We can also confirm
that it cares about the head and
shoulders but ignores the arms and
torso.
We can even see that it fires to some
extent for cat
faces using backrop or decon. We can see
that this unit depends most strongly on
a couple units in the previous layer con
4 and on about a dozen or so in con 3.
Now let's look at another neuron on this
layer. So what's this unit doing? From
the top nine images, we might conclude
that it fires for different types of
clothing. But examining the synthetic
images shows that it may be detecting
not clothing per se, but wrinkles. In
the live plot, we can see that it's
activated by my shirt. And smoothing out
half of my shirt causes that half of the
activations to
decrease. Finally, here's another
interesting
neuron. This one has learned to look for
printed text in a variety of sizes,
colors, and
fonts. This is pretty cool because we
never ask the network to look for
wrinkles or text or faces. The only
labels we provided were at the very last
layer. So the only reason the network
learned features like text and faces in
the middle was to support final
decisions at that last layer. For
example, the text detector may provide
good evidence that a rectangle is in
fact a book seen on edge. And detecting
many books next to each other might be a
good way of detecting a bookcase, which
was one of the categories we trained the
net to
recognize. In this video, we've shown
some of the features of the deep viz
toolbox. Okay, so I encourage you to
play with that. It's it's really fun.
So, I hope that gives you an idea about
exactly what's going on. There's these
convolutional layers. We downsample them
from from time to time. There's usually
some fully connected layers at the end,
but mostly it's just these convolutional
operations stacked on top of each other.
So, what I'd like to do now is I'll dive
into some details of how these
architectures are actually put together.
The way I'll do this is I'll go over all
the winners of the imageet challenges
and I'll tell you about the
architectures, how they came about, how
they differ, and so you'll get a
concrete idea about what these
architectures look like in practice. So
we'll start off with the Alex net in
2012. Um so the Alex net just to give
you an idea about the uh the sizes of
these networks and the images that they
process it took 227 x27 by3 images. And
the first layer of an Alex net for
example was a convolutional layer that
had 11 by11 filters applied with a
stride of four and there are 96 of them.
stride of four I didn't fully explain
because I wanted to save some time but
intuitively it just means that as you're
sliding this filter across the input you
don't have to slide it one pixel at a
time but you can actually jump a few
pixels at a time so we have 11 by11
filters with a stride a skip of four and
we have 96 of them you can try to
compute for example what is the output
volume if you apply this uh this um this
sort of convolutional layer on top of
this volume and I didn't go into details
of how you compute that but basically
there are formulas for this and you can
look into details uh in the class but um
you arrive at 55 x 55 by 96 volume as
output. The total number of parameters
in this layer we have 96 filters every
one of them is 11 by 11 by3 because
that's the input uh depth of these
images. So basically just amounts to 11
* 11 * 3 and then you have 96 filters.
So about 35,000 parameters in this very
first layer. Uh then the second layer of
an Alex net is a pooling layer. So we
apply 3x3 filters at stride of two and
they do max pooling. So you can again
compute the output volume size of that
after applying this to that volume and
you arrive if you do some uh very simple
arithmetic there you arrive at 27 by 27
by 96. So this is the down sampling
operation. You can think about what is
the number of parameters in this pooling
layer. Um and of course it's zero. So
pooling layers compute a fixed function
a fixed down sampling operation. There
are no parameters involved in the
pooling layer. All the parameters are in
convolutional layers and the fully
connected layers which are in some
extent equivalent to convolutional
layers. So you can go ahead and just
basically based on the description in
the paper although it's non-trivial I
think based on the description of this
particular paper but you can go ahead
and decipher what uh the volumes are
throughout you can look at the uh kind
of patterns that emerge in terms of how
you actually um increase number of
filters in higher convolutional layers.
So we started off with 96 then we go to
256 filters then to 384 and eventually
4,96 units fully connected layers.
You'll see also normalization layers
here which have since become slightly
deprecated. It's not very common to use
the normalization layers that were used
uh at the time for the Alexent
architecture. What's interesting to note
is how this differs from the 1998 yan
lakun network. So in particular I
usually like to think about four things
that hold back progress. So uh at least
in deep learning so the data as a
constraint compute uh and then I like to
differenti differentiate between
algorithms and infrastructure algorithms
being something that feels like research
and infrastructure being something that
feels like a lot of engineering has to
happen and so in particular we've had
progress in all those four fronts. So we
see that in 1998 uh the data you could
get a hold of maybe would be on the
order of a few thousand whereas now we
have a few million. So we had three
orders of magnitude of increase in
number of data. Compute uh GPUs have
become available and we use them to
train these networks. They are about say
roughly 20 times faster than CPUs. And
then of course CPUs we have today are
much much faster than CPUs that they had
back in 1998. So I don't know exactly to
what that works out to but I wouldn't be
surprised if it's again on the order of
three orders of magnitude of
improvement. Again uh I'd like to
actually skip over algorithm and talk
about infrastructure. So in this case
we're talking about uh Nvidia releasing
the CUDA library that allows you to
efficiently create all these matrix
vector operations and apply them on
arrays of numbers. So um that's a piece
of software that you we rely on and that
we take advantage of that wasn't
available before. And finally algorithms
is kind of an interesting one because
there's been uh in those 20 years
there's been much less improvement in uh
in algorithms than all these other three
pieces. So in particular what we've done
with the 1998 network is we've made it
bigger. So you have more channels, you
have more layers by a bit. Uh and the
two really new things algorithmically
are uh dropout and rectified linear
units. So uh dropout is a regularization
technique uh developed by Jeff Hinton
and colleagues. And rectified linear
units are these nonlinearities that
train much faster than sigmoids and
10H's. And this paper actually had a
plot u that showed that the rectified
linear units trained a bit faster than
sigmoids. And that's intuitively because
of the vanishing gradient problems. And
when you have very deep networks with
sigmoids, um those gradients vanish as
Hugo was talking about in last lecture.
Uh so what's interesting also to note by
the way is that both dropout and relu
are basically like one line or two lines
of code change. So it's about two line
diff total in those 20 years. And both
of them consist of setting things to
zero. So with the ReLU, you set things
to zero when they're lower than zero.
And with Dropout, you set things to zero
at random. So, it's a good idea to set
things to zero. Apparently, that's what
we've learned. So, if you try to find a
new cool algorithm, look for oneline
diffs that set something to zero.
Probably will work better and we could
add you here to this list. Uh, now some
of the newest things that happened uh
some of the comparing it again and
giving you an idea about the
hyperparameters that uh were in this
architecture. Um, it was the first use
of rectified linear units. We haven't
seen that as much before. uh this
network used the normalization layers
which are not used anymore at least in
the specific way that they use them in
this paper. Uh they used heavy data
augmentation. So you don't only put in
you don't only pipe these images into
the networks exactly as they come from
the data set but you jitter them
spatially around a bit and you warp them
and you change the colors a bit and you
just do this randomly because you're
trying to build in some invarianes to
these small perturbations and you're
basically hallucinating additional data.
Uh it was the um the first real um use
of dropout. Um and roughly you see
standard hyperparameters like say batch
sizes of roughly 128 u using stocastic
gradient descent with momentum usually
0.9 um in the momentum learning rates of
1 -2 you reduce them in normal ways. So
you reduce roughly by factor of 10
whenever validation stops improving and
weight decay of just a bit 5 negative4
and uh ensembling always helps. So you
train seven independent convolutional
networks separately and then you just
average their predictions always gives
you additional 2% improvement. So this
is AlexNet the winner of 2012. In 2013
the winner was the ZFNET. This was
developed by uh Matthew Zyler and Rob
Fergus in
2013 and this was an improvement on top
of Alexet architecture. In particular,
one of the the bigger differences here
were that the convolutional layer, the
first convolutional layer, they went
from 11 by11 stride 4 to 7 by7 stride 2.
So you have slightly smaller filters and
you apply them more densely. And then
also they noticed that these
convolutional layers in the middle if
you make them larger if you scale them
up then you actually gain performance.
So they managed to improve a tiny bit.
Matthew Zyler then went uh he um became
the founder of clarify uh and uh he
worked on this a bit more inside clarify
and he managed to push the performance
to 11% which was the winning entry at
the time but we don't actually know what
get gets you from 14% to 11% because
Matthew never disclosed the full details
of what happened there but uh he did say
that it was more tweaking of these
hyperparameters and optimizing that a
bit so that was 2013 winner in 2014 we
saw a slightly bigger diff to this um so
one of the networks that was introduced
then was a VGNet from Karen Simmonian
and Andrew Zerman. What's beautiful
about VGNet and they explored a few
architectures here and the one that
ended up working best was this D column
which is why I'm highlighting it. What's
beautiful about the VGNet is that it's
so simple. So you might have noticed in
these previous uh um in these previous
networks you have these different filter
sizes, different layers and you do
different amount of strides and
everything kind of looks a bit hairy and
you're not sure where these
hyperparameters are coming from. VGET is
extremely uniform. All you do is 3x3
convolutions with stride one pad one and
you do two x2 max poolings with stride
two and you do this throughout
completely homogeneous architecture and
you just alternate a few comp and a few
pool layers and you get a top top
performance. So they managed to reduce
the error down to 7.3% in the VGNet um
just with a very simple and homogeneous
architecture. So it's I've also here
written out this D architecture just so
you can see I'm not I'm not sure how
instructive this is because it's kind of
dense but you can definitely see and you
can look at this offline perhaps but you
can see how these volumes develop and
you can see the kinds of sizes of these
filters. Um so they're always 3x3 but
the number of filters again grows. So we
started off with 64 and then we go to
128 256 512. So we're just doubling it
over time.
Um I also have a few numbers here just
to give you an idea of the scale at
which these networks normally operate.
So we have on the order of 140 million
parameters. This is actually quite a
lot. I'll show you in a bit that this
can be about five or 10 million
parameters and it works just as well. Um
and it's about 100 megabytes for image
in terms of memory in the forward pass
and then the backward pass also needs
roughly on that order. So that's roughly
the numbers that we're uh we're working
with here. Uh also you can note that
most of the and this is true mostly in
convolutional networks is that most of
the memory is in the early convolutional
layers. Most of the parameters at least
in the case where you use these giant
fully connected layers at the top would
be here. Um so the winner actually in
2014 was not the VGET I only presented
because it's such a simple architecture
but the winner was actually Google net
with a slightly um hairier architecture
we should say. So it's still a sequence
of things but in this case they've uh
put inception modules in sequence and
this is an example inception module. I
don't have too much time to go into the
details but you can see that it consists
basically of convolutions and different
kinds of strides and so on. Um so the
Google net um is looks slightly uh
hairier but it turns out to be more
efficient in several respects. So for
example it works a bit better than VGNET
at least at the time. um it only has
five million parameters compared to VGE
that's 140 million parameters. So a huge
reduction and you do that by the way by
just throwing away fully connected
layers. So you'll notice in this
breakdown I did these fully connected
layers here have 100 million parameters
and 16 million parameters. Turns out you
don't actually need that. So if you took
take them away that actually doesn't
hurt the performance too much. So uh you
can get a huge reduction of parameters.
Um and it was um it was slightly we can
also compare to the original AlexNet. So
compared to the original Alex net, we
have fewer parameters, a bit more
compute and a much better performance.
So Google Net was really optimized to
have a low footprint both memory wise uh
both computation wise and both
parameter- wise but it looks a bit
uglier and VGNet is a very beautiful
homogeneous architecture but there are
some inefficiencies in it. Okay, so
that's uh 2014. Now in 2015 we had a a
slightly bigger delta on top of the
architectures. So right now these
architectures if Yan Lakun looked at
them maybe in 1998 he would still
recognize everything. So everything
looks very like simple. You've just
played with hyperparameters. So one of
the first kind of bigger departures I
would argue was in 2015 with the
introduction of residual networks. Uh
and so this is work from Kaming Hi and
colleagues in Microsoft Research Asia.
And so they did not only win the imageet
challenge in 2015 but they won a whole
bunch of challenges. And this was all
just by applying these residual networks
that were trained on imageet and then
fine-tuned on all these different tasks
and you basically can crush lots of
different tasks whenever you get a new
awesome comnet. Um so at this time the
performance was basically 3.57% from
these residual networks. So this is
2015. Also uh this paper tried to argue
that if you look at the number of layers
it goes up and then it uh they made the
point that uh with residual networks as
we'll see in a bit you can introduce
many more layers and they uh and that
that correlates strongly with
performance. We've since found that in
fact you can make these residual
networks quite sh quite a lot shallower
like say on the order of 20 or 30 layers
and they work just as fine just as well.
So it's not necessarily the depth here
but I'll go into that in a bit but you
get a much better performance. What's
interesting about this paper is this
this plot here where they compare these
residual networks and I'll go into
details of how they work in a bit and
these what they call plane networks
which is everything I've explained until
now and the problem with plane networks
is that when you try to scale them up
and introduce additional layers they
don't get monotonically better. So if
you take a 20 layer model and uh on this
is on C10 experiments if you take a 20
layer model and you run it and then you
take a 56 layer model you'll see that
the 56 layer model performs worse and
this is not just on the test data. So
it's not just an overfitting issue. This
is on the training data. The 56 layer
model performs worse on the training
data than the 20 layer model even though
the 56 layer model can imitate 20 layer
model by setting 36 layers to compute
identities. So basically it's an
optimization problem that you can't find
the solution once your problem size
grows that much bigger in this plain net
uh architecture. So in the residual
networks that they've proposed they
found that when you wire them up in a
slightly different way you monotonically
get a better performance as you add more
layers. So more layers always strictly
better and you don't run into these
optimization issues. So comparing
residual networks to plane networks in
plain networks as I've explained already
you have this sequence of convolutional
layers uh where every convolutional
layer operates over volume before and
produces volume. In residual networks we
have this first convolutional layer on
top of the raw image. Then there's a
pooling layer. Um so at this point we've
reduced to 56 x 56x 64 the original
image and then from here on they have
these residual blocks with these funny
skip connections and this turns out to
be quite important.
Um so let me show you what these look
like. Um so the original climbing paper
had this architecture here shown under
original. So on the left you see
original residual networks design. Since
then they had an additional paper that
uh played with the architecture and
found that there's a better arrangement
of u layers inside this block that works
better empirically. And so the way this
works, so concentrate on the proposed
one in the middle since that works so
well, is you have this pathway uh where
you have this representation of the
image X and then instead of transforming
that representation X to get a new X to
plug in later, we end up uh having this
X, we go off and we do some compute on
the side. So that's that residual block
doing some computation and then you add
your result on top of X. So you have
this addition operation here going to
the next residual block. So you have
this X and you always compute deltas to
it. And I think this it's not intuitive
that this should work much better or why
that works much better. I think it
becomes a bit more intuitively clear if
you actually understand the back
propagation dynamics and how backrop
works. And this is why I always urge
people also to implement backdrop
themselves to get an intuition for how
it works, what it's computing and so on.
Because if you understand backdrop,
you'll see that addition operation is a
gradient distributor. So um you you get
a gradient from the top and this
gradient will flow equally to all the
children that participated in that
addition. So you have gradient flowing
here from the supervision. So you have
supervision at the very bottom here in
this diagram and it kind of flows
upwards and it flows through these
residual blocks and then gets added to
this stream. And so you end up with but
this addition distributes that gradient
always ident identically through. So
what you end up with is this kind of a
gradient superighway as I like to call
it where these gradients from your
supervision go directly to the original
convolutional layer and then on top of
that you get these deltas from all the
residual blocks. So these blocks can
come on online and can help out that
original stream of information. This is
also related to I think why LSTMs long
short-term memory uh networks uh work
better than recurrent neural networks
because they also have these kind of
additional addition operations in the
LSTM and it just makes the gradients
flow significantly
better. Then there were some results on
top of residual networks that I thought
were quite amusing. So uh recently for
example we had this result on deep
networks with stoastic depth. Uh the
idea here was that uh the authors of
this paper noticed that you have these
residual blocks that compute deltas on
top of your stream and you can basically
randomly throw out layers. So you have
these say 100 blocks 100 residual blocks
and you can randomly drop them out and
uh at test time similar to dropout you
introduce all of them and they all work
at the same time but you have to scale
things a bit just like with dropout. Uh
but basically it's kind of a unintuitive
result because you can throw out layers
at random and I think it breaks the
original notion of what we had of
comnets of as like these feature
transformers that that they compute more
and more complex features over time or
something like that. And I think it
seems much more intuitive to think about
these residual networks, at least to me,
as some kinds of dynamical systems where
you have this original representation of
the image X and then every single
residual block is kind of like a vector
field that because it computes an a
delta on top of your signal. And so
these vector fields nudge your original
representation X towards a space where
you can decode the answer Y of like the
class of that X. And so if you drop off
some of these residual blocks at random,
then if you haven't applied one of these
vector fields, then the other vector
fields that come later can kind of make
up for it and they nudge they basically
nudge the um they pick up the slack and
they nudge it along. Anyways, and so
that's possibly why this the image I
currently have in mind of how these
things work. Um so much more like
dynamical systems. In fact, another
experiments that people are playing with
that I also find interesting is you
don't have you can share these residual
blocks. So it starts to look more like a
recurrent neural network. So these
residual blocks would have shared
connectivity and then you have this
dynamical system really where you're
just running a single RNN, a single
vector field that you keep iterating
over and over and then your fixed point
gives you the answer. So it's kind of
interesting what's happening. Uh it
looks very funny. Okay, we've had many
more interesting results that so people
are playing a lot with these residual
networks and uh improving on them in
various ways. So, as I mentioned
already, it turns out that you can make
these residual networks much shallower
and make them wider. So, you introduce
more channels and that can work just as
well, if not better. So, it's not
necessarily the depth that is giving you
a lot of the performance. It's um um you
can scale down the depth and if you
increase the width, that can actually
work better. And um they're also more
efficient if you do it that way. There's
more uh funny regularization techniques
here. Swap out is a funny regularization
technique that actually interpolates
between plain nets, resets and dropout.
So that's also a funny paper. Uh we have
fractal nets. We actually have many more
different types of nets. And so people
have really experimented with this a
lot. I'm really eager to see what the
winning architecture will be in 2016 as
a result of a lot of this. One of the
things that has really enabled this
rapid experimentation in the community
is that somehow we've developed luckily
this culture of sharing a lot of code
among ourselves. So for example um
Facebook has released um just as an
example Facebook has released residual
networks code and torch that is really
good that a lot of these papers I
believe have adopted and worked on top
of and that allowed them to actually
really um scale up their experiments and
and
uh explore different architectures. So
it's great that this has happened.
Unfortunately a lot of these papers are
coming kind of on archive and it's kind
of a chaos as these are being uploaded.
So at this point I think this is a
natural point to plug very briefly my
archivesity.com. So this is the best
website ever and what it does is it
crawls archive and uh it takes all the
papers and it analyzes all the papers
the full text of the papers and it
creates TF bag of words features for all
the papers and then you can do things
like you can search a particular paper
like residual networks paper here and
you can look for similar papers on
archive and so this is a sorted list of
basically all the residual networks
papers that are most related to that
paper. uh or you can also create user
accounts and you can create a library of
papers that you like and then archive
sanity will train a support vector
machine for you and basically you can
look at what are archive papers over the
last month that I would enjoy the most
and that's just computed by archive
sanity and so it's like a curated feed
specifically for you. So I use this
quite a bit and I find it uh useful so I
hope that other people do as well. Okay,
so we saw convolutional neural networks.
I explained how they work. I explained
some of the background context. given
you an idea of what they look like in
practice and we went through case
studies of the winning architectures
over time, but so far we've only looked
at image classification specifically. So
we're categorizing images into some
number of bins. So I'd like to briefly
talk about addressing other tasks in
computer vision and how you might go
about doing that. So the way to think
about uh doing other tasks in computer
vision is that really what we have is
you can think of this comput
convolutional neural network as this
block of compute that has a few million
parameters in it and it can do basically
arbitrary functions that are very nice
over images and um so takes an image
gives you some kind of features and now
different tasks uh will basically look
as follows. You want to predict some
kind of a thing in different tasks that
will be different things and you always
have a desired thing and then you want
to make the predicted thing much more
closer to the desired thing and you back
propagate. So this is the only part
usually that changes from task to task.
You'll see that these comnets don't
change too much. what changes is your
loss function at the very end and that's
what actually helps you uh really
transfer a lot of these winning
architectures you usually use for these
pre-trained networks and you don't worry
too much about the details of that
architecture because you're only worried
about you know adding a small piece at
the top or changing the loss function or
substituting a new data set and so on.
So just to make this slightly more
concrete, in image classification, we
apply this compute block. We get these
features and then if I want to do
classification, I would basically
predict 10,00 numbers that give me the
lock probabilities of different classes.
And then I have a predicted thing, a
desired thing, particular class, and I
can back prop. If I'm doing image
captioning, the it also looks very
similar. Instead of predicting just a
vector of 10,000 numbers, I now have,
for example, a 10,000 num uh 10,000
words in some kind of vocabulary. and
I'd be predicting 10,000 numbers and a
sequence of them. And so I can use a
recurrent neural network which you will
hear much more about I think in
Richard's uh lecture just after this.
And so I produce a sequence of 10,000
dimensional vectors and that's just a
description and they indicate the
probabilities of different words to be
emitted at different time steps. Or for
example if you want to do localization
again most of the block stays unchanged
but now we also want some kind of a
extent in the image. So suppose we want
to classify we don't only just want to
classify this as an airplane but we want
to localize it with XY width height
bounding box coordinates and if we make
the specific assumption as well that
there's always a single one thing in the
image like a single airplane in every
image then you can just afford to just
predict that. So we predict these uh
softmax scores just like before and
apply the cross entropy loss and then we
can predict XY with height on top of
that and we use like an L2 loss or a
Huber loss or something like that. So
you just have a predicted thing, a
desired thing and you just backdrop. If
you want to do reinforcement learning
because you want to play different
games, then again the setup is you just
predict some different thing and it has
some different semantics. So in this
case we would be for example predicting
eight numbers that give us the
probabilities of taking different
actions. For example, there are eight
discrete actions in Atari. Then we just
predict eight numbers and then we train
this with slightly different manner
because in the case of reinforcement
learning you don't actually have a you
don't actually know what the correct
action is to take at any point in time
but you can still get a desired thing
eventually because you just run these
rollouts over time and you just see uh
what what happens and then um that helps
you that helps inform exactly what the
correct answer should have been or what
the desired thing should have been in
any one of those rollouts in any point
in time. I don't want to dwell on this
too much in this lecture though it's
outside of the scope. You'll hear much
more about reinforcement learning in in
a later
lecture. Uh if you wanted to do
segmentation for example uh then you
don't want to predict a single vector of
numbers for a single uh for a single
image but every single pixel has its own
category that you'd like to predict. So
a data set will actually be colored like
this and you have different classes
different areas and then instead of
predicting a single vector of classes
you predict an entire array of 224 x24
since that's the extent of the original
image for example times 20 if you have
20 different classes and then you
basically have uh 224 x24 independent
softaxis here that's one way you could
pose this and then you back propagate
this would here would be slightly more
difficult because you see here I have
decom layers mentioned here and I didn't
explain deconvolution layers. They're
related to convolutional layers. They do
a very similar operation but kind of uh
backwards in some way. So a
convolutional layer kind of does these
downsampling operations as it computes.
A decom layer does these kind of
upsampling operations as it computes
these convolutions. But in fact you can
implement a decom layer using a com
layer. So what you do is you a decom
forward pass is the com layer backward
pass and the decom backward pass is the
com layer forward pass basically. So
they're basically an identical operation
but it just are you upsampling or
downsampling kind of. So uh you can use
decon layers or you can use hyper
columns and there are different things
that people do in segmentation
literature but that's just a rough idea
as you're just changing the loss
function at the end. If you wanted to do
autoenccoders so you want to do some
unsurprised learning or something like
that. Well you're just trying to predict
the original image. So you're trying to
get the convolutional network to
implement the identity transformation.
And the trick of course that makes it
non-trivial is that you're forcing the
representation to go through this
representational bottleneck of 7 by7 x
512. So the network must find an
efficient representation of the original
image so that it can decode it later. So
that would be a autoenccoder you again
have an L2 loss at the end and you back
prop or if you want to do variational
autoenccoders you have to introduce a
reparameterization layer and you have to
append an additional small loss that
makes your posterior be your prior but
it's just like an additional layer and
then you have an entire generative model
and you can actually like sample images
as well. If you wanted to do detection
things get a little more hairy perhaps
compared to localization or something
like that. So one of my favorite
detectors perhaps to explain is the
yellow detector because it's perhaps the
simplest one. It doesn't work the best
but it's the simplest one to explain and
it has the core idea of how people do
detection in uh computer vision. And so
the way this works is we reduced the
original image to a 7x7 x 512 feature.
So really there are these 49 discrete
locations that we have and um at every
single one of these 49 locations we're
going to predict in yellow we're going
to predict a class. So that's shown here
on the top right. So every single one of
these 49 will be some kind of a softmax.
And then additionally at every single
position we're going to predict some
number of bounding boxes. And so there's
going to be a b number of bounding
boxes. Say b is 10. So we're going to be
predicting uh 50 numbers. And the the
five comes from the fact that every
bounding box will have five numbers
associated with it. So you have to
describe the x y the width and the
height. And you have to also indicate
some kind of a confidence of that
bounding box. Um so that's the fifth
number is some kind of a confidence
measure. So you basically end up
predicting these bounding boxes. They
have positions, they have class, they
have confidence and then you have some
true bounding boxes in the image. So you
know that there are certain true boxes
and they have certain class and what you
do then is you match up the desired
thing with the predicted thing and
whatever. So say for example you had one
um bounding box of a cat then you would
find the closest predicted bounding box
and you would mark it as a positive and
you would try to make that associated
grit cell predict cat and you would
nudge the prediction to be slightly more
towards the cat uh box and so all of
this can be done with simple losses and
you just back propagate that and then
you have a detector or if you want to
get much more fancy you could do uh
dense image captioning so in this case
this is a combination of detection and
image captioning this is a paper with my
equal co-author Justin Johnson and FA
Lee from last year. And so what we did
here is image comes in and it becomes
much more complex. I don't maybe want to
go into it as much but the first order
approximation is that instead it's
basically detection but instead of
predicting fixed classes we instead
predict a sequence of words. So we use a
recurrent neural network there. Uh but
basically you can take an image then and
you can predict you can both detect and
predict and describe everything in a
complex visual scene. So that's just
some overview of different tasks that
people care about. Most of them consist
of just changing this top part. You put
different loss function in a different
data set. But you'll see that this
computational block stays relatively
unchanged from time to time. And that's
why as I mentioned when you do transfer
learning um you just want to kind of
take these pre-trained networks and you
mostly want to use whatever works well
on imageet because a lot of that does
not change too
much. Okay. So in the last part of the
talk I'd like to let me just make sure
we're good on time. Okay, we're good. So
in the last part of the talk I just
wanted to give some um hints of some
practical considerations when you want
to apply convolutional networks in
practice. So first consideration you
might have if you want to run these
networks is what hardware do I use? Um
so some of the options that um I think
are available to you well first of all
you can just buy a machine. So for
example Nvidia uh has these digits dev
boxes that you can buy. They have Titan
X GPUs which are strong GPUs. You can
also, if you're much more ambitious, you
can buy DGX1, which has the newest
Pascal P100 GPUs. Unfortunately, the
DGX1 is about
$130,000. So, this is kind of an
expensive supercomputer. Uh, but the dig
box, I think, is a more accessible. And
so, that's one option you can go with.
Alternatively, you can look at the specs
of a dev box and those specs are they're
good specs, and then you can buy all the
components yourself and assemble it like
Lego. Unfortunately u you that's prone
to mistakes of course but you can
definitely reduce the price maybe by a
factor of like two um if compared to the
Nvidia machine but of course Nvidia
machine would just come with all the
software installed all the hardware is
ready and you can just do work there are
a few GPU offerings in the cloud but
unfortunately it's actually not at a
good place right now uh it's actually
quite difficult to get GPUs in the cloud
good GPUs at least. So, Amazon AWS has
these grid K5 520s. They're not very
good GPUs. They're not fast. They don't
have too much memory. It's actually kind
of a problem. Um, Microsoft Azure is
coming up, Azure is coming up with its
own offering soon. Uh, so I think uh
they've announced it and it's in some
kind of a beta stage if I remember
correctly. And so those are powerful
GPUs K80s that would be available to
you. At OpenAI for example, you use
Cirrus Scale. So Serale is much more a
slightly different model. You can't spin
up GPUs on demand, but they allow you to
rent a box in the cloud. So what that
amounts to is that we have these boxes
somewhere in the cloud. I have just the
the DNS. I just have the URL. I SSH to
it. It's a it's a TitanX boxes in the
machine. And so you can just do work
that way. So these options are available
to hardware wise. In terms of software,
there are many different frameworks of
course that you could use for deep
learning. Uh so these are some of the
more um common ones that you might see
in practice. Um so different people have
different um recommendations on this. I
would my personal recommendation right
now to most people if you just want to
apply this in uh practical settings 90%
of the use cases are probably
addressable with things like KAS. So KAS
would be my go-to number one uh thing to
look at. Keras is a layer over
TensorFlow or Theano. Uh and basically
it's just a higher level API over either
of those. So for example I usually use
Keras on top of TensorFlow and uh it's a
much more um higher level language than
raw tensorflow. So you can also work in
raw tensorflow but you'll have to do a
lot of low-level stuff. If you need all
that freedom, then that's great because
that allows you to have much more
freedom in terms of how you design
everything. But um it can be slightly
more wordy. For example, you have to
assign every single weight. You have to
assign a name, stuff like that. And so
it's just much more wordy, but you can
work at that level. Or for most
applications, I think KAS would be
sufficient. And I've used Torch for a
long time. I still really like Torch.
It's very lightweight, interpretable. It
works just just fine. So those are the
the options that I would uh currently
consider at least.
Um, another practical consideration you
might be wondering what architecture
what architecture do I use in my
problem. So my answer here and I've
already hinted at this is don't be a
hero. Don't go crazy. Don't design your
own neural networks and convolutional
layers and don't probably don't you
don't want want to do that probably. So
the algorithm is actually very simple.
Look at whatever is currently the latest
released thing that works really well in
ILSVRC. you download that pre-trained
model and then you potentially add or
delete some layers on top because you
want to do some other task. So that
usually requires some tinkering at the
top or something like that and then you
fine-tune it on your application. So
actually a very straightforward process.
Uh the first degree I think to most
applications would be don't tinker with
it too much you're going to break it.
But of course you can also take 231N and
then you might become much better at u
at tinkering with with these
architectures. Second uh is uh how do I
choose the parameters? And my answer
here again would be don't be a hero. Uh
look into papers, look what
hyperarameters they use. For the most
part, you'll see that all papers use the
same hyperparameters. They look very
similar. So Adam, when you use Adam for
optimization, it's always learning rate
1G3 or 1G4.
Uh so for you can also use SGD momentum,
it's always the similar kinds of
learning rates. So don't go too crazy
designing this. One of the things you
probably want to play with the most is
the regularization. So uh and in
particular not the L2 regularization but
the dropout rates is something I would
advise instead and
um so uh because you might have a
smaller or a much larger data set. If
you have a much smaller data set then
overfitting is a concern. So you want to
make sure that you uh regularize
properly with dropout and then you might
want to as a second degree consideration
uh maybe learning rate you want to tune
that a tiny bit but that that's usually
doesn't have as much of an effect. Um so
really there's like two hyperparameters
and you take a pre-trained network and
this is 90% of the use cases I would
say.
Um yeah so compared to when computer
vision 2011 where you might have
hundreds of hyperparameters so uh yeah
okay and uh in terms of uh distributed
training so if you want to work at scale
because uh if you want to train imageet
or some large scale data sets you might
want to train across multiple GPUs. So,
just to give you an idea, most of these
state-of-the-art networks are trained on
the order of a few weeks across multiple
GPUs, usually four or eight GPUs. And
these GPUs are roughly on the order of
$1,000 each, but then you also have to
house them. So, of course, that adds
additional price. But you almost always
want to train on multiple GPUs if
possible. Um, usually you don't end up
training across machines. That's much
more rare, I think, to train across
machines. What's much more common is you
have a single machine and it has eight
Titan X's or something like that and you
do distributed training on those eight
Titan X's. There are different ways to
do distributed training. So if you're
very if you're feeling fancy, you can
try to do some uh model parallelism
where you split your network across
multiple GPUs. Um I would instead advise
some kind of a data parallelism
architecture. So usually what you see in
practice is you have eight GPUs. So I
take my batch of 256 images or something
like that. I split it and I split it
equally across the GPUs. I do forward
pass in those GPUs and then I u I
basically just add up all the gradients
and I propagate that through. So you're
just distributing this batch and you're
doing um mathematically you're doing the
exact same thing as if you had a giant
GPU but you're just splitting up that
batch across different GPUs. Uh but
you're still doing synchronous training
with SGD as normal. So that's what
you'll see most in practice which I
think is uh the best thing to do right
now for most normal
applications. And other kind of
considerations that sometimes enter that
uh you could maybe worry about is that
there are these bottlenecks to be aware
of. So in particular CPU to disk
bottleneck. This means that you have a
giant data set. It's somewhere on some
disk. You want that disk to probably be
an SSD because you want this loading to
be quick because these GPUs process data
very quickly and that might actually be
a bottleneck. Like loading the data
could be a bottleneck. So in many
applications, you might want to
pre-process your data. Make sure that
it's read out contiguously in very raw
form from something like an HDFI file or
some kind of other binary format. And um
another bottleneck to be aware of is the
CPU GPU bottleneck. So the GPU is doing
a lot of heavy lifting of the neural
network and the CPU is loading the data
and you might want to use things like
pre-fetching threads where the CPU while
the networks are doing forward backward
on the GPU. Your CPU is busy loading the
data from the disk and maybe doing some
pre-processing and making sure that it
can um ship it off to the GPU at the
next time step. So those are some of the
practical considerations I I could come
up with for this lecture. Uh if you
wanted to learn much more about
convolutional neural networks and a lot
of what I've been talking about, then I
encourage you to check out CS231N. Uh we
have lecture videos available. We have
notes, slides, and assignments.
Everything is uh up uh and available. So
uh you're welcome to check it out. And
that's it. Thank you.
[Applause]
So I guess I can take some questions.
Yeah.
Hello. Hello.
Hi. I'm Kyle Far from Lumna. Um, I'm
using a lot of convolutional nets for
genomics. One of the problems that we
see is that our genomic sequence tends
to be arbitrary length. Uh so right now
we're pattern with a lot of zeros, but
we're curious as to what your thoughts
are on using CNN's for uh things of
arbitrary size or we can't just down
sample to 277 by 277. Yep. So is this
like a genomic sequence of like ATCG
like that kind of sequence? Yeah,
exactly. Yeah. So some of the options
would be uh so recurren networks might
be a good fit because they allow
arbitrarily sized contacts. Uh another
option I would say is if you look at the
waveet paper uh from deep mind they have
uh audio and they're using convolutional
networks for processing it and I would
basically adopt that kind of an
architecture. they have this clever way
of doing uh what's called atros or
dilated convolutions and so that allows
you to capture a lot of context with few
layers and so that's called dilated
convolutions and the waveet paper has
some details and there's an efficient
implementation of it that you should be
aware of on GitHub and so you might be
able to just drag and drop the fast
WaveNet code into that application and
so you have much larger context but it's
of course not infinite context as you
might have with a recurrent network yeah
we're definitely checking those out uh
we also tried RNN's they're quite slow
for these things uh our main problem is
that the genes can be very short or very
long, but the whole sequence matters.
Um, so I I think that's one of the
challenges that we're looking at with
this type of problem. Interesting. Um,
yeah. So those would be the two options
that I would play with basically. I
think those are the two that I'm aware
of. Yeah, thank you.
Thanks for a great lecture. So my
question is that is there a clear
mathematical or conceptual understanding
when people decide how many hidden
layers have to be part of their
architecture? Yeah. So um the answer
with a lot of this is there a
mathematical understanding will likely
be no because we are in very early
phases of just doing a lot of empirical
anal like guess and check kind of work.
Um and so theory is in some some ways
like lagging behind a bit. Uh I would
say that with residual networks uh you
want to have more layers usually works
better and so you can take these layers
out or you can put them in and it's just
mostly a computational consideration of
how much can you fit in. So our
considerations usually is you have a GPU
it has maybe 16 gigs of RAM or 12 gigs
of RAM or something. I want certain
batch size and I have these
considerations and that upper bounds the
amount of like layers or how big they
could be. And so I use the biggest thing
that fits in my GPU. And that's mostly
what uh the way you choose this. And
then you regularize it very strongly. So
if you have a very small data set uh
then you might end up with a pretty big
network for your data set. So you might
want to make sure that you are tuning
those dropout rates properly and so
you're not overfitting.
So I have question uh my understanding
is that uh uh the recent uh convolution
doesn't use pooling layers right. So the
question is why uh you know why don't
they use pulling layers? So you know is
there still a place for pulling? Yeah.
Uh yeah. So certainly so if you saw for
example the residual network um at the
end there was a single pooling layer at
the very beginning but mostly they went
away. You're right. So it took uh I
wonder if I can find the slide. I wonder
if this is a good idea to try to find
the
slide. That's probably okay. Let me just
find
this. Oh okay. So this was the residual
network architecture. So you see that
they do a first com and then there's a
single pool right there. But certainly
the trend has been to throw them away
over time and there's a paper also uh
it's called striving for simplicity the
all convolutional neural network and uh
the point in that paper is look you can
actually do strided convolutions you can
throw away pooling layers altogether
works just as well. So pulling layers
are kind of I would say this kind of a
bit of a historical vestage of they
needed things to be efficient and they
need to control the capacity and down
sample things uh quite a lot and so
we're kind of throwing them away over
time and uh yeah they're not doing
anything like super useful. They're
doing this fixed operation and uh you
want to learn as much as possible so
maybe you don't actually want to get rid
of that information. Uh so it's always
more appealing to um it's probably more
appealing I would say to throw them
away. uh you mentioned there is a sort
of cognitive uh or brain analogy that
the brain is doing pulling so uh yeah so
I think that analogy is stretched by a
lot so the brain I'm not sure if the
brain is doing
[Laughter]
pooling yeah
about image compression not for just
classification but the usage of uh
neural networks for image compression do
we have any examples sorry I couldn't
hear the question uh instead of
classification for images uh can we use
the neural networks for uh image
compression. Image compression. Uh yeah,
I think there's actually really exciting
work in this area. So um one that I'm
aware of for example is recent work from
Google where they're using convolutional
networks and recurrent networks to come
up with variably sized codes for images.
Um so certainly a lot of these
generative models I mean they are very
related to compression. Uh so definitely
a lot of work in that area that uh that
I'm excited about. Also for example
super resolution networks. So you saw
the recent acquisition uh of um magic
pony by Twitter. So they were also doing
something that basically allows you to
compress, you can send low resolution
streams because you can upsample it on
the client. Uh and so a lot of work in
that area. Yeah,
I had one question. One more but maybe
after you
can you please comment on scalability
regarding number of classes? So what
does it take if we go up to 10,000 or
100,000 classes? Mhm. Yes. Yeah. So if
you have a lot of classes then of course
you can grow your softmax but that
becomes inefficient at some point
because you're doing a giant matrix
multiply. So some of the ways that
people are addressing this in practice I
believe is use of like hierarchical
softmax and things like that. Uh so you
um you decompose your classes into
groups and then uh you kind of predict
one group at a time and you kind of
converge uh that way. Um so I'm not I I
see these papers but I don't uh I'm not
an expert on exactly how this works but
I do know that hierarchical softmax is
something that people use in this
setting especially for example in
language models this is often used
because you have huge amount of words
and you still need to predict them
somehow and so I believe Thomas Mikolof
for example he has some papers on using
hierarchical softmax in this context
could you uh could you talk a little bit
about the u the convolutional functions
like what uh what considerations you
should make in u uh selecting the
functions they're used in any
convolutional filters selecting the
functions that are used in the
convolutional filters. Uh so these
filters are just parameters, right? So
we train those filters. They're just
numbers that we train with back
propagation. Okay. Are you talking about
the nonlinearities perhaps or Yeah, I'm
just wondering about uh when you're
selecting those uh the features or when
you're getting the uh when you're trying
to train to to uh understand different
features within an image, what uh what
are those uh filters actually doing? Oh,
I see you're talking about understanding
exactly what those filters are looking
for in so a lot of interesting work
especially for example so Jason Yosinski
uh he has this deepest toolbox and I've
shown you that you can kind of debug it
that way a bit. Uh there's an entire
lecture that I encourage you to watch in
CS231N on visualizing understanding uh
convolutional networks. So people use
things like a decom or guided or guided
back propagation or you back propagate
to image and you try to find a stimulus
that maximally activates any arbitrary
neuron. So different ways of probing it
uh and different ways have been
developed and there's a lecture about
it. So I would I would check that out.
Great. Thanks.
Uh I had a question regarding the size
of fine-tuning data set. For example, is
there a ballpark uh number if if you are
trying to do classification? Uh how many
do would you need for fine-tuning it to
your sample set? Uh so how many uh how
many data points do you need to to get
good performance is the question.
Okay. So, so, okay. So, this is like the
most boring answer I think because the
more the better always and uh it's
really hard to say actually the um how
many you need. Um so, usually one way
one way to look at it is um one heristic
that people sometimes follow is you look
at number of parameters and you want the
number of examples to be on the order of
number of parameters. That's one way
people sometimes break it down even for
fine-tuning. Uh because we'll have a
imageet model. So I was hoping that most
of the things would be taken care over
there and then you're just fine-tuning.
So you you might need a lower order. I
see. So when you're saying fine-tuning,
are you fine the whole network or you're
freezing some of it or just the top
classifier? Just the top classifier.
Yeah. So one another way to look at it
is you have some number of parameters
and you can estimate the number of bits
that you you think every parameter has
and then you count the number of bits in
your data. So that's the kind of like
comparisons you would do. Uh but really
uh yeah I have no good answer. So the
more the better and you have to try and
you have to regularize and you have to
cross validate that and you have to see
what performance you get over time
because it's too task dependent for me
to say something
stronger. Uh hi I would like to know how
do you think the covenant will work in
the 3D case? Uh like is it just a simple
extension of the 2D case or do we need
some extra tweak about it in 3D case? So
you're talking specifically about say
videos or some uh 3D uh actually I'm
talking about the the the image has the
depth information. Oh I see. So uh say
you have like RGBD input and things like
that. Yes. So I'm not too familiar what
people do but um uh I do know for
example that uh people try to have for
example one thing you can do is just
treat it as a fourth channel or maybe
you want a separate comnet on top of the
depth channel and do some fusion later.
Uh so I don't know exactly what the
state-of-the-art in treating that depth
channel is right now.
Um, so I don't know exactly how they do
how they do it right now. Oh, so maybe
just one more question just uh how do
you think the 3D object recognition
3D object? Yeah. Recognition. So what is
the output that you'd like? Uh the
output is still the the class
probability but we are not treating the
a 2D image but the the 3D representation
of the object. I see. So do you have a
mesh or a point cloud? Yeah. I see.
Yeah. Uh so also not not exactly my area
unfortunately but so the problem with
these uh meshes and so on is that
there's this like rotational degree of
freedom that I'm not sure what people do
about honestly. So uh the yeah so I'm
actually not an expert on this so I
don't want to comment. There are some
obvious things you might want to try
like you might want to plug in all the
possible ways you could orient this and
then at test time average over them. So
that would be some of the obvious things
to play with but I don't I'm not
actually sure what the state of the art
is. Okay. Thank you. I have one more
question.
Okay. So coming back to distributed
training, is it possible to do even the
classification in a distributed way or
my question in future can I imagine my
um our cell phones do these things
together for one inquiry?
Uh our cell phones Oh, I see you. You're
trying to get cell phones distributed
training. Yes. Yes. A train and also a
radical idea for one cell phone user.
Very radical idea. So related thoughts I
had recently was so I had comejs in the
browser and I was thinking of um
basically this trains networks and I was
thinking about similar questions because
you could imagine shipping this off as
an ad equivalent like the people just
include this in the JavaScript and then
everyone's browsers are kind of like
training a small network uh so I think
that's a related question but do you
think there's like too much
communication overhead or it could be
actually really distributed in a uh
efficient way? Yes. So the problem with
distributing it a lot is actually um the
stale gradients problem. So when you
look at some of the uh papers that
Google has put out about distributed
training as you look at the number of
workers when you do asynchronous SGD
number of workers and the the
performance improvement you get it kind
of like plateaus quite quickly after
like eight workers or something quite
small. So I'm not sure if there are ways
of dealing with thousands of workers.
The issue is that you have a distributed
you every worker has this like specific
snapshot of the weights that are
currently um at from the pull you pull
from the master and now you have a set
of weights that you're using and you do
forward backward and then you send an
update but by the time you send an
update and you've done your forward
backward the parameter server has now
done like lots of updates from like
thousands of other things and so your
gradient is stale you've evaluated it at
a wrong an old location and so it's an
incorrect direction now and everything
breaks. So that's the challenge and I'm
not sure what people are doing about
this. Yeah. Uh I was wondering about uh
applications of convolutional uh nets to
uh two inputs at a time. So let's say
you have two pictures of jigs of
puzzles, puzzles, jigs of pieces and
trying to figure out if they fit
together or uh whether one object
compares to the other in a specific way.
Have you heard of any implementation of
this kind? Uh yes. So you have two
inputs instead of one. Yeah. So the
common ways of dealing with that is you
put a comet on each and then you do some
kind of a fusion eventually to to merge
the information. Right. I see. And uh
same for um recurrent neural networks if
you had like variable input. Uh so for
example in the context of videos where
you have frames coming in. Yeah. Then
yeah so some of the approaches are you
have a convolutional network on the
frame and then at the top you tie it in
with the recurren neural network. Mhm.
So you have these you reduce the image
to some kind of a lower dimensional
representation and then that get that's
an input to a recurrent neural network
at the top. Uh there are other ways to
play with this. For example, you can
actually make the recurrent you can make
every single neuron in the comnet
recurrent. That's also one funny way of
deal doing deal doing deal doing deal
doing deal doing deal doing deal doing
deal doing deal doing deal doing deal
doing this. So right now when a neuron
computes its output it's only a function
of a local neighborhood uh in below it.
But you can also make it in addition a
function of that same local neighborhood
or like its own activation perhaps at
the previous time step if that makes
sense. So so this so this neuron is not
just computing a dot product with the
with the current patch but it's also
incorporating a dotproduct of its own
and maybe its neighborhoods uh
activations at the previous time step of
the frame. So that's kind of like a
small RNN update hidden inside every
single neuron. So those are the things
that I think people play with when I'm
not familiar with what currently is
working best in this area. Pretty
awesome. Thank you. Yeah. Yeah. Hi uh
thanks for the great talk. I have a
question uh regarding uh the latency for
the models that are trained using
multiple layers. So especially at the
prediction time you know as we add more
more layers for the forward pass it will
take some time you know it'll increase
in the latency right for the prediction.
So what are the numbers that we have
seen uh you know you know presently that
you know that you know if you can share
that you know the the prediction time or
that you know the latency uh at the at
the forward pass. So you're worried for
example uh you have some you want to run
prediction very quickly would it be on
an embedded device or is this in the
cloud? Uh yes suppose you know it's a
cell phone you know you have you're
you're identifying the objects or you
know you're you're doing some uh you
know image analysis or something. Yeah.
So there's definitely a lot of work on
this. So one way you would approach this
actually is you have this uh network
that you've trained using floatingpoint
arithmetic 32 bits say and so there's a
lot of work on uh taking that network
and uh discretizing all the weights into
like ins and making it much smaller and
pruning connections. So one of the works
I'm um related to this for example is
sonhan here at Stanford has a few papers
on getting rid of spurious connections
and reducing the network as much as
possible and then making everything very
efficient with integer arithmetic. Uh so
basically you achieve this by um
discretizing all the weights and all the
activations and uh throwing away and
pruning the network. So there are some
tricks like that that people play. Um
that's mostly what you would do on an
embedded device. And then the challenge
of course is you've changed the network
and now you just kind of are crossing
your fingers that it works well. And so
I think what's uh interesting for uh re
from research standpoint is you'd like
to do you'd like your test time to
exactly match your training time, right?
So then you get the best performance and
so the question is how do we train with
low precision arithmetic and there's a
lot of work on this as well. So say from
Yoshua Benjio's lab as well and um uh so
that's exciting directions of how you
train in low precision regime. Do do you
have any numbers I mean that you can
share for the you know state-of-the-art
how much time does it take? Um yes so I
see the papers but I'm not sure if I
remember the the exact reductions. It's
on the order of okay I don't want to say
because basically I don't know. Thanks a
I don't want to try to guess this. All
right. Thank you. All right. We're out
of time. Let's thank
Andre. Lunch is outside and we'll
restart at 12:45.