Deep Learning for Computer Vision (Andrej Karpathy, OpenAI)

u6aEYuemt0M • 2016-09-27

Transcript preview

Open

Kind: captions
Language: en
Yeah. So thank you very much for the
introduction. Uh so today I'll speak
about uh deep learning especially in the
context of computer vision. So what you
saw in the previous talk is neural
networks. Uh so you saw that neural
networks are organized into these layers
fully connected layers where neurons in
one layer are not connected but they're
connected fully to all the neurons in
the previous layer. And we saw that
basically we have this um layer-wise
structure from input until output um and
there are neurons and nonlinearities
etc. Now, so far we have not made too
many assumptions about the inputs. So,
in particular here, we just assume that
an input is some kind of a vector of
numbers that we plug into this neural
network. So, um that's both a bug and a
feature to some extent. Uh because in
most um in most real world applications,
we actually can make some assumptions
about the input that make learning much
more efficient. Uh um that makes
learning much more efficient. So in
particular um usually we don't just want
to plug in uh into neural networks
vectors of numbers but they actually
have some kind of a structure. So we
don't have vectors of numbers but these
numbers are arranged in some kind of a
uh layout like an n- dimensional array
of numbers. So for example spectrograms
are two-dimensional arrays of numbers.
Images are threedimensional arrays of
numbers. Videos would be
four-dimensional arrays of numbers. Text
you could treat as one dimensional array
of numbers. And so whenever you have
this kind of local connectivity uh
structure in your data then you'd like
to take advantage of it and
convolutional neural networks allow you
to do that. So before I dive into
convolutional neural networks and all
the details of the architectures I'd
like to uh briefly talk about a bit of
the history of how this field evolved
over time. So I like to start off
usually with uh talking about hub and
weasel and the experiments that they
performed in 1960s. So what they were
doing is trying to study the
computations that happened in the early
visual cortex areas of a cat. And so
they had cat and they plugged in
electrodes uh to that could record from
the different uh neurons. And then they
showed the cat different patterns of
light and they were trying to debug
neurons effectively and try to show them
different patterns and see what they
responded to. And a lot of these
experiments uh inspired some of the
modeling that came in afterwards. So in
particular, one of the early models that
tried to take advantage of some of the
results of these experiments where the
um was the model called neurokcognitron
from Fukushima in the 1980s. And so what
you saw here was these uh this
architecture that again is layer-wise
similar to what you see in the cortex
where you have these simple and complex
cells where the simple cells detect
small things in the visual field and
then you have this local connectivity
pattern and the simple and complex cells
alternate in this layered architecture
throughout. And so this was this looks a
bit like a comnet because you have some
of its features like say the local
connectivity but at the time this was
not trained with back propagation. These
were specific heristically chosen uh u
updates that and this was unsupervised
learning back then. So the first time
that we've actually used back
propagation to train some of these
networks was an experiment of Yan Lakun
in the 1990s. And so um this is an
example of one of the networks that was
developed back then in 1990s by Yan
Lakun as Linet 5. And this is what you
would recognize today as a convolutional
neural network. So it has a lot of the
very sim uh convolutional layers and
it's alternating and it's a similar kind
of design to what you would see in the
Fukushima's neurocognitron but this was
actually trained with back propagation
end to end using supervised learning. Um
now so this happened in roughly 1990s
and we're here in 2016 basically about
20 years later. Um now computer vision
has
u has for a long time kind of um worked
on larger images and a lot of these
models back then were applied to very
small uh kind of settings like say
recognizing uh digits um and zip codes
and things like that and they were very
successful in those domains. But back at
least when I entered computer vision in
roughly 2011 it was thought that a lot
of people were aware of these models but
it was thought that they would not scale
up naively into large complex images
that they would be constrained to these
toy tasks for a long time or I shouldn't
say toy because these were very
important tasks but certainly like
smaller visual recognition problems and
so in computer vision in roughly 2011 it
was much more common to use a kind of um
these feature-based approaches at the
time and they didn't work actually that
well so when I entered my PhD in 200 1
working on computer vision, you would
run a state-of-the-art uh object
detector on this image and you might get
something like this uh where cars were
detected in trees and you would kind of
just shrug your shoulders and say,
"Well, that just happens sometimes." You
kind of just accept it as a as a
something that would just happen. Um and
of course this is a caricature. Things
actually were like relatively decent. I
I should say, but uh definitely there
were many mistakes that you would not
see today about four years uh in 2016,
five years later. And so a lot of uh
computer vision kind of looked much more
like this. When you look into a paper of
trying that tried to do image
classification, you would find this
section in the paper on the features
that they used. So this is one page of
features. And so they would use um yeah
a gist etc. And then a second page of
features and all their hyperparameters.
So all kinds of different histograms and
you would extract this kitchen sink of
features and a third page here. And so
you end up with uh this very large
complex codebase because some of these
feature types are implemented in MATLAB,
some of them in Python, some of them in
C++. And you end up with this large
codebase of extracting all these
features, caching them and then
eventually plugging them into linear
classifiers to do some kind of visual
recognition task. So it was uh quite
unwieldy
uh but uh it worked to some extent but
there were definitely room for
improvement and so a lot of this changed
uh in computer vision in 2012 with this
paper from Alex Kepsky, Eliask and Jeff
Hinton. So this is the first time that
um someone took a convolutional neural
network that is very similar to the one
that you saw from 1998 from Yanakun and
I'll go into details of how they differ
exactly uh but they took that kind of
network they scaled it up they made it
much bigger and they trained on a much
bigger data set on GPUs and things
basically ended up working extremely
well and this is the first time that
computer vision community has really
noticed these models and adopted them to
work on larger images. So uh we saw that
the performance uh of these models has
improved drastically. Here we are
looking at the imageet ILSVRC um visual
recognition challenge over the years and
we're looking at the top five errors. So
low is good and you can see that from
2010 uh in the beginning uh these were
feature-based methods and then in 2012
we had this huge jump in performance and
that was due to um the first uh kind of
convolutional neural network in 2012 and
then we've managed to push that over
time and now we're down to about
3.57%. Uh I think the results for
imageet 2000 imageet challenge 2016 are
actually due to come out today but I
don't think that actually they've come
out yet. I have this second tab here
opened.
I was waiting for the result, but I I
don't think this is up yet. Okay. No,
nothing. All right. Well, we'll get to
find out very soon what happens right
here. Uh, so I'm very excited to see
that. Uh, just to put this in context,
by the way, because you're just looking
at numbers like 3.57. How good is that?
That's actually really really good. So,
what something that I did about two
years ago now is that I tried to measure
human accuracy on this data set. And so
what I did that uh for that is I
developed this web interface where I
would show myself imageet images from
the test set. And then I had this
interface here um where I would have all
the different classes of imageet.
There's 10,00 and some example images.
And then basically you go down this list
and you scroll for a long time and you
find what class you think that image
might be. And then I competed against
the comnet uh at the time and this was
Google net in 200 uh in
2014. And uh so hot dog is a very simple
class. You can do that quite easily. Uh
but why is the accuracy not 0%. It well
some of the things like hot dog seems
very easy. Why isn't it trivial for
humans to see? Well, it turns out that
some of the images in a test set of
imageet are actually mislabeled. But
also some of the images are just very
difficult to guess. So in particular, if
you have this terrier, there's 50
different types of terriers and it turns
out to be very difficult task to find
exactly which type of terrier that is.
You can spend minutes trying to find it.
Turns out that convolutional neural
networks are actually extremely good at
this and so this is where I would lose
points compared to comnet. Um so I
estimate that human accuracy based on
this is roughly 2 to 5% range depending
on how much time uh you have and how
much expertise you have and how many
people you involve and how much they
really want to do this which is not too
much and uh so really we're doing
extremely well and so we're down to 3%
and uh I think the error rate if I
remember correctly was about 1.5%. So if
we get below 1.5% I would be extremely
suspicious on imageet. Uh that seems
wrong. So to summarize basically what
we've done is um before 2012 computer
vision looked somewhat like this where
we had these feature extractors and then
we trained a small portion at the end of
the feature extractor extraction step.
And so we only trained this last piece
on top of these features that were
fixed. And we've basically replaced the
feature extraction step with a single
convolutional neural network. And now we
train everything completely end to end.
And this turns out to work uh quite
nicely. So I'm going to go into details
of how this works in a bit. Uh also in
terms of code complexity uh we kind of
went from a setup that looks whoops I'm
way ahead. Okay. We went from a setup
that looks something like that in papers
to something like uh you know instead of
extracting all these things we just say
apply 20 layers with 3x3 combo or
something like that and things work
quite well. Uh this is of course an
overexaggeration but I think it's a
correct first order statement to make is
that we've definitely seen um that we've
reduced code complexity quite a lot
because these architectures are so
homogeneous compared to what we've done
before. So it's also remarkable that so
we had this reduction in complexity. We
had this amazing performance on imageet.
One other thing that was quite amazing
about the results in 2012 that is also a
separate thing that did not have to be
the case is that the features that you
learn by training on imageet turn out to
be quite generic and you can apply them
in different settings. So in other
words, this transfer learning um works
extremely well. And of course, I didn't
go into details of convolutional
networks yet, but uh we start with an
image and we have a sequence of layers
just like in a normal neural network.
And at the end, we have a classifier.
And when you pre-train this network on
imageet, then it turns out that the
features that you learn in the middle
are actually transferable and you can
use them on different data sets and that
this works extremely well. And so that
didn't have to be the case. You might
imagine that you could have a
convolutional network that works
extremely well on imageet but when you
try to run it on some something else
like birds data set or something that it
might just not work well but that is not
the case and that's a very interesting
finding in my opinion. So um people
noticed this back in roughly 2013 after
the first convolutional networks. They
noticed that you can actually take many
computer vision data sets and it used to
be that you would compete on all of
these kind of separately and design
features maybe for some of these
separately and you can just uh shortcut
all those steps that we had designed and
you can just take these pre-trained
features that you get from ImageNet and
you can just train a linear classifier
on every single data set on top of those
features and you obtain many
state-of-the-art results across many
different data sets. And so this was
quite a remarkable finding back then I
believe. So things worked very well on
imageet. Things transferred very well
and the code complexity of course got
much uh much more manageable. So now all
this power is actually available to you
with very few lines of code. If you want
to just use a convolutional network uh
on images it turns out to be only a few
lines of code. If you use for example
caris is one of the deep learning
libraries that I'm going to go into and
I'll mention again later in the talk. Uh
but basically you just load a
state-of-the-art convolutional neural
network. You take an image, you load it
and you compute your predictions and it
tells you that this is an African
elephant inside that image. And this
took a couple hund couple hundred or a
couple 10 milliseconds if you have a
GPU. And so everything got much faster,
much simpler, works really well,
transfers really well. So this was
really a huge advance in computer
vision. And so as a result of all these
nice properties, uh, comnets today are
everywhere. So here's a collection of
some of the some of the things I I try
to uh find across across different
applications. So for example, you can
search Google photos for different types
of um categories like in this case
Rubik's cubes. Um you can find house
numbers very efficiently. You can of
course this is very relevant in
self-driving cars and we're doing
perception in the cars. Convolutional
networks are very relevant there.
Medical image diagnosis recognizing
Chinese characters uh doing all kinds of
medical segmentation tasks. Uh quite
random tasks like whale recognition and
more generally many Kaggle challenges.
uh satellite image analysis recognizing
different types of galaxies. You may
have seen recently that um a waveet from
deep mind also a very interesting paper
that they generate music and they
generate speech. Uh and so this is a
generative model and that's also just a
comet is doing most of the heavy lifting
here. So it's a convolutional network on
top of sound and uh other tasks like
image captioning in the context of
reinforcement learning and agent in
environment interactions. We've also
seen a lot of advances of using comnets
as the core computational building
block. So when you want to play Atari
games or you want to play Alph Go or
Doom or Starcraft or if you want to get
robots to perform interesting
manipulation tasks, all of this uses
comes as a core computational um block
uh to do very impressive things. Uh not
only are we using it for a lot of
different application, we're also
finding uses in art. So um so here are
some examples from DeepDream. So you can
basically uh simulate what it looks
like, what it feels like maybe to be on
some drugs. So you can take images and
you can just hallucinate features using
comnets or you might be familiar with
neural style which allows you to take
arbitrary images and transfer arbitrary
styles of different paintings like Bango
on top of them. And this is all using
convolutional networks. The last thing
I'd like to note that I find also
interesting is that in the process of
trying to develop better computer vision
architectures and trying to basically
optimize for performance on the imageet
challenge, we've actually ended up
converging to something that potentially
might function something like your
visual cortex in some ways. And so these
are some of the experiments that I find
interesting where they've studied macak
monkeys uh and they record from a
subpopul of the um of the IT cortex.
This is the part that does a lot of
object recognition and so they record.
So basically they take a monkey and they
take a comnet and they show them images
and then you look at what those images
are represented at the end of this
network. So inside the monkeykey's brain
or on top of your convolutional network.
And so you look at representations of
different images and then it turns out
that there's a mapping between those two
spaces that actually seems to indicate
to some extent that some of the things
we're doing somehow ended up converging
to something that the brain could be
doing as well in the visual cortex. Um
so that's just some intro. I'm now going
to dive into convolutional networks and
try to explain um the briefly how these
networks work. Of course there's an
entire class on this that I taught which
is a convolutional networks class. And
so I'm going to distill some of you know
those 13 lectures into one lecture. So
we'll see how that goes. I won't cover
everything of course. Okay. So
convolutional neural network is really
just a single function. It goes from
it's a function from the raw pixels of
some kind of an image. So we take 224
x24x3 image. So three here is for the
color channels RGB. You take the raw
pixels, you put it through this
function, and you get 1,000 numbers at
the end. In the case of image
classification, if you're trying to
categorize images into 1,000 different
classes and really functionally all
that's happening in a convolutional
network is just dotproducts and max
operations. That's everything. But
they're wired up together in interesting
ways so that you are basically doing
visual recognition. And in particular
the this function f has a lot of knobs
in it. So these ws here that participate
in these dotproducts and in these
convolutions and fully connected layers
and so on these ws are all parameters of
this network. So normally you might have
about on the order of 10 million
parameters and uh those are basically
knobs that change this function. And so
we'd like to change those knobs of
course so that when you put images
through that function you get
probabilities that are consistent with
your training data. And so that gives us
a lot to tune and turns out that we can
do that tuning automatically with back
propagation uh through that search
process. Now more concretely a
convolutional neural network is made up
of a sequence of layers just as in the
case of normal neural networks. But we
have different types of layers that we
play with. Uh so we have convolutional
layers here I'm using rectified linear
unit relu for short as a nonlinearity.
Uh so I'm making that an explicit its
own layer. Um pooling layers and fully
connected layers. The core computational
building block of a convolutional
network though is this convolutional
layer and we have nonlinearities
interspersed. We are probably getting
rid of things like pooling layers. So
you might see them slightly going away
over time and fully connected layers can
actually be represented. They're
basically equivalent to convolutional
layers as well. And so really uh it's
just a sequence of com layers in the
simplest case. So let me explain
convolutional layer because that's the
core computational building block here
that does all the heavy lifting.
So the entire comnet is this collection
of layers and these layers don't
function over vectors. So they don't
transform vectors as a normal neural
network but they function over volumes.
So a layer will take a volume a
threedimensional volume of numbers an
array. In this case for example we have
a 32x 32x3 image. So those three
dimensions are the width, height and
I'll refer to the third dimension as
depth. We have three channels. Uh that's
not to be confused with the depth of a
network which is the number of layers in
that network. So this is just the depth
of a volume. So this convolutional layer
accepts a threedimensional volume and it
produces a threedimensional volume using
some weights. So the way it actually
produces this output volume is as
follows. We're going to have these
filters in a convolutional layer. So
these filters are always small spatially
like say for example 5x5 filter but
their depth extends always through the
input depth of the uh input volume. So
since the input volume has three
channels, the depth is three, then our
filters will always match that number.
So we have depth of three in our filters
as well. And then we can take those
filters and we can basically convolve
them with the input volume. So uh what
that amounts to is we take this filter.
Um oh yeah, so that's just a point that
the channels here must match. We take
that filter and we slide it through all
spatial positions of the input volume.
And along the way as we're sliding this
filter, we're computing dotproducts. So
wrppose x plus b where w are the filters
and x is a small piece of the input
volume and b is the offset. And so this
is basically the convolutional
operation. You're taking this filter and
you're sliding it through at all spatial
positions and you're computing that
products. So when you do this you end up
with this activation map. So in this
case uh we get a 28x 28 activation map.
28 comes from the fact that there are 28
unique positions to place this 5x5
filter into this 3 32x32 uh space. So
there are 28 by 28 unique positions you
can place that filter in. In every one
of those you're going to get a single
number of how well that filter likes
that part of the input. Um so that
carves out a single activation map. And
now in a convolutional layer we don't
just have a single filter but we're
going to have an entire set of filters.
So here's another filter a green filter.
We're going to slide it through the
input volume. It has its own parameters.
So these there are 75 numbers here that
basically make up a filter. there are
different 75 numbers. We convolve them
through get a new activation map and we
continue doing this for all the filters
in that convolutional layer. So for
example, if we had six filters uh in
this convolutional layer, then we might
end up with 28x 28 activation maps six
times and we stack them along the depth
dimension to arrive at the output volume
of 28x 28x 6. And so really what we've
done is we've re-represented the
original image which is 32x 32x3 into a
kind of a new image that is 28x 28x 6 uh
where this image basically has these six
channels that tell you how well every
filter matches or likes every part of
the input
image. So let's compare this operation
to say using a fully connected layer as
you would in a normal neural network.
So in particular we saw that we
processed a 32x 32x3 volume into 28x
28x6 volume. And uh one question you
might want to ask is how many parameters
would this require if we wanted a fully
connected layer of the same number of
output neurons here? So we wanted 28 x
28x 6 or time 28* 2 * 28 * 6 number of
neurons fully connected. How many
parameters would that be? Turns out that
that would be quite a few parameters,
right? because every single neuron in
the opted volume would be fully
connected to all of the 32x 32x3 numbers
here. So basically every one of those
28x 28x 6 neurons is connected to 32x
32x3 turns out to be about 15 million
parameters and also on that order of
number of multiplies. So you're doing a
lot of compute and you're introducing a
huge amount of parameters into your
network. Now since we're doing
convolution instead uh you'll notice
that think about the number of
parameters that we've introduced with
this example convolutional layer. So
we've used uh we had six filters and
every one of them was a 5x5x3 filter. So
basically we just have 5x5x3 filters. We
have six of them. If you just multiply
that out we have 450 parameters. And in
this I'm not counting the biases. I'm
just counting the raw weights. So
compared to 15 million we've only
introduced very few parameters. Also,
how many multiplies have we done? So,
computationally, how many flops are we
doing? Uh, well, we have 28 by 28 by six
outputs to produce. And every one of
these numbers is a function of a 5x5x3
region in the original image. So,
basically, we have 28 x 28 by 6 and then
there's every one of them is computed by
doing 5* 5* 3 multiplies. So, you end up
with only on the order of 350,000
um multiplies. So, we've reduced from 15
million to quite a few. So we're doing
less flops and we're using fewer
parameters. And really what we've done
here is we've made assumptions, right?
So we've made the assumption that
because um the fully connected layer, if
this was a fully connected layer, could
compute the exact same thing. Uh but it
would um so a specific setting of those
15 million parameters would actually
produce the exact output of this
convolutional layer. But we've done it
much more efficiently. We've done that
by introducing um these biases. So in
particular, we've made assumptions.
We've assumed, for example, that since
we have these fixed filters that we're
sliding across space, we've assumed that
if there's some interesting feature that
you'd like to detect in one part of the
image, like say top left, then that
feature will also be useful somewhere
else like on the bottom right because we
fix these filters and apply them at all
the spatial positions equally. You might
notice that this is not always something
that you might want. For example, if
you're getting inputs that are centered
face images and you're doing some kind
of a face recognition or something like
that, then you might expect that you
might want different filters at
different spatial positions. Like say
for eye regions you might want to have
some eye like filters and for mouth
region you might want to have mouth
specific features and so on. And so in
that case you might not want to use
convolutional layer because those
features have to be shared across all
spatial positions. And the second um
assumptions that we made is that these
filters are small locally and so we
don't have global connectivity. We have
this local connectivity but that's okay
because we end up stacking up these
convolutional layers in sequence. And so
this the neurons at the end of the
comnet will grow their receptive field
as you stack these convolutional layers
on top of each other. So at the end of
the comnet, those neurons end up being a
function of the entire image
eventually. So just to give you an idea
about what these activation maps look
like concretely, here's an example of an
image on the top left. This is a part of
a car I believe. And we have these
different filters at we have 32
different small filters here. And so if
we were to convolve these filters with
this image, we end up with these
activation maps. So this filter if you
convolve it you get this activation map
and so on. So this one for example has
some orange stuff in it. So when we
convolve with this image you see that
this white here is denoting the fact
that that filter matches that part of
the image quite well. And so we get
these activation maps. You stack them up
and then that goes into the next
convolutional layer. So the way this
looks then uh looks like then is that
we've processed this with some kind of a
convolutional layer. We get some output.
We apply a rectified linear unit, some
kind of a nonlinearity as normal and
then we just repeat that operation. So
we keep plugging these con volumes into
the next convolutional layer and so they
plug into each other in sequence. Okay?
And so we end up processing the image
over time. So that's the convolutional
layer. Now you'll notice that there are
a few more layers. So in particular the
pooling layer I'll explain very briefly.
Um pooling layer is quite simple. Uh if
you've used Photoshop or something like
that, you've taken a large image and
you've resized it, you've downsampled
the image. Well, pooling layers do
basically something exactly like that,
but they're doing it on every single
channel independently. So for every one
of these channels independently in a
input volume, we'll pluck out that
activation map. We'll down sample it and
that becomes a channel in the output
volume. So it's really just a
downsampling operation on these volumes.
Uh so for example one of the common ways
of doing this in the context of neural
networks especially is to use max
pooling operation. So in this case it
would be common to say for example use
2x2 filters stride two uh so um and do a
max operation. So if this is an input
channel in a volume then we're basically
what that amounts to is we're truncating
it into these 2x two regions and we're
taking a max over four numbers to
produce uh one piece of the output.
Okay. So this is a very cheap operation
that downsamples your volumes. It's
really a way to control the capacity of
the network. So you don't want too many
numbers. You don't want things to be too
computationally expensive. It turns out
that a pooling layer allows you to down
sample your volumes. You're going to end
up doing less computation and it turns
out to not hurt the performance too
much. So we use them basically as a as a
way of controlling the capacity of these
networks. And the last layer that I want
to briefly mention of course is the
fully connected layer which is exactly
as what you're familiar with. So we have
these volumes throughout as we've
processed the image. At the end you're
left with this volume and now you'd like
to predict some classes. So what we do
is we just take that volume we stretch
it out into a single column and then we
apply a fully connected layer which is
really amounts to just a matrix
multiplication and then that gives us uh
probabilities after applying like a soft
max or something like that. So let me
now show you briefly uh a demo of what
the convolutional network looks like. Uh
so this is comjs. uh this is um a deep
learning library for training
convolutional neural networks that I've
that is implemented in JavaScript. I
wrote this maybe uh two years ago at
this point. So here what we're doing is
we're training a convolutional network
on the CR10 data set. CR10 is a data set
of 50,000 images. Each image is 32x 32x3
and there are different 10 different
classes. So here we are training this
network in the browser and you can see
that the loss is decreasing which means
that we're better classifying these
inputs. And uh so here's the network
specification which you can play with
because this is all done in the browser.
So you can just change this and play
with this. Uh so this is an input image
and this convolutional network I'm
showing here all the intermediate
activations and all the intermediate um
basically activation maps that we're
producing. So here we have a set of
filters. We're convoling them with the
image and getting all these activation
maps. Uh I'm also showing the gradients
but I don't want to dwell on that too
much. Venue threshold. So ReLU
thresholding anything below zero gets
clamped at zero and then you pull. So
this is just a downsampling operation
and then another convolution relu pull
com pool etc until at the end we have a
fully connected layer and then we have
our softmax so that we get probabilities
out and then we apply a loss to those
probabilities and back propagate. And so
here we see that I've been training in
this tab for the last maybe uh 30
seconds or 1 minute and we're already
getting about 30% accuracy on CR10. So
this these are test images from CR10 and
these are the outputs of this
convolutional network and you can see
that it learned that this is already a
car or something like that. So this
trains pretty quickly in JavaScript. Uh
so you can play with this and you can
change the architecture and so
on. Another thing I'd like to show you
is uh this video because it gives you
again this like very intuitive visceral
feeling of exactly what this is
computing is there's a very good video
by Jason Yosinski uh from recent
advance. I'm going to play this in a
bit. This is from the deep visualization
toolbox. So you can download this code
and you can play with this. It's this
interactive convolutional network demo
and neural networks have enabled
computers to better see and understand
the world. They can recognize school
buses and Z top left corner we show the
in this popular. So what we're seeing
here is these are activation maps in
some particular uh shown in real time as
this demo is running. Uh so these are
for the com one layer of an Alex net
which we're going to go into in much
more detail. But these are the different
activation maps that are being produced
at this point. Um neural network called
Alexet running in cafe. By interacting
with the network, we can see what some
of the neurons are
doing. For example, on this first layer,
a unit in the center responds strongly
to light to dark
edges. Its neighbor one neuron over
responds to edges in the opposite
direction, dark to light.
Using optimization, we can synthetically
produce images that light up each neuron
on this layer to see what each neuron is
looking for. We can scroll through every
layer in the network to see what it
does, including convolution, pooling,
and normalization layers. We can switch
back and forth between showing the
actual activations and showing images
synthesized to produce high activation.
By the time we get to the fifth
convolutional layer, the features being
computed represent abstract
concepts. For example, this neuron seems
to respond to faces. We can further
investigate this neuron by showing a few
different types of information. First,
we can artificially create optimized
images using new regularization
techniques that are described in our
paper. These synthetic images show that
this neuron fires in response to a face
and shoulders. We can also plot the
images from the training set that
activate this neuron the most as well as
pixels from those images most
responsible for the high activations
computed via the deconvolution
technique. This feature responds to
multiple faces in different locations.
And by looking at the
decons, we can see that it would respond
more strongly if we had even darker eyes
and rosier lips. We can also confirm
that it cares about the head and
shoulders but ignores the arms and
torso.
We can even see that it fires to some
extent for cat
faces using backrop or decon. We can see
that this unit depends most strongly on
a couple units in the previous layer con
4 and on about a dozen or so in con 3.
Now let's look at another neuron on this
layer. So what's this unit doing? From
the top nine images, we might conclude
that it fires for different types of
clothing. But examining the synthetic
images shows that it may be detecting
not clothing per se, but wrinkles. In
the live plot, we can see that it's
activated by my shirt. And smoothing out
half of my shirt causes that half of the
activations to
decrease. Finally, here's another
interesting
neuron. This one has learned to look for
printed text in a variety of sizes,
colors, and
fonts. This is pretty cool because we
never ask the network to look for
wrinkles or text or faces. The only
labels we provided were at the very last
layer. So the only reason the network
learned features like text and faces in
the middle was to support final
decisions at that last layer. For
example, the text detector may provide
good evidence that a rectangle is in
fact a book seen on edge. And detecting
many books next to each other might be a
good way of detecting a bookcase, which
was one of the categories we trained the
net to
recognize. In this video, we've shown
some of the features of the deep viz
toolbox. Okay, so I encourage you to
play with that. It's it's really fun.
So, I hope that gives you an idea about
exactly what's going on. There's these
convolutional layers. We downsample them
from from time to time. There's usually
some fully connected layers at the end,
but mostly it's just these convolutional
operations stacked on top of each other.
So, what I'd like to do now is I'll dive
into some details of how these
architectures are actually put together.
The way I'll do this is I'll go over all
the winners of the imageet challenges
and I'll tell you about the
architectures, how they came about, how
they differ, and so you'll get a
concrete idea about what these
architectures look like in practice. So
we'll start off with the Alex net in
2012. Um so the Alex net just to give
you an idea about the uh the sizes of
these networks and the images that they
process it took 227 x27 by3 images. And
the first layer of an Alex net for
example was a convolutional layer that
had 11 by11 filters applied with a
stride of four and there are 96 of them.
stride of four I didn't fully explain
because I wanted to save some time but
intuitively it just means that as you're
sliding this filter across the input you
don't have to slide it one pixel at a
time but you can actually jump a few
pixels at a time so we have 11 by11
filters with a stride a skip of four and
we have 96 of them you can try to
compute for example what is the output
volume if you apply this uh this um this
sort of convolutional layer on top of
this volume and I didn't go into details
of how you compute that but basically
there are formulas for this and you can
look into details uh in the class but um
you arrive at 55 x 55 by 96 volume as
output. The total number of parameters
in this layer we have 96 filters every
one of them is 11 by 11 by3 because
that's the input uh depth of these
images. So basically just amounts to 11
* 11 * 3 and then you have 96 filters.
So about 35,000 parameters in this very
first layer. Uh then the second layer of
an Alex net is a pooling layer. So we
apply 3x3 filters at stride of two and
they do max pooling. So you can again
compute the output volume size of that
after applying this to that volume and
you arrive if you do some uh very simple
arithmetic there you arrive at 27 by 27
by 96. So this is the down sampling
operation. You can think about what is
the number of parameters in this pooling
layer. Um and of course it's zero. So
pooling layers compute a fixed function
a fixed down sampling operation. There
are no parameters involved in the
pooling layer. All the parameters are in
convolutional layers and the fully
connected layers which are in some
extent equivalent to convolutional
layers. So you can go ahead and just
basically based on the description in
the paper although it's non-trivial I
think based on the description of this
particular paper but you can go ahead
and decipher what uh the volumes are
throughout you can look at the uh kind
of patterns that emerge in terms of how
you actually um increase number of
filters in higher convolutional layers.
So we started off with 96 then we go to
256 filters then to 384 and eventually
4,96 units fully connected layers.
You'll see also normalization layers
here which have since become slightly
deprecated. It's not very common to use
the normalization layers that were used
uh at the time for the Alexent
architecture. What's interesting to note
is how this differs from the 1998 yan
lakun network. So in particular I
usually like to think about four things
that hold back progress. So uh at least
in deep learning so the data as a
constraint compute uh and then I like to
differenti differentiate between
algorithms and infrastructure algorithms
being something that feels like research
and infrastructure being something that
feels like a lot of engineering has to
happen and so in particular we've had
progress in all those four fronts. So we
see that in 1998 uh the data you could
get a hold of maybe would be on the
order of a few thousand whereas now we
have a few million. So we had three
orders of magnitude of increase in
number of data. Compute uh GPUs have
become available and we use them to
train these networks. They are about say
roughly 20 times faster than CPUs. And
then of course CPUs we have today are
much much faster than CPUs that they had
back in 1998. So I don't know exactly to
what that works out to but I wouldn't be
surprised if it's again on the order of
three orders of magnitude of
improvement. Again uh I'd like to
actually skip over algorithm and talk
about infrastructure. So in this case
we're talking about uh Nvidia releasing
the CUDA library that allows you to
efficiently create all these matrix
vector operations and apply them on
arrays of numbers. So um that's a piece
of software that you we rely on and that
we take advantage of that wasn't
available before. And finally algorithms
is kind of an interesting one because
there's been uh in those 20 years
there's been much less improvement in uh
in algorithms than all these other three
pieces. So in particular what we've done
with the 1998 network is we've made it
bigger. So you have more channels, you
have more layers by a bit. Uh and the
two really new things algorithmically
are uh dropout and rectified linear
units. So uh dropout is a regularization
technique uh developed by Jeff Hinton
and colleagues. And rectified linear
units are these nonlinearities that
train much faster than sigmoids and
10H's. And this paper actually had a
plot u that showed that the rectified
linear units trained a bit faster than
sigmoids. And that's intuitively because
of the vanishing gradient problems. And
when you have very deep networks with
sigmoids, um those gradients vanish as
Hugo was talking about in last lecture.
Uh so what's interesting also to note by
the way is that both dropout and relu
are basically like one line or two lines
of code change. So it's about two line
diff total in those 20 years. And both
of them consist of setting things to
zero. So with the ReLU, you set things
to zero when they're lower than zero.
And with Dropout, you set things to zero
at random. So, it's a good idea to set
things to zero. Apparently, that's what
we've learned. So, if you try to find a
new cool algorithm, look for oneline
diffs that set something to zero.
Probably will work better and we could
add you here to this list. Uh, now some
of the newest things that happened uh
some of the comparing it again and
giving you an idea about the
hyperparameters that uh were in this
architecture. Um, it was the first use
of rectified linear units. We haven't
seen that as much before. uh this
network used the normalization layers
which are not used anymore at least in
the specific way that they use them in
this paper. Uh they used heavy data
augmentation. So you don't only put in
you don't only pipe these images into
the networks exactly as they come from
the data set but you jitter them
spatially around a bit and you warp them
and you change the colors a bit and you
just do this randomly because you're
trying to build in some invarianes to
these small perturbations and you're
basically hallucinating additional data.
Uh it was the um the first real um use
of dropout. Um and roughly you see
standard hyperparameters like say batch
sizes of roughly 128 u using stocastic
gradient descent with momentum usually
0.9 um in the momentum learning rates of
1 -2 you reduce them in normal ways. So
you reduce roughly by factor of 10
whenever validation stops improving and
weight decay of just a bit 5 negative4
and uh ensembling always helps. So you
train seven independent convolutional
networks separately and then you just
average their predictions always gives
you additional 2% improvement. So this
is AlexNet the winner of 2012. In 2013
the winner was the ZFNET. This was
developed by uh Matthew Zyler and Rob
Fergus in
2013 and this was an improvement on top
of Alexet architecture. In particular,
one of the the bigger differences here
were that the convolutional layer, the
first convolutional layer, they went
from 11 by11 stride 4 to 7 by7 stride 2.
So you have slightly smaller filters and
you apply them more densely. And then
also they noticed that these
convolutional layers in the middle if
you make them larger if you scale them
up then you actually gain performance.
So they managed to improve a tiny bit.
Matthew Zyler then went uh he um became
the founder of clarify uh and uh he
worked on this a bit more inside clarify
and he managed to push the performance
to 11% which was the winning entry at
the time but we don't actually know what
get gets you from 14% to 11% because
Matthew never disclosed the full details
of what happened there but uh he did say
that it was more tweaking of these
hyperparameters and optimizing that a
bit so that was 2013 winner in 2014 we
saw a slightly bigger diff to this um so
one of the networks that was introduced
then was a VGNet from Karen Simmonian
and Andrew Zerman. What's beautiful
about VGNet and they explored a few
architectures here and the one that
ended up working best was this D column
which is why I'm highlighting it. What's
beautiful about the VGNet is that it's
so simple. So you might have noticed in
these previous uh um in these previous
networks you have these different filter
sizes, different layers and you do
different amount of strides and
everything kind of looks a bit hairy and
you're not sure where these
hyperparameters are coming from. VGET is
extremely uniform. All you do is 3x3
convolutions with stride one pad one and
you do two x2 max poolings with stride
two and you do this throughout
completely homogeneous architecture and
you just alternate a few comp and a few
pool layers and you get a top top
performance. So they managed to reduce
the error down to 7.3% in the VGNet um
just with a very simple and homogeneous
architecture. So it's I've also here
written out this D architecture just so
you can see I'm not I'm not sure how
instructive this is because it's kind of
dense but you can definitely see and you
can look at this offline perhaps but you
can see how these volumes develop and
you can see the kinds of sizes of these
filters. Um so they're always 3x3 but
the number of filters again grows. So we
started off with 64 and then we go to
128 256 512. So we're just doubling it
over time.
Um I also have a few numbers here just
to give you an idea of the scale at
which these networks normally operate.
So we have on the order of 140 million
parameters. This is actually quite a
lot. I'll show you in a bit that this
can be about five or 10 million
parameters and it works just as well. Um
and it's about 100 megabytes for image
in terms of memory in the forward pass
and then the backward pass also needs
roughly on that order. So that's roughly
the numbers that we're uh we're working
with here. Uh also you can note that
most of the and this is true mostly in
convolutional networks is that most of
the memory is in the early convolutional
layers. Most of the parameters at least
in the case where you use these giant
fully connected layers at the top would
be here. Um so the winner actually in
2014 was not the VGET I only presented
because it's such a simple architecture
but the winner was actually Google net
with a slightly um hairier architecture
we should say. So it's still a sequence
of things but in this case they've uh
put inception modules in sequence and
this is an example inception module. I
don't have too much time to go into the
details but you can see that it consists
basically of convolutions and different
kinds of strides and so on. Um so the
Google net um is looks slightly uh
hairier but it turns out to be more
efficient in several respects. So for
example it works a bit better than VGNET
at least at the time. um it only has
five million parameters compared to VGE
that's 140 million parameters. So a huge
reduction and you do that by the way by
just throwing away fully connected
layers. So you'll notice in this
breakdown I did these fully connected
layers here have 100 million parameters
and 16 million parameters. Turns out you
don't actually need that. So if you took
take them away that actually doesn't
hurt the performance too much. So uh you
can get a huge reduction of parameters.
Um and it was um it was slightly we can
also compare to the original AlexNet. So
compared to the original Alex net, we
have fewer parameters, a bit more
compute and a much better performance.
So Google Net was really optimized to
have a low footprint both memory wise uh
both computation wise and both
parameter- wise but it looks a bit
uglier and VGNet is a very beautiful
homogeneous architecture but there are
some inefficiencies in it. Okay, so
that's uh 2014. Now in 2015 we had a a
slightly bigger delta on top of the
architectures. So right now these
architectures if Yan Lakun looked at
them maybe in 1998 he would still
recognize everything. So everything
looks very like simple. You've just
played with hyperparameters. So one of
the first kind of bigger departures I
would argue was in 2015 with the
introduction of residual networks. Uh
and so this is work from Kaming Hi and
colleagues in Microsoft Research Asia.
And so they did not only win the imageet
challenge in 2015 but they won a whole
bunch of challenges. And this was all
just by applying these residual networks
that were trained on imageet and then
fine-tuned on all these different tasks
and you basically can crush lots of
different tasks whenever you get a new
awesome comnet. Um so at this time the
performance was basically 3.57% from
these residual networks. So this is
2015. Also uh this paper tried to argue
that if you look at the number of layers
it goes up and then it uh they made the
point that uh with residual networks as
we'll see in a bit you can introduce
many more layers and they uh and that
that correlates strongly with
performance. We've since found that in
fact you can make these residual
networks quite sh quite a lot shallower

Resume

Berikut adalah rangkuman komprehensif dan terstruktur dari transkrip video yang Anda berikan.

***

# Evolusi & Praktik Convolutional Neural Networks (CNN) dalam Computer Vision

### Inti Sari (Executive Summary)
Video ini membahas perjalanan evolusi Convolutional Neural Networks (CNN) dalam bidang *Computer Vision*, mulai dari konsep dasar biologis dan sejarah perkembangannya, hingga kebangkitan era *Deep Learning* pasca tahun 2012. Pembicara menjelaskan secara rinci arsitektur jaringan modern (seperti AlexNet, VGG, GoogLeNet, dan ResNet), berbagai aplikasi praktis di luar klasifikasi gambar, serta memberikan panduan praktis mengenai implementasi, pemilihan framework, dan strategi *transfer learning*.

### Poin-Poin Kunci (Key Takeaways)
*   **Pergeseran Paradigma:** CNN menggantikan metode *feature engineering* manual dengan pembelajaran *end-to-end*, di mana fitur diekstraksi secara otomatis dari data mentah.
*   **Faktor Keberhasilan:** Lonjakan performa CNN dipicu oleh ketersediaan data besar (ImageNet), kekuatan komputasi GPU, dan algoritma yang lebih baik (ReLU, Dropout).
*   **Arsitektur Inti:** Perkembangan arsitektur bergerak menuju jaringan yang lebih dalam dan efisien, dengan ResNet memperkenalkan *skip connections* untuk mengatasi masalah degradasi pada jaringan sangat dalam.
*   **Transfer Learning:** Fitur yang dipelajari dari dataset besar (seperti ImageNet) bersifat generik dan dapat ditransfer ke berbagai tugas lain dengan data yang lebih sedikit.
*   **Praktik Terbaik:** Disarankan untuk tidak membuat arsitektur dari nol ("don't be a hero"), melainkan menggunakan model *pre-trained* dan menyesuaikannya, serta memanfaatkan framework tingkat tinggi seperti Keras.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Sejarah & Dasar-Dasar CNN
*   **Latar Belakang Biologis:** Penelitian Hubel dan Wiesel pada tahun 1960-an tentang korteks visual kucing menginspirasi konsep konektivitas lokal dalam jaringan saraf tiruan.
*   **Milestone Sejarah:**
    *   **1980an:** Fukushima menciptakan "Neocognitron" dengan arsitektur berlapis, namun tanpa *backpropagation*.
    *   **1990an:** Yann LeCun mengembangkan LeNet-5, penerapan pertama *backpropagation* pada CNN untuk pengenalan digit.
    *   **2012 (Titik Balik):** AlexNet (Krizhevsky, Sutskever, Hinton) memenangkan ImageNet dengan penurunan *error* yang drastis, membuktikan skala CNN dapat bekerja pada data kompleks.
*   **Cara Kerja Konvolusi:**
    *   CNN menggunakan filter kecil yang digeser (sliding) di seluruh gambar untuk menghasilkan *activation map*.
    *   **Efisiensi:** Jauh lebih efisien parameter dibandingkan *Fully Connected Layers* karena berbagi parameter (*parameter sharing*) dan asumsi konektivitas lokal.
    *   **Operasi:** Terdiri dari lapisan Konvolusi, ReLU (non-linearitas), dan *Pooling* (downsampling).

#### 2. Visualisasi & Aplikasi CNN
*   **Visualisasi Neuron:**
    *   Layer awal mendeteksi fitur sederhana seperti tepi (*edges*) dan warna.
    *   Layer lebih dalam mendeteksi konsep abstrak seperti tekstur, pola, wajah, atau bahkan teks, tanpa pelatihan eksplisit untuk konsep tersebut.
*   **Aplikasi Luas:** CNN tidak hanya untuk klasifikasi gambar, tetapi juga digunakan dalam:
    *   *Search* (Google Photos).
    *   Mobil otonom (persepsi visual).
    *   Diagnosa medis.
    *   Reinforcement Learning (Atari, AlphaGo).
    *   Seni generatif (*Neural Style Transfer*).

#### 3. Evolusi Arsitektur Jaringan
*   **AlexNet (2012):** Menggunakan arsitektur yang relatif dalam dengan *normalization layer*, ReLU, dan Dropout. Dilatih menggunakan GPU.
*   **ZFNet (2013):** Perbaikan dari AlexNet dengan filter yang lebih kecil dan lebih padat pada layer awal.
*   **VGGNet (2014):** Dikenal kesederhanaannya. Hanya menggunakan tumpukan konvolusi 3x3 dan *pooling* 2x2. Namun, memiliki jumlah parameter yang sangat besar (~140 juta).
*   **GoogLeNet (2014):** Memperkenalkan *Inception Module*. Sangat efisien secara parameter (hanya ~5 juta) dengan performa tinggi, menghilangkan lapisan *Fully Connected* yang besar.
*   **ResNet (2015):** Revolusioner dengan memperkenalkan *Residual Blocks* dan *skip connections*. Memungkinkan pelatihan jaringan yang sangat dalam (ratusan layer) dengan menambahkan input langsung ke output layer (*identity mapping*), memecahkan masalah degradasi akurasi.

#### 4. Tugas Computer Vision Lainnya
CNN berfungsi sebagai "blok komputasi" yang dapat diadaptasi untuk berbagai tugas dengan mengubah fungsi *loss* dan lapisan output:
*   **Klasifikasi:** Mem prediksi kelas gambar.
*   **Lokalisasi & Deteksi:** Memprediksi kotak pembatas (*bounding box*) objek (contoh: YOLO).
*   **Segmentasi:** Memprediksi kelas untuk setiap piksel (output berupa peta gambar).
*   **Image Captioning:** Menggabungkan CNN (untuk visi) dengan RNN (untuk teks) untuk menghasilkan deskripsi kalimat.
*   **Autoencoder:** Untuk kompresi gambar atau pembelajaran tanpa pengawasan (*unsupervised learning*) melalui *bottleneck*.

#### 5. Tips Praktis & Implementasi
*   **Framework:** Keras direkomendasikan untuk 90% kasus penggunaan praktis karena kemudahannya. TensorFlow atau Torch bisa digunakan untuk fleksibilitas tingkat rendah.
*   **Strategi Arsitektur:** Jangan mendesain dari nol. Gunakan model *pre-trained* (seperti ResNet) yang sudah tersedia, lalu lakukan *fine-tuning* pada dataset spesifik Anda.
*   **Hyperparameter:** Gunakan pengaturan standar yang terbukti (Optimizer Adam/SGD, Learning rate sekitar 1e-3 atau 1e-4). Fokus pada regularisasi (Dropout) jika dataset kecil.
*   **Pelatihan Terdistribusi:** Gunakan *Data Parallelism* (membagi batch data ke beberapa GPU) untuk mempercepat pelatihan. Hindari *bottleneck* transfer data CPU-GPU dengan menggunakan *prefetching*.

#### 6. Tanya Jawab & Insight Lanjutan
*   **Tren Pooling:** Penggunaan *Pooling Layer* mulai berkurang; banyak arsitektur modern menggantinya dengan *Strided Convolutions* untuk mengurangi kehilangan informasi spasial.
*   **Skalabilitas:** Untuk jumlah kelas yang sangat besar (10k-100k), gunakan *Hierarchical Softmax*.
*   **Embedded Devices:** Untuk menjalankan CNN di perangkat mobile, teknik seperti *quantization* (mengubah bobot ke bilangan bulat) dan *pruning* (menghapus koneksi yang tidak penting) digunakan untuk mengurangi ukuran dan latensi.
*   **Tantangan 3D:** Pengenalan objek 3D menantang karena derajat kebebasan rotasi; solusi umum termasuk mencoba semua orientasi atau menggunakan data kedalaman (*depth*) sebagai saluran tambahan.

---

### Kesimpulan & Pesan Penutup
Convolutional Neural Networks telah merevolusi bidang computer vision dengan menyederhanakan proses pemrosesan data yang sebelumnya rumit menjadi pipeline pembelajaran yang otomatis dan akurat. Bagi para praktisi, kunci keberhasilan bukanlah menciptakan arsitektur baru yang rumit, melainkan memanfaatkan pengetahuan yang sudah ada (model *pre-trained*), memahami prinsip dasar CNN, dan menerapkannya dengan bijak pada masalah spesifik yang ingin diselesaikan.

Read

file updated 2026-02-13 13:22:27 UTC