MIT 6.S094: Computer Vision

CLOAswsxudo • 2018-01-27

Transcript preview

Open

Kind: captions
Language: en
today we'll talk about how to make
machines see computer vision and we'll
present Thank You Claire said yes and
today we will present a competition that
unlike deep traffic which is designed to
explore ideas teach you about concepts
of deep reinforcement learning seg fuse
the deep dynamic driving scene
segmentation competition that I'll
present today is at the very cutting
edge whoever does well in this
competition is likely to produce a
publication or ideas that would lead the
world in the area of perception perhaps
together with the people running this
class perhaps in your own and I
encourage you to do so even more cats
today computer vision today as it stands
is deep learning majority of the
successes in how we interpret form
representations understand images and
videos utilize to a significant degree
neural networks the very ideas we've
been talking about that applies for
supervised unsupervised and
reinforcement learning and for the
supervised case is just the focus of
today the process is the same the data
is essential there's annotated data
where the human provides the labels that
serves as the ground truth in the
training process then the neural network
ghost's through that data learning to
map from the raw sensory input to the
ground truth labels and then generalize
or the testing data set and the kind of
raw sensors were dealing with their
numbers I'll say this again and again
that for human vision for us here would
take for granted this particular aspect
of our ability is to take in raw sensory
information through our eyes and
interpret
but it's just numbers that's something
whether you're an expert computer vision
person or new to the field you have to
always go back to meditate on is what
kind of things the Machine is given what
what what is the data that is tasked to
work with in order to perform the tasks
you're asking it to do
perhaps the data is given is highly
insufficient to do what you want it to
do
that's the question I'll come up again
and again our images enough to
understand the world around you and
given these numbers the set of numbers
sometimes with one channel sometimes
with three RGB where every single pixel
have three different colors the task is
to classify or regress produce a
continuous variable or one of a set of
class labels as before we must be
careful about our intuition of what is
hard and what is easy in computer vision
let's take a step back to the
inspiration for neural networks our own
biological neural networks because the
human vision system and the computer
vision system is a little bit more
similar in these regards this
and visual cortex is in layers and as
information passes from the eyes to the
to the parts of the brain that makes
sense of the raw sensor information
higher and higher order representations
have formed this is the inspiration the
idea behind using deep neural networks
for images higher and higher order
representations of form through the
layers there early layers taking in the
very raw and sensory information then
extracting edges connecting those edges
forming those edges to form more complex
features and finally into the
higher-order semantic meaning that we
hope to get from these images in
computer vision deep learning is hard
I'll say this again
the illumination variability is the
biggest challenge or at least one of the
one of the biggest challenges in driving
for visible light cameras pose
variability the objects as I'll also
discuss about some of the advances geoff
hinton and the capsule networks the idea
with the neural networks as they're
currently useful computer vision are not
good with representing variable pose
these objects in images and this 2d
plane of color and texture look very
different numerically when the object is
rotated and the object is mangled and
shaped in different ways the deformable
will truncated cat intraclass
variability the for the classification
task which would be an example today
throughout to introduce some of the
networks over the past decade that have
received success in some of the
intuition and insight that made those
networks work classification there is a
lot of variability inside the classes
and very little variability between the
classes all of these are cats on top all
of those are dogs are bottom they look
very different and the other I would say
the second biggest problem in driving
perception
visible light camera perceptions
occlusion when part of the object is
occluded due to the three-dimensional
nature of our world some objects in
front of others and they occlude the
background object and yet we're still
tasked with identifying the object when
only part of it is visible and sometimes
that part told you there's cats is very
hardly visible here
we're tasked with classifying a cat with
just an ears visible just the leg and in
the philosophical level as we'll talk
about the motivation for our competition
here here's a cat dressed as a monkey
eating a banana on a philosophical level
most of us understand what's going on in
the scene in fact a neural network it's
to today successfully classify this
image this video as a cat but the
context the humour of the situation and
in fact you could argue it's a monkey is
missing and what else is missing is the
dynamic information the temporal
dynamics of the scene that's what's
missing in a lot of the perception work
that has been done to date in the
autonomous vehicle space in terms of
visible light cameras and we're looking
to expand on that
that's what psyche fuse is all about
image classification pipeline there's a
bin with different categories inside
each class cat dog mug hat those bins
there's a lot of examples of each and
your task with when a new example comes
along you never seen before to put that
image in a bin it's the same as the
machine learning tasks before and
everything relies on the data that's
been ground truth that been labeled by
human beings
amnesty is a toy data set of handwritten
digits often used as examples and Koko
safar imagenet places and a lot of other
incredible datasets rich data sets of a
hundred thousands millions of images out
there represent scenes people's faces
and different objects those are all
ground truth data for testing algorithms
and for competing architectures to be
evaluated against each other see far ten
one of the simplest almost toy datasets
of tiny icons with ten categories of
airplane automobile bird cat deer dog
for our course ship and truck is
commonly used to explore some of the
basic convolution neural networks we'll
discuss so let's come up with a very
trivial classifier to explain the
concept of how we could go about it in
fact this is maybe if you start to think
about how to classify an image if you
don't know any of these techniques this
is perhaps the approach you would take
is you would subtract images so in order
to know that an image of a cat is
different than image of a dog if to
compare them when given those two images
what what's the what's the way you
compare them one way you could do it is
you just subtract it and then sum all
the pixel wise differences in the image
just subtract the intensity of the image
pixel by pixel sum it up if that intent
if that difference is really high that
means the images are very different
using that metric we can look at C for
10 and use it as a classifier saying
based on this difference function I'm
going to find one of the 10 bins for a
new image that that is that has the
lowest difference find an image in this
data set that is most like the image I
have and put it in the same bin as that
images in so there's 10 classes if we
just flip a coin the accuracy of our
classifier will be 10% using our image
difference classifier we can actually do
pretty good much better than random much
better than 10%
we can do 35 38 percent accuracy
that's a classifier we have our first
classifier K nearest neighbors let's
take our classifier to a whole new level
instead of comparing it to just fight
trying to find one image that's the
closest in our data set we tried to find
K closest and say what is what class do
the majority of them belong to and we
take that k and increase it for 1 to 2
to 3 to 4 to 5 and see how that changes
the problem with seven years neighbors
which is the optimal under this approach
for CFR 10 we achieve 30% accuracy
human level is 95% accuracy and with
convolutional neural networks will get
very close to 100% that's where you'll
networks shine this very task of bending
images it all starts at this basic
computational unit signal in each of the
signals are weighed summed bias added
and put an input into a nonlinear
activation function that produces an
output the nonlinear activation function
is key all of these put together and
more and more hidden layers form a deep
neural network and that deep neural
network is trained as we've discussed by
taking a forward pass and examples have
garage with labels seeing how close
those labels are to the real ground
truth and then punishing the weights
that resulted in the incorrect decisions
and rewarding the weights that resulted
in correct decisions for the case of 10
examples the output of the network is
different values the input being
handwritten digits from 0 to 9 for 10 of
those and we wanted our network to
classify what is in this image of a
handwritten digit is it 1 is 0 1 2 3
through 9 the way it's often done is
there's ten outputs of the network and
each of the neurons on the output is
responsible for getting really excited
when it's number is called and everybody
else is supposed to be not excited
therefore the number of classes is the
number of outputs that's how it's
commonly done and you assign a class to
the input image based on the highest the
neuron which produces the highest output
but that's for a fully connected network
that we've discussed on Monday there is
in deep learning a lot of tricks that
make things work that make training much
more efficient on large class problems
where there's a lot of classes on large
data sets when the representation that
the neural network is tasked with
learning is extremely complex and that's
where convolutional neural neural
networks step in the trick they use a
spatial invariance they use the idea
that a cat in the top left corner of an
image is the same as a cat in the bottom
right corner of an image so we can learn
the same features across the image
that's where the convolution operation
steps in instead of the fully connected
networks here there's a third dimension
of depth so the blocks in this neural
network as input take 3d volumes and as
output produced 3d volumes
a slice of the image a window and slide
it across applying the same exact
weights and we'll go through an example
the same exact weights as in the fully
connected network on the edges that are
used to map the input to the output here
are used to map this slice of an image
this window of an image to the output
and you can make several many of such
convolutional filters many layers many
different options of what kind of
features you look for in an image
what kind of window you slide across in
order to extract all kinds of things all
kinds of edges all kind of higher-order
patterns in the images the very
important thing is the parameters on
each of these filters the subset of the
image these windows are shared if the
feature that defines a cat is useful in
the top left corner it's useful in the
top right corner it's useful in every
aspect of the image this is the trick
that makes convolutional neural networks
save a lot of a lot of parameters reduce
parameter significantly it's the reuse
the spatial sharing of features across
the space of the image the depth of
these 3d volumes is the number of
filters the stride is the skip of the
filter the step size how many pixels you
skip when you apply the filter to the
input and the padding is
they're padding the zero padding on the
outside of the input to a convolutional
layer let's go through an example so on
the left here and the slides are now
available online you can follow them
along and I'll step through this example
on the left here is a input volume of
three channels the left column is the
input the three block the three squares
there are the three channels and there's
numbers inside those channels and then
we have a filter in red two of them two
channels of filters with a bias and we
those filters are three by three each
one of them is size three by three and
what we do is we take those three by
three filters that are to be learned
these are our variables our weights that
we have to learn and then we slide it
across an image to produce the output on
the right the green so by applying the
filters in the red there's two of them
and within each one there's one for
every input channel we go from the left
to the right from the input volume on
the left to the output volume green on
the right and you can look it you can
pull up the slides yourself now if you
can't see the numbers on the screen but
the the operations are performed on the
input to produce the single value that's
highlighted there in the green and the
output and we slide this convolution no
filter along the image with a stride in
this case of to skipping skipping along
they sum to the to the right the two
channel output in green that's it
the convolutional operation that's
what's called the convolutional layer
neural networks and the parameters here
besides the bias are the read values in
the middle that's what we're trying to
learn and there's a lot of interesting
tricks we'll discuss today on top of
those but this is at the core this is
the spatially invariant sharing of
parameters that make convolutional
neural networks able to efficiently
learn and find patterns and images to
build your intuition a little bit more
about convolution here's an input image
on the left and on the right the
identity filter produces the output you
see on the right and then there's
different ways you can different kinds
of edges you can extract with the
activate or the resulting activation map
seen on the right so when applying the
filters with those edge detection
filters to the image on the left you
produce in white are the parts that
activate the convolution the results of
these filters and so you can do any kind
of filter that's what we're trying to
learn any kind of edge any kind of any
kind of pattern you can move along in
this window and this way that's shown
here you slide along the image and you
produce the output you see on the right
and depending on how many filters you
have in every level you have many of
such slices VC on the right the input on
the left the output on the right if you
have dozens of filters you have dozens
of images on the right each with
different results that show where each
of the individual filter patterns were
found and we learned what patterns are
useful to look for in order to perform
the classification task that's the task
for the neural network to learn these
filters
and the filters have higher and higher
order of representation going from the
very basic edges to the high semantics
meaning that spans entire images and the
ability to spend images can be done in
several ways but traditionally has been
successfully done through max pooling
through pooling of taking the output of
convolutional operation and reducing the
resolution of that byte by condensing
that information by for example taking
the maximum values the maximum
activations therefore reducing the
spatial resolution which has detrimental
effects as we'll talk about in the scene
segmentation but it's beneficial for
finding higher order representations and
the images that bring images together
that bring features together to form an
entity that we're trying to identify and
classify okay so that forms a
convolution Yool network such
convolutional layers stacked on top of
each other is the only addition to a
neural network that makes for a
convolutional neural network and then at
the end the fully connected layers or
any kind of other architectures allow us
to apply particular domains
let's take image net as a case study an
image net the data set an image net the
challenge
the task is classification as I
mentioned the first lecture image net is
a data set one of the largest in the
world of images with 14 million images
21,000 categories and a lot of depth to
many of the categories as I mentioned
1200 granny smith apples
these allow - these allow the newer
networks to learn the rich
representations in both pose lighting
variability and intraclass class
variation for the particular things
particular classes like granny smith
apples so let's look through the various
networks let's discuss them let's see
the insights
it started with Alex net the first
really big successful GPU trained neural
network on image net that's achieved a
significant boost over the previous year
and moved on to vgg net Google net ague
Lynnette ResNet see you image and as
Annette in 2017 again the numbers will
show for the accuracy are based on the
top five error rate we get five guesses
and it's a one or zero if you get guess
if one of the five is correct you get a
one for that particular guess otherwise
it's a zero and human error is five
point one when a human tries to achieve
the same tries to perform the same task
as the machinist task of doing the air
is five point one the human annotation
is performed on the images based on
binary classification Granny Smith apple
or not cat or not the actual tasks that
the machine has to perform and that the
human competing has to perform is given
an image is provide one of the many
classes under that human errors 5.1%
which was surpassed in 2015 by ResNet to
achieve four percent error so let's
with Alex net I'll zoom in on the later
networks they have some interesting
insights but Alex net and vgg net both
fall at a very similar architecture very
uniform throughout its depth vgg net in
2014 is convolution convolution pooling
convolution pooling convolution pooling
and fully connected layers at the end
there's a certain kind of beautiful
simplicity uniformity to these
architectures because you can just make
it deeper and deeper and makes it very
amenable to implementation in a layer
stack kind of way and in any of the deep
learning frameworks it's clean and
beautiful to understand in the case of
eg gina was 16 or 19 layers with 138
million parameters not many
optimizations and these parameters
therefore the number of parameters is
much higher than the networks that
followed it despite the layers not being
that large Google Net introduced the
inception module starting to do some
interesting things with the small
modules within these networks which
allow for the training to be more
efficient and effective the idea behind
the inception module shown here with the
previous layer on bottom and the
convolutional layer here with the
inception module on top produced on top
is it used the idea that different size
convolutions provide different value for
the network smaller convolutions are
able to capture or propagate forward
features that are very local a high
resolution in in in texture larger
convolutions are better able to
represent and capture and catch highly
abstracted features higher-order
features so the idea behind the
inception module is to say well as
opposed to choosing and high in a high
pair
tuning process or architecture design
process choosing which convolution size
we want to go with why not do all of
them together while several together in
the case of the Google net model there's
the one by one three by three and five
by five convolutions with the old trusty
friend of max pooling still left in
there as well which has lost favor more
and more over time for the image
classification task and the results is
there's fewer parameters are required if
you pick the placing of these inception
modules correctly the number of
parameters required to achieve a higher
performance is much lower res net one of
the most popular still to date
architectures that we'll discuss in
scene segmentation as well came up and
use the idea of a residual block the
initial inspiring observation which
doesn't necessarily hold true as it
turns out but that network depth
increases representation power so these
residual blocks allow you to have much
deeper networks and I'll explain why in
a second here but the thought was they
work so well because the network's so
much deeper the key thing that makes
these blocks so effective is the same
idea that's that reminiscent of
recurrent neural networks that I hope
would get a chance to talk about the
training of them is much easier they
take a simple block repeated over and
over and they pass the input along
without transformation along with the
ability to transform it to learn to
learn the filters learn the weights
so you're allowed to you're allow every
layer to not only take on the processing
of previous layers but to take in the
wrong transform data and learn something
new the ability to learn something new
allows you to have much deeper networks
and the simplicity of this block allows
for more effective training the state of
the art in 2017 the winner is squeezed
and excitation networks that unlike the
previous year will see you image which
simply took ensemble methods and
combined a lot of successful approaches
to take a marginal improvement se net
got a significant improvement at least
in percentages I think there's a 25%
reduction in error from 4 percent to 3
percent something like that by using a
very simple idea that I think is
important to mention a simple insight
it added a parameter to each channel and
the convolutional layer in the
convolutional block so the network can
now adjust the weighting on each channel
based for for each feature map based on
the content based on the input to the
network this is kind of a take away to
think about about any of the networks
who talk about any of the architectures
is a lot of times your recurrent neural
networks and convolutional neural
networks have tricks that significantly
reduce the number of parameters the bulk
the sort of low-hanging fruit they use
spatial invariants a temporal invariants
to reduce the number of parameters to
represent the input data but they also
leave certain things not parameterize
they don't allow the network to learn it
allow in this case the network to learn
the weighting on each of the individual
channels so each of the individual
filters is something that you learn as
along with the filters takes it makes a
huge boost
the cool thing about this is it's
applicable to any architecture this kind
of block that's kind of what the the
squeeze and excitation block is
applicable to any architecture and
because obviously it it just simply
permit Rises the ability to choose which
filter you go with based on the content
it's a subtle but crucial thing I think
it's pretty cool and for future research
it inspires to think about what else can
be parameterize in your own networks
what else can be controlled as part of
the learning process including hiring
higher-order hyper parameters which
which aspects of the training and the
architecture of the network can be part
of the learning this is what this
network inspires another network has
been in development since the 90s ideas
with geoff hinton but really received
has been published on received
significant attention 2017 that i won't
go into detail here we are going to
release an online-only
video about capsule networks it's a
little bit too technical but they
inspire a very important point that we
should always think about with deep
learning whenever it's successful is to
think about what as I mentioned with the
cat eating a banana on a philosophical
and the mathematical level you have to
consider what assumptions these networks
make and what through those assumptions
they throw away so neural networks due
to the spatial with convolutional neural
networks due to their spatial invariants
throw away information about the
relationship between the the hierarchies
between the simple and the complex
objects so the face on the left and the
face on the right looks the same to
accomplish a neural network the presence
of eyes and nose and mouth is the
central aspect of what makes
classification tasks work for
convolution Network where it will fire
and say this is definitely a face but
the spatial relationship is lost is
ignored
which means there's a lot of
implications to this but for things like
pose variation that information is lost
we're throwing away that away completely
and hoping that the pooling operation
that's performing these networks is able
to sort of mesh everything together to
come up with the features that are
firing of the different parts of the
face that then come up with the total
classification that it's a face without
representing really the relationship
between these features at the low level
and and the high level at the low level
of the hierarchy at the simple and the
complex level this is a super exciting
field now that's hopefully will spark
developments of how we design your own
networks that are able to learn this the
rotational the orientation invariance as
well ok so as I mentioned you take these
combos in your networks chop off the
final layer in order to apply to a
particular domain and that is what we'll
do with fully convolutional neural
networks the ones that we task to
segment the image at a pixel level as a
reminder these networks through the
convolutional process are really
producing a heat map different parts of
the network are getting excited based on
the different aspects of the image and
so it can be used to do the localization
of detecting not just classifying the
image but localizing the object and they
could do so at a pixel level so the
convolutional layers are doing the
encoding process they're taking the rich
raw sensory information in the image and
encoding them into an interpretable set
of features representation that can then
be used for classification but we can
also then use it
kotor up sample that information and
produce a map like this fully
convolutional neural network
segmentation semantic scene segmentation
image segmentation the goal is to as
opposed to classify the entire image you
classify every single pixel its pixel
level segmentation you color every
single pixel with what that pixel what
object that pixel belongs to in this 2d
space of the image the 2d projection the
in the image of a 3-dimensional world so
the thing is there's been a lot of
advancement in the last three years but
it's still an incredibly difficult
problem if you if you think if you think
about the amount of data that's used for
training and the task of pixel level of
megapixels here of millions of pixels
that are tasked with having a scientist
single label it's an extremely difficult
problem why is this interesting
important problem to try to solve as
opposed to bounding boxes around cats
well it's whenever precise boundaries of
objects are important certainly medical
applications when looking at imaging and
detecting in particular for example
detecting tumors in the in in medical
imaging of different different organs
and in driving in robotics when objects
are involved it's a done scene of all
those vehicles pedestrians cyclists we
need to be able to not just have a loose
estimate of where objects are we need to
be able to have the exact boundaries and
then potentially through data fusion
fusing sensors together
fusing this rich textural information
about pedestrians cyclists and vehicles
to lidar data that's providing us the
three-dimensional map of the world or
have both the semantic meaning of the
different objects and their exact
three-dimensional location
a lot of this work successfully a lot of
the work in the semantic segmentation
started with fully convolutional
networks for semantic segmentation paper
FCN that's where the name of FCN came
from in november 2014
now go through a few papers here to give
you some intuition where the field is
gone and how that takes us to seg fuse
the segmentation competition so FCM
repurposed the image net pre-trained
nets the nets that were trained to
classify what's in an image the entire
image and chopped off the fully
connected layers and then added decoder
parts that that up sample there the
image to produce a heat map here shown
with a tabby cat a heat map of where the
cat is in the image it's a much slower
much coarser resolution than the input
image 1/8 at best
skip connections to improve coarseness
of up sampling there's a few tricks if
you do the most naive approach the up
sampling is going to be extremely coarse
because that's the whole point of the
neural network the encoding part is you
throw away all the useless data the
YouTube the most essential aspects that
represent that image so you're throwing
away a lot of information that's
necessary to then form a high resolution
image so there's a few tricks where you
skip a few of the final pooling
operations to go in similar way and this
is a residual block to go to go to the
output produce higher and higher
resolution heat map at the end segment
in 2015 applied this to the driving
context and really taking it to kitty
data set and have have shown a lot of
interesting results and really explored
the encoder decoder or formulation of
the problem
really solidifying this the place of the
encoder/decoder framework for the
segmentation task dilated convolution
I'm taking you through a few components
which are critical here to the state of
the art dilated convolutions so the
convolution operation as the pooling
operation reduces resolution
significantly and dilated convolution
has a certain kind of gritting as
visualized there that maintains the
local high resolution textures while
still capturing the spatial window
necessary
it's called dilated convolutional layer
and that's in a 2015 paper proved to be
much better at up sampling a high
resolution image deep lab with a be v1
v2 Navi 3 added conditional random
fields which is the final piece of the
of the state-of-the-art puzzle here a
lot of the successful networks today
that do segmentation not all do post
process using CRFs conditional random
fields and what they do is they smooth
the segmentation the up sample
segmentation that results from the FCN
by looking at the underlying image
intensities so that's the key aspects of
the successful approaches today you have
the encoder decoder framework of a fully
accomplished in your network it replaces
the fully connected layers with the
convolutional layers deconvolution
layers and as the years progress from
2014 to today as usual than underlying
networks from alex net to vgg net and to
now ResNet have been one of the big
reasons for the improvements of these
to be able to perform the segmentation
so naturally they mirrored the imagenet
challenge performance in adapting these
networks so the state-of-the-art uses
ResNet or similar networks conditional
random fields for smoothing based on the
input image intensities and the dilated
convolution that maintains the
computational cost but increases the
resolution of the up sampling throughout
the intermediate feature Maps and that
takes us to the state of the art that we
used to produce the images to produce
the images for the competition present
that do you see for dance up sampling
convolution instead of bilinear up
sampling you make the up sampling learn
about you learn the upscaling filters
that's on the bottom that's really the
key part that made it work there should
be a theme here sometimes the the
biggest addition they can be done this
parameter izing one of the aspects of
the network they've taken for granted
letting the network learn that aspect
and the other I'm not sure how important
it is to the success but it's a it's a
cool little addition is a hybrid dilated
convolution as I showed that
visualization where the convolution is
spread apart a little bit in the input
from the input to the output the steps
of that dilated convolution filter when
they're changed it produces a smoother
result because when it's kept the same
there certain input pixels get a lot
more attention than others so losing
that favoritism is what's achieved by
using a variable different dilation rate
those are the two tricks but really the
biggest one is the parameterization of
the upscaling filters okay so that's
what we're that's what we used to
generate that data and that's what we
provides you the code with if you're
interested in competing in psyche views
the other aspect here that everything
we've talked about from the
classification to the segmentation to
making sense of images is it there the
information about time the temporal
dynamics of the scene is thrown away and
for the driving context of the robotics
contest and what we'd like to do with
psyche fuse for the segmentation
dynamics scene segmentation context of
when you try to interpret what's going
on in the scene over time and use that
information time is essential thus the
movement of pixels is essential through
time that that understanding how those
objects move in a 3d space through the
2d projection of an image it's
fascinating and us there's a lot of set
of open problems there so flow is what's
very helpful to as a starting point to
help us understand how these pixels move
flow optical flow dense optical flow is
the computation that our best of a best
approximation of where each pixel in
image one and moved in the in
temporarily following image after that
there's two images in 30 frames a second
there's one image at time zero the other
is 33.3 milliseconds later and the
idents optical flow is our best estimate
of how each pixel in the input image
moved to in the output image the optical
flow for every pixel produces a
direction of where we think that pixel
moved and the magnitude of how far moved
that allows us to take information that
we detected about the first frame and
try to propagate it forward this is the
competition it's to try to segment an
image and propagate that information
forward for manual annotation of a
image so this kind of coloring book
annotation where you color every single
pixel in the state-of-the-art dataset
for driving cityscapes that it takes 1.5
ninth and 1.5 hours 90 minutes to do
that coloring that's 90 minutes per
image that's extremely long time that's
why there doesn't exist today dataset
and in this class we're going to create
one of segmentation of these images
through time through video so long
videos where every single frame is fully
segmented that's still an open problem
that we need to solve flows a piece of
that and we also provide you the this
computer state-of-the-art flow using
flow net 2.0 so flow net 1.0 in May 2015
used neural networks to learn the
optical flow the dense optical flow and
it did so with two kinds of
architectures flow net s flowing that
simple and flow net core flow net see
the simple one is simply taking the two
images so what's what's the task here
there's two images and you you want to
produce from those two images they
follow each other in time thirty-three
point three milliseconds apart and your
task is the output to produce the dense
optical flow so for the simple
architecture you just stack them
together each are RGB so it produces a
six channel input to the network there's
a lot of convolution and finally it's
the the same kind of process as the
fully convolution your networks to
produce the optical flow then there is
flow net correlation architecture where
you perform some convolution separately
before using a correlation layer to
combine the feature Maps both
effective in different data sets and
different applications so flow net 2.0
in December 2016 is one of the
state-of-the-art frameworks code bases
that we used to generate the data all
show combines the flow net Assam flow
net C and improves over the initial flow
net producing a smoother flow field
preserves the fine motion detail along
the edges of the objects and it runs
extremely efficiently depending on the
architecture there's a few variants
either eight to a hundred forty frames a
second and the process there is
essentially one that's common across
various applications deep learning is
stacking these networks together the
very interesting aspect here that we're
still exploring and again applicable in
all of deep learning in this case it
seemed that there was a strong effect in
taking sparse small multiple data set
and doing the training the order of
which those data sets were used for the
training process mattered a lot that's
very interesting
so using flow net 2.0 here's the data
set we're making available for psych
fuse the competition cars that mit.edu
slash psych fuse first the original
video us driving in high-definition
1080p and a 8k 360 video original video
driving around Cambridge
we're providing the ground truth for a
training set for that training set for
every single frame 30 frames a second
we're providing the segmentation frame
to frame to frame segmented on
Mechanical Turk we're also providing the
output of the network that I mentioned
the state of their our segmentation
network that's pretty damn close to the
ground truth but still not and our task
is this is the interesting thing is our
task is to take the output of this
network well there's two options one is
to take the output of this network and
use use other networks to help you
propagate the information better so what
this segmentation the output of this
network does is it only takes a frame by
frame by frame it's not using the
temporal information at all so the
question is can we figure out a way can
we figure out tricks to use temporal
information to improve this segmentation
so it looks more like this segmentation
and we're also
providing the optical flow from frame to
frame to frame so the optical flow based
on flowing at 2.00 of how each of the
pixels moved okay and that forms a seg
fuse competition 10,000 images and the
task is to submit code we have starter
code in Python and on github to take in
the original video take in for the
training set the ground truth the
segmentation from the state-of-the-art
segmentation Network the optical flow
from the state-of-the-art optical flow
Network and taking that together to
improve the the stuff on the bottom left
the segmentation to try to achieve the
ground truth and on the top right okay
with that I'd like to thank you tomorrow
at 1 p.m. is way mo in Stata 32 one two
three the next lecture next week will be
on deep learning for a sense in the
human understanding the human and we
will release online only lecture on
capsule networks and Gans general
adversarial networks thank you very much
[Applause]

Resume

Berikut adalah rangkuman komprehensif dan terstruktur dari transkrip video yang Anda berikan.

***

# Mengungkap Rahasia Computer Vision: Dari Deep Learning hingga Kompetisi SegFuse

### Inti Sari (Executive Summary)
Video ini membahas secara mendalam konsep dasar *Computer Vision* dan *Deep Learning*, dengan fokus utama pada evolusi arsitektur jaringan saraf tiruan (CNN) dan tantangan dalam segmentasi semantik. Pembahasan dimulai dari dasar pemrosesan citra digital, tantangan utama dalam visi komputer, hingga perkembangan arsitektur modern seperti VGG, ResNet, dan SE-Nets. Topik ini kemudian menjembatani ke area yang lebih spesifik yaitu segmentasi semantik untuk kendaraan otonom melalui kompetisi "SegFuse", yang menantang peserta untuk mengintegrasikan informasi temporal (waktu) ke dalam pemrosesan visual yang sebelumnya hanya statis.

### Poin-Poin Kunci (Key Takeaways)
*   **Dominasi Deep Learning:** *Computer Vision* modern sangat bergantung pada *Deep Learning* (jaringan saraf tiruan) untuk memetakan input mentah menjadi label semantik, meniru cara kerja korteks visual otak manusia.
*   **Tantangan Utama:** Lima tantangan besar dalam visi komputer adalah variasi pencahayaan, pose, variasi intra-kelas, oklusi (benda tertutup), dan dinamika temporal/konteks.
*   **Evolusi CNN:** Arsitektur CNN berkembang dari VGG yang sederhana namun berat parameter, menjadi GoogleNet yang efisien, ResNet dengan *residual block* untuk kedalaman jaringan, hingga SE-Nets yang memberi bobot dinamis pada saluran (channel).
*   **Segmentasi Semantik:** Berbeda dengan klasifikasi gambar, segmentasi memerlukan presisi piksel. Teknologi berkembang dari FCN ke DeepLab yang menggunakan *Dilated Convolution* dan CRF untuk hasil yang lebih halus.
*   **Kompetisi SegFuse:** Fokus utama adalah mengatasi keterbatasan segmentasi statis dengan memanfaatkan *Optical Flow* (pergerakan piksel antar frame) untuk meningkatkan akurasi persepsi kendaraan otonom secara dinamis.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Pengantar Computer Vision & Tantangan Persepsi
Bagian ini memperkenalkan konteks kompetisi "SegFuse" (Deep Dynamic Driving Scene Segmentation) yang berfokus pada persepsi, berbeda dengan kompetisi lain yang berfokus pada *reinforcement learning*.
*   **Peran Data:** Sistem visi komputer membutuhkan data yang diberi anotasi manusia (*ground truth*) untuk melatih jaringan saraf agar dapat memetakan input mentah (piksel RGB) menjadi label yang dimengerti.
*   **Tantangan dalam Citra:** Meskipun mudah bagi manusia, komputer mengalami kesulitan karena:
    *   **Variasi Pencahayaan:** Perubahan cahaya adalah tantangan terbesar.
    *   **Variasi Pose:** Objek terlihat berbeda saat diputar atau diubah bentuknya.
    *   **Variasi Intra-Kelas:** Banyak variasi dalam satu kategori (misal: berbagai jenis anjing) dibandingkan antar kategori.
    *   **Oklusi:** Objek yang tertutup sebagian oleh objek lain.
    *   **Konteks & Temporal:** Mesin saat ini masih kesulitan memahami humor, konteks, dan informasi dinamis pergerakan.
*   **Klasifikasi Dasar:** Penjelasan awal tentang klasifikasi gambar menggunakan dataset seperti CIFAR-10, di mana metode tradisional seperti K-Nearest Neighbors (KNN) hanya mencapai akurasi sekitar 30%, jauh di bawah manusia (95%) dan CNN (mendekati 100%).

#### 2. Dasar-Dasar Convolutional Neural Networks (CNN)
Bagian ini menjelaskan mengapa CNN menjadi standar industri dalam memproses citra.
*   **Mekanisme Training:** Jaringan dilatih dengan "menghukum" bobot yang salah dan "mengganjar" yang benar. Output klasifikasi ditentukan oleh neuron dengan nilai tertinggi.
*   **Spatial Invariance:** CNN menggunakan filter yang digeser (sliding window) di seluruh gambar. Filter yang sama digunakan untuk mendeteksi fitur di pojok kiri atas maupun kanan bawah, sehingga mengurangi jumlah parameter secara drastis (*parameter sharing*).
*   **Operasi Konvolusi:** Melibatkan input volume 3D, filter dengan kedalaman tertentu, *stride* (langkah pergeseran), dan *padding* (penambahan nol di tepi).
*   **Hierarki Fitur:** Layer awal mendeteksi tepi sederhana, sedangkan layer lebih dalam mendeteksi fitur kompleks hingga makna semantik keseluruhan.
*   **Pooling:** Teknik *Max Pooling* digunakan untuk mengurangi resolusi spasial. Ini bermanfaat untuk klasifikasi (menggabungkan fitur menjadi entitas), namun merugikan untuk segmentasi karena kehilangan detail spasial.

#### 3. Evolusi Arsitektur Jaringan: Dari VGG hingga SE-Nets
Perkembangan arsitektur CNN ditandai dengan peningkatan efisiensi dan akurasi, dibuktikan dengan keberhasilan melampaui performa manusia dalam dataset ImageNet (14 juta gambar).
*   **VGG:** Dikenal dengan kesederhanaan dan kemiripan arsitektur (konvolusi-pooling-konvolusi). Namun, VGG memiliki jumlah parameter yang sangat besar (138 juta parameter).
*   **GoogleNet:** Memperkenalkan *Inception Module* yang menggunakan berbagai ukuran konvolusi (1x1, 3x3, 5x5) secara paralel. Ini menangkap fitur lokal dan abstrak sekaligus dengan parameter yang lebih sedikit.
*   **ResNet:** Menggunakan *Residual Block* yang memungkinkan input melewati lapisan tanpa transformasi (*skip connection*). Ini memudahkan pelatihan jaringan yang sangat dalam dan menurunkan *error rate* di bawah tingkat kesalahan manusia (4% vs 5.1%).
*   **Squeeze and Excitation (SE) Networks:** Arsitektur *State of the Art* (2017) yang menambahkan parameter ke setiap saluran (channel) untuk menyesuaikan pembobotan berdasarkan konten isi gambar, mengurangi *error* hingga 25%.

#### 4. Segmentasi Semantik & Pentingnya Informasi Temporal
Topik beralih dari klasifikasi gambar ke segmentasi semantik, di mana setiap piksel diberi label, yang krusial untuk medis dan kendaraan otonom.
*   **Evolusi Segmentasi:**
    *   **FCN (Fully Convolutional Networks):** Mengubah jaringan klasifikasi menjadi segmentasi dengan mengganti lapisan *fully connected* dengan *decoder* untuk *upsampling*. Hasilnya masih kasar (resolusi rendah).
    *   **SegNet & DeepLab:** Memperkenalkan kerangka kerja *encoder-decoder* dan *Dilated Convolution* untuk menjaga tekstur resolusi tinggi sambil menangkap konteks spasial. DeepLab juga menggunakan *Conditional Random Fields* (CRF) untuk memperhalus hasil berdasarkan intensitas gambar.
*   **Dinamika Temporal:** Segmentasi saat ini umumnya statis (frame-by-frame). Untuk mengemudi, memahami pergerakan (dinamika) sangat penting.
*   **Optical Flow:** Teknik untuk memperkirakan pergerakan piksel (arah dan magnitudo) antar frame.
*   **FlowNet:** Jaringan saraf untuk menghitung *Optical Flow* secara *dense*.
    *   *FlowNet 1.0:* Dua arsitektur (FlowNetSimple dan FlowNetCorrelation).
    *   *FlowNet 2.0:* Menggabungkan arsitektur sebelumnya untuk hasil presisi tinggi pada tepi objek dan berjalan sangat efisien (8-140 fps).

#### 5. Kompetisi SegFuse & Tantangan Masa Depan
Bagian penutup menjelaskan detail kompetisi dan dataset yang digunakan.
*   **Dataset SegFuse:** Dataset berupa video mengemudi di sekitar Cambridge dalam resolusi tinggi (1080p dan 8k 360 derajat). Data yang disediakan mencakup video asli, *ground truth* (segmentasi frame-by-frame oleh manusia via Mechanical Turk), dan output jaringan segmentasi *state of the art* yang saat ini hanya memproses per frame tanpa informasi temporal.
*   **Tantangan Utama:** Peserta diminta untuk meningkatkan kualitas segmentasi output jaringan dengan memanfaatkan informasi temporal (pergerakan antar frame).
*   **Wawasan Tambahan:** Penelitian menunjukkan bahwa urutan penggunaan dataset kecil yang jarang (*sparse small multiple datasets*) selama proses pelatihan sangat mempengaruhi hasil akhir.

### Kesimpulan & Pesan Penutup
Video ini menegaskan bahwa meskipun *Deep Learning* telah mencapai keberhasilan luar biasa dalam klasifikasi gambar statis melalui arsitektur seperti ResNet dan SE-Nets, tantangan berikutnya adalah mengintegrasikan dimensi waktu. Kompetisi "SegFuse" dihad

Read

file updated 2026-02-13 13:23:37 UTC