Nuts and Bolts of Applying Deep Learning (Andrew Ng)

F1ka6a13S9I • 2016-09-27

Transcript preview

Open

Kind: captions
Language: en
so you know when we're uh organizing
this Workshop My My co-organizers
initially asked me hey Andrew end of the
first day go give a Visionary talk so
until uh several hours ago my talk was
advertised as Visionary talk um but
until but but I Was preparing for this
presentation over the last several days
um I tried to think what what would be
the most useful information to you um
and what are the things that you know
you could take back to work on Monday
and and do something different at your
job next Monday and I thought that um Mr
context right now as pet mentioned I
lead BYU's AI team so team about a
thousand people working on Vision speech
NLP you know lots of applications of
machine learning and so what I thought
I'd do instead is um instead of taking
the shiniest pieces of deep learning
that I know I want to take the lessons
that I saw at BYU that are common to so
many different um academic areas as well
as applications and you know autonomous
cause augmented reality uh advertising
uh web search um medical diagnosis with
take what of the common lessons the
simple powerful ideas that I've seen
help drive a lot of machine learning
progress at BYU and I thought I will um
share those ideas of you because the
patterns I see across a lot of projects
I thought might be the patterns that
would be most useful to you as well
whatever you are working on in the next
several weeks or months um so one common
theme that will appear in in this
presentation today is that the workflow
of organizing machine learning projects
feels like parts of it are changing in
in the era of deep learning so for
example one of the ideas I talk about is
bias variance this is a super old idea
right and then you know many of you
maybe all of you have heard of buyers
and variance but in the era of deep
learning I feel like there have been
some changes to the way we think about
buyers and variance so we want to talk
about some of these ideas which maybe
aren't even deep learning per se but um
have been slowly shifting as as we apply
deep learning to more and more of our
applications okay oh and um instead of
holding all your questions until the end
you know if you have a question in the
middle feel free to raise your hand and
well I'm very happy to take questions in
the middle since this is a more maybe
informal whiteboard talk right and also
we want to say hi to all home viewers hi
right so um you know one question that I
still get asked sometimes is um and kind
Andre alluded to this earlier a lot of
the basic ideas of deep learning have
been around for decades so why are they
taking off just now right why is it that
deep learning these neon networks have
all know for maybe decades why why are
they working so well now so I think that
um the one biggest Trend in deep
learning the the is is is scale that
scale drives deep learning progress um
and uh I think Andrea mentioned scale of
data and scale of computation um and I'm
just draw a picture that illustrates
that concept maybe a little bit more
right so if I plot a figure where on the
horizontal axis I plot um the amount of
data we have for a problem and on the
vertic iCal axis we plot you know
performance right so x-axis is the
amount of spam data you've collected y
AIS is how accurately can you classify
spam um then if you apply you know
traditional learning
algorithms right what we found was that
the performance often looks like it
starts to Plateau after a while what was
as if the older generations of learning
algorithms including you know support V
what logistic regression as the
was as if they didn't know what to do
with all the data that we finally had
and what happened kind of over the last
20 years last last 10 years was with the
rise of the internet rise of Mobile Rise
of iot where as a society sort of
marched to the right of this curve right
for for for for many problems not all
problems and so um with all the buzz and
all the hype about deep learning in my
opinion the number one reason um that
deep learning algs work so well is that
if you train going to call a small
neuronet maybe you get slightly better
performance um if you
train a mediumsized
neuronet right maybe get even better
performance and is only if you
train a large neuronet that you could
train a model with the capacity to
absorb all this data that we have access
to that allows you to get the best as
possible performance and so I feel like
this is a trend that we' seen in many
verticals in many application areas um
couple comments one is that um this uh
you know actually when I draw this
picture some people ask me well does
this mean a small neuronet always
dominates a traditional learning
algorithm and the answer is not really
uh technically if you look at the small
data regime if you look at the left end
of this plot right um the relative
ordering of these algorithms is not that
well defined it depends on who's more
motivated to engineer the features
better right so if if you know the svm
guy is more motivated to spend more time
Eng doing features they might beat out
the uh uh the the the neuron Network
application but um uh because when you
don't have much data a lot of the
knowledge of the algorithm comes from
hand engineering right but so but this
trend is much more evident in the regime
of Big Data where you just can't hand
engineer enough features uh and and and
the large interet combined with a lot of
data tends to
outperform so couple of the comments um
the ification of this figure is that in
order to get the best performance in
order to hit that Target uh you need two
things right one is you need to train a
very large NE Network or reasonably
large NE Network and you need um a large
amount of data and so this in turn has
caused pressure to train large near net
Nets right build large Nets as well as
get huge amounts of data so one of the
other interesting Trends I've seen is
that um increasingly um it I'm I'm
finding that it makes sense to build an
AI team as well as build a computer
systems team and have the two teams kind
of sit next to each other and the reason
I say that is um I guess uh so let's see
what so when when we started you know bu
research we said our team that way other
teams are also organized this way I
think Peter mentioned to me that open AI
also has a systems team and a and a and
a machine learning team and the reason
we're starting to organize our teams
that way I think is that um some of the
computer systems work we do right so we
have an HPC team high performance team
super Computing team at BYU some of the
extremely specialized knowledge in HPC
is just incredibly difficult for for an
AI researcher to learn right some people
are super smart maybe maybe Jeff Dean is
smart enough to learn everything but but
it's just difficult for any one human to
be sufficiently expert in HPC and
sufficiently expert in um uh uh in in
machine learning and so we've been
finding and and shubo actually one of
the co-organizers is on our HPC team
we've been finding that bring Talent
from that Knowledge from these multiple
sources multiple communities allows us
to get our best
performance um I want to you know you've
heard a lot of present heard a lot of
fantastic presentations today I want to
draw one other picture which is um in my
mind this is how I mentally bucket you
know work in in in deep learning so this
might be a useful calization right when
you look at the talk you can mentally
put each talk into one of these buckers
I'm about to draw um but I feel like
there's a lot of work on I'm G to call
you know General DL General models and
this would basically what the type of
model that Hugo lell talked about this
morning where you have you know really
densely connected layers right um I
guess FC right was was the so there's a
huge bucket of models there um and then
I think a second bucket is um sequence
models so 1D sequences um and this is
where I would Buck could lot the work on
rnns uh you know lstms right grus um
some of the attention models which I
guess probably yosha r talk about
tomorrow or maybe maybe maybe others
maybe quas I'm not sure right um but so
the 1D sequence models is another huge
bucket um the third bucket is the image
models um this is really 2D and maybe
sometimes 3D but this is where I would
tend to bucket all the work of CNN
convolutional Nets and then in my mental
bucket then then there's a fourth one
which is the other right and this
includes uh unsupervised learning you
know uh uh uh the reinforcement learning
right as well as lots of other creative
ideas um being explored L and you like
what I still find slow fature analysis B
coding um U uh you know a various models
kind of in the other category super
exciting so it turns out that if you
look across industry today
um almost all the value today is driven
by these three bucket right so what I
mean is uh those three buckers of
algorithms are you know driving causing
us to have much better products right or
or monetizing very well it's just
incredibly useful for lots of things um
in some ways I think this bucket might
be the future of AI right so I find UNS
supervised learning especially super
exciting uh so so I'm I'm actually super
excited about this as well um although I
think that if you know on Monday you
have a job and you're trying to like
build a product or whatever the chance
of you using something from one of these
three buckets will be will be highest um
but I definitely encourage you to
contribute to research here as well
right so um I said the trend
one the the major Trends one of deep
learning is scale um this is what I
would say is maybe major Trend two of of
two of two Trends this is not going to
go on forever right um is I feel major
Trend too is um the rise of endtoend
deep learning
uh for Rich especially for Rich outputs
and so um end to end deep learning I'll
say a little bit more in a second
exactly what I mean by that but the
examples I'm going to talk about are all
from one of these three buckets right
General DL sequence models image 2D 3D
models um but let's see best Illustrated
a few examples um until recently a lot
of machine learning used to Output just
real numbers you know so I guess in
Richard's uh uh example you have Ave
movie review right and then actually but
I prepared totally different examples I
was editing the my examples earlier to
to to be more coherent with the speakers
before me um but we have a movie review
and then output the sentiment you know
is this a positive or A negative movie
review um or you might have an image
right and then you want to do uh image
nit object recognition you know so this
would be a01 output this might be a
integer from 1 to 1,00 but so until
recently a lot of machine learning was
about out putting a single number maybe
a real number maybe an integer um and I
think the the the number two major Trend
that I'm really excited about is um
enter and de learning algorithms that
can output much more complex things than
numbers and so one example that you've
seen is a image captioning where instead
of taking an image and saying this is a
cat you can now take an image and output
you know an entire string of texts using
RNN to generate that sequence so I guess
uh what um Andre
who spoke just now I think Oro vendal uh
uh shoe at BYU right a whole bunch of
people have have have worked on this
problem um one of the things that I
guess uh my my my collaborator Adam
coats will talk about tomorrow uh maybe
Quark as well not sure is um speech
recognition where you take as input
audio and you directly output you know
the text
transcript right and so um when we first
propose using this kind of ENT in
architecture to do speech recognition
this was very controversial we're
building my work of Alex Graves uh but
the idea of actually putting this in the
production speech system was very very
controversial when we first you know
said we wanted to do this but I think
the whole Community is coming around to
this point of view more recently um or
you know machine translation say go from
English to French right soas qu others
uh working on there a lot of teams now
um or you know given the parameters
um synthesize a brand new image right
and and and you saw some examples of
image synthesis so I feel like the the
the second major trend of of of deep
learning that that I find very exciting
and and I mean this allowing us to build
you know transformative things that we
just couldn't build three or four years
ago has this trend toward not just
learning algorithms an output not just a
number that can output very complicated
things like a sentence or caption or
French sentence or image or or or or or
let the recent wavenet paper output
audio right so I think this is a maybe
the second um major
Trend so um despite all
the excitement um about endtoend deep
learning um I think that end to end deep
learning you know sadly is not the
solution to everything um I want to give
you some rules of thumb for deciding
when to use what is exactly an learning
and when to use it and when not to use
it so was moving the second bullet and
we'll go through
these so the trend toward end to end
deep learning has been um this idea that
instead of engineering a lot of
intermediate representations maybe you
can go directly from your Ro input to
whatever you want to predict right so
for example actually a take I'm going to
use speech as a recurring example uh so
for speech recognition
um previously one used to go from the
audio to you know hand engineered
features like mfccs or something and
then maybe extract phon
names right um and then eventually you
try to generate the
transcript um oh for those of you that
aren't sure what a phone name is so uh
if you look at the word listen to the
word cat and the word kick the sound
right is the same sound and so pH names
are this uh um basic units of sound such
as c as a pH name and is um hypothesized
by linguist to be the basic unit of
sound so C would be the maybe the three
pH names that make up the word cat right
so traditional speech systems used to
used to do this uh and I think 2011
leang and Jeff Hinton um made a lot of
progress in speech recognition by saying
we can use deep learning to do this
first step um but the end to end
approach to this would be to say let's
forget about phes let's just have a
neuronet right input the audio and
output the
transcript um so it turns out that in
some problems this's endtoend approach
so one end is the input the other end is
the output so the phrase end to end deep
learning refers to uh just having a
neuronet or you know like a learning
algorithm directly go from input output
that's that's what n to end means um
this ENT end formula uh is I think it
makes for what great PR uh and and it's
actually very simple but it only works
sometimes um and actually maybe maybe
say this interesting story you know this
end to-end story we really upset a lot
of people um when we were doing this
work I guess I guess I used to go around
and say I think PHS are a fantasy of
linguists um and we should do away with
them and I still remember there was a
meeting at Stanford and some of you know
who it was there was a linguist kind of
yelling at me in public for saying that
so maybe maybe I should not H we turned
out to be right yeah so
all right um so let's see um but the the
the Ares heel of a lot of deep learning
is that you need tons of label data
right so if this is your X and that's
your y then for endtoend deep learning
to work you need a ton of label you know
input output data X comma y so to take
an example where um um where you know
one may or may not consider ENT deep
learning um this is a problem I learned
about just last week from ctis langas
and and Doin who's in the audience I
think of uh imagine you want to use um
X-ray pictures of your hand in order to
predict the child's age right so this is
a real thing you know doctors actually
care to look at an x-ray of your of a
child's hand in order to predict the the
age of the child so um boy let me draw
an x-ray image right so this is you know
the child's hand so these are the bones
right I
guess this is why I'm not a doctor okay
so that's a hand and and and you see the
bones um and so more traditional
algorithm my input an image and then
first you know extract the bones so
first figure out oh there's a bone here
there's a bone here there's a bone here
and then maybe measure the length of
these bones right
um so really I'm going to say bone
lengths and then maybe has some formula
like some regression average some simple
thing to go from the bone length to
estimate the age of the child right so
this is a non-end to-end approach to
trying to solve this problem an interent
approach would be to take an image and
then you know run a convet or whatever
and just try to Output the age of a
child and I think this is one example of
a problem where um it's very challenging
to get end to end deep learning to work
because you just don't have enough data
you just don't have enough X-rays of
children's hands annotated with dat ages
and instead where we see deep learning
coming in is in this step right to use
go from image to to figure out where the
bones are use deep learning for that but
the advantage of this non-end
architecture is it allows you to hand
engineer in more information about the
system such as how bone lengths map age
right which which you can kind of get
tables about um there are a lot of
examples like this and I think one of
the unfortunate things about deep
learning is that um let's see uh you
know you can for for for suitably sexy
values of X and Y you could almost
always train a model and publish a paper
but that doesn't always mean that you
know it's actually a good idea
Peter I see yeah I see yeah I see yes
that's true yes Pet's poting out that in
practice you could um uh if this is a
fixed function f right you could back
propop all the way from the age all the
way back to the image yeah that's a good
idea actually who was it just said you
better do it
quickly yeah um Let me give a couple
other examples uh uh that where where it
might be harder to backdrop all the way
through right so here here's an example
um take self-driving cars you know most
teams are using an architecture where
you input an image what's in front of
the car let's say and then you you know
detect other cars right uh and then and
and maybe use the image detect
pedestrians right self-driving cars are
obviously more complex than this right
uh but then now that you know where the
other cars and where the posss are
relative to your car you then have a
planning
algorithm uh uh to then you know come up
with a
trajectory right and then now that you
know um what's the trajectory that you
want your car to drive through um you
could then you know compute the steering
direction right let's
say and so um this is actually the
architecture that most self-driving car
teams are using um and you know that
have been interesting approaches to to
say well I'm going to input an image and
I'll put a steering
direction right and I think this is an
example of where um at least with
today's data technology I'd be very is
about the second approach and I think if
you have enough data the second approach
will work and you could even prove a
theorem you know showing that it will
work I think but um I don't know that
anyone today has enough data to make the
second approach really really work well
right and and I think kind of Peter made
a great comment just now and I think you
know some of these components will be
incredibly complicated you know like
this could be a pop Planet ex explicit
search and you could actually design a
really complicated powerp plan and
generic the trajectory and your ability
to hand code that still has a lot of
value right so this is one thing to
watch out for um I have seen project
teams say I can get X I can get y I'm G
to train deep learning um but unless you
actually have the data you know some of
these things make for great demos if if
you cherry pick the examples but but it
can be challenging to um get to work at
scale I I should say for self-driving CS
this debate is still open I'm I'm
cautious about this I don't think is
this I don't think this will necessarily
fail I just think the data needed to do
this will be will be really immense so I
I'd be very cautious about and and right
now but it might work if you have enough
data um so you know one of the themes
that comes up in machine learning really
if you're work on a machine learning
project one thing that'll often come up
is um you will you know develop a
learning system uh train it maybe
doesn't work as well as you're hoping
yet and the question is is what do you
do next right this is a very common part
of a machine learning you know research
or a machine learning engineer's life
which is you know you you train a model
doesn't do what you want it to yet so
what do you do next right this happens
us all the time um and you face a lot of
choices you could collect more data
maybe you want to train it longer maybe
you want a different neuron Network
architecture maybe you want to try
regularization maybe you know bigger
model for some more gpus so you have a
lot of decisions and I think that um a
lot of the skill of a machine learning
researcher machine learning engineer is
knowing how to make these decisions
right and and and the difference in
performance and whether you you know do
you train a bigger model or do you try
regularization your skill at picking
these decisions will have a huge impact
on how rapidly um uh you can make
progress on actual machine learning
problem so um I want to talk a bit about
bias and variance since that's one of
the most basic you know Concepts in
machine learning and I feel like it's
evolving slightly in the era of of of
deep learning so to use a as a motiving
example
um let's say the goal is to build a
human
level right uh Speech
system right speech recognition system
okay so um what we would typically do
especially in Academia is we'll get a
data set you know here's my data set a
lot of examples and then we Shuffle it
and we randomly split it into 7030 train
tests or maybe or maybe 70% train you
know 15% Dev and uh 15% test right we oh
and uh some people use the term
validation set but I'm I'm just use the
dep set or stand for development set
means the same thing as validation set
okay so it's pretty common um and so
what we would what what what I encourage
you to do if you aren't already is to
measure the following things um human
level
error so let actually let me illustrate
an example let's say that on your deel
uh uh let's say that um on your death
set you know human level error is uh 1%
error
um let's say that your training set
error is um use 5% and let's say that
your def set
error really very de set is appr proxy
for test set except you tune to the dev
set right is um you know 6% d
okay so this is one of the most basic
this this is really a a step in
developing a learning Al that I
encourage you to do if you aren't
already to figure out what are these
three numbers because these three
numbers um really helps in terms of
telling you what to do next so in this
example um you see that you're doing
much worse than human level performance
um and so you see that there's a huge
gap here from 1% to 5% and I'm going to
call this you know right the bias of
your learning algorithm
um and for the statisticians in the room
I'm using the terms buys and variance
informally and doesn't correspond
exactly to the way they're defined in
textbooks but I find these useful
concepts for for for deciding how to
make progress on your problem um and so
I would say that you know in this
example you have a high bias class by
try training a bigger model maybe try
training longer we come back to this in
a second um for a different example you
know so this is one example uh for a
different example if human level error
is 1% and uh training set error with
2% right and death set error was 6% then
you know you really have a high what
variance problem right like an
overfitting problem and this tells you
this really tells you what to do what to
try right try adding regularization or
try um uh uh or try early stopping or um
or or or even better we get more
data um and then there's also really a
third case which is if you have
a 1% human level error um I'm going to
say 6% death set error oh actually let
me say 5% death set error and
10% um excuse me 5% training error and
10% death set error and in this case you
have high bias and high variance right
um
so so I guess yeah High buys and high
VAR you know like sucks for you right um
so I feel like that when I talk to
applied machine learning teams there's
one really simple
workflow um that is enough to help you
make a lot of decisions about what you
should be doing on your machine learning
application um and by if if you're
wondering why I'm talking about this and
what this has to do with deepy I'll come
back to this in a second right does this
change in Era deep learning but uh uh I
feel like this is you know almost a
workflow like almost a a flow chart
right right which is first ask yourself
um is your training error
high oh and I hope I'm writing big
enough that people can see if if you
have a trouble reading let me know and
I'll and I'll read it back out right but
first I ask you know are you even doing
well in your training set um and and and
if your training error is high then you
know you have high bias and so your
standard tactics like train a bigger
model just train a bigger NE Network
um or maybe try training longer you know
make sure that your optimization
algorithm is is doing a good enough job
um and then there's also this magical
one which is a new model architecture
which is a hard one right um come back
to that in a second okay and then you
kind of keep doing that until you're
doing well at least on your training set
once you're at least doing well on your
training set so your training error is
no longer high so no training error is
not unacceptably High um we then ask you
know is your death
error
high right and if the answer is yes then
um well if your dep set error is high
then you have a high variance problem
you have an overfitting problem and so
you know the solutions are try to get
more
data right or add
regularization or try a new model
architecture
right and then until and and you kind of
keep doing this until your uh dep set
error is is is is no is I guess until
both you're doing well on your training
set and on your death set and then you
know hopefully right you're done so I
think one of the um one of the nice
things about this era of deep learning
is that no matter it's kind of no the
way you're stuck with modern deep
learning tools we have a clear path for
making progress in a way that was not
true or at least was much less true in
the era before deep learning which is in
particular no matter what your problem
is overfitting or underfitting uh really
high buys or high Varian or maybe both
right you always have at least one
action you can take which is bigger
model or more data so you could so so so
in the Deep learning era relative to say
the logistic regession era the svm era
it feels like we more often have a way
out of whatever problem we're stuck in
um and so I feel like these days people
talk less about buyers variance
trade-off you might have heard that term
buyers variance trade-off underfitting
versus overfitting and the reason we
talked a lot about that in the past was
because a lot of the moves available to
us like tuning regularization that
really traded off buyers and variance so
it was like a you know zero something
right you you could improve one but that
makes the other one worse but in the era
of deep learning really one of the
reasons I think deep learning has been
so powerful
is that the coupling between buyers and
variance can be weaker and we now have
tools we now have better tools to you
know reduce buyers without increasing
variance or reduce variance without
increasing buyers and really the bigger
the the the big one is really you can
always train a bigger model bigger
neuron Network in a way that was harder
to do when you're training logistic
regression is to come up with more and
more features right that was just harder
to do
um
so let's see one of the um and I'm I'm
going to add more to this diagram at the
bottom in a second okay um one of
the
effects of uh this maybe this and and
and by the way I've been surprised I
mean honestly um this new model
architecture that's really hard right it
takes a lot of experience but but even
if you aren't super experienced with you
know a variety of deep learning models
the things in the blue boxes you can
often do those and that would drive a
lot of progress right but if you have
experience with you know how to tune a
confet versus a resonet versus Whatever
by all means try those things as well
definitely encourage you to keep
mastering those as well but this dumb
formula of more data bigger bigger model
more data is enough to do very well on a
lot of
problems so um let's
see uh so bigger model puts pressure on
you know systems which is why we we have
high performance Computing team um more
data has has led to another interesting
um set of Investments so uh with you
know I guess a lot of us have always
what needed that had this insatiable
hunger for data we use you know trout
sourcing for labeling um uh we try to
come with all sorts of clever ways to
come to to to get data um one one area
that that I'm seeing more and more
activity in right it feels a little bit
nent but I'm seeing a lot of activity in
is um automatic data synthesis right um
let's see
and so here's what I
mean you know once upon a time people
used the hand engineer features and
there was a lot of skill in hand
engineering the features of you know
like the CIF or the hog or whatever to
feed into svm um automatic data
synthesis is this little area that is
small but feels like it's growing where
there is some hand engineering needed
but I'm seeing quite a lot of progress
in multiple problems is enabl by hand
engineering uh synthetic data in order
to feed into the giant mole of your
neuron Network right so let me best
Illustrated a couple examples um one of
the easy ones is OCR so so let's say you
want to train a um optical character
recognition system and actually I've
been surprised that by do this has tons
of users actually this is one of my most
useful apis that I you right um if you
imagine firing up Microsoft Word um and
uh downloading a random picture off the
Internet then choose a random Microsoft
Word font choose a random word in
English dictionary
and just type the English word into
Microsoft Word in a random font and
paste that on top you know like a
transparent background on top of a
random image off the internet then you
just synthesize a training example for
OCR right um and so this gives you
access to essentially unlimited amounts
of data it turns out that the simple
idea I just described won't work in its
natural form you actually need to do a
lot of tuning to blur the synthesize
text with the background to make sure
the color contrast matches your training
distribution so found it in practice can
be a lot of work to find you and how you
synthesize data but I've seen in many
verticals um I'll give a few examples if
you do that engineering work and sadly
it's painful engineering you could
actually get a lot of progress actually
actually ta Wang uh who was a a student
here at Stanford um uh the effect I saw
was he engineered this for months with
very little progress and then suddenly
he got the parameters right and he had
huge amounts of data and was able to
build one of the best OCR systems in the
world at that time
right um other examples speech
recognition right uh one of the most
powerful ideas uh for building a
effective speech system is if you take
clean audio you know a clean relatively
noises audio and take random background
sounds and just synthesize what that
person's voice would sound like in the
presence of that background noise right
and this turns out to work remarkably
well so if you recall a lot of car noise
what the inside of your car sounds like
and record a lot of clean audio of
someone speaking in a quiet environment
um the mathematical operation is
actually addition it's superposition of
sound but you basically add the two
waveforms together and then you get an
audio clip that sounds like that person
talking in the car and you feed this
your learning algorithm and so this has
a dramatic effect in in terms of
amplifying the training set for speech
recognition and has a huge effect can
have a we found a huge effect on um
performance um and then also NLP you
know here here's here's one example
actually done by some Stanford students
which is um using entend deep learning
to do grammar correction so input a
ungrammatical English sentence you know
maybe written by non-native speaker
right and can you automatically have a
have a I guess attention RNN input an
ungrammatical sentence and correct the
grammar just edit the sentence for me um
and it turns out that you can synthesize
huge amounts of this type of data
automatically and so that'll be another
example where data synthesis um works
very
well um and oh and I think uh uh video
games in RL right really one of the um
well let me just games broadly right one
of the most powerful um uh applications
of RL deep RL these days is video games
and I think if you think supervised
learning has an insatable hunger for
data wait till you work on AO algorithms
right I think the the hunger for data is
even greater but when you play video
games the advantage of that is you can
synthesize almost infinite amounts of
data to to feed this even greater more
right even greater need that our ARS
have
um so just one note of caution data
synthesis has a lot of limits um I'll
tell you one other story um you know
let's say you want to recognize cars
right uh there are a lot of video games
um I need to play more video games
what's a video game with cause in it oh
GTA Grand Theft Auto right so there a
bunch of cars in Grand Theft Auto why we
just take pictures of cars from Grand
Theft Auto and you can synthesize lots
of cars lots of orientations there and
paste that give that as training data um
it turns out that's difficult to do
because
from the human perceptual system there
might be 20 cars in a game but it looks
great to you because you can't tell if
there 20 cars in the game or a thousand
cars in a game right and so there are
situations where the synthetic data set
looks great to you because 20 cars in a
video game is plenty it turns out uh you
don't need a 100 different calls for the
human to think it looks realistic but
from perspective of learning algorithm
this is very impoverished very very poor
data set so so I think so so so a lot to
be to be sorted out for data synthesis
um
for those you that work in companies one
one practice I would strongly recommend
is to have a unified data
warehouse
right um so what I mean is that if your
teams if your you know engineer teams
your research teams are going around
trying to accumulate the data from lots
of different organizations in your
company that's just going to be a pain
it's going to be slow so um at buo you
know our our policy is um it's not your
data is a company's data and if it's
user data it goes into my user data
warehouse uh we we we should have a
discussion about user access rights
privacy and who can access what data but
at BYU I felt very strongly so we we
mandate this data needs to come into one
loog it's a logical warehous right so
it's physically distributed across loss
of data censes but they should be in one
system and what we should discuss is
access rights but what we should not
discuss is whether or not to bring
together data into as unified a data
warehouse as possible and so this is
another practice that I found
um makes uh access the data just much
smoother and allows you know teams to to
to to drive performance so really if if
if your boss ask you tell them that I
said like build a unified data warehouse
right so um I want to take the
uh train test you know bias variance
picture and refine it it turns out that
this idea of a 7030 split right train
test or whatever this was common
in um machine learning kind of in the
past when you know frankly most of us an
Academia were working on relatively
small data sets right and so I know
there used to be this thing called the
UC Irvine repository for machine
learning data sets you know by today's
this amazing results at the time but by
today standards is quite small and so
you download the data set shuffle the
data set and you have you know train Dev
test and whatever um in today in
production machine learning today is
much more common
for your train and your test
distributions to come from different
distributions right and and and this
creates new problems and new ways of
thinking about bu and VAR so let me sure
talk about that um so actually here's a
concrete example and this is a real
example from Buu right what builds a
very effective speech recognition system
and then recently actually actually
quite some time back now we wanted to
launch a new product that uses speech
recognition um we wanted a speech
enabled rear viiew mirror right so you
know if you have a car that doesn't have
a built-in GPS unit right uh we wanted
this is a real product in China we want
to let you take out your rearview mirror
and put a new you know AI power speech
part rearview mirror because it's an
easier uh uh like a off the market
installation so you can speak to
rearview mirror and say dear Rew mirror
you know navigate me to whatever right
so this is a real product um so so so
how do you build a speech recognition
system for this incar speech enable rear
view mirror um so this is our status
right we have you know let's call it
50,000 hours of data from of speech
recognition data from all sorts of
places right a lot of data we bought
some user data that that that we have
permission to use but a lot of data
collected from all sorts of places but
not your incar rear viiew mirror
scenario right and then our product
managers can go around and you know
through quite a lot of work for this
example I'm going to say let's say they
collect 10 hours more of data from
exactly the rear view
mirror scenario right so you know
install this thing in the car get drive
around talk to it you collect 10 hours
of data from exactly the distribution
that you want to test on so the question
is what do you do now right do you throw
this 50,000 hours of data away because
it's not from the distribution one or or
or can you use it in some way um in the
older pre deep learning days people used
to build very separate models it was
more common to build one speech model
for rearview mirrror one model model for
the maps voice query one model for
search one model and in the era of deep
learning it's becoming more and more
common to just power all the data into
one model and let the model sorted out
and so long as your model is big enough
you could usually do this um and if you
do little Tech if you get the features
right you could usually pile all the
data into one model uh and often see
gains butly usually not see any losses
but the question is given this data set
you know how do you split this into
train dep test right so here's one thing
you could do which is call this your
training set call this your dep set and
call this your test set right um turns
out this is a bad idea I would not do
this and so one of the best practices
with with' derived is um make
sure your development set and test sets
are from the same
distribution right I've been finding
that this is one of the tips that really
boost the effectiveness of a machine
learning team um so in particular I
would make this the training set and
then of my 10 hours let me expand this a
little bit right much smaller data set
maybe five hours def five hours of tests
and the reason for this is um uh your
team will be working to tune things on
the death set right and the last thing
you want is if they spend three months
working on the death set and then
realize when they finally tested that
the test is totally different a lot of
work is wasted so I think to make an
analogy you know having different dep
and test set distributions is a bit like
if I tell you hey everyone let's go
north right and then a few hours later
when when all of you are in Oakland I
say where are you wait I want you to be
in San Francisco and you go what why'
you tell me to go north tell me to go to
San Francisco right and so I think
having depth and test set be from the
same distribution is one of the ideas
that I found really optimizes the team's
efficiency because it you know the
development set which is what your team
is going to be tuning as algorithms to
that is really the problem specification
right and you problem specification
tells them to go here but you actually
want them to go there you're going to
waste a lot of effort um and so when
possible having Deen tests from the same
distribution which it isn't always there
there there's some cavas but when is
reason well to do so um this really
improves the the the the um the team's
efficiency um and another thing is once
you specify the de set that's like your
problem specification right uh once you
start the test set that's your problem
specification your team might go and
collect more training data or change the
training set or synthesize more training
set but but you know you shouldn't
change the test set if the test set is
is is your problem specification right
so um so in practice what I actually
recommend is splitting a training set as
follows um your training set cover a
small part of this let me just say 20
hours of data to form I'm going to call
this the um
training def set train dasde set but
that's basically a development set
that's from the same distribution as
your training Set uh and then you have
your depth set and your test set right
so these are what you actually from the
distribution you actually care about and
these you have your training set $50,000
of all sorts of data and maybe we aren't
even entirely sure what data this is uh
but split off just a small part of this
so I guess this is now what
49980 hours and 20 hours um and then
here's the generalization of the bias
variance concept um actually let me use
this
board and and but has say the the um the
fact that training and test sets don't
match is one of the problems that um
Academia doesn't study much there's some
work on domain adaptation there is some
literature on it but it turns out that
when you train and test on different
distributions you know it it sometimes
it's just random is a little bit luck
whether you generalize well to a totally
different test set so that's made it
hard to study systematically which is
why I think um Academia has not studied
this particular problem as much as I
feel it is important for to to those of
us building production systems um but
there is some work but but not no no no
very widely deployed Solutions yet would
be would my sense um but so I think our
best practice is if if you now
generalize what I was describing just
now to the following which is um measure
human level performance measure your
training set
performance measure your training death
performance measure your death set
performance and measure your test set
performance right so now you have kind
of five numbers so to take an example
let's say human level is 1% error um and
I'm going to use very obvious examples
for illustration if your training set
performance is 10% you know and this is
10.1%
right uh
10.1% you know
10.2% right in this example then it's
quite clear that you have a huge gap
between human level performance and
training set performance and so you have
a huge
bias right uh and and and so kind of use
the the the bias fixing types of um uh
uh Solutions um and then um there just
one example I want to well and so I find
that the machine learning one of the
most useful things is to look at the
aggregate error of your system which in
this case you know is your depth set
your tested era and then to break down
the components to to figure out how much
of what eror comes from where so you
know where to focus your attention so
this accumulation of errors this
difference here this is maybe 9% bias
which is a lot so I would work on the
bias reduction techniques uh this Gap
here right this is kind of um really the
variance this Gap here is due to your
train test distribution
mismatch um and this is overfitting of
Dev okay right um so just to be really
concrete um here's an example where you
have high train
test error mismatch right which is if
human level performance is 1% % your
training error is you know 2% uh your
training death is
2.1% and then on your death Set uh the
error suddenly jumps to 10% right so
this would sorry my my my my x-axis
doesn't perfectly line up but if there's
a huge gap here then I would say you
have a huge train test set mismatch
problem okay um and so at this basic
level of analysis what you know this
formula for machine learning
instead of Dev I would replace this with
train Dev right and then in the rest of
this uh really recipe for machine
learning um I would then
ask um is your death error
high if yes then you have a train test
mismatch problem and there the solution
would be to try to get more data
uh that's similar to test
set right or maybe a data
synthesis or data augmentation you know
try to tweak your training set to make
it look more like your test set um and
then there's always this kind of a uh
Hail Mary I guess which is you know new
architecture
right um and then finally just to finish
this up you know that that that's not
that much more uh finally uh uh there's
this yeah well and then hopefully if
you're done uh uh hopefully your test
set error will be will be good and if if
you're doing well your death set but not
your test set it means you've overit
your death set so just get some more
death set data right actually I'll just
write this I guess test set error
y right and if yes then just get more
depth data
okay and then
done sorry if this is not too legible
what I wrote here is uh if your dep set
error is not high but your test set
error is high it means you've overfit
your dep set so just get more test set
get more depth set data
okay
um so one of the um effects I've seen is
bias and variance is it sounds so simple
but it's actually much diffic much more
difficult to apply in practice than it
sounds when I talk about it or or on
text right so some tips for a lot of
problems just calculate these numbers
and this can help drive your analysis in
terms of deciding what to do
um yeah and and I find that it takes
surprisingly long to really Gro to
really understand bu and variance deeply
but I find that people that understand
buys and variance deeply are often able
to drive very rapid progress in in in
machine learning applications right and
and I know it's much sexier to show you
some cool new network architecture and I
don't know and and and and just this
really helps our teams make rapid
progress on
things
um so you know there's one thing I I I
kind of snuck in here without making it
explicit which is that in this whole
analysis we were benchmarking against
human level performance right so there
another Trend another thing that that
that has been differenced uh again you
know I'm looking across a lot of
projects I've seen in many areas and
trying to pull out the common Trends but
I find that comparing to human level
performance is a much more common theme
now than several years ago right with
with I guess Andre being the uh the the
human level Benchmark for image net uh
um and and and really by do we compare
our speech system to human level
performance and try to exceed it and so
on so why is that um it turns out that
so why why why is human level
performance right such a such a common
theme in in applied deep learning um it
turns out that if um this the x-axis is
time as in you know how long you've been
working on a project and the y- axis is
accuracy right if this is human level
performance you know like human level
accuracy or human level performance on
some task you find that for a lot of
projects your teams will make rapid
progress you
know up until they get to human level
performance and then often it will maybe
surpass human level performance a bit
and then progress often gets much harder
after that right but this is a common
pattern I see in a lot of problems um so
there multiple reasons why this is the
case I'm I'm curious like why why why
why do you think this is the case any
any guesses
yeah cool labels are coming from humans
the labs are oh cool yep labels coming
from humans anything
else all right cool anything
else oh interesting Ox small out the
human brain yeah I don't know maybe I I
think the the the the distance from
neuronet to human brains is very far so
that one I would
uh I see human capacity this from
similar yeah kind of yeah all close yeah
just okay board I see see right cool
that's one more and then I'll
just oh be satisfied okay cool you're
satisfied and bought I guess on two
sides of the coin I
guess all right
so oh defens human man yeah yeah cool so
so let let me let me let me uh I think
there there all all all you know lots of
great answers um I think that there
there there are several good reasons for
this type of effect um one of them is
that um there is for a lot of problems
there is some theoretical limit of
performance right if if you know some
fraction of data is just noisy in speech
recognition a lot of audio CPS are just
noisy uh someone picked up a phone and
you know they're in a rock console or
something and it's just impossible to
figure out what on Earth they were
saying right or some images you know are
jus

Resume

Berikut adalah rangkuman komprehensif dan terstruktur dari transkrip video yang Anda berikan.

***

# Strategi Praktis Machine Learning & Deep Learning: Panduan Lengkap untuk Praktisi dan Pengembang Karir

### Inti Sari (Executive Summary)
Video ini membahas strategi praktis dalam mengelola proyek Machine Learning (ML) dan Deep Learning (DL) di era modern, dengan penekanan pada perubahan alur kerja akibat skala data besar. Pembicara menjelaskan konsep kunci seperti *end-to-end deep learning*, manajemen bias-variance, serta pentingnya pemahaman distribusi data. Selain aspek teknis, video ini juga memberikan panduan berharga untuk membangun karir di bidang AI, menekankan pentingnya konsistensi, kombinasi antara teori dan praktik ("dirty work"), serta peluang besar AI dalam mentransformasi industri layaknya listrik.

### Poin-Poin Kunci (Key Takeaways)
*   **Skala adalah Kunci:** Keberhasilan Deep Learning modern ditentukan oleh ketersediaan data dalam jumlah masif dan kemampuan komputasi (Large Neural Networks).
*   **Diagnosa Error:** Memahami *Bias* dan *Variance* serta membandingkannya dengan performa tingkat manusia (*Human-Level Performance*) adalah langkah krusial untuk memperbaiki model secara efisien.
*   **End-to-End Learning:** Pendekatan ini memungkinkan pemrosesan langsung dari input ke output, namun membutuhkan data berlabel yang sangat besar; pendekatan modular masih relevan untuk data terbatas.
*   **Manajemen Data:** Di dunia nyata, data *training* dan *testing* seringkali memiliki distribusi yang berbeda; strategi pemisahan data yang tepat (menggunakan *Train-Dev set*) diperlukan untuk mengisolasi masalah.
*   **Konsistensi dalam Belajar:** Membangun karir di AI membutuhkan kombinasi antara belajar teori (membaca paper), mengerjakan proyek nyata, dan melakukan pekerjaan teknis yang "kotor" (data cleaning, debugging) secara konsisten.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Mengapa Deep Learning Meledak Saat Ini?
Deep Learning mengalami lonjakan popularitas bukan karena algoritma baru secara tiba-tiba, melainkan karena **skala**. Dengan memplot Data vs Performa:
*   Algoritma tradisional (seperti SVM atau Logistic Regression) cenderung mencapai plateau (datar) meskipun data ditambah.
*   Neural Network kecil, menengah, dan besar terus meningkat performanya seiring bertambahnya data karena memiliki kapasitas untuk menyerap informasi tersebut.
*   **Implikasi:** Tekanan untuk membangun model besar dan dataset besar mendorong kolaborasi antara tim AI dan tim Sistem Komputer (HPC/Supercomputing).

#### 2. Kategori dan Tren Utama Deep Learning
Secara umum, Deep Learning dibagi menjadi beberapa "ember mental":
1.  **General DL:** Lapisan terhubung padat (Dense/FC).
2.  **Sequence Models:** RNN, LSTM, GRU (untuk data 1D seperti teks/suara).
3.  **Image Models:** CNN (untuk data 2D/3D).
4.  **Lainnya:** *Unsupervised learning* dan Reinforcement Learning (yang masih kurang umum untuk aplikasi produk praktis hari ini).

**Tren Utama: End-to-End Deep Learning**
*   Ini adalah pendekatan untuk memetakan input langsung ke output yang kompleks tanpa rekayasa fitur perantara (contoh: *Speech Recognition* langsung dari Audio ke Teks tanpa fitur fonetik).
*   **Keterbatasan:** Membutuhkan data berlabel (X, Y) dalam jumlah sangat besar. Jika data sedikit, pendekatan manual atau modular (menggunakan pengetahuan domain) seringkali lebih baik.

#### 3. Diagnosa Masalah: Bias dan Variance
Keterampilan penting insinyur ML adalah mendiagnosa mengapa model tidak bekerja:
*   **High Bias:** Error *Training* tinggi jauh di atas tingkat manusia.
    *   *Solusi:* Perbesar model, latih lebih lama, atau ubah arsitektur.
*   **High Variance:** Error *Training* rendah, tetapi error *Dev* (Validasi) tinggi (gap besar).
    *   *Solusi:* Tambah data, gunakan regularisasi, atau *early stopping*.
*   Di era Deep Learning, *trade-off* bias-variance menjadi lemah; kita seringkali dapat mengurangi keduanya sekaligus dengan menambah data atau memperbesar model.

#### 4. Mengatasi Ketidakcocokan Data (Data Mismatch)
Dalam produksi, data *training* sering berasal dari distribusi yang berbeda dengan data *test* (misal: data suara bersih vs data suara di kaca spion mobil yang berisik).
*   **Strategi Split:**
    *   **Training Set:** Campuran dari semua data yang tersedia.
    *   **Train-Dev Set:** Bagian dari data training yang tidak dilatih, untuk mengecek apakah model belajar dengan baik dari data training.
    *   **Dev Set & Test Set:** Harus berasal dari distribusi target (data nyata aplikasi).
*   Jika error *Train-Dev* rendah tapi error *Dev* tinggi, masalahnya adalah *Data Mismatch*, bukan *Variance* biasa. Solusinya meliputi sintesis data atau mengumpulkan data lebih mirip dengan target.

#### 5. Pentingnya Performa Tingkat Manusia (Human-Level Performance)
Menggunakan performa manusia sebagai *baseline* sangat krusial karena:
*   **Proksi Bayes Error:** Manusia seringkali sangat dekat dengan batas error optimal (Bayes error) untuk tugas persepsi (penglihatan/suara).
*   **Alat Perbaikan:** Selama performa model di bawah manusia, kita bisa mendapatkan label dari manusia, menganalisis mengapa model salah, dan memperbaikinya.
*   **Definisi Terbaik:** Untuk keperluan analisis, gunakan performa "Tim Dokter Ahli" atau "Grup Ahli" sebagai tolok ukur, bukan individu biasa, karena ini mendekati batas optimal.

#### 6. Sintesis Data dan Rekayasa Fitur Baru
Saat ini, *feature engineering* beralih menjadi *data engineering* atau *synthetic data*.

## Kesimpulan & Pesan Penutup
Secara keseluruhan, video ini menegaskan bahwa keberhasilan di bidang Machine Learning dan Deep Learning modern bergantung pada pemahaman mendalam mengenai skala data, strategi diagnostik error, serta pendekatan *end-to-end*. Bagi para pengembang karir, kunci utamanya terletak pada keseimbangan antara penguasaan teori dan penerapan praktis yang konsisten, termasuk pengerjaan tugas-tugas teknis yang fundamental. Dengan menguasai strategi ini, praktisi diharapkan dapat membangun model yang efektif serta siap menyambut peluang transformasi besar yang ditawarkan oleh kecerdasan buatan.

Read

file updated 2026-02-13 13:24:39 UTC