Transcript
LqGTFqPEXWs • Jeremy Howard: Very Fast Training of Neural Networks | AI Podcast Clips
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0141_LqGTFqPEXWs.txt
Kind: captions
Language: en
there's some magic on learning rate that
you played around with yeah interesting
yeah so this is all work that came from
a guy called Leslie Smith Leslie's a
researcher who like us cares a lot about
just the practicalities of training
neural networks quickly and accurately
which i think is what everybody should
care about but almost nobody does and he
discovered something very interesting
which he calls super convergence which
is there are certain networks that with
certain settings of high parameters
could suddenly be trained 10 times
faster by using a 10 times higher
learning rate now no one published that
paper because it's not an area of kind
of active research in the academic world
no academics recognized this is
important and also deep learning in
academia is not considered a
experimental science so unlike in
physics where you could say like I just
saw as a subatomic particle do something
which the theory doesn't explain you
could publish that without an
explanation and then in the next 60
years people can try to work out how to
explain it
we don't allow this in the deep learning
world so it's it's literally impossible
for Leslie to publish a paper that says
I've just seen something amazing happen
this thing trained ten times faster than
it should have I don't know why
and so the reviewers were like we can't
publish that because you don't know why
so anyway that's important to pause on
because there's so many discoveries that
would need to start like that every
every other scientific field I know of
work so that way I don't know why ours
is uniquely disinterested
in publishing unexplained experimental
results but there it is so it wasn't
published having said that I read a lot
more unpublished papers and published
papers because that's where you find the
interesting insights so I absolutely
read this paper and I was just like this
is astonishingly mind-blowing and weird
and awesome and like why isn't everybody
only talking about this because like if
you can train these things ten times
faster they also generalize better
because your your doing less epochs
which means you look at the data less
you get better accuracy so I've been
kind of studying that ever since and
eventually Leslie kind of figured out a
lot of how to get it's done and we added
minor tweaks and a big part of the trick
is starting at a very low learning rate
very gradually increasing it so as
you're training your model you would
take very small steps at the start and
it gradually makes them bigger and
bigger and troll eventually you're
taking much bigger steps than anybody
thought as possible there's a few other
little tricks to make it work but ever
ever it basically we can reliable to get
super convergence and so for the drawing
bench thing we were using just much
higher learning rates than people
expected to work what do you think the
future of I mean makes so much sense for
that to be a critical hyper parameter
learning rate that you very what do you
think the future of learning rate magic
looks like well there's been a lot of
great work in the last 12 months in this
area it's and people are increasingly
realizing that up to might like we just
have no idea really how optimizers work
and the combination of weight decay
which is how we regularize optimizers
and the learning rate and then other
things like the epsilon we use in in the
atom optimizer they all work together in
weird ways and different parts of the
model this is another thing we've done a
lot of work on is research into how
different parts of the model should be
trained at different rates in different
ways so we do something we call
discriminative learning rates which is
really important particularly for
transfer learning so really I think in
the last 12 months a lot of people have
realized that all this stuff is
important there's been a lot of great
work coming out and we're starting to
see algorithms here which have very very
few dials if any that you have to touch
so like that I think what's gonna happen
is the idea of a learning rate well it
almost already has disappeared in the
latest research and instead it's just
like you know we we know enough about
how to interpret the gradients and the
change of gradients we see to know how
to set every parameter you can't wait it
you