Transcript
L1sHcj3qDNc • Torch Tutorial (Alex Wiltschko, Twitter)
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0014_L1sHcj3qDNc.txt
Kind: captions
Language: en
so I'm gonna tell you about machine
learning with torch and with torture
Auto grads so the the description of the
talk isn't entirely correct I'm gonna do
practical stuff for the first half and
then what I want to do is dive into
torch Auto grad and some of the concepts
that are behind it and those concepts
also happen to be shared amongst all
deep learning libraries so I really want
to give you a perspective of the common
thread that links all deep learning
software you could possibly use and then
also talk a bit about what makes each of
the libraries different and why there's
I will I will hypothesize why there's so
many and the different choices so one
thing I want to try there's been a lot
of questions and we've gone over time
but if there's not questions that go
over time in the room there's a lot of
people watching online and if there's
extra time we'll of course prioritize
people here but if you ask a question
with the DL school hashtag or if you
tweet at me directly I will try to
answer those questions from online and
I'll certainly answer them offline as
well so ask if you're watching at home
maybe that will kind of increase you
know meaningful participation for people
watching through the stream that aren't
here today umm a lot of this material
was developed with sumus chintala at
Facebook he's kind of the Czar of the
torch ecosystem these days and Hugo la
rochelle who you heard from yesterday
and also Ryan Adams who's at Twitter
with us and all this some material is
available on this github repository that
you got actually on a printed sheet for
installing torch so all the examples
that I'll show you will be in in one
notebook and then there's a separate
notebook which it actually won't
reference in the talk that's a full
end-to-end walkthrough of how to train a
convolutional neural network on CFR 10
so that's kind of a self-paced tutorial
notebook that you can work through on
your own time but I'm going to focus on
the basics on the fundamentals and
hopefully give you some of the concepts
and vocabulary that you can use to
really dive into torch on your own time
so let's let's get going so torch is an
array programming language for Lua right
so it's like numpy it's like MATLAB but
it's in the Lua language so torch is -
Lua as numpy is - pi
right so what you can do in torch you
can do in you know any language this is
the absolute minimum basics you can grab
strings and print them you can put
things in associative data types in
Python there's tuples and lists and sets
and dictionaries in lua there's just one
data type called a table so you'll see
that a lot but you can do all those
things that I mentioned before with with
a table and you got four loops and if
statements the core type of torch is the
tensor just like in in numpy when you
have the ND array which is a way of
shaping sets of numbers into matrices or
tensors we have the tensor and you can
fill it up with random numbers you can
multiply them standard stuff but the
tensor is the core data type of torch
and we've got plotting functionality
going over at a very high level I'll
show you some more specific code in a
moment so you can do all the kind of
standard stuff that you'd do in any
other array based language there's all
the tensor functions that you'd like to
like to use including all the linear
algebra and convolutions and and you
know blast functions and I'm leaving
this link here when the slides get
uploaded you can follow this and kind of
dive into the documentation and see
exactly what what kind of tools you have
at your disposal in in the notebook and
the eye torch notebook which is
something that seumas put together you
can prepend any torch function with a
question mark and that gives you the
help for that function so it makes it
really nice to discover functionality in
the torch library in the notebook so why
is it in Lua alright it's kind of a
maybe a strange maybe esoteric language
to write things in Lua is is
unreasonably fast for how convenient it
is to use especially a flavor of Lua
called Lua jet for loops in Lua jet are
basically the same speed as C so this
for loop here is actually in production
code in master and torch it's not C code
but this is perfectly fast enough right
so that's a really nice aspect of Lua is
you can depend on super
high-performance c-code and then on top
of it you've got this very convenient
glue layer but you don't pay much of a
speed penalty to use that glue layer so
that's one of the reasons why we've used
Lua another advantage that some people
might see as a plus is the language
itself is quite small so there's 10,000
lines of C code that define the whole
language of Lua so you can really sit
down with the manual in an afternoon
and understand most of the language on
your own that same day another aspect
which is pretty critical for deep
learning but also for other fields is
that it's really easy to interoperate
with C libraries it was designed
originally to be embedded so Lua was a
language that was designed to run inside
of another C program but have a little
scripting layer inside of it so it's
very easy to call indicee it's very easy
for c to call into Lua so this is
another reason why it's kind of an
appropriate choice for deep learning
libraries the FFI for like the FF I call
signature and the idea has been copied
into many other languages so C FF I and
Python is a Python version of the Lua FF
I julia has something similar as well
and as I mentioned it was originally
designed to be embedded and it's in all
kinds of crazy places that you maybe
wouldn't expect Lua to be so in World of
Warcraft all the graphics are in C++ or
whatever they wrote it in but like the
boss battles or the quests so like when
you go give the gem to the blacksmith or
whatever and they give you back the
magic sword the scripting of those
events that happens in Lua and if you
write scripts for world of warcraft to
make your own quests that's Lua Adobe
Lightroom is a photo processing app all
the image processing is done in C++ but
all the UI and everything was done in
Lua so again it was used to bind
together high-performance code with a
with kind of a scripting layer and Redis
and nginx which are kind of workhorses
in the field of web development are both
scriptable with Lua and in fact if you
go to github pages like my page github I
oh if somebody's hosting a web page on
github that's served in part by Lua the
apocryphal story of why I was originally
chosen maybe you could correct me is
klimova Oh BAE was trying to build an
embedded machine learning application
some device he could whereas
helmut and classify the world with the
CNN when he was a young student and he
was trying to do this with Python and
it's incredibly frustrating to get
Python to run on embedded chips maybe
it's easier now with raspberry pi but
that just wasn't the case and then he
stumbled upon Lua
and turns out people had been building
Lua into embedded applications for years
before that and so that kind of was the
snowballing effect so that's that's the
hearsay for how we arrived at Lua but
maybe there's there's another story
another really nice feature of torch is
we have first-class support for GPU
computation interactive GPU computation
so it's very very easy to get some data
from the CPU to the GPU and then
everything that you do with that data
happens on the GPU without you having to
worry about writing CUDA kernels right
so this has been a feature of Lua torch
which is becoming maybe a little bit
less unique now but this was this was a
pretty solid feature when it first came
out so interactive GPU computing and
I'll go very quickly over some of the
basic features and all of these examples
again are in a notebook which you can do
kind of at your own pace if you'd like
so there's all the basic arithmetic like
creating matrices and and doing
arithmetic between them taking maxes of
numbers and arrays clamping building
tensors out of ranges boolean operations
over entire arrays special functions
this is supported through a wrapper
around the Cepheus library this is what
numpy uses to support things like 10h
and atan2 and other kinds of functions
that I guess are in the special class
and then sumif again has wrapped the
Bocage a/s library which is originally
just for python but it provides really
nice and beautiful plots in the eye
torch notebook and so we can you know
draw random numbers from our favorite
distributions and make nice histograms
of these so you can do nice data
exploration in the eye torch notebook
along with deep learning so one feature
that is attractive to some folks but
just an interesting feature of the torch
ecosystem is that although there's a lot
of industries
support it is not industry owned so at
Twitter and at Facebook air research in
at Nvidia
we all contribute a lot to the torch
community but we don't own it we can't
really steer it to go one way or the
other definitively and there's a ton of
other people that participate
academically in this ecosystem and
that's a really nice feature and along
with I guess because of the really nice
habits of people in deep learning when a
paper comes out there's often a high
quality code implementation that follows
it not not always but but very often at
least compared with with other fields
and torch is one of the environments in
which you'll often see high quality
implementations of really cutting-edge
stuff so if you just browsed through
github and you kind of follow
researchers on github you can see really
high quality implementations of image
captioning of neural style transfer so
you can just clone this github
repository and run this yourself
seek to seek models kind of the what is
whatever is the state of the art there's
usually a torch implementation of it
some of the recent work in generating
very realistic synthetic images with
generative adversarial networks also has
great torch code implementing it so
given that there's this active community
on github in deep learning for torch how
does that stack up against other
communities just to give you some
context so the Python data science
community is is pretty enormous and its
focuses are also very very varied
if you enter into the data science
community in torch and lua you'll likely
find deep learning people but not a lot
of other people so it's strengthened
deep learning compared to its size is
actually quite enormous and for those
that are kind of thinking of switching
between Python and Lua and giving torch
a try the effort to switch from Python
to Lua you can probably do that in a day
if you've tried some Python programming
so I was a Python programmer for a while
and getting started on Lua took took me
maybe a couple days and I was you know
actually productive at work and maybe a
week or so but you can actually run your
code and understand and write new things
pretty quickly if you've worked in a
scripting language like MATLAB or
or Python so if you were intimidated or
waiting to try it you should just dive
in so how does torch compared to other
deep learning libraries specifically as
opposed to languages and the first thing
I'll say is there's really no silver
bullet right now there are a lot of deep
learning libraries out there I say
tensorflow is by far the largest and
this is a plot that was made by a
colleague of SU myths and I wish it kind
of had confidence intervals on it
because it's not strictly that these are
like you know points in in deep learning
space but maybe this is a good guess of
where things kind of fit it seems as if
tensorflow was engineered to be very
good in an industrial production setting
and it seems like it's really fulfilling
that Theano
seems to have always had a research goal
in mind and has been really awesome in
the research community for some time
Torche tends to be more towards research
than industry I think Twitter maybe has
pulled it a little bit towards
production we maybe are the only example
I'd love to learn of others but were
maybe the only example of a large
company that uses torch in production to
serve models so every piece of media
that comes in to Twitter goes through a
torch model at this point so we're
really dealing with an enormous amount
of data in a live setting the
development of torch just to give you a
sense of how we think about how it was
built and how we're extending it is
there's some kind of tenets of our core
philosophy and if really the first is
things should be as not to this isn't
necessarily good or bad this but this is
our choice whenever you hit enter on a
particular line and your I torch
notebook or on the command line you
should get an answer back and this is
something that we've we've tried to
stick to pretty pretty tightly so no
compilation time imperative programming
right so just write your code and you
know each each line of code executes
something and passes it to the next line
and minimal abstraction what I mean by
minimal abstraction is if you want a
reason about how your code is performing
it shouldn't take you that many jumps to
go to the C code that's actually being
run in fact it usually is one or two
jumps from the file that defines the
function that you care about to the
actual C code so if you want a reason
about performance or really understand
what's going on it's it's it's quite
easy to do so in torch
I want to take a little bit of a detour
and tell you about how torch thinks
about its objects how it thinks about
the tensor because this can help you
also reason about performance a lot of
the reason why people come to torch is
to build high-performance models very
quickly and easily so I mentioned
tensors before so attentional tensor a
tensor is an N dimensional array and a
tensor is actually just a pointer it's a
view into your member into your data
that's sitting in memory all right so
it's just a it's a shape it's um it's a
view into into what's actually being
stored in your RAM and it's stored in a
row major way
so that means if I go to the first
element of my tensor in memory and I
move over one I'm moving over one in a
row and not one in a column column major
memory storage does exist it's just less
common today so you often see row major
so this tensor is defined by its link to
some storage and it's size 4 by 6 and
it's tried six by one and six by one
means if I move one down in the column
direction I actually have to skip six
elements in memory right whereas the one
here means if I move over one in the
second axis the row axis I have to go
over one in memory so if I take a slice
of this tensor using the Select command
so I select along the first dimension
the third element what he gives me back
is a new tensor it doesn't give me a new
memory this is a thing that that happens
a lot in torch is you'll deal with views
into memory you won't do memory copies
right so usually working with kind of
the raw data in RAM and so this creates
a new tensor with the size of six
because there's six elements astride of
one because we've pulled out a row not a
column and an offset of 13 that means I
have to go 13 elements from the
beginning of the original storage to
find that piece of memory so if I pull
out a column then something different
happens which is they still have or I
have a size of four here and my stride
is now six because in order to grab each
element of the column I have to skip six
and then the offset of three is because
I grab the third element there all right
so that's kind of a view of the of the
memory model and if we act
run something like this like we
instantiate a double-a tensor of double
of foot double values inside of the
tensor and fill it with you know uniform
uniform distribution and print it we can
see the values here and then we grab a
slice B and print it it's just this row
and then we can fill B with just some
number and print it now it's filled with
that number now if we go back and print
a we've actually overwritten the values
there so this is something you see a lot
in torches is working on one big piece
of shared memory and as I mentioned
before working with CUDA is really
really easy so if you just require ku
torch which is installed automatically
if you have a CUDA GPU using the
instructions on the github repository
you can instantiate a tensor on the GPU
and do the same thing and it will just
work so now I want to talk a bit about
the frameworks that you'll use to
actually train neural networks in torch
so this is a schematic kind of cartoon
of how we of the pieces we typically
need to train a neural network so we've
got our data stored on you know hard
drive or on a big distributed file
system and we have some system for
loading that data off of that file
system which goes into a nice queue and
then some training code which
orchestrates a neural network so the
thing actually making the prediction a
cost function which is a measure of how
good our neural network is at any point
in our training and an optimizer which
is going to take the gradient of the
cost with respect to the parameters in
the neural network and try to make the
neural network better so in the torch
ecosystem we've got some packages that
tackle each one of these separately so I
won't talk about threads here there's
actually several different libraries
that will do this there's actually
several different libraries that will do
each one of these things but this one is
maybe the most common or the easiest to
start with and and then here we'll cover
both the specification of the neural
network and the cost function as well as
the mechanisms to push data through the
neural network in the cost function and
pull the gradients back from the cost to
the parameters and then the optimizer
which is we've heard mentioned several
times today is to cast a gradient
descent or
we're outta grad so let me talk about NN
first give you a flavor of kind of how
it works and what the pieces are so NN
is a package for building feed-forward
neural networks mostly feed-forward
neural networks but kind of clicking
Lego blocks together
right so you might start with your input
and then click together a fully
connected layer and then another fully
connected layer and then maybe some
output right so here I've defined a
sequential container which is going to
be a container for all my Lego blocks
and then I might click in a spatial
convolution so I'm going to be working
with images maybe a non-linearity some
max pooling some other layers as well to
kind of complete the whole neural
network and then I might add a log
softmax at the end to to compute class
probabilities so this this kind of the
structure that you'll build neural
networks with in NN is define a
container and then one by one add pieces
down a processing hierarchy and I
mentioned the sequential container which
is starting from inputs and then
proceeding linearly there's two other
types of containers that you might use
but generally NN shines when your
architecture is linear right not when
it's got some crazy branches or anything
like that the there's not a lot of API
to the NN package so if you if you learn
these couple functions which will be in
the slides for later if you want to
refer to them back you will understand
all the mechanisms that you need to know
to push data through a neural network
and then to push it through a criterion
or a loss function and then to pull
those gradients back in order to make a
gradient update to your model so these
are really the API is the levers that
you need to know to kind of drive your
neural network and of course we have a
CUDA back-end for n n so in the same way
that you'll just call CUDA on some data
you can call CUDA on a container and
that will move the whole model onto the
GPU and then anything that you do with
that model will occur on the GPU so it's
kind of a one-liner to start training
models on a graphics processor so for
doing feed-forward neural networks n n
is pretty great but for starting too
weirder architectures like richard
social yesterday mentioned a pretty
complicated NLP model that starts with
glove vectors which are kind of like
shallow neural networks and then a
recursive neural network and then an
attention mechanism and all these things
were interacting in strange ways that's
actually pretty hard to specify in NN at
Twitter we have a package called torch
Auto grab which makes these kinds of
gluing different model pieces together
really easy and in fact the pieces can
be as small as addition division
multiplication and subtraction so you
can glue together any size piece of
computation and still get a correct
model out and I'll talk more about that
in a moment
the optin package is what you need in
order to train models with like
stochastic gradient descent or a degrade
or out of delta whatever your optimizer
is that you that's your favor
the API is pretty straightforward but
maybe a little bit different for people
kind of coming from the Python world
it's got a bit of a functional approach
where it will actually you'll you'll
pass a function to opt in that will
evaluate your neural network and pass
back the gradients so that's just
something to be aware of it's a little
bit of a different style another gotcha
with optin that you might run into and
you'll see in some of the notebooks that
are online is your parameters should be
linear in memory so if you want to
optimize to neural networks that are
interacting in some way you actually
need to first bring their parameters
together into one tensor and then pass
that to opt in there's just something to
be aware of so I want to talk for the
rest of the talk about torch Auto grad
but also about some of the ideas that
are behind torch Auto grad and how those
link all the deep learning libraries
that you possibly could choose so first
I want to take a step back and say that
just appreciate the wonderful stable
abstractions that we have in scientific
computing right so Fortran you know back
in 57 I don't think anybody uses Fortran
57 but people might actually still use
Fortran 90 the idea of an array was
didn't exist on a computer and it really
took some pretty crazy thinking I think
to build a system that made arrays
something we take for granted same with
linear algebra over about a 20-year
period starting in the late 70s people
decided oh maybe we should think about
linear algebra in a systematic way and
now we don't really worry about this if
you want to multiply two matrices that
used to be you know a phd's worth of
work to do that at scale and now we just
you know we don't even actually import
Blas there's so many wrappers of blasts
that we don't even think about this
anymore so this is another abstraction
and also the idea that we should have
all of the routines that we would
possibly want to call in one place
available that we don't have to write
that was kind of invented I would say by
MATLAB in the mid-80s and then really
popularized in the open-source community
by numpy and we should take them for
granted we should totally forget about
them that because they make us faster
they make us better for us to assume
these things will work so machine
learning has other abstractions besides
these computational ones that we take
for granted all gradient based
optimization that includes neural nets
as a subset relies on automatic
differentiation to calculate those
gradients right and and I like this
definition from Barack Perlmutter
automatic differentiation mechanically
calculates derivatives as functions
expressed as computer programs right so
it doesn't derive things are right on a
piece of paper with a pencil it derives
computer programs app machine precision
and with complexity guarantees those
last two clauses differentiate it from
finite differences where you take the
input to a program you perturb it
slightly and you measure the gradient
that way that's a very bad way to
measure gradients it's it's numerically
very unstable and it's not symbolic
differentiation so it's not writing down
the symbolic expression of a neural
network putting it in Mathematica or
maple and then it asking for the the
derivative because your expression might
go from this to this so you get
expressions well when you do naive
symbolic differentiation and you don't
get that with automatic differentiation
so automatic differentiation I would say
is the abstraction for gradient based
machine learning it's been rediscovered
several times there's a review by
Woodrow and there
I think the first implementation where
it actually operates on a computer
program was by Bert's bill pending in
1980 although it has been described back
you know in 1964 by Wengert in in neural
networks rumble heart is the one that I
suppose popularized it as back
propagation although back propagation is
a special case of auto-da-fé this this I
think is important in nuclear science
and computational fluid dynamics and in
weather modeling these people have been
using auto-da-fé for years decades and
their tools in many ways are much more
sophisticated than we have in machine
learning there's a lot of ideas that we
have yet to import from people that
model the weather that would really
benefit our ability to train larger and
larger models and I would clarify that
our abstraction and machine learning is
actually reverse mode automatic
differentiation there's two different
types two extremes I should say forward
mode in Reverse mode you never hear
about forward mode and you never hear
about forward mode of machine learning
because it's a very bad idea to try
forward mode and machine learning and
I'll show you why so here is a cat
picture from the internet and my job at
my job is to decide that that is in fact
a cat picture this is actually something
that we do do at Twitter what I am doing
is passing this cat through successive
layers of transformations than
eventually producing a probability over
classes I'm getting it wrong my
classifier thinks it's a dog so I'd like
to train my neural net to think it's a
cat so I have a loss a gradient of my
loss and I have it with respect to my
parameters and this is my gradient that
will let me update my parameters and it
is composed of multiple pieces and using
the chain rule I know that I can fold
this together to actually compute the
loss I want which is the gradient of the
law through the respect to the
parameters the issue is I can do it
either left to right or right to left so
going from left to right looks like this
whoops that was very fast okay I'll do
two big matrix matrix multiplies so this
is bad this is not good because we had
these huge matrix matrix products that
we're keeping around it's actually worse
than this and I'll show you in another
view of forward node so see I have a
computer program so no longer a symbolic
representation of a neural net this is
just some computer program and let's say
I'd like to optimize a write a is the
single parameter of my neural net it's a
very silly trivial example but I think
it will help illustrate the point so I
can execute this program and look at all
of the arithmetic operations that occur
and build what's called a trace
so I'll define say a is 3 I'll define B
is to C is 1 and then I'll start
executing the code I'm actually going to
look if B is greater than C and choose a
branch to operate on but then ignore it
in my trace so I've chosen one of those
traces that one of those branches which
is the first because B is greater than C
and I have some output value D and I'll
return the output value all right so
this is a trace execution of my program
given some inputs so to calculate in
forward mode the derivative of my output
D with respect to a I'll define a is 3
and then initialize a gradient of a with
respect to itself and the idea is I
eventually want the derivative of D with
respect to a and I'll build it up
sequentially da da and then I'll do D be
da and then Dissidia in ddd a so I'm
moving from the left to the right
building up my gradient I can't do much
about the derivative of B with respect
to a right now so I'll define C and
remove C with respect to a and then I
have my value D and then I can define my
target value which is the gradient of D
with respect to a so if I wanted the
gradient of D with respect to B so if I
had a two parameter neural network and I
wanted optimize both at once I would
have to execute this whole thing again
and initialize this guy here as DB DB
has one right so if you have a million
parameters in your neural network or
tens of millions if you have to do a
million evaluations of forward mode or
tens of millions of evaluations of fort
mode so it is a very bad idea to try
forward mode automatic differentiation
on neural network and that's why you
probably never heard of it so now you
can forget about it but the alternative
is reverse mode and that's starting from
the right to the left so now I've got
this nice matrix that
your products which are much smaller and
the complexity is much better and
there's an interesting difference when I
actually go to do this in computer code
and you'll see these words are closer
together and that's because for reverse
mode I actually have to evaluate the
whole program before I can start
deriving because I'm starting with the
derivative of D with respect to D and
then decrementing derivative of D with
respect to C with respect to D with
respect to a so I'm going the other way
but I have to have all the information
first before I start that so now I can
initialize derivative of D with respect
to D and I can walk backwards and return
both the value and get gradient what's
really nice about this is you'll notice
here I actually have all the information
I need to calculate the derivatives of D
with respect to these other parameters
so that's why we really like reverse
mode auto-da-fé aka back propagation for
neural nets is if you have a million of
these guys you really want to be ready
to compute them all at once right and
doing these with matrices is very
efficient thing to do on the computer so
we've implemented this trace based
automatic differentiation in a package
called Auto grad and this is the
entirety of a neural network so this is
how you would specify and train a neural
network and autocrat so I'll initialize
my parameters we'll just be some random
numbers and then here is my neural
network function I'm multiplying my you
know image that I'm passing in by my
white matrix and adding a bias
non-linearity doing it again and then
returning some probabilities and I have
a loss which will take in an image and
return a prediction so just using this
function and then I'll just take the
mean squared error or it's the sum
squared error in order to get the
gradients of this function the
derivative of the loss with respect to
these parameters all I have to do is
import this autograph package and then
call grad on this function this returns
a new function that returns the
gradients of my original function so
it's a what's called a higher-order
function it's inputs and its outputs are
a function so whenever you see that
Noblet that upside-down triangle
grad triangle this is the coding
equivalent of that and then to Train
we'll just call our D loss function on
our parameters our image and our label
which I'm just pretending like you
already have a system to get here when
we have our gradients and then we're
updating with stochastic gradient
descent here all right so it's a very
thin it's it's really just this this is
the interface with which you talk with
Auto grad so what's actually happening
so here's my simple function as we
evaluate it
we're actually keeping track of
everything that you're doing in order to
be able to reverse it so we're actually
building that trace list that I
described before and keeping track of it
internally so we'll start online I guess
that's five so we'll multiply some
things we'll keep track of the fact you
multiplied and the inputs will keep
track of the addition and the inputs and
also the output of addition will keep
track of inputs outputs in the function
every time and we'll kind of walk down
this function and build your compute
graph just in time so as you're running
your code we're learning what you've
done and the way we track that and I
won't go into details we actually
replace every function and torch with
like a like a spy function so instead of
just running torch dot some our spy
function says oh I hear you're running
torch dot some let me remember the
parameters you gave me let me run some
on those parameters remember the output
and then return it like nothing happened
but internally we're remembering all
those things and the way we do this to
actually compute the gradients is we're
walking back this list like I described
before and every time we get to a point
where we need to calculate a partial
derivative we look it up so we've
written all of the partial derivatives
for Torche functions and it really every
neural network library is going to do
this at some level of granularity so let
me walk you through another couple
examples just to show you what it could
do so this is kind of a pretty vanilla
one we can you know add and multiply
scalars and get the correct gradient
this is where things get a little bit
more interesting if there's an if
statement all right so this control flow
can be a little bit difficult or awkward
and a lot of existing deep learning
libraries because we just listen to what
era
medic functions get run we ignore
control flow so we just go right through
this stuff all right so we can get the
correct gradient even with if statements
we actually care about tensors when
we're doing optimization or machine
learning so everything I've shown you
that works with scalars also works with
tensors just as easily this is in the
notebook that is on the github
repository if you want to play with it
this is where things get a little bit
interesting for loops also work just
fine and not just for loops that have a
fixed length which is something that is
perhaps easy to unroll but for loops
whose duration can depend on data you
just computed right or while loops whose
stopping condition can depend on a
computation that occurs in the while
loop we don't really care we're building
your graph dynamically and when it's
done and when you return some value will
calculate the derivative derivatives of
the graph that we have you can turn any
for loop into a recursive function this
is kind of wacky I mean I don't know how
you would actually use this in practice
but you can cook up a lot of crazy
things you might try with autograph and
they just work so here we have a
function f if B is at some stopping
condition will return a otherwise we'll
call F and we're gonna differentiate
this right so we're gonna differentiate
a fully recursive function and it works
just fine another aspect which is coming
up more and more as papers are coming
out that basically disrespect the
sanctity of the partial you know of the
derivative of the gradient and people
are computing synthetic gradients
they're you know adding they're clipping
two gradients or people are messing with
kind of the the internals of back
propagation or of auto-da-fé it's
actually pretty easy to start to engage
with in Auto grad so say I'm going to
sum the floor of a to the third power so
the floor operation is piecewise
constant so the derivative is zero
almost everywhere except for where it's
undefined why would I want to do this
for instance if you wanted to build a
differentiable JPEG encoder or
differentiable MPEG encoder in
compression algorithms like that there's
often a quantization step that will
floor around or truncate numbers and if
you wanted to differentiate through that
to build like a neural Jake
algorithm or something you need to pass
gradients through something that
ordinarily does not and so if we look at
what the gradient is at zero everywhere
I won't go into the details but you can
ask Auto grad to use your own gradient
for anything so if you have a new module
that you want to define and either
you've written high-performance code for
it and you want to use it or you want to
redefine or overwrite you know the
gradients that we have there's a pretty
easy mechanism for doing that and then
when you call your special dot floor you
can propagate gradients through it right
and here I was just saying basically
ignore the gradient of floor so this is
a toy example but there are real places
where you have a non differentiable
bottleneck inside of your computer off
and you want to either hop over it or
find some approximation and auto grad
has a mechanism for very easily plugging
those types of things in so that's a bit
of what auto grad is and what it can do
and I want to turn our attention to how
autograph relates to other deep learning
libraries and maybe how they're common
and how they're similar and how they're
different so one big difference that I
found between different deep learning
libraries is the level of granularity at
which you are allowed to specify your
neural network so there's a lot of
libraries where you say you get a
confident or you get a feed-forward
neural network and that's it right so
the menu is two items long and that's
fine I think Andre I really hit it on
the head where if you want to solve a
problem don't be a hero use somebody
else's network so maybe this is vgg that
you've downloaded from from the model
Zoo or something like that right so this
is the don't be a hero regime on the
left in the middle there's a lot of
really convenient neural net specific
libraries like torch and n and Karras
and lasagna and you get to put together
big layers and you don't really get to
see what's inside those layers but you
get to click together linear layers or
convolutions and usually that's kind of
what you want to do and on the far end
of the spectrum the things you can click
together are the function the the
numeric functions in your kind of host
scientific computing library right like
add multiply subtract and these are
features of projects like Otto grad and
Theano and tensor flow
and the reason why these boundaries are
made is because the developers have
chosen to give you partial derivatives
at these interfaces all right so this is
how they've defined their api's and
these are the interfaces with you know
across which you as a user cannot pass
if you want to new one of these modules
for the type on the left or the type in
the middle you have to go in and build a
whole new model and actually implement
the partial derivatives but with the
types of libraries on the right you can
build your own models by modules by
composing primitive operations all right
so that's one difference that you can
find in practice how these things are
implemented under the hood usually means
this is the totally shrink-wrap stuff
and maybe they implemented this whole
thing by hand usually these guys in the
middle are rappers they're rapping some
other library and the guys on the right
are usually actually implementing
automatic differentiation so Auto grad
in theano and tensorflow all implement
auto death and the guys in the middle
are taking advantage of that to make
more convenient wrappers so another
aspect that's different is how these
graphs are built so I'll remind you in
Auto grad we build these things just in
time by listening to what you're doing
and recording it but that's not how all
neural network libraries are built and
this is an axis along which I think that
they are differentiated meaningfully so
there's a lot of libraries that build
these graphs explicitly where you say
I'm going to click this Lego block into
this Lego block where I'm going to give
you this yamo specification file the
graph is totally static and you really
have no opportunity for compiler
optimizations there and then there are
the just-in-time library so Auto grad
and chain ER is another one where you
get any graph the graph can be anything
it can change from sample to sample it
can be you know to the length of the
graph can be determined by the compute
that occurs in the graph you have very
little opportunity for compiler
optimizations there so speed can be an
issue sometimes and in the middle
there's a head of time libraries like
tensorflow and Theano where you
construct your graph using a
domain-specific language you hand it off
to their runtime and then they can do
crazy stuff to make it faster the
problem with that is
it can be awkward to work with I guess
that got cut off it can be awkward to
work with control flow and I think
there's a reason why it can be awkward
to work with control flow and it's
because of the types of graphs that
these libraries are actually
manipulating so we say compute graph a
lot we say data flow graph a lot data
flow graph has a pretty restricted
meaning and it means that the nodes in
your graph do computation and the edges
are data and there's no room for control
flow in a graph that is a data flow
graph right so static data flow is the
type of graph that N and n Cafe use
because all the ops are the nodes and
the edges are just the data and the
graph can't change get data flow just in
time compiled data flow like Auto grad
and chain ER has the same
characteristics but the graph can change
from iteration to iteration because we
wait until you're done computing the
forward pass to build the graph in the
middle there's kind of a hybrid and I
don't know what to call that graph type
the ops are nodes the edges are data but
then there's special information that
the runtime gets in order to expand
control flow or for loops so scan is in
Theano is an instance of this where the
Theano runtime has special information
that allows it to make scan work but
it's kind of it's it's it's conspiring
with the graph data type to do that
there's actually another graph type that
naturally expresses control flow and
data flow together that I haven't seen
implemented in a deep learning library
it's called see of nodes from cliff
clicks thesis in the mid-90s it seems
like a really natural thing to try and
man maybe that's something that comes up
in the future
but that's kind of a big question marks
maybe one of you will we'll try that out
and see how well it works so in practice
this level of granularity can sometimes
slow us down having to work with
addition and multiplication can be nice
if you want to try crazy stuff but if
you know you want to make a confident
why don't you just rush all the way over
to the left if you want to take you know
inception and add another layer where
you want to use the type in the middle
an autograph allows you to do that so
I'll just kind of walk through writing a
neural net three ways very quickly and
then and then close
questions shortly thereafter so using
the fully granular approach there's a
lot of text on the screen but the top
half is basically let's instantiate our
parameters the way that we want to and
then here just like I've showed you in
previous slides let's do a multiply and
let's do an addition and put it through
non-linearity we're being very explicit
right so we're breaking all the
abstraction boundaries and we're just
using primitive operations we can use
the layer based approach so in Auto grad
we have a facility to turn all of the N
and modules of which there are a lot may
be an exhaustive list for what you'd
want to use for standard deep learning
applications you can turn them into
functions and then just use them so
linear one on the linear parameters and
your input and some activation you can
go through your neural network this way
so you can use a layer based approach if
you want and if you just want your
network just a feed-forward neural
network we've got a couple of these kind
of standard models just ready to go so
you can just say give me a neural
network give me log softmax and a loss
and let me blow these guys together so
you can do it any of those three ways
Auto grad at Twitter has had a pretty
cool impact we use NN for a lot of stuff
when we use Auto grat as well but being
able to reach for autograph to try
something totally crazy and just knowing
that you're going to get the right
gradients has really accelerated the
pace of high risk potentially high
payoff attempts that we make so one
crazy thing you might want to try is
experiment with loss functions so
instead of I have a hundred image
classes and I want to have my
convolutional neural network be good at
classifying this hundred image classes
maybe you have a taxonomy of classes
maybe you have a vehicle and then a bus
a car and a motorcycle and if you guess
any one of those you kind of want
partial credit for vehicle or if you
guess motorcycle you want partial credit
for for car so building that kind of a
tree loss is actually really
straightforward an auto grad and you can
do that in in just one sitting but might
be more complicated to do that in other
libraries we have to crack open the
abstraction barrier write your own
partial derivatives glue it back
together and then use that module that
you've built we've trained models that
are in production in auto grad so this
is something that's a battle-tested to a
sense and is running on
large amount of media Twitter in a sense
Auto grad doesn't actually matter when
you're running in production because you
just you have your function definition
for your prediction of your neural
network and then the gradient part just
goes away or so all the fancy stuff
where we play Storch with our secret you
know listener functions all that just
goes away and you just have some
numerical code so there's actually no
speed penalty a test time at all and we
have an optimized mode which does a
little bit of compiler stuff still work
in progress but for the average model
it's as fast sometimes faster than n N
and for really complicated stuff if you
wrote that by hand you'd probably be
faster but the time to first model fit
using Auto grad is dramatically reduced
because you don't have to worry about
correctness so this is a big wall of
text but it's meant to put in your head
some ideas of things from automatic
differentiation from that world that we
don't have yet that we really want right
to be able to train models faster and
better so the first is checkpointing
this does not check pointing where you
save your model every 10 iterations this
is check pointing where on your forward
pass you might you in normal reverse
mode automatic differentiation you have
to remember every single piece of
computation you do because you might
need it to calculate the derivatives and
checkpointing you just delete them you
let them go away because you think that
some of those might actually be easier
to recompute than to store alright so
for point wise nonlinearities for
instance it might be easier once you've
loaded your data just to recompute the
reloj as opposed to saving the result of
reloj and loading that back in again
mixing forward and reverse mode is
something that you can imagine being
important for kind of complicated
architectures although I don't really
know how much impact that would have so
in the chain rule you can either go from
left to right or you could start in the
middle and go out you can do all kinds
of crazy stuff if you want and we really
just do reverse mode for diamond shape
graphs where your computation explodes
out and it comes back in that might be
useful to start with forward mode and
then finish with the reverse mode or an
hourglass you might want to start with
reverse mode and end with forward mode
stencils are a generalization of
convolutions that people use a lot in
computer graphics automatically
calculate
really efficient derivatives of image
processing just general image processing
algorithms is under active investigation
in the graphics world and in the
computer vision world so these are two
references that are kind of neat papers
source to source transformations is
something that hasn't really made it it
basically has kind of been dormant for
about ten or fifteen years so the gold
standard used to be you take a piece of
code as text and you output another
piece of code as text what we're doing
now in deep learning is we're always
building runtimes we're always building
some domain-specific layer that depends
on you actually running code it used to
be that you just read that text and kind
of like a compiler spit out the gradient
this this was the gold standard it might
not be now but I think it's worth three
investigating and then higher order
gradients so Hessian vector products and
kind of Hessian based optimization maybe
doesn't always have full payoff I
actually don't recall hearing anything
about this at this school so far because
it's very expensive and difficult to do
expensive computationally fashion is
just if you take the grad of F it gives
you the gradients if you want the second
derivative right so you take grad a grad
of F so there's efficient ways to do
this it's still kind of an open problem
but there are libraries out there the
Python version of autograph dust as well
diff sharp and hype both also do this as
well so to kind of close out you should
just try it out it's really easy to get
it if you have anaconda if you use
Python we've made it so that Lua is
fully installable with anaconda so if
you're already using it it's very very
easy to get all of the tools that I've
showed you today and that's kind of the
single line to interface with it and if
you have any questions you can find me
on Twitter or email or github but I'm
happy to to answer any questions that
you have
oh yeah
I have no idea
thanks thanks for the great talk
oh yeah I was wondering what's the state
of the data visualization facilities in
Lua compared to say Python if I'm Frank
it's it's not as good python has been at
this for you know five ten years really
actively building matplotlib and you
know Seabourn and all these other
libraries and in Lua were importing
other people's work so book ajs is
really the best that i've seen so far
and that's something you can use in a
notebook so you have the full suite of
that of that particular library yeah
hey thanks for the luck is it possible
to convert a model train with torch in
into a C model that's deployable in you
know production we just run torch in
production we use a little model but you
want to run it and see so the whole
layer of torch that's actually doing the
work is in C and calling torch from C I
don't have a specific website I can
point you to but you can very easily
call and execute a Lua script from C
it's like three or four lines of code in
C thank you the follow-up the question
about see just now just like if I'm
gonna compile I mean I want to have Tosh
into my sequence passcode what kind of
overhead do I see I see
just animations yourself like I have a
10,000 line - what just-in-time compiler
I need to put that in there right oh I
can I avoid that because for example I
think about if I'm going to put the one
in an embedded system they have a mouth
resource of anything during inference
time so I'm sorry during yet during
inference time there's there's no
appreciable overhead if I'm
understanding your question right so you
you are importing a Louis so in your C
code you're going to basically say Lua
please run this Lua script and that's
going to call out into other C code
so all this overhead I talked about with
autograph that's training time that
doesn't exist at test time at all so so
during test time but the thing is I
still need to have Lua compile into my C
code right yeah so this is something
people have been doing for like 15 20
years it's pretty mature so Lua is in
like microwaves for instance people have
done very embedded applications of Lua
yeah I think the binary for Lu is like I
don't want to it's like a round it's a
kilobytes it's very very small there's
10,000 lines of code so when it compiles
down on small
so there's a question from the twitters
says i'm using a combination of Karros
and tensor flow why should I use torture
auto grad if you're happy then you know
that's great
I guess so people tend to reach for
torch when they would like to be able to
reason very easily about performance the
kind of the more of a compiler
infrastructure that gets added to a deep
learning environment the harder it can
be for the end user right away from the
people that originally made the library
can be harder for the end user to reason
why is this slow why is this not working
you might eventually see some github
issue later my network is slow in these
conditions and then it gets closed a
year after you had to have shipped your
project right I mean these things can
happen it's not the fault of anybody
it's just that torch was designed to
basically be very thin a thin layer over
C code so if that's something that you
care about torch is a really good thing
to work for if careless and tensorflow
is working great for you then keep deep
learning you know that's awesome so I'm
trying to see
it's hard to filter
where will the slides be posted it's not
a deep learning question but they will
be posted that's the answer to that
question I have a question now how do I
access through so normally all the web
services production generally are
another you know fast based application
in Python or you know Java based Web
Services right or maybe in you know in
the cellphone through Android which is
also Java right so how do you call these
models which were you know trained in
torch how would you actually access
those there's a couple different ways
you can do that if you're using a
feed-forward neural network writing the
Java code to do the matrix multiplies
can be pretty straightforward and we've
actually done that before or it's just
simpler tor just write the deep learning
code load in the weights we'll serialize
it however you know it needs to be
loaded that's one approach is kind of
you know hacking short term at Twitter
we've engineered a system where we
actually have Lua virtual machines
running inside of Java and we talked
over the j'ni so we have like a more
permanent solution for that but if
you're using standard model
architectures you might try to serialize
your weights and then use the native
deep learning library that exists to
load up those weights and then run for
it and that with some debugging I think
that's perfectly fair approach if you
have this split between testing and kind
of deployment where you're constrained
by language or environment that's
generally the thing that you know I mean
you do basically just you know C realize
your model and then try to read it what
about the latency actually so related to
you know this so when you see realize
that hackish way at least you can get
you know that latency things sold out
but is there any plan basically to have
you know interfaces available for other
languages so that you know you don't
have to do this extra step of
serializing and then you know loading
into language if you if you don't like
in your case you were mentioning that in
Twitter you have
- available inside your Java JVM our
access to the JVM using j'ni so what
what what impact does it have on the
latency and by latency you mean time to
ship the model not the latency of how
long it takes many predictions oh that's
gonna be very engineering dependent so
if you're calling torch from C code the
latency is not appreciable over if
you're just running Lua code and that
can be extremely fast if you're going
through some wrapper like through the
J&I or something like that you will
incur an overhead and you should just
try to pick the interfaces that reduce
that as much even if you incur
engineering overhead to do so I don't
know if that answers your question I'm a
little bit distant from the server side
so I can't give you I just don't know
but generally I think what I can say
this that's fair is we're constrained by
machine learning you know model
complexity latency we are not
constrained by overhead of like figuring
out how to actually get those
predictions like to an HTTP request for
instance serving which you know which is
kind of sort of solving this problem
yeah not that I'm aware of
again the torch community is not
centralized and so people could be
working on a totally awesome
you know complement to the the
tensorflow server but I am not aware of
it thank you okay we're going to take a
short break of 15 minutes
let's thanks Alex again