Theano Tutorial (Pascal Lamblin, MILA)

OU8I1oJ9HhI • 2016-09-27

Transcript preview

Open

Kind: captions
Language: en
okay so today I'm going to briefly
introduce you tno how to use it and go
over the basic principles behind the
libraries and if you paid attention
during yesterday's presentation of
tensor flow some concepts will be
familiar to you as well and if you paid
attention to you go lava Shell's
introduction area talk you'll see some
some serie concept as well so there's
going to be four main parts so the first
one is well this slide and introduction
about what the concept of Tiano are
there is a companion ipython notebook
that's on github so if you go on that
page or clone that github repository
there is an eye Python notebook that
basically has all the code snippets from
these slides so that you can run them at
the same time then we're going to have a
more hands-on example basically applying
logistic regression on the Emnes digits
data set and then if we have time we'll
go quickly over to more examples
concepts so the basic Linette
architecture and an STM for character
level generation of text so Tiano is we
can say mathematical symbolic expression
compiler so what does that mean it means
that it makes it possible to define
expressions that represent mathematical
expression using numpy syntax so it's
easy to use and it supports all the kind
of basic mathematical operations like
main max addition subtraction all the
kind of basic things not only larger
blocks like layers of neural nets
whole networks or things like that it
makes it possible to manipulate those
expressions during rough substitutions
cloning and replacement things like that
and also making possible to go through
that graph and perform things like
automatic differentiation a symbolic
differentiation actually all the our
operator for forward differentiation
applying some optimizations for
increased numerical stability and then
it's possible to use that optimized
graph and the Endo's runtime to actually
compute some values some output values
even inputs we also have a couple of
tools that help debug both pianos code
and the users code and try to inspect
and understand better what's actually
happening when you're using Tianna
so when I was currently more than 8
years old
it started small with only a couple of
contributors from the ancestor of Mila
and which was called Lisa at the time
and it grew a lot
we now have contributors from all over
the world users from all over the world
and it's been used to drive a lot of
research papers prototypes for
industrial application in startups and
in larger companies tno has also been
the base of other software projects that
build on top of the nose so for instance
blocks Kara's Lezyne our machine
learning deep learning libraries that
used ya know as a back-end and provides
user interface that is a higher level so
that has concepts of layers of training
algorithms of this kind of things
whereas ya know is modern backends SK
don't ya know as well which is nice
because it has a converter to load cafe
models from the cafe zoo and use them in
Tiano and does a lot of other things as
well
pi MC 3 actually uses t anode not to do
machine learning but for ballistic
programming and we have two other
libraries platoon that Mira is
developing and TN o MP I developed a 12
with our layers on top of T and O to
help train on multiple machines multiple
GPUs and have some level of model
parallelism and data parallelism so how
to use TN well first of all we are
working with symbolic expression
symbolic variables so that will make up
a computation graph so let's see how how
to do that so to define the symbolic
expression so we defined the expression
first then we want to compile a function
and then execute that function on values
so to define the expression we start by
defining inputs so the inputs are
symbolic variables that have some type
so you have to define in advance whether
like this variable is like a vector or
matrix
what's its data type is floating-point
integers and so on so things like the
number of dimensions have to be known in
advance but the shape is not fixed the
memory layout is not fixed so you could
have shapes that change between like 1
mini-batch and the next or different
calls to do to the function in general
so x and y are purely symbolic variables
here we will give them values later but
for now that's just that's just empty
there's another kind of input variables
that is share variables and
they they're symbolic but they also hold
a value and that value is persistent
across function calls it's shared
between different IANA functions it's
usually used for instance for storing
parameters of the model that you want to
learn and yet these values can be
updated as well so here we create two
other variables from social variables
from from values this one has two
dimensions because its initial values
after dimensions and this one has only
one so that's basically weight matrix
and the bias we can name variables by
assigning to the name attribute short
variable do not have a fixed side either
there are usually kept fixed in most
models but it's not a requirement then
from these inputs we can define
expressions that will build new
variables intermediate variables which
are the result of some computation and
so for instance here we can define well
the product of X and W at the bias apply
sigmoid function on that and they say
this is our output variable and from the
output Y ball and Y we can define just
say the squared error cost so those new
variables are connected to the previous
ones through the operations that we
define and we can visualize the graph
structure like that by using for
instance by dot print which is a helper
function so variables are those square
boxes and we have other nodes here we
call apply nodes that represent the
mathematical operation that connects
them so input variables and shared
variables do not have any ancestors they
don't have any road connecting from them
but then you see that intermediate
result and and more of them
usually when we visualize we don't
necessarily care about all the
intermediate variables unless they have
a name or something and so this is a
simplified version of exactly the same
the same graph where we hide the unnamed
intermediate variables but you can still
see all the operations and actually you
see that the type on the edges so once
you have defined some graph say your
forward computation for your model we
want to be able to use back propagation
to to get your idioms so this is just
the basic concept of the chain rule we
have a scalar crossed we have
intermediate variables that here are
vectors here's just the general starting
from the from the cost and so the whole
derivative of say that that function G
is actually a whole Jacobian matrix
that's M by n if the intermediate
variables are vectors of size N and M
and usually you don't need that and it's
actually usually a bad idea to compute
it explicitly unless you need it for
some other purposes what the only thing
you need is an expression that given any
vector representing the gradient of the
cost with respect to the output will
compute you the gradient of the cost
with respect to the input so basically
the dot product between that vector and
the whole Jacobian matrix so that's also
called the L operator sometimes and so
almost all operations in Tiano implement
a function that returns that and it
actually returns not numbers not a
numerical expression for that but it
returns a symbolic expression that
represents that computation
again usually without having to
explicitly represent or define that
whole Jacobian matrix so you can call
Tia no grant which will back propagate
through the graph from the cost towards
the inputs that that you give and along
the way it will call that grad method of
each operation back propagating means
starting from one for the cost and back
propagating through the whole graph
accumulating when you have the same
variables that used more than once and
so on and again here so DCW and this is
DB they are symbolic expression the same
way as if you had manually defined the
gradient expression using T&O operations
like the dot product the sigmoid and so
on that we that we've seen earlier so we
have non numerical values at that point
and they are part of the computation
graph so the completion graph was
extended to add these these variables
and we can continue extending the graph
from these variables for instance to
compute update expressions corresponding
to gradient descent something like that
like we do here so for instance this is
what the extended graph for the gradient
looks like so you see there's like a lot
of small operations that have been
inserted and outputs you have actually
here the gradients with respect to the
bias which is both an output and an
intermediate result that will help
compute the gradient with respect to the
weights and here's the graph or the
update expressions so you have as
intermediate as intermediate variables
the gradients that we had on the
previous slide and then this uses the
scaled version
with constant 0.1 that's somewhere so
once we have defined the whole graph the
whole expression that we actually care
about from the input and initial weights
to the weight updates for our training
algorithm we want to compile a function
that we'll be able to actually compute
those numbers given inputs and perform
the weight updates so to compute values
what we do is called Tiano dot function
and you provide it with the input
variables that you want to feel and the
output variables that you want to get
and you don't have necessarily to
provide values for all the inputs that
you might have declared especially if
you don't want to go all the way through
the end of the graph you can have a
function that only computes sub set
expression for a subset of the graph for
instance we can have a predict function
here that goes only from X to out we
don't need values from Y we don't need
and so the gradient and so on will not
be computed it's just going to take a
small part of the graph and make a
function out of it so so that's it you
can first compile it get value and call
it so you have to provide values for all
the input variables that that you define
you don't have to provide values for
shared variables W and B that we
declared earlier there are implicit
inputs to all the functions and their
value will automatically be be fetched
when it's needed can declare other
functions like monitoring function that
computes both the output and the cost so
you have two output you also need the
second input Y you can compute the
function that does not start from the
beginning like for instance I want an
error function that only computes the
the mismatch between the prediction and
the actual targets then I don't have to
start from the input I can just start
from the prediction and compute the cost
then the next thing that you might we
want to do is update your Bibles for
training it's necessary and again you
can pass duty and functions updates a
list of updates and updates are pairs of
shared variable and the symbolic
expression that will compute the new
value for that shared Bible so you can
see a big W and up they'd be here as
implicit outputs of the function like W
and B were implicit inputs update W
update B are implicit outputs that will
compute it that will be completed at the
same time as C and then after all the
outputs are computed the updates are
actually effective and the values are
updated so here if we print the value of
B before and after having calling after
having called the same function then we
see the value has changed what happens
also during graph compilation is that
the graph that we selected for that
particular function gets optimized and
what we mean by that is that it's going
to be rewritten in parts there are some
expressions that will be substituted and
so on and there are different different
goals for that
some are quite simple that for instance
if we have the same computation being
defined twice we only want it to be
executed once if you have expressions
that are not necessary you don't want to
compute them at all for instance if you
have X divided by X you don't know and
and X is not used anywhere else we just
want to replace that by one there are
numerical stability optimizations for
instance well log of one plus
can under fill' if X is really small and
this would give 0 whereas which would be
close to X things like log of softmax
get optimized into more stable locks of
Max operation it's also the time where
in place and destructive operations are
inserted for instance if an operation is
the last to be executed on some numbers
it can instead of allocating output
memory I can just work in place on its
input and so on also the transfer of the
graph expression to the GPU is due is
done during the optimization phase so by
default Kanno tries to apply most of the
optimizations so that you have the run
time that's almost as fast as possible
except for a couple of checks and
assertions but if you're iterating and
want fast feedback and don't care that
much about Timothy about the runtime
speed then you have a couple of ways of
enabling and disabling some set of
optimizations and you can do that either
globally or function by function so to
have a look at for instance what happens
during the the graph up to my different
phase here's the the original and
optimized graph going from the inputs X
and W going to the output prediction
it's the same one that we've seen before
and if we compare that with the function
the compile function that goes from
these input variables to out which was
called predicts this is what we have I
won't go into details about what's
happening in there but here you have a
gem G operation which basically calls an
optimized Blas routine that can also
do multiplication and accumulation at
the same time we have a sigmoid
operation here can will work in place
destructively on its input which is
denoted by the red arrow here if you
have a look at for instance the
operation optimized graph completing the
expression for the updated W and B this
was the original one and the optimized
one is much smaller
it has also in place operations
it has fused LM wise operations like for
instance if you have a whole tensor and
then you do an element-wise a addition
with with a constant and then a sigma
eight and then something else and so on
you want to only loop once through the
array and apply all these carrier
operations on each element and then go
to the next and so on and not iterate
each time that you want to apply a new
new person and those kind of things
happen often when you have automatically
generated gradient expressions oh and
here you see the update for the shared
eyeballs which are inputs so you see the
cost and the implicit outputs for the
updated wnb here and here another
graphitization tool that exists is the
back print which basically prints
text-based tree like structure of of the
graph assigning arbitrary ids and
printing the variable names and so on so
here you can see more in detail like
what the structure is and you see the
inputs of gmv and the scaling parameters
and so on so when the function is
compiled then we can actually run it so
T no function is call a ball python
objects that that we can that we can
call and we've seen those examples
here for instance where we call train
and so on but what happens to have say
optimized run time it's not only the
degree of optimizations but we also
generate C++ or CUDA code for instance
for the LMS loop fusion that I mentioned
we can't know in advance which
elementwise operation will be will occur
in which order in any drive that the
user might define so we have on-the-fly
code generations for that you generate
Python module written in C++ or in CUDA
that gets compiled and imported back so
that we can use it from Python the
runtime environment then calls in the
right order the different operations
that have to be executed from the inputs
to the outputs so that we so that we get
the desired results we have a couple of
different ones and in particular there's
one which was written in C++ which
avoids having to switch contacts between
the Python interpreter and the C++
execution engine something else that's
really crucial for speed and performance
is GPU so how to use a GPU in TN oh we
wanted to make it as simple as possible
in usual cases so now it supports a
couple of different data types
not only float 32 but double precision
if you really need that integers as well
and we have now easier interaction with
GPU arrays from Python itself so you can
just use Python code to handle GP arrays
outside of a Tiano function if you'd
like all of that he will be in
future 0.9 release that we hope to get
out soon and to use it well you select
the device that you want to use the
primary device that you want to use with
just the configuration flag for instance
you could to get the first GPU that's
available or one specific one and if you
specify that in the configuration then
all share variable will by default be
created in GPU memory and the
optimizations that move the computation
from CPU to GPU so that replace the CPU
operation by GPU operations are going to
be applied usually you want to make sure
you use 432 or even float16 for storage
which is experimental but because most
GPUs don't have a good performance for
for for double precision so how you set
those configuration flags you have in
order that you never see configuration
file that you can it's just basic
configuration file from for for Python
you have an environment variable where
you can define those and the environment
variable overrides the config file and
you can also set things directly from
Python but some flags have to be known
in advance before you know is is
imported so for instance the device
itself you have to set it either in the
configuration file or throw flags
so I'm going to quickly go over more
advanced topics and if you want to learn
more about that there's other tutorials
available online and there's a
documentation on the planning up net so
to have loops in the graph we've seen
that the expression graph is basically a
directed acyclic graph and we cannot
have loops in there one way if you know
if you know in advance the number of
iterations it's just to unroll the loop
use a for loop in Python that builds all
the nodes for all the time steps
it doesn't work if you want for instance
to have dynamic no dynamic size for the
loop for models that generate sequences
for instance it can be an issue so what
we have for that in India know is called
scan and basically it's one node that
encapsulate another whole T&O function
and that the end of function or step
function is going to compute the is
going to represent the computation that
has to be done at each time step so you
have at the end of function that
performs the competition for one time
step and you have the scan node that
calls it in the loop taking care of the
bookkeeping of indices and sequences and
feeding the right slice at the right
point and feeding back the outputs where
needed and having that structure makes
it also possible to define gradient for
that node which is basically another
scan node another loop that goes
backwards and applies back drops with
time and it can be transferred to GPU as
well in which case the internal function
is going to be transferred to G and
recompile on GPU and there's an example
of scan in the
lsdm example later this is just a small
small example but it's we don't really
have time for that we also have a
visualization debugging and diagnostic
tools one of the reason it's important
is that in piano like in terms of flow
the definition of a function is separate
from its execution and if something
doesn't work during the execution if you
encounter errors and so on
then it's not obvious how to connect
that from where the expression was
actually defined so we try to have
infirmity of error messages and we have
some completion modes that enable to for
instance check for not a number fall out
values you can assign test values to the
symbolic variables so that each time you
create a new symbolic intermediate
variable each time you define a new
expression then it the test value gets
computed and so you can evaluate on one
piece of data at the same time as you
build a graph which can be useful to
detect shape mismatch errors or it's
like that it's possible to extend ya
know a couple of ways you can create an
app just from Python by calling python
wrappers for existing efficient
libraries you can extend ya know by
writing C or CUDA code and you can also
add optimizations either for increased
numerical stability for instance or for
more efficient computation or for
introducing your new ops instead of the
nave versions that that a user might
have used we have a couple of new
features that have been recently added
to to the analyst I mentioned the new
GPU back-end
with support for many data types and
we've had some performance improvements
especially for convolution 2d and 3d and
especially on GPU we made some progress
on the time of the graph optimization
phase and also have introduced new ways
of avoiding recompiling the same graph
over and over again and we have new
diagnostic tools that are quite useful
and interactive visualization an
interactive graphical ization tool and
pdb breakpoints that enables you to
monitor a couple of eyeballs and only
break if some condition is met rather
than monitoring something every time the
before for every every piece of data in
the future well we're still working on
new operations on GPU we still want to
wrap more convenient operations for for
better performance in particular the
basic errand ends should be completed in
the following days hopefully someone has
been working on that a lot recently we
want better support for 3d convolutions
still faster optimization and more work
on data parallelism as well so what we
want to thank well most of my colleagues
and main tno developers and people who
contributed one way or another to a lab
and the software development efforts and
of course recognizing the organizers for
volley school now yeah so the slides are
available online as I mentioned as a
companion notebook and now we can start
to go and and more resources if you want
to go to go further and now I think that
it's time to start the practical
examples so for
those who have not clone the repository
yet then this is the command line you
want to two nouns for those who had
cloned it you might want to do a git
ball just to get the latest to make sure
we have the latest versions and you can
launch Jupiter notebook on the on the
repository itself so we have three
examples that we are going to go over
logistic regression comes net and the
rest yeah so I've launched the Jupiter
notebook here and let's start with so
intro TN o was the companion notebooks
there's nothing new in there just the
code snippets I've showed your alrighty
and okay let's go with a logic
regression is that big enough for do we
need to increase the font size okay so
I'm going to skip over the text because
you probably know already about the
model we have some we've packaged the
amnesty database with the on on github
with the repository so let's load the
data and here let's see how we define
the model so it's basically the same way
that we did in in the styles we define
sizes that will be useful for the shell
variables we define an input variable
here it's a matrix because we want to
use mini-batches and we have survived
balls initialized from zeros then we
define the our model so here's our
predictor so the probability
of the class given the input and we're
going to use well so here the fine model
and then the softmax on top of it and
the prediction if you want to help
prediction it's going to be the class of
maximum probability so hard max over
that axis because we still want one
prediction for each element of of the
mini batch then we define the loss
function so here is going to be the log
likelihood of the label given the input
or the cross entropy and we define it
simply we don't have like we don't need
to have one croissants for P or log
likelihood operation by itself you can
just build it from the basic building
blocks so we take the log of the
probability you take the index of the
actual target and then you take the mean
of that to have them in prediction over
the mini batch then derived equations
derive the update rules so again we
don't have like one gradient descent
objects or something like that we just
build whatever rule we we want so yeah
we could use momentum by defining other
shape variables that will hold the
velocity and then you have that
expressions for both the velocity and
the survival itself and then we compile
a training function going from X&Y
outputting the laws and the dating W and
B so while the code is getting generated
and compiled and the graph is getting
optimized let's see the next step well
we also want to
monitor not only the log-likelihood but
actually actually the misclassification
rate on validation and test set so it's
simply the different like how many
elements are different between the
prediction which was the arc max and the
actual target and the rate is the mean
or the mini-batch and we create another
we compile another two and a function
outputting that and not doing any
updates of course so to train the model
well first we need to process the data a
little bit so we want to feed the model
one mini batch of data at a time so here
we have simply a generator I mean not
really pay attention right over just a
helper function that gives us the mini
batch number I and it's going to be the
same fraction used both for the training
and validation and test set
we define a couple of parameters for
early stopping in that training loop
it's not necessary it's just like a way
of knowing when to stop and use only
like the best model that was encountered
during the optimization so let's let's
define that and this is the main
training loop it's a bit more complex
that it might be but it's because we use
this early stopping and we want to only
validate when we are confident that the
training error has gone down enough but
basically the the most important part is
you loop over the epochs unless unless
you encounter the early stopping
conditions and then during each epoch
you want to loop over the mini batches
and call train model then every once in
a while you want to validate and print
some result of the validation error so
here we call test model on the
validation set for that and then keep
track of what the best model currently
is and get the the test error as well
and save the best one so to save the
best one to save the model we usually
just save the values of all parameters
which is more robust than trying to pick
all the whole Python object and it also
enables more easily transferred to other
frameworks to visualization frameworks
and so on so let's try to execute that
so of course it's a simple model the
data is not that big so it should it
should not take that long so you see
that at the beginning
well almost at each iteration we are
better on the training set and then
after a while the progress is slower and
okay so just wait a little bit more
seems to stall more and more and okay
and here it's the end after 96 epochs so
now if we want to visualize what filters
were learned or what the final train
model looks like we just using a helper
function call here to visualize the
filters it's not really important but
here what we use is we call get value on
the weights to access the internal value
of the shell variable and then we use
that to to plot the different filters
and we can see it's kind of reasonable
like this
is the filter for class zero and see
kind of like zero one part did what's
important for the two is to have like an
opening here and so on so yeah if we
have a look at the final error well we
can see that the training error is well
to hit training you know not plotting it
but the validation and the test error I
are quite high and we know that the
human level performance is quite low and
the performance of our model is quite
low so it really means that the model is
too simple and we should use something
more advanced so to use something more
advanced if you go back to the home of
the Jupiter notebook can have a look at
the continent and run Lynnette so this
new example is basically it's the same
data it's still amnesty because it has
the other edge of training fast even on
an older laptop and but this time we're
going to use a completion net we look up
all of conclusion layers and then fully
connected layers and then the final
classifier so I'm going to make for that
float X is float 32 here and let's see
how we could use Tiano to define helper
classes that are layers that can make it
easier for a user to compose them if
they want to you to replicate some
results or use some classical
architectures this is done usually in
frameworks built on top of Tiano like
carrots like blocks like lasagna some
people also develop their own mini
framework
with their own versions of layers and so
on
that they find useful and intuitive so
this logistic regression layer basically
holds well parameters weight and bias
and compute the well the conditional
probability of classes prediction holds
the params and have expressions for the
negative log likelihood and errors so if
you were to use only that class then
it's doing essentially the same as what
we did by hand in the previous notebook
and in the same way we can define a
layer that has convolution and pooling
so again in the init methods we pass it
well filter shape image shape data side
of pooling and so on we initialize the
weights using the formula from grow and
venture at 2010 and buyers from zeros
and then from the inputs while we
compute to the convolution with the
filters we then computes max pooling and
output wealth and H of the pooling plus
the bias and here the bias is only like
one number for each channel so which
means that you don't have a different
bias for each location in the image so
you could actually apply such a layer on
images of various size without having to
initialize new parameters or return that
and then the same way we define the
hidden layer which is just a fully
connected layer again initializing
weight and by
and expression going from so the
symbolic expression going from the input
and the shared variables to the output
after activation and again we want to
collect the parameters so that we know
what we will want to Train and then
here's a function that has that the main
the main loop in the main training loop
so we have a mini batch generator again
it's synced as as before and here we are
building the whole graph so always the
same the same process we define input
symbol symbolic input variables matrix
and a vector of int here so L vector is
a vector of long because the targets
here are in this's and not not one Hots
vectors or masks or something like that
and we create the first layer which is a
Linette compo layer with size we want to
have the next one with also so yeah here
the image size changes this is mostly
for efficiency actually you don't really
have to to pass that for for those
particular models but you still need
like the shape of filters I mean you
have the filters anyway and then it's
useful to have those size still because
even if the convolution layers can
handle arbitrary sized images then after
that we want to flatten the whole the
whole feature Maps and feed that into a
fully connected layer and then to the
projection layer so this one has to be
fixed so we have to know what the last
comes layer will will have four
dimensions and here we here we go
a fully connected layer and the output
layer that's just logic regression class
same as before we want the final cost to
be the log likelihood of that we have
again the errors which is the
misclassification rate parameters or the
concatenation of the parameters of all
layers and once we have that we can
build the gradient so just one call of
grad of cost with respect to parameter
updates
so again just regular SGD but we could
have a class or something that performs
like momentum a degree that a delta
whatever you need compile the function
and here we have again the early
stopping routine with the same main loop
for all a parks until we are done then
loop over the mini-batches and validate
every once in a while and stop when it's
finished so let's just declare that
loading the data exactly the same as
before and here we can actually run run
that so this was the result of a
previous run it that took 5 minutes so I
will probably not have time to do that
but here you can see basically what
happens and if you want to run it or try
that during the lunch break or or later
you're welcome to to play with it and
after that yeah you can visualize the
the the round filters as well here you
you have them for the first layer and
for the
and here you have the an example of the
activations of the first layer for one
example so we have just a little bit
more time to cover the lsdm tutorial
I mean example so if you go back to the
home of the Jupiter notebook and go to
ASTM
then so this model is an SEM network
that tries to predict the next character
of our sentence given the previous ones
so not going to go into details but here
you can see that the LSM layer is
defined here with like shot variables
for all the the matrices that that you
need and the different biases for the
different gates and so on so you have a
lot of parameters it would be possible
and sometimes more efficient to actually
define say only one variable that
contains the concatenation of a couple
of matrices and that way you can do more
efficient bigger matrix matrix multiply
but this is just one one simple
implementation and here's an example of
how to use scan for the loop so here we
define the step function that takes well
a couple of different different inputs
so you have like the different
activation and so on from the previous
time steps you have the current sequence
input and so on and from them here's
basically the DSM formula where you have
the dot product and Sigma 8 or 10 H of
the different connection inside the cell
and in the end you have the hidden and
that it so once you have that that's
step function is going to be passed to
Tiano dot scan where the sequences are
the masks and
input so the mask is is useful because
we're using mini batches of sequences
and not all the sequences in the same
batch have the same length also for
efficiency we usually want to group them
with two group example of similar length
together but they may not always be
exactly the same length so in that case
we pad that to only the longest sequence
in the mini batch not the longest
sequence in the whole set just for the
mini batch but we still have to pad and
remember like what's the length of the
different sequences is in order for us
to correctly predict and back propagate
so let's define that here we define the
cost function that's the categorical
cross-entropy of the sequence and here
again you see that the mask is used so
that we don't consider the predictions
after the end of the sequence logistic
regression the same as before does the
final cost here for processing the data
we're using fuel which is another tool
being developed by students at Mira and
it's nice because it can read from just
plain text data do some pre-processing
on-the-fly including things that I
mentioned earlier like grouping
sequences by similar length and then
shuffling them and padding and doing all
of that and so it outputs like a
generator that you can then feed in your
main loop through a channel function so
that whole processing happens outside of
tno and then the processed values are
fed into into the channel function so
yeah here we build our final key on a
graph we have symbolic inputs for well
the input and
masks we create lsdm layered a lot
correct layer define our cost parameters
are the concatenation of the parameters
of logistic regression and the current
layer take the gradients of course with
right to all parameters so as I
mentioned it's going to use back prop
through time to get the gradient through
the scan operation the update rule again
simple SGD no momentum nothing it's
something that you could add if you want
to play with it and compile to function
to evaluate the model so here the main
loop is training and we also have
another function that generates one
character at a time given the previous
ones that's why we will declare like
input here and so does that speak
function that get probability
predictions we normalize them because we
are working in float32 and sometimes if
you divide by the sum and RISM then it
doesn't add up to one so we want a
higher precision for just that operation
and then try to generate to generate a
sequence every once in a while so again
this is the result of a previous run so
we see in the so for for monitoring we
seed that prediction with the meaning of
life is and then we let the network
generate so if I try to run it now it's
going to be long but here's some
examples that I generated yesterday in
the previous run so it starts with not
that much and it has like a couple of
unusual characters I mean
it's usually it's not usual to have like
one Chinese character in the middle of
words you have like concentration in the
middle of word and so on
but then as it as it progresses you see
that it's getting slowly better and
better and the meaning of life is is the
dets and so of course this is not what's
going to give you the the actual meaning
of life but yeah a tons lot of ham why
not and and this is this so so yeah so I
interrupted the the training at some
point but you can play with it a little
bit and here are some suggestions of
things you might want to do like better
training algorithms different
nonlinearities inside the lsdm sell
different initialization of weights try
to generate something else that the
meaning of life is and yeah so I hope I
could give you a good introduction of
what you know is what it can be used for
and what you can build on top of it and
if you have if you have any questions
later then we have general users mailing
lists we are answering questions on
Stack Overflow as well and we would be
happy to have your feedback
have time for a few quick quick
questions
that's right here could you go to the
mic can you just give a quick example of
what debugging might look like in Theon
Oh could you just break something in
there and show us what happens and how
you figure out what it was
actually yeah I think I had one okay so
let's let's go to say a simple simpler
example okay so I'm just going to go to
the logistic regression 1 and say for
instance that when I initialize my thing
I don't have the right I don't have the
right shape so you can still build the
the whole symbolic graph and at the time
where you want to actually execute it
then you have an error message that
tells you shape mismatch X has of
Cowen's and some rows but Y has only
that number of rows and the apply node
that caused the error is that dot
product and gives the inputs again and
in that case it tells you it's not
really able to tell where it was defined
but if you remove the optimizations then
it might so we can we can do that and we
can go back to where the train operation
was defined train Model T a new function
and then I'll just say optimizer equals
none
sorry I have to do my Audi calls piano
note optimized or not that's correct yes
so it's recompiling the function let's
record everything
and then he updated our message says
back-trace when the node was created and
it's somewhere in my kernel and it's on
the line py given X equals that so of
course we have like lots of things in
there but you know that there's a dot
product and it's probably a mismatch
between those so that's that's one
example then there are other techniques
that we can use we can have the
breakpoints as I said and so on I don't
have right now tutorial about that but
have some one line and I could point you
to that I have some models I'd like to
distribute and I don't want to require
people to install Python and a bunch of
compilers and so unfortunately at the
time we're pretty intermingled with
Python a lot because all the memory
management during the execution is done
by Python and we use an umpire and
arrays for our intermediate values on
the CPU and the similar structure on the
GPU even though that one might be easier
to convert but yeah all our C code deals
with Python and does the ink ref and
Decker F and so on so that Python
manages the memory so if you want to
distribute that I would suggest like a
docker container something like that
recently even for GPU and video docker
is quite efficient and we don't have any
modest allowance that that we had seen
earlier so it's not ideal and if like
someone has some time and the wheel to
to help us disentangle tno from the
Python runtime it would be awesome but
that's a use project
okay let's thank Pascal again and we
reconvene in 55 minutes for the next
talk have a good lunch

Resume

Berikut adalah rangkuman komprehensif dan terstruktur berdasarkan transkrip yang diberikan.

***

# Panduan Komprehensif Theano: Dari Konsep Simbolik hingga Implementasi Deep Learning

### Inti Sari (Executive Summary)
Video ini merupakan pembahasan teknis mendalam mengenai **Theano**, sebuah pustaka komputasi numerik yang digunakan untuk mengembangkan model *Deep Learning*. Pembicara menjelaskan konsep dasar Theano sebagai *compiler* ekspresi matematika simbolik, fitur-fitur canggih seperti optimasi grafik dan eksekusi GPU, serta panduan praktis implementasi model mulai dari Regresi Logistik, Convolutional Neural Network (LeNet), hingga LSTM untuk pembuatan teks.

### Poin-Poin Kunci (Key Takeaways)
*   **Theano sebagai Compiler**: Theano bukan sekadar pustaka numerik biasa, melainkan *compiler* ekspresi matematika simbolik yang mengoptimalkan kode untuk kecepatan dan stabilitas numerik.
*   **Struktur dan Optimasi**: Menggunakan *Shared Variables* untuk manajemen parameter dan teknik optimasi grafik (seperti *loop fusion* dan penghapusan kode redundan) untuk performa maksimal, termasuk dukungan GPU.
*   **Operasi Loop dengan *Scan***: Menggunakan fungsi `scan` untuk menangani perulangan dalam grafik komputasi, yang sangat penting untuk model sekuensial seperti RNN dan LSTM.
*   **Implementasi Praktis**: Demonstrasi langkah-demi-langkah pembangunan model Regresi Logistik (dataset MNIST), LeNet (CNN), dan LSTM menggunakan *Jupyter Notebook*.
*   **Debugging dan Deployment**: Strategi penelusuran kesalahan (*debugging*) menggunakan *test values* dan rekomendasi distribusi model menggunakan kontainer Docker.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Pengenalan Theano dan Konsep Dasar
*   **Definisi Theano**: Theano adalah *compiler* ekspresi matematika simbolik yang memungkinkan pengguna mendefinisikan ekspresi menggunakan sintaks mirip *NumPy*.
*   **Kemampuan Utama**:
    *   Mendukung operasi matematika dasar (penjumlahan, pengurangan) hingga blok besar (lapisan neural, jaringan saraf).
    *   Manipulasi ekspresi (substitusi, kloning, penggantian).
    *   Diferensiasi otomatis/simbolik untuk menghitung gradien.
    *   Optimasi untuk stabilitas numerik dan perhitungan nilai yang efisien.

#### 2. Mekanisme Kompilasi dan Optimasi Grafik
*   **Fungsi Kompilasi (`theano.function`)**:
    *   Mengompilasi fungsi dengan mendefinisikan variabel input dan output.
    *   Tidak semua input memerlukan nilai saat runtime (misalnya untuk prediksi, label target tidak diperlukan).
    *   *Shared Variables* (seperti bobot W dan bias B) bertindak sebagai input implisit.
*   **Pembaruan (Updates)**:
    *   Digunakan untuk pelatihan model, di mana variabel bersama diperbarui setelah komputasi output.
    *   Daftar pembaruan terdiri dari pasangan variabel bersama dan ekspresi simbolis baru.
*   **Optimasi Grafik**:
    *   Theano menulis ulang grafik komputasi untuk efisiensi: menghapus redundansi (ekspresi yang sama dihitung sekali saja), menghapus kode mati (dead code), dan melakukan operasi *in-place*.
    *   Optimasi stabilitas numerik (misalnya mengubah `log(1+x)` menjadi versi yang lebih stabil).
    *   Transfer otomatis operasi ke GPU (menggunakan CUDA) untuk mempercepat perhitungan.

#### 3. Fitur Lanjutan: Loop, Debugging, dan Ekstensi
*   **Loop dengan `scan`**:
    *   Karena grafik komputasi bersifat DAG (*Directed Acyclic Graph*), perulangan ditangani oleh fungsi `scan`.
    *   `scan` mengenkapsulasi fungsi langkah (*step function*) untuk satu *time step*, menangani indeks, urutan, dan *feedback*.
    *   Gradien untuk loop dihitung oleh node `scan` lain (Backpropagation Through Time).
*   **Debugging dan Diagnostik**:
    *   Karena definisi terpisah dari eksekusi, *debugging* dilakukan menggunakan pesan kesalahan yang informatif dan *test values* pada variabel simbolik untuk mendeteksi kesalahan bentuk (*shape mismatch*) atau NaN saat pembuatan grafik.
*   **Ekstensi Pustaka**:
    *   Pengguna dapat membungkus pustaka yang sudah ada dengan Python, menulis kode C atau CUDA kustom, serta menambahkan optimasi baru.

#### 4. Sesi Praktik: Regresi Logistik (MNIST)
*   **Persiapan Data**: Menggunakan dataset MNIST yang tersedia di repositori.
*   **Definisi Model**:
    *   Input matriks untuk *mini-batches*.
    *   Inisialisasi bobot dan bias ke nol.
    *   Prediksi menggunakan model linear + *Softmax*.
*   **Pelatihan**:
    *   Fungsi biaya (*Loss*) menggunakan *Log-likelihood* / *Cross-entropy*.
    *   Pembaruan parameter menggunakan *Gradient Descent*.
    *   Loop pelatihan menggunakan *mini-batch generator* dengan parameter *early stopping* untuk mencegah *overfitting*.
    *   Model terbaik disimpan berdasarkan parameter validasi.

#### 5. Implementasi LeNet (Convolutional Neural Network)
*   **Arsitektur**: Terdiri dari lapisan Konvolusi (*Convolutional*) + *Pooling*, diikuti lapisan *Fully Connected*, dan klasifikasi akhir.
*   **Detail Lapisan**:
    *   **Conv+Pool Layer**: Menggunakan inisialisasi bobot dari formula Glorot & Bengio (2010) dan bias nol. Operasi mencakup konvolusi, *max pooling*, dan penambahan bias.
    *   **Hidden Layer**: Lapisan terhubung penuh standar dengan fungsi aktivasi.
*   **Hasil**: Model LeNet menunjukkan performa yang jauh lebih baik dibandingkan model regresi logistik sederhana, dengan visualisasi *filter* dan aktivasi yang lebih kompleks.

#### 6. Implementasi LSTM dan Pembangkit Karakter
*   **Struktur Model**: LSTM digunakan untuk pembangkit teks tingkat karakter.
*   **Pelatihan**:
    *   Menggunakan *Backpropagation Through Time* (BPTT) melalui operasi `scan`.
    *   Aturan pembaruan menggunakan SGD sederhana.
*   **Pembangkitan Teks**:
    *   Fungsi menghasilkan satu karakter pada satu waktu berdasarkan karakter sebelumnya.
    *   Tantangan presisi `float32` di mana probabilitas prediksi terkadang tidak berjumlah tepat 1, memerlukan penanganan khusus.
    *   Hasil awal berupa karakter acak (termasuk karakter asing), namun semakin lama pelatihan, output mendekati kata yang bermakna (misalnya: "the meaning of life is...").

#### 7. Tanya Jawab (Q&A) dan Tips Penutup
*   **Debugging**:
    *   Jika terjadi kesalahan bentuk (*shape mismatch*) saat eksekusi, nonaktifkan optimasi (`optimizer=None`) untuk mendapatkan *back-trace* Python yang menunjuk ke baris kode spesifik di mana node grafik dibuat.
*   **Distribusi**:
    *   Theano saat ini masih terkait erat dengan *runtime* Python. Untuk distribusi yang efisien (terutama yang melibatkan GPU), disarankan menggunakan kontainer **Docker**.
*   **Proyek Masa Depan**: Ada proyek terbuka untuk memisahkan Theano dari *runtime* Python agar dapat berjalan secara mandiri (*standalone mode*).

---

### Kesimpulan & Pesan Penutup
Video ini menyajikan gambaran lengkap tentang kekuatan Theano dalam mengembangkan model *Deep Learning* yang kompleks, mulai dari teori simbolik hingga implementasi kode nyata. Meskipun memiliki kurva pembelajaran yang curam, kemampuan optimasi dan fleksibilitas Theano menjadikannya alat yang sangat kuat untuk penelitian dan produksi. Pembicara menutup sesi dengan mengucapkan terima kasih dan mengundang audiens untuk istirahat sebelum sesi berikutnya.

Read

file updated 2026-02-13 13:23:15 UTC