FTI ITB Morning Lectures - Introduction to Bioinformatics
1Yp67Sywlcw • 2021-03-08
Transcript preview
Open
Kind: captions
Language: en
good morning ladies and gentlemen
uh in this this morning
we we are we are glad to have dr
himanshu rajeh or dr rajeh from nicole
state of university
usa and he will
share his experience he will let us know
more about
bio the interesting topic that is
bioinformatics
professor raj you know that we are
most of us are engineers here
with a little bit background on biology
so uh in bioinformatics is uh
growing growing topics and we should
know about that
but we have very little experience about
that
so you we will be glad if you can share
us
and tell us more about what is
bioinformatics
okay the time is yours thank you
thank you so much for such a nice
introduction let me share my screen with
you all
real quick
okay great so uh like they said i'm dr
himanshu rajei
i'm assistant professor at nicole state
university
and thank you for having me to share
some of my knowledge of bioinformatics
with you all
this lecture i have arranged it in such
a way that it's going to be a little
interactive okay so it's not going to be
just
me talking all the time i would much
appreciate
your interaction your responses we are
going to do
several activities on computer of course
um but your participation and your
outcomes of those activities would be
much appreciated they will essentially
help me
um judge whether you're understanding or
not and feel free to ask questions
as we move on through this presentation
i really like
to answer questions during the lecture
as well okay so if you do have questions
don't hesitate i'll give you time to
answer to ask questions
so if you do have questions don't
hesitate to put in the text chat window
of zoom and we'll take it from there
also i have a google drive folder
and i'm going to share that link with
you all so the activities that i told
you about
the activities that we are going to do
throughout this lecture
um we will have some files that i have
uploaded onto that folder and we will
use those
so as and when time comes i'll tell you
what
and when we are using that folder on
google drive and i'll share that link
with you all
in the text chat of the zoom
okay bioinformatics is a
fairly new science so put it this way
it evolved after the invention of
computers
much after in fact the invention of
computers because it involves computers
okay so biology was advancing
since centuries i would say
but computers are themselves fairly new
of an invention
a couple of decades ago and that is when
the thought
started in people that can we use
computer in
other sciences like chemistry like
biology like
physics can we get help from this
tremendous technology of computers
to do certain tasks and the answer to
that was
yes and people actually started doing
that
but when i was in my bachelor's degree
this
word bioinformatics was completely new
i didn't even know about it it barely
existed
um and of course my bachelor's degree is
from india so i was completely unaware
of bioinformatics at that time
but when i was in my master's degree and
i had pretty much decided to go for
biology at that time that is when we
were actually
hearing or reading in news articles
about bioinformatics but still it was
not taught
at colleges at that time as well in fact
as you guys can probably understand that
misconceptions circulate when the topic
is like
very new or fresh and there were
misconceptions going on
in the community i still remember myself
reading an article
in the newspaper that of course it was a
question mark
future of biology might be at stake
because now computers will take over
everything
maybe computers can do experiments and
what would biologists do
and while reading that news article of
course i was just a master's degree
student at that time i was like
oh boy i decided to take biology as my
career
and is this field in trouble and at that
same time
i remember having discussion with my
parents you know we all have that kind
of a phase at some point of time in
our lives what to do with our career so
um we me and my parents we came across
this
one week of a workshop on bioinformatics
at a city in india called as chennai
it's really good for education
so my parents told me why don't you fly
over there and why don't you see what
this bioinformatics is about
and see if your career is at stakes
see what's the future of biology what
does it look like and i did fly
to chennai for that one week workshop of
bioinformatics
that trip to chennai was kind of
memorable
for several things first thing is it was
my first flight trip
second thing is i got introduced to
bioinformatics that was the first and
foremost important thing
and i made really good friends over
there and the
last but not the least in fact the most
important thing that i learned
from that one week workshop on
bioinformatics at chennai
is that biology
is not in trouble biology still holds
strong
we need biological experiments we need
to be in
lab but we also need
help from this new technology that's
coming up computers
and maybe those computers can actually
help us steer ourselves
in a correct way while doing our
experiments so that was my take home
message
from that one week workshop that i
attended on bioinformatics and
i was kind of intrigued by this science
so although i did not decide to do my
career in bioinformatics i stuck with
molecular cell biology
i always try to keep myself updated with
what's going on
in this science so like i said
we cannot proceed in bioinformatics
without biology
so let's have a little bit of background
on biology okay just what's required
so let me just start off with
important biomolecules there are several
molecules in our cells not just our
cells we have bacterial cells fungal
cells
these cells behave as they behave
and these cells interact using molecules
so it's pretty much
like i always tell my students these
molecules are non-living
but their interaction with each other
makes cells
that are living how does that happen
we really don't have full answer yet and
that's why still
research is going on so let's have a
quick introduction of some of these
important biomolecules that we are going
to deal with in bioinformatics
let's start with dna let's start with a
very
familiar biomolecule deoxyribonucleic
acid
the genetic material for most of the
cells
you name it prokaryote eukaryote
some viruses are different though they
are exceptions
however viruses are not cells so let's
keep viruses aside for a while
and let's just focus on dna the
structure of dna
is double-stranded i'm sure you have
seen a picture
or similar pictures like this um
on several occasions so double stranded
helical the strands are anti-parallel
the monomer dna is a long chain okay
people
it's the monomer of dna the single unit
is a nucleotide and there are four of
them in the dna
adenine thymine guanine and cytosine
they are represented by the single
letter abbreviations
essentially coming from the first letter
in their name
so a for adenine t for timing etc
the chemistry the rule of chemistry here
is
adenine a on one strand of dna pairs
with
thymine t on the other strand of dna
and guanine g on one strand pairs with
cytosine c on the other strand of dna so
keep this in mind
some of you might already know this if
you
know this fine if you don't know well
just have a quick introduction about
this but
notice here there are a couple of points
that we should note from this slide
the structure of dna that double helical
antiparallel strands is pretty much the
same
in any organism you talk about okay
so that's one major biomolecule that we
are going to look at
okay so we have a sequence of dna
on one strand of course it's a stretch
of nucleotides and of course the other
strand also has the same thing
the complementary sequence that's what
we call
the second biomolecule that we are going
to talk about
is rna ribonucleic acid
okay notice the very first thing on this
slide is
rather than dna just being a
common structure double helical
structure
rna has three types messenger rna
ribosomal rna and transfer rna
okay so there are three types of rna
another thing
that you should keep in mind is
whichever
rna we are talking about the rna is
single
stranded okay it is not double stranded
like dna so no matter what rna you talk
about
messenger rna single stranded ribosomal
rna
and trna are also single stranded
molecules
which means nothing prevents them
to fold onto themselves and they can
form structures like this
what's shown in this picture of course
not just this they can even form
several other kind of structures okay so
keep that in mind
rna comes in diverse forms because
each and every molecule is single
stranded
now again just to have a quick
introduction
every gene can have every protein making
gene
um will have its own messenger rna
produced when cell
expresses that gene so messenger rna is
the molecule
that's going to be produced from every
gene that can make protein
okay so when you think about messenger
rna
put it this way ribosomes are going to
read it in triplets
and they are going to call corresponding
trnas with
amino acids i'm going to make proteins
so messenger rna
we call it messenger because it carries
message from genes
the other two rna molecules ribosomal
rna and
transfer rna they never make proteins
from themselves they just help with
protein formation
for example ribosomal rna
just goes and becomes part of ribosome
okay so along with some other proteins
it just sits in cytoplasm and that's
what ribosome is
it helps in formation of proteins it
helps to read
these messenger rnas and form proteins
but ribosomal
rna molecules are never going to form
proteins
from themselves ribosomes are not going
to read these
same thing applies to trna these are
again helper molecules they help with
protein synthesis
okay so trna will never be
formed into protein of itself just helps
so the genes
that code for ribosomal rna
the genes that make transfer rna they
never make proteins
they get expressed but they just stop
at rna formation and that rna actually
does
perform some action in the cell so
that's
rna of course it is also made up of
nucleotides so the monomer of rna are
also nucleotides
notice that there is no thyamine instead
we have
uracil in rna molecules
and of course if you are a biologist you
might know
that trna has some other uncommon
nucleotides in it
but that's not really the part of our
lecture here
my attention is on uracil because that
is the unique nucleotide
in rna and that replaces thymine but
keep in mind
that rna is single stranded that should
be the take-home message and it comes in
three types
it can fold onto itself to assume
several different structures
so let's collect these information
collect these points with
us make a note of them and move on
the third but the most diverse
biomolecule is proteins
proteins there are several proteins in
our cell
okay because every protein coding gene
will have its own messenger rna and
there are several of protein making
genes and those mrnas will be read by
ribosomes of course
ribosomes contain ribosomal rna and
transfer rnas are going to come into
picture with
loaded amino acid and we are going to
have proteins
so nonetheless the monomer here
is amino acids one amino acid is going
to join to each
other amino acid with peptide bond
and form a chain of amino acid
that is basically protein however
just a simple chain of amino acid
is a primary structure of protein okay
these amino acids have their single
letter abbreviations just like
nucleotides do
and we are going to see i'm going to
point out to some of these
um single letter abbreviations to you
later on when i show you certain
bioinformatic things
but primary structure is a simple string
of amino acids if i keep
writing single letter abbreviations of
amino acids one after the other
that's a simple primary structure that
is not sufficient people for protein to
work in the cell
i'm sure you guys know that each
molecule
some of you are chemistry majors some of
you are biology majors so you know
that each molecule has its own
three-dimensional
shape and when it assumes that shape
when it forms um when it takes that
shape in the cell
that is when it can perform certain
actions
because that is when it can find its
binding
partners in the cell each protein is
looking
for something to bind to something it
could be an
ion it could be another protein it could
be
maybe sugar something and that structure
of the binding partner of protein should
perfectly fit
into the three-dimensional structure of
protein and that is why
no protein stops at primary structure
there are secondary structures like
alpha helices
beta sheets and even further
those secondary structures are folded in
the cell
to form a tertiary structure for every
single protein
okay so every single protein in the cell
will have its own three-dimensional
structure
now some proteins don't stop even here
some proteins need to attach themselves
to another protein cell needs to couple
a couple of proteins together
and that's when they can act so they act
together they act in a group
bunch of proteins bound to each other
and so
some proteins not all few proteins have
something called as
quaternary structure so not all proteins
have this some proteins do
but the proteins that do have quaternary
structure it is basically
proteins different proteins attached to
each other and performing
a biological function so again for us
the take home message here is the
monomer of protein is amino acid
it's an amino acid sequence
three-dimensional structure of protein
comes into picture
it is critically important to know okay
so keep these things in mind and let's
move ahead
although i have shown you a eukaryotic
cell
you can see this this is a nucleus of
the cell
even prokaryotic cells have the same
process going on from dna this is where
we essentially start this is where the
genes are
this is the genetic material of every
cell
this house has all the genes now when
sell
any cell prokaryote or eukaryote
us plants animals bacteria whatever
when the cell decides to activate
a gene it will form an rna molecule
from that gene okay and the process of
going from dna to rna is called as
transcription so when a gene is
activated
that gene is going to be transcribed now
i just told you a few seconds ago
that some genes go all the way down to
proteins so their
rna molecules are messenger rnas and
they will
further be translated with the help of
ribosomes and a protein will form
from them some genes can do that but
what if we are talking about a gene
that just makes ribosomal rna or a
transfer rna
that kind of rna will never ever form
its own protein
but still when a ribosomal rna gene is
activated
we will have transcription of that gene
and we will form
ribosomal rna okay but those genes will
stop here
so as a whole class we can settle on the
thought
that when a gene is activated it is at
least
getting transcribed now if we are
talking about
messenger rnas they will also get
translated
with the help of ribosomes and form
proteins
keep in mind that these three are
extremely diverse molecules and we are
talking about
a stretch of nucleotides double stranded
anti-parallel helical molecule here
with the process of transcription it is
forming a single stranded molecule
of rna okay and i'm going to
technicalities here
okay so keep that in mind rna are single
stranded
formed from a double-stranded molecule
still the monomer is nucleotide but
thymine is replaced by uracil and here
if messenger rnas are translated to
proteins
then this is a whole different
biomolecule in itself the monomer is
amino acids
so cell is doing an incredible thing
here
cell is creating three different kinds
of course this
is what cell has already but cell is
essentially creating
two completely diverse molecules and
there are several proteins so we have
tremendous amount of diversity
here and this is this whole process
together is called a central dogma
of molecular biology okay because this
holds true for prokaryotes as well as
eukaryotes
so we are going to stick to the central
dogma
and we are going to appreciate this
diversity
in biomolecules that we see and we are
going to try and see how that fits
within the information that we can
collect from these biomolecules
i want your attention for now on this
process
transcription how come
from a double stranded dna molecule we
have a single stranded rna
the process happens kind of like this
here we have
double stranded dna okay this is where
the gene so
in this picture a gene is shown to you
right here
just a cartoon of a gene two strands of
dna
since rna is single stranded
when cell decides to activate mark my
words okay later by letter
when cell decides to activate this gene
the two strands of dna are going to
separate
and cell is going to recruit an enzyme
to read
keep your focus on my mouse pointer to
read
just one strand of dna and form
an rna molecule so only
one strand of dna is going to be used
to form an rna molecule makes perfect
sense to us
because rna is single stranded cell is
never going to use both of these dna
strands and form
a double-stranded rna that's not how it
happens rna is
rarely double-stranded if it is
double-stranded then it is the
single-stranded rna folded onto itself
that's it otherwise rna is
single-stranded so only one strand of
dna is used
the strand of dna that cell is going to
use to make rna is called as template
strand
keep that name in your mind somewhere we
are going to
come to this name at least couple of
times today
and the other strand of dna is called as
a
coding strand or the sense strand this
strand of dna
is not used to form rna okay
maybe this figure would do um
a better judgment to the point that i'm
trying to make
so this strand of dna it's shown in red
color to you there are no real colors in
dna this is just for our understanding
but this trend of dna has this sequence
let's say for example
it's not being used to make rna by cell
in fact this bottom strand of dna that's
being used that's a template strand it
acts as a template for rna formation
so as you can see the sequence of rna
is complementary
to the template strand okay if we have t
in the template strand cell will add
a in the rna and of course if there is
adenine in the template strand
rna doesn't have t but it has u instead
we learned that few seconds ago
so cell will put you but the point to
note here
is the sequence of rna is going to be
complementary
these nucleotides pair to each other to
the template strand of dna
and if you go back a second and look at
this strand of dna the other coding
strand of dna
that strand of dna was also
complementary
to this trend of dna because usually
these two strands of dna bind to each
other
now we have rna which is complementary
to this strand
this strand of dna is also complementary
to the template strand
so the sequence look at the sequence of
rna
the sequence of rna perfectly matches
with the sequence
of coding strand of dna apart from the t
is replaced by use
okay so keep that in mind the sequence
of rna
is the same exact sequence just because
both of these strands
are complementary to template strand
template strand of dna
the other strand of dna is acting as a
template to form rna
and that is why most of the databases
in bioinformatics they will provide you
with this sequence
you will see the coding strand sequence
okay
the sequence that is very exactly in
fact exactly similar to the rna sequence
so if you are looking at a gene sequence
in the database and if you wonder
hey what would be the rna sequence here
all you have to do
is just replace those t's by use and
that's your rna sequence
and that is why databases biological
databases
give you the coding strand sequence okay
so keep that in mind
and again some of these things might
sound
um like you know um foreign to you right
now
but when we actually look at those
biological databases trust me it will
all make sense
as long as you're trying to keep up with
the pace
so coding strand is the sequence that we
see
all right when
biological inventions were taking place
when scientists were discovering
how are these genes expressed and a lot
of expression data
um was essentially piling up in
scientific community
when human genome project was going on
people had questions in their mind
humans have lots of genes human cells
have tremendous amount of genes in them
what is their sequence what is the dna
sequence of each gene
and there was a worldwide collaborative
project
to sequence the entire human genome
it generated tremendous amount of data
now where to keep that data
we needed some help to preserve that
data
we cannot just preserve that data on
paper if we do that it will just remain
in one lab
or maybe at one place we wanted
scientific community wanted
access of that data to worldwide
it wanted outreach people in the world
everybody should have access to that
data and so where to store
that data that is when people looked
into some
other sciences like computer science can
we get help from computers
maybe to store this data is computer
science advanced enough
now luckily fortunately even computers
were evolving around the same time and
the answer came to be
yes yes we can get help from computers
and store this data furthermore
not only the store we can even
try and analyze this data to draw some
meaningful conclusions
we all do experiments people in lab at
some point of time
even otherwise our life is full of
experiments essentially
in no matter in what science you talk
about even in other subjects people do
some types of experiments we we know two
things
by doing an experiment no matter whether
it's chemistry whether it's physics
whether it's biology
experiment takes time
and sometimes the reagents that we use
for these experiments are costly
they take money now if
we and typically we don't know the
outcome of experiment we are doing
research
when we start off with an experiment we
don't know what's going to be an outcome
we can kind of
predict our hypotheses but we don't real
we don't even know whether we are going
to be heading into right direction or
not typically
and that is where decades ago
people were trying to get any help
possible from computer science
can we at least virtually predict the
outcome of an experiment
can we at least know if we are heading
into right direction
in order for us to save time and money
there is no point in spending five years
doing an experiment
only to realize that i was chasing
shadows
if i can get periodic help from
computers computer will not be doing any
experiment for me
however i am going to just check with
computer maybe plan out my experiment
in computer we call it as in vitro
experiment the experiment that we do
with animals are in vivo
but experiment that we do in a test tube
are in vitro and the experiments that we
do with computers are in silicon
because they have silicon chip so these
are three different words
that we need to kind of keep somewhere
in our mind
but can we do some of those in silico
experiments
and periodically judge maybe on a
monthly basis maybe on bi-monthly basis
just to see
if our experiments are going in the
right direction or not if we can do that
we can modify our hypothesis and always
steer
ourselves at right direction and that's
what i'm going to focus
my lecture on today okay i'm going to
introduce you
again this is just introductory
bioinformatics so i'm just going to
introduce you
to some pre-existing tools in
bioinformatics how can we use those
in our day-to-day experiments day-to-day
biological experiments some of those
tools you can even use in chemistry
or you can even use in bio process
so it's going to be interesting some of
you are all
might already be familiar with some of
those tools so if you are
that's fantastic you will be able to do
those activities
very quickly if you are not um you will
learn those so that's going to be
knowledge to you all
okay so stay tuned some interesting
stuff is going to come to you
the point here on this slide that i want
to make before i leave this slide
is there is tremendous amount of data
in biology that is being generated okay
and we are actually going to talk about
what kind of data that we are talking
about
well first thing is right in front of us
the nucleic acid sequence
so let's see what type of data we can
gather
in biology shall we and this is where
informatics comes into picture
wherever we have data we have
information
and in this it is the information in
context of biology
and that is where this culmination of
biology and compu
information technology i.t or computers
is essentially what bioinformatics is
all about
and this led to starting of a whole new
field nowadays people do careers in
bioinformatics there are
majors named as bioinformatics in
colleges
because this has tremendous potential
keep in mind though
we can never ever do any earth shaking
discovery
with bioinformatics i mean just in
bioinformatics
we need to do biological experiments in
order to invent new things
we can use bioinformatics we can get
help from computers
only to assist us with our biological
experiments so that is one thing
to bear in our mind real well and now
it's time since we are now introduced to
informatics
it's time to look into what kind of data
we can collect in biology
and we are actually going to make this
slide together okay
um as you can see this light is almost
blank and it has those three
familiar biomolecules with us
so i'm going to um stop my slideshow for
a while
and i'm back to that text box that i
have there
with biomolecule dna what kind of data
can we get
and we are going to finish this like
we're going to complete this light
together
so again i i would much appreciate your
input
as i complete this light i'm going to
use dna
as my bio molecule okay i'm going to
complete dna
but you guys are going to help me with
rna and proteins
so let's start with dna of course we can
have
new nucleotide sequence
that is a data
so nucleotide
sequence could be a data for dna
the structure of dna is pretty much the
same
in any organism we talk about so i
wouldn't put structure there is
barely any diversity there so i wouldn't
put it
as a diversified data for dna okay
uh however nucleotide sequence
definitely yes
how about this there are four different
types of nucleotides in dna a t
g and c sometimes it's important to know
how many a's how many adenines are there
in dna
how many thymines cytosines are guanines
so
percentage of each nucleotide
that could be some meaningful
information
the other meaningful information here
would be
if i have one dna molecule sequence
how similar that is with the other dna
molecule sequence for example
let's talk about us let's talk about
humans we have
several genes in our body i can take a
common example hemoglobin
it's the protein that carries oxygen in
us
of course it's a protein which is coming
from its own gene
so gene of hemoglobin there
there are several globins in our cells
but that gene
how similar is that gene in its
nucleotide sequence with
mouse hemoglobin if you have that kind
of a question
you need to first obtain human
hemoglobin gene sequence compare it
of course obtain mouse hemoglobin gene
sequence and compare
both of them with each other there is a
scientific work to it
there is a bioinformatic work to it you
got to align
those sequences with each other so
sequence
alignment that could be
a form of data for dna okay
if you can think of something else feel
free to put in the text chat window
okay as we speak so sequence alignment
or percent homology these are some words
that we should keep in mind
between
several
dna molecules
okay can somebody tell me what kind of
data can we have
for rna we can always go back to dna if
something strikes to us
rna unlike dna
has several types we just learned about
that
so if you can put your thoughts in the
text chat window of zoom
i would much appreciate that
what kind of data can we have for rna of
course nucleotide sequence
any other thoughts
types of rna i love that
yes types of rna for sure
so let's put that right here there are
three types of rna
if i show you just a nucleotide sequence
of rna
i'm not telling you much here you might
ask me is this mrna
is this rrna or trna so type
of rna fantastic
any other thing that you can think of
rna is single stranded so i told you
some peculiarities about rna
rna structure fantastic we are all
learning together
but drop
sure of
rna in parenthesis i'm
going to write um
folding pattern
transcriptomics yes
so we can have a set of rna molecules in
a cell
all of those rna molecules have
definitely come
from expression transcription of certain
genes
so if we have a question hey here is a
cell
how many different rna molecules are
there in the cell and what is their
sequence
that is transcriptome just like genome
genome is a set of genes in our cells
transcriptome is a set of rna
in our cells so set of
rna in a cell
let's stick to simple english in
parentheses
transcriptome fantastic people
this is going well any other things you
can think of
for rna
translation start point and end points
that holds true for mrna for sure
yes messenger rna the rna molecule that
forms proteins
it has to have some starting point for
protein synthesis and some ending point
which means it has to have a start codon
somewhere okay and it has to have a stop
codon
which tells ribosomes where to start
making protein and where to stop making
that protein
so that whole sequence of rna the whole
sequence of
messenger rna now i'm fine tuning my
words
i'm building upon this answer from start
to stop codon
is called as o r f
open reading frame the whole sequence of
messenger rna from start to stop codon
it is imperative to predict open reading
frames for messenger rnas fantastic yes
what else
which mrna is expressed in certain
conditions
yes conditional expression
that kind of goes with transcriptomics
but yes conditional expression of
rna is kind of important some cells some
some genes are expressed only under
stressed conditions
so what are those genes it's important
to know that
okay so definitely nucleotide sequence
definitely the type
definitely the structure of rna
definitely
the um conditional expression the
transcriptome
that is good any other thing that you
can think of
for rna otherwise we will go to proteins
we can always go back
what about proteins people
what type of data can we have here
protein sequence and structure yes so
let's say
amino acid
sequence shall we that's the primary
structure of every protein simple amino
acid sequence by
their single letter abbreviation
structure
three-dimensional structure of protein
is critically important for its function
and so it is almost very critical
tremendously important to be able to
predict
if i just give you a simple amino acid
sequence of a protein
my question is will you be able to at
least predict
solving a full three-dimensional
structure that takes time
it takes involving and money consuming
techniques such as x-ray crystallography
or 3-d cryo-electron microscopy etc
before going to that can you at least
predict
has any other organism been shown to
have a similar protein
of the 3d structure so yes we we can
definitely look at the 3d structure
of the protein what else
what else can we look into protein
aha i like that protein function
thank you people i told you proteins
are the most diverse biomolecules in the
cell
and they come with variety of functions
functions are their own
so what is the function of the protein
of our interest
if we have just the sequence of the
protein
if we can predict the three-dimensional
structure of the protein
or just the sequence matches let's say
that we are looking into a human protein
human hemoglobin let's say for example
hemoglobin carries oxygen
let's say that we are looking into human
hemoglobin protein sequence just the
amino acid sequence
if we can somehow
match that sequence with all of the
plant proteins that are known
and if we do see some similarity with
that maybe that protein implant can also
carry oxygen
maybe just because maybe this is just
prediction and that's what
bioinformatics helps us with
it helps us to do meaningful
to to generate meaningful predictions
and we can test those predictions
further more with real experiments
so yes the function of protein
essentially
any other
factor that might matter into biological
data for protein how about this
um conditional again
um
formation of protein
or synthesis
how about that just like rna some
proteins are made
only under certain conditions like most
of the antibodies are made
only when we have infection okay there
are some antibodies that are made even
without
but there are some proteins that are
made only under stress conditions
there are some proteins that are always
being made
homology fantastic yes
amino acid sequence
homology how
similar one protein is to the other
protein
and that kind of thing can
be in rna as well
wherever you have some kind of sequence
we can definitely
have sequence homology sequence
alignment
any other thing you can think of this is
this is going tremendously well people
thank you
thank you for your feedback any other
things
aha protein interactions
that relates to what i just told you few
minutes ago
protein wants to find binding partners
so what other molecules does it bind to
how does it interact in the cell is it
no that brings to another point
cellular
location of
the protein that's also important
some proteins are membrane proteins some
proteins
are just remain within the cell some
proteins are secreted out of the cell
okay so where does this protein go that
is also
important any other
thing that you can think of
how about
rna splicing yes
fantastic r and
a splicing
in parentheses we can write
alternative surprising
in you care
i deliberately avoided this because not
everyone knows this is kind of a
complicated topic alternative splicing
but i'm glad somebody mentioned this
fantastic
any other things in protein in fact i
would like to add something in dna now
how about
name of
the gene if we are really looking into
a gene sequence because there could be
several other
dna several other stretches of
nucleotides that may or may not be a
gene
if we are looking into the gene then
it's good to have the name of that gene
its location
remember dna is the genetic material
chromosomal location
where exactly on chromosome that gene is
present
are there any diseases associated with
that gene
what if that gene sequence might have
some mutations in some people
if they do have mutations then what
diseases could they have
so diseases
are disorders
associated with
specific genes
you know what people every single point
that we are putting on this slide
there is a database out there for that
there are databases out there
that correlate the name of the gene with
diseases
there are databases out there that have
um
homology between um several genes
between different organisms we are going
to touch upon some of those
there are databases out there to tell
you protein three-dimensional structure
there are databases out there to analyze
the whole transcriptome
no matter what organism you talk about
so things have advanced
quite far we are going to just touch
upon those databases and just some of
the widely used databases that's the
whole point of today's lecture
okay so this is going well let's move
on let's keep this slide this is a slide
in progress always
okay so now i think it's time for me to
introduce you
two common biological databases
that house gene sequences
one of them was originated in america
the other one tells us essentially the
same thing but it's originated in europe
okay
so american one is ncbi genbank
national center for biotechnology
information and the european one
is embell bank these two
databases house gene sequences
of course there are protein sequences
that are housed
by these two databases they essentially
are
different versions american and european
version of the same data so there is
redundancy
there is correlation between these two
databases
and there is interrelations in fact
for today we will stick with the
american version of the database just
because it is more user friendly
however we are going to use some cool
tools
from this emblem bag okay and
there are many more databases that i
have not even listed on this slide
there are some specific databases what
if
genbank houses the gene sequences from
all organisms that are sequenced like
humans
mice fruit flies worms
plants but what if we just want to look
into fruit fly sequences then there is
a database for that flybase what if we
want to look at just
plant gene sequences then there are some
small databases just for that so there
are some specific ones
but as of now we are going to stick to
genbank and again i'm going to just
stop slideshow for a while and i'm going
to share my browser screen with you
i'm going to show you one
critical thing to do how to search for a
gene sequence
in a genbank okay so let me stop sharing
my screen
and let me be back with my web browser
here we go
what i'm going to do is in the search
window
i'm just going to type ncbi
genbank
that is genbank notice
that there are several sister databases
in genbank
there are nucleotide sequence databases
in which case you will select a
nucleotide
there are genome databases there are
gene expression omnibus geo database
let's start off with gene
this also houses some free textbooks by
the way people there are book databases
as well in there
let's start with gene and you can pretty
much type your favorite organism and the
name of your favorite gene in here
my favorite gene in humans
is acting so let's search for human
a c t b beta actin
just an example later on we will search
for some other genes as well i'm just
going to show you
how to search for a gene and this is
where you will get a gene card for that
gene
beta actin make sure that we are looking
into human gene
click on that that will take you
to an ncbi page for that
gene the name of the gene right up front
there is some summary you can get some
meaningful information about this gene
okay um you can also get some
information about the expression pattern
of this gene
in which human tissues is this gene
expressed well it tells you ubiquitous
expression
several tissues first of all it is a
protein coding gene
keep going down i'm just going to
quickly scroll if you are a biologist
you will have also actually appreciate
this little
interactive browser it tells you a
cartoon of the structure of that gene
and tells you some meaningful
information about how many exons
how many protein making sequences are
there how many introns are there
and um where do they start where do they
end
so as you hover your mouse pointer on
that
it tells you that information keep going
down
if the experiment is done by some people
some scientists out there in the lab
this graph will pop
up and this is the expression data
i like actin gene because it's expressed
in
every single human tissue in every
single pretty much human cell
and that's what this graph tells you
okay you can also change the type of
experiment
over here from the drop down menu and
see several other types of graphs
but again that's reserved for some
advanced things let's move on what other
things
does this webpage show you of course
several
references the people names of the
people that work
on um this data associated conditions
are there any mutations associated with
this gene
if so what kind of diseases or disorders
or syndromes
can humans get you can have that
information right there
okay so without any without going
anywhere on to google
just in this database itself you can do
this for any gene
and there are several other things down
there
what mutations you can have what kind of
interactions does this protein do
to which other proteins it can interact
maybe it can interact with some viral
proteins
so you will find all kinds of
interactions and the associated
research studies listed right here on
this web page
the most important thing that i want to
point out is
what if i want to know this gene
sequence in that case you have to go up
right here where you see this
interactive browser
if you are a biologist again feel free
to look around
but click on this link genbank
that will take you to the sequence of
that gene
and that is the sequence okay we are
getting to that
page it doesn't tell you that it's human
actin
it just tells you that human actin gene
is on human chromosome
7 and of course the earlier page also
had that information
this number though is a unique database
id
for human acting gene so if you are a
researcher working on this gene you
better note down this id
so that you can refer to this same gene
sequence in future you can almost just
put this number
in the search window and you will be
coming directly to this page
this also tells you how long is the
nucleotide sequence
so about um 3 454 base pairs
so about 3 400 base pairs it's a linear
dna
keep scrolling down it tells you the
names of people who submitted this
sequence
make a note of this section features
it tells you that
of course it's genomic dna all the way
starting from the first nucleotide to
the last nucleotide
it also tells you that it's a gene the
name of that gene is actb
and all of that sequence starting from
first to the last nucleotide
is the same gene it also tells you
the mrna sequence for that gene okay
it asks you to join several nucleotides
to make an mrna
now you might be thinking oh do i have
to manually join these nucleotides no no
no
look look under the subheading
under this mrna transcript id just click
on that
and it will take you to just
the mrna sequence for that gene
there you go mrna
if you scroll down that is just the mrna
sequence of course replace
ts by use okay
it also gives you the coding
dna sequence just the exams
of the gene and what would the protein
sequence look like
so this is the protein sequence these
are single letter abbreviations
of the amino acids so every single
information that you need to know
is right there on this page if you want
to know just the coding sequence
separately just click on this external
link
ccds and that will take you to just
this sequence it asks you to manually
join these nucleotides you don't have to
just click on that link and then we have
a full gene sequence starting from first
nucleotide
to the very last nucleotide right here
now notice that this sequence has
numbers
okay nucleotide number of positions what
if you want to work
with this sequence what if you want to
do some analysis with the sequence and
you want to get rid of these numbers
you can do a simple trick just copy this
whole thing
copy and go to this website
this is a fantastic sequence
manipulation
suite online software developed by
university of alberta in canada
it has several free tools
for us to play around okay we are just
going to look at
some of these for example filter dna
whatever other
non-dna characters you might have what
if somebody gives you a word file of dna
sequence
with some other characters you don't
want those characters because those
characters will be thrown off by any
bioinformatics software
in that case just run your sequence
through this
i mean it gives you an example sequence
just clear that off
and paste our sequence in here
and hit submit what it gives you
is the sequence without numbers and now
you can play around with this sequence
okay it is also in some kind of a format
which i'll tell you what that format is
i think i'm not really sharing this
screen with you
so let me go
back and share my whole
desktop
i can see your screen before oh you
could
you could see the output i saw your cut
and paste
oh okay um now this is the output here
we go
so now we don't have numbers we have
just the sequence
okay filter dna sequence it gives a name
to it
and i want you guys to notice this sign
the greater than sign that starts off
with the sequence that greater than sign
signifies something
it's a format of a sequence that this
software online software converts our
sequence to it's called as fasta format
and that's what we are going to come to
now so let me unshare my screen
and let me put the powerpoint back up
since we had some introduction about
genes right here
the fasta format
lot of bioinformatics software's don't
accept
just the dna sequence it has to be
in this fast a format
we call it faster and what it means is
whatever sequence you are looking into
put this greater than sign
in front of it and that helps computer
to know
that this is where computer
should start reading the sequence and
now you can list
multiple sequences one after the other
as long as you start every sequence with
a greater than sign
you can even you are allowed to put
certain name for that sequence so
greater than sign
whatever unique name you want to put for
this sequence you can even type
in simple english like human beta actin
something like that
and um you know mouse beta acting
something like that
and don't be under impression that you
have to have a limited number of
nucleotides now
you can go on to like thousands of
nucleotides here
and then put another greater than sign
and put your second sequence
put another greater than sign put your
third sequence
so that is a fast start format
of a sequence now how can we get that
well fortunately it is easy
for us to get any sequence on genbank
in fasta format genbank has made that
real easy um let me go
back to the browser that we were working
with
and that will be more clear to you i'll
show you right there how you can go to
fasta format of any sequence
from genbank
here we go we are back to that beta
acting gene sequence of humans
on chromosome 7. there is a link here
for every single genbank entry fasta
just click on that
and you will get the whole sequence into
fasta format
there you go you see that family are
greater than sign you see some name so
all you can do is just copy paste
this sequence into any kind of
bioinformatics software
that you want to use it with you can do
one more thing
if you want a standalone file standalone
fasta file from this sequence all you
got to do is just click send to
file and select format faster and say
create
file this will actually download
a fasta file of that sequence on your
computer
if you have some software installed on
your computer you can use this
feature okay um
there are several other tools that you
can use in fact you can
sign into ncbi using one of your google
account and that can
save you can have your favorite searches
saved you can play around with this ncbi
it's all free
and in open domain if you want to search
something within this sequence
there is a feature in ncbi to do that
all you have to do is go here find
within this sequence
okay so the most important thing is now
you know
how to obtain fasta sequences
of any dna sequence from genbank
we are going to use that skill and i'm
going to share my screen back with you
and give you the very first assignment
okay and that's going to be kind of
interesting so let's go back to my
powerpoint we are going to play around a
little
now with these dna sequences it's time
for that
here we have i have assembled it's not
me i have just put together
these several coronavirus whole genome
sequences
here we have the sars cove two right at
the bottom
okay that's causing covalent we have we
are living with pandemic these days
a similar one to that is sars cove of
course there is
one middle eastern variant and there are
several other coronal viruses
and these links will take you to their
full
viral genome sequences on genbank
your job is to obtain them in fasta
format
and then go to this link to align them
with each other and we are going to
actually do a multiple sequence
alignment the software will do it for us
to see which of these viral sequences
are similar to each other
and which of them differ from each other
so let me go to this link real quick
and show you how it kind of looks like
um now again i might have
lost that shared screen but i will
share my entire desktop with you all
so you can see everything in there okay
here we have
the cluster omega web server i'll just
align sample sequences for you
we are aligning dna we just load example
sequences in there notice that they are
in fasta format so when you
copy and paste your whole coronavirus
genomes make sure you post them with the
fasta format next to each other
okay other than that leave the default
parameters
just like that and just hit submit
and let the server do its job it's going
to take some time to run
and it gives you the alignment when it
gives you the alignment the first thing
you should do is go to this guide
tree and that tells you which two
sequences are highly similar to each
other
and the third one is a little distant so
people start doing this
start gathering those sequences from my
powerpoint slide i'm actually going to
share
um the google drive folder link with you
now it's time to do that
and i'll give you about 5-10 minutes for
this activity and we will further move
on to the next one okay
so let me stop sharing
and go back to my powerpoint
so that you can see
there you go if you can click on these
links you can get those sequences from
right here
otherwise i'm gonna show you
and i'm gonna share that link of google
drive folder with you the same text
you will find as activity one but start
doing this
and if you can get to that cladogram the
guide tree click on guide retrieve you
can get that
then screenshot it and maybe post it in
the same folder i have given you
edit access for that i'm posting that
link
um in a few seconds
okay here we go
here is the link
to the google drive folder
so this is your time start working on
that alignment
thank you
that
if you guys can get to um
the guide tree cladogram post it either
in zoom text chat
or post it into our google drive folder
right there
as long as we have a couple of responses
we can
safely move on to next activity but i'm
going to giv
Resume
Read
file updated 2026-02-12 02:09:01 UTC
Categories
Manage