FTI ITB Morning Lectures - Introduction to Bioinformatics
1Yp67Sywlcw • 2021-03-08
Transcript preview
Open
Kind: captions Language: en good morning ladies and gentlemen uh in this this morning we we are we are glad to have dr himanshu rajeh or dr rajeh from nicole state of university usa and he will share his experience he will let us know more about bio the interesting topic that is bioinformatics professor raj you know that we are most of us are engineers here with a little bit background on biology so uh in bioinformatics is uh growing growing topics and we should know about that but we have very little experience about that so you we will be glad if you can share us and tell us more about what is bioinformatics okay the time is yours thank you thank you so much for such a nice introduction let me share my screen with you all real quick okay great so uh like they said i'm dr himanshu rajei i'm assistant professor at nicole state university and thank you for having me to share some of my knowledge of bioinformatics with you all this lecture i have arranged it in such a way that it's going to be a little interactive okay so it's not going to be just me talking all the time i would much appreciate your interaction your responses we are going to do several activities on computer of course um but your participation and your outcomes of those activities would be much appreciated they will essentially help me um judge whether you're understanding or not and feel free to ask questions as we move on through this presentation i really like to answer questions during the lecture as well okay so if you do have questions don't hesitate i'll give you time to answer to ask questions so if you do have questions don't hesitate to put in the text chat window of zoom and we'll take it from there also i have a google drive folder and i'm going to share that link with you all so the activities that i told you about the activities that we are going to do throughout this lecture um we will have some files that i have uploaded onto that folder and we will use those so as and when time comes i'll tell you what and when we are using that folder on google drive and i'll share that link with you all in the text chat of the zoom okay bioinformatics is a fairly new science so put it this way it evolved after the invention of computers much after in fact the invention of computers because it involves computers okay so biology was advancing since centuries i would say but computers are themselves fairly new of an invention a couple of decades ago and that is when the thought started in people that can we use computer in other sciences like chemistry like biology like physics can we get help from this tremendous technology of computers to do certain tasks and the answer to that was yes and people actually started doing that but when i was in my bachelor's degree this word bioinformatics was completely new i didn't even know about it it barely existed um and of course my bachelor's degree is from india so i was completely unaware of bioinformatics at that time but when i was in my master's degree and i had pretty much decided to go for biology at that time that is when we were actually hearing or reading in news articles about bioinformatics but still it was not taught at colleges at that time as well in fact as you guys can probably understand that misconceptions circulate when the topic is like very new or fresh and there were misconceptions going on in the community i still remember myself reading an article in the newspaper that of course it was a question mark future of biology might be at stake because now computers will take over everything maybe computers can do experiments and what would biologists do and while reading that news article of course i was just a master's degree student at that time i was like oh boy i decided to take biology as my career and is this field in trouble and at that same time i remember having discussion with my parents you know we all have that kind of a phase at some point of time in our lives what to do with our career so um we me and my parents we came across this one week of a workshop on bioinformatics at a city in india called as chennai it's really good for education so my parents told me why don't you fly over there and why don't you see what this bioinformatics is about and see if your career is at stakes see what's the future of biology what does it look like and i did fly to chennai for that one week workshop of bioinformatics that trip to chennai was kind of memorable for several things first thing is it was my first flight trip second thing is i got introduced to bioinformatics that was the first and foremost important thing and i made really good friends over there and the last but not the least in fact the most important thing that i learned from that one week workshop on bioinformatics at chennai is that biology is not in trouble biology still holds strong we need biological experiments we need to be in lab but we also need help from this new technology that's coming up computers and maybe those computers can actually help us steer ourselves in a correct way while doing our experiments so that was my take home message from that one week workshop that i attended on bioinformatics and i was kind of intrigued by this science so although i did not decide to do my career in bioinformatics i stuck with molecular cell biology i always try to keep myself updated with what's going on in this science so like i said we cannot proceed in bioinformatics without biology so let's have a little bit of background on biology okay just what's required so let me just start off with important biomolecules there are several molecules in our cells not just our cells we have bacterial cells fungal cells these cells behave as they behave and these cells interact using molecules so it's pretty much like i always tell my students these molecules are non-living but their interaction with each other makes cells that are living how does that happen we really don't have full answer yet and that's why still research is going on so let's have a quick introduction of some of these important biomolecules that we are going to deal with in bioinformatics let's start with dna let's start with a very familiar biomolecule deoxyribonucleic acid the genetic material for most of the cells you name it prokaryote eukaryote some viruses are different though they are exceptions however viruses are not cells so let's keep viruses aside for a while and let's just focus on dna the structure of dna is double-stranded i'm sure you have seen a picture or similar pictures like this um on several occasions so double stranded helical the strands are anti-parallel the monomer dna is a long chain okay people it's the monomer of dna the single unit is a nucleotide and there are four of them in the dna adenine thymine guanine and cytosine they are represented by the single letter abbreviations essentially coming from the first letter in their name so a for adenine t for timing etc the chemistry the rule of chemistry here is adenine a on one strand of dna pairs with thymine t on the other strand of dna and guanine g on one strand pairs with cytosine c on the other strand of dna so keep this in mind some of you might already know this if you know this fine if you don't know well just have a quick introduction about this but notice here there are a couple of points that we should note from this slide the structure of dna that double helical antiparallel strands is pretty much the same in any organism you talk about okay so that's one major biomolecule that we are going to look at okay so we have a sequence of dna on one strand of course it's a stretch of nucleotides and of course the other strand also has the same thing the complementary sequence that's what we call the second biomolecule that we are going to talk about is rna ribonucleic acid okay notice the very first thing on this slide is rather than dna just being a common structure double helical structure rna has three types messenger rna ribosomal rna and transfer rna okay so there are three types of rna another thing that you should keep in mind is whichever rna we are talking about the rna is single stranded okay it is not double stranded like dna so no matter what rna you talk about messenger rna single stranded ribosomal rna and trna are also single stranded molecules which means nothing prevents them to fold onto themselves and they can form structures like this what's shown in this picture of course not just this they can even form several other kind of structures okay so keep that in mind rna comes in diverse forms because each and every molecule is single stranded now again just to have a quick introduction every gene can have every protein making gene um will have its own messenger rna produced when cell expresses that gene so messenger rna is the molecule that's going to be produced from every gene that can make protein okay so when you think about messenger rna put it this way ribosomes are going to read it in triplets and they are going to call corresponding trnas with amino acids i'm going to make proteins so messenger rna we call it messenger because it carries message from genes the other two rna molecules ribosomal rna and transfer rna they never make proteins from themselves they just help with protein formation for example ribosomal rna just goes and becomes part of ribosome okay so along with some other proteins it just sits in cytoplasm and that's what ribosome is it helps in formation of proteins it helps to read these messenger rnas and form proteins but ribosomal rna molecules are never going to form proteins from themselves ribosomes are not going to read these same thing applies to trna these are again helper molecules they help with protein synthesis okay so trna will never be formed into protein of itself just helps so the genes that code for ribosomal rna the genes that make transfer rna they never make proteins they get expressed but they just stop at rna formation and that rna actually does perform some action in the cell so that's rna of course it is also made up of nucleotides so the monomer of rna are also nucleotides notice that there is no thyamine instead we have uracil in rna molecules and of course if you are a biologist you might know that trna has some other uncommon nucleotides in it but that's not really the part of our lecture here my attention is on uracil because that is the unique nucleotide in rna and that replaces thymine but keep in mind that rna is single stranded that should be the take-home message and it comes in three types it can fold onto itself to assume several different structures so let's collect these information collect these points with us make a note of them and move on the third but the most diverse biomolecule is proteins proteins there are several proteins in our cell okay because every protein coding gene will have its own messenger rna and there are several of protein making genes and those mrnas will be read by ribosomes of course ribosomes contain ribosomal rna and transfer rnas are going to come into picture with loaded amino acid and we are going to have proteins so nonetheless the monomer here is amino acids one amino acid is going to join to each other amino acid with peptide bond and form a chain of amino acid that is basically protein however just a simple chain of amino acid is a primary structure of protein okay these amino acids have their single letter abbreviations just like nucleotides do and we are going to see i'm going to point out to some of these um single letter abbreviations to you later on when i show you certain bioinformatic things but primary structure is a simple string of amino acids if i keep writing single letter abbreviations of amino acids one after the other that's a simple primary structure that is not sufficient people for protein to work in the cell i'm sure you guys know that each molecule some of you are chemistry majors some of you are biology majors so you know that each molecule has its own three-dimensional shape and when it assumes that shape when it forms um when it takes that shape in the cell that is when it can perform certain actions because that is when it can find its binding partners in the cell each protein is looking for something to bind to something it could be an ion it could be another protein it could be maybe sugar something and that structure of the binding partner of protein should perfectly fit into the three-dimensional structure of protein and that is why no protein stops at primary structure there are secondary structures like alpha helices beta sheets and even further those secondary structures are folded in the cell to form a tertiary structure for every single protein okay so every single protein in the cell will have its own three-dimensional structure now some proteins don't stop even here some proteins need to attach themselves to another protein cell needs to couple a couple of proteins together and that's when they can act so they act together they act in a group bunch of proteins bound to each other and so some proteins not all few proteins have something called as quaternary structure so not all proteins have this some proteins do but the proteins that do have quaternary structure it is basically proteins different proteins attached to each other and performing a biological function so again for us the take home message here is the monomer of protein is amino acid it's an amino acid sequence three-dimensional structure of protein comes into picture it is critically important to know okay so keep these things in mind and let's move ahead although i have shown you a eukaryotic cell you can see this this is a nucleus of the cell even prokaryotic cells have the same process going on from dna this is where we essentially start this is where the genes are this is the genetic material of every cell this house has all the genes now when sell any cell prokaryote or eukaryote us plants animals bacteria whatever when the cell decides to activate a gene it will form an rna molecule from that gene okay and the process of going from dna to rna is called as transcription so when a gene is activated that gene is going to be transcribed now i just told you a few seconds ago that some genes go all the way down to proteins so their rna molecules are messenger rnas and they will further be translated with the help of ribosomes and a protein will form from them some genes can do that but what if we are talking about a gene that just makes ribosomal rna or a transfer rna that kind of rna will never ever form its own protein but still when a ribosomal rna gene is activated we will have transcription of that gene and we will form ribosomal rna okay but those genes will stop here so as a whole class we can settle on the thought that when a gene is activated it is at least getting transcribed now if we are talking about messenger rnas they will also get translated with the help of ribosomes and form proteins keep in mind that these three are extremely diverse molecules and we are talking about a stretch of nucleotides double stranded anti-parallel helical molecule here with the process of transcription it is forming a single stranded molecule of rna okay and i'm going to technicalities here okay so keep that in mind rna are single stranded formed from a double-stranded molecule still the monomer is nucleotide but thymine is replaced by uracil and here if messenger rnas are translated to proteins then this is a whole different biomolecule in itself the monomer is amino acids so cell is doing an incredible thing here cell is creating three different kinds of course this is what cell has already but cell is essentially creating two completely diverse molecules and there are several proteins so we have tremendous amount of diversity here and this is this whole process together is called a central dogma of molecular biology okay because this holds true for prokaryotes as well as eukaryotes so we are going to stick to the central dogma and we are going to appreciate this diversity in biomolecules that we see and we are going to try and see how that fits within the information that we can collect from these biomolecules i want your attention for now on this process transcription how come from a double stranded dna molecule we have a single stranded rna the process happens kind of like this here we have double stranded dna okay this is where the gene so in this picture a gene is shown to you right here just a cartoon of a gene two strands of dna since rna is single stranded when cell decides to activate mark my words okay later by letter when cell decides to activate this gene the two strands of dna are going to separate and cell is going to recruit an enzyme to read keep your focus on my mouse pointer to read just one strand of dna and form an rna molecule so only one strand of dna is going to be used to form an rna molecule makes perfect sense to us because rna is single stranded cell is never going to use both of these dna strands and form a double-stranded rna that's not how it happens rna is rarely double-stranded if it is double-stranded then it is the single-stranded rna folded onto itself that's it otherwise rna is single-stranded so only one strand of dna is used the strand of dna that cell is going to use to make rna is called as template strand keep that name in your mind somewhere we are going to come to this name at least couple of times today and the other strand of dna is called as a coding strand or the sense strand this strand of dna is not used to form rna okay maybe this figure would do um a better judgment to the point that i'm trying to make so this strand of dna it's shown in red color to you there are no real colors in dna this is just for our understanding but this trend of dna has this sequence let's say for example it's not being used to make rna by cell in fact this bottom strand of dna that's being used that's a template strand it acts as a template for rna formation so as you can see the sequence of rna is complementary to the template strand okay if we have t in the template strand cell will add a in the rna and of course if there is adenine in the template strand rna doesn't have t but it has u instead we learned that few seconds ago so cell will put you but the point to note here is the sequence of rna is going to be complementary these nucleotides pair to each other to the template strand of dna and if you go back a second and look at this strand of dna the other coding strand of dna that strand of dna was also complementary to this trend of dna because usually these two strands of dna bind to each other now we have rna which is complementary to this strand this strand of dna is also complementary to the template strand so the sequence look at the sequence of rna the sequence of rna perfectly matches with the sequence of coding strand of dna apart from the t is replaced by use okay so keep that in mind the sequence of rna is the same exact sequence just because both of these strands are complementary to template strand template strand of dna the other strand of dna is acting as a template to form rna and that is why most of the databases in bioinformatics they will provide you with this sequence you will see the coding strand sequence okay the sequence that is very exactly in fact exactly similar to the rna sequence so if you are looking at a gene sequence in the database and if you wonder hey what would be the rna sequence here all you have to do is just replace those t's by use and that's your rna sequence and that is why databases biological databases give you the coding strand sequence okay so keep that in mind and again some of these things might sound um like you know um foreign to you right now but when we actually look at those biological databases trust me it will all make sense as long as you're trying to keep up with the pace so coding strand is the sequence that we see all right when biological inventions were taking place when scientists were discovering how are these genes expressed and a lot of expression data um was essentially piling up in scientific community when human genome project was going on people had questions in their mind humans have lots of genes human cells have tremendous amount of genes in them what is their sequence what is the dna sequence of each gene and there was a worldwide collaborative project to sequence the entire human genome it generated tremendous amount of data now where to keep that data we needed some help to preserve that data we cannot just preserve that data on paper if we do that it will just remain in one lab or maybe at one place we wanted scientific community wanted access of that data to worldwide it wanted outreach people in the world everybody should have access to that data and so where to store that data that is when people looked into some other sciences like computer science can we get help from computers maybe to store this data is computer science advanced enough now luckily fortunately even computers were evolving around the same time and the answer came to be yes yes we can get help from computers and store this data furthermore not only the store we can even try and analyze this data to draw some meaningful conclusions we all do experiments people in lab at some point of time even otherwise our life is full of experiments essentially in no matter in what science you talk about even in other subjects people do some types of experiments we we know two things by doing an experiment no matter whether it's chemistry whether it's physics whether it's biology experiment takes time and sometimes the reagents that we use for these experiments are costly they take money now if we and typically we don't know the outcome of experiment we are doing research when we start off with an experiment we don't know what's going to be an outcome we can kind of predict our hypotheses but we don't real we don't even know whether we are going to be heading into right direction or not typically and that is where decades ago people were trying to get any help possible from computer science can we at least virtually predict the outcome of an experiment can we at least know if we are heading into right direction in order for us to save time and money there is no point in spending five years doing an experiment only to realize that i was chasing shadows if i can get periodic help from computers computer will not be doing any experiment for me however i am going to just check with computer maybe plan out my experiment in computer we call it as in vitro experiment the experiment that we do with animals are in vivo but experiment that we do in a test tube are in vitro and the experiments that we do with computers are in silicon because they have silicon chip so these are three different words that we need to kind of keep somewhere in our mind but can we do some of those in silico experiments and periodically judge maybe on a monthly basis maybe on bi-monthly basis just to see if our experiments are going in the right direction or not if we can do that we can modify our hypothesis and always steer ourselves at right direction and that's what i'm going to focus my lecture on today okay i'm going to introduce you again this is just introductory bioinformatics so i'm just going to introduce you to some pre-existing tools in bioinformatics how can we use those in our day-to-day experiments day-to-day biological experiments some of those tools you can even use in chemistry or you can even use in bio process so it's going to be interesting some of you are all might already be familiar with some of those tools so if you are that's fantastic you will be able to do those activities very quickly if you are not um you will learn those so that's going to be knowledge to you all okay so stay tuned some interesting stuff is going to come to you the point here on this slide that i want to make before i leave this slide is there is tremendous amount of data in biology that is being generated okay and we are actually going to talk about what kind of data that we are talking about well first thing is right in front of us the nucleic acid sequence so let's see what type of data we can gather in biology shall we and this is where informatics comes into picture wherever we have data we have information and in this it is the information in context of biology and that is where this culmination of biology and compu information technology i.t or computers is essentially what bioinformatics is all about and this led to starting of a whole new field nowadays people do careers in bioinformatics there are majors named as bioinformatics in colleges because this has tremendous potential keep in mind though we can never ever do any earth shaking discovery with bioinformatics i mean just in bioinformatics we need to do biological experiments in order to invent new things we can use bioinformatics we can get help from computers only to assist us with our biological experiments so that is one thing to bear in our mind real well and now it's time since we are now introduced to informatics it's time to look into what kind of data we can collect in biology and we are actually going to make this slide together okay um as you can see this light is almost blank and it has those three familiar biomolecules with us so i'm going to um stop my slideshow for a while and i'm back to that text box that i have there with biomolecule dna what kind of data can we get and we are going to finish this like we're going to complete this light together so again i i would much appreciate your input as i complete this light i'm going to use dna as my bio molecule okay i'm going to complete dna but you guys are going to help me with rna and proteins so let's start with dna of course we can have new nucleotide sequence that is a data so nucleotide sequence could be a data for dna the structure of dna is pretty much the same in any organism we talk about so i wouldn't put structure there is barely any diversity there so i wouldn't put it as a diversified data for dna okay uh however nucleotide sequence definitely yes how about this there are four different types of nucleotides in dna a t g and c sometimes it's important to know how many a's how many adenines are there in dna how many thymines cytosines are guanines so percentage of each nucleotide that could be some meaningful information the other meaningful information here would be if i have one dna molecule sequence how similar that is with the other dna molecule sequence for example let's talk about us let's talk about humans we have several genes in our body i can take a common example hemoglobin it's the protein that carries oxygen in us of course it's a protein which is coming from its own gene so gene of hemoglobin there there are several globins in our cells but that gene how similar is that gene in its nucleotide sequence with mouse hemoglobin if you have that kind of a question you need to first obtain human hemoglobin gene sequence compare it of course obtain mouse hemoglobin gene sequence and compare both of them with each other there is a scientific work to it there is a bioinformatic work to it you got to align those sequences with each other so sequence alignment that could be a form of data for dna okay if you can think of something else feel free to put in the text chat window okay as we speak so sequence alignment or percent homology these are some words that we should keep in mind between several dna molecules okay can somebody tell me what kind of data can we have for rna we can always go back to dna if something strikes to us rna unlike dna has several types we just learned about that so if you can put your thoughts in the text chat window of zoom i would much appreciate that what kind of data can we have for rna of course nucleotide sequence any other thoughts types of rna i love that yes types of rna for sure so let's put that right here there are three types of rna if i show you just a nucleotide sequence of rna i'm not telling you much here you might ask me is this mrna is this rrna or trna so type of rna fantastic any other thing that you can think of rna is single stranded so i told you some peculiarities about rna rna structure fantastic we are all learning together but drop sure of rna in parenthesis i'm going to write um folding pattern transcriptomics yes so we can have a set of rna molecules in a cell all of those rna molecules have definitely come from expression transcription of certain genes so if we have a question hey here is a cell how many different rna molecules are there in the cell and what is their sequence that is transcriptome just like genome genome is a set of genes in our cells transcriptome is a set of rna in our cells so set of rna in a cell let's stick to simple english in parentheses transcriptome fantastic people this is going well any other things you can think of for rna translation start point and end points that holds true for mrna for sure yes messenger rna the rna molecule that forms proteins it has to have some starting point for protein synthesis and some ending point which means it has to have a start codon somewhere okay and it has to have a stop codon which tells ribosomes where to start making protein and where to stop making that protein so that whole sequence of rna the whole sequence of messenger rna now i'm fine tuning my words i'm building upon this answer from start to stop codon is called as o r f open reading frame the whole sequence of messenger rna from start to stop codon it is imperative to predict open reading frames for messenger rnas fantastic yes what else which mrna is expressed in certain conditions yes conditional expression that kind of goes with transcriptomics but yes conditional expression of rna is kind of important some cells some some genes are expressed only under stressed conditions so what are those genes it's important to know that okay so definitely nucleotide sequence definitely the type definitely the structure of rna definitely the um conditional expression the transcriptome that is good any other thing that you can think of for rna otherwise we will go to proteins we can always go back what about proteins people what type of data can we have here protein sequence and structure yes so let's say amino acid sequence shall we that's the primary structure of every protein simple amino acid sequence by their single letter abbreviation structure three-dimensional structure of protein is critically important for its function and so it is almost very critical tremendously important to be able to predict if i just give you a simple amino acid sequence of a protein my question is will you be able to at least predict solving a full three-dimensional structure that takes time it takes involving and money consuming techniques such as x-ray crystallography or 3-d cryo-electron microscopy etc before going to that can you at least predict has any other organism been shown to have a similar protein of the 3d structure so yes we we can definitely look at the 3d structure of the protein what else what else can we look into protein aha i like that protein function thank you people i told you proteins are the most diverse biomolecules in the cell and they come with variety of functions functions are their own so what is the function of the protein of our interest if we have just the sequence of the protein if we can predict the three-dimensional structure of the protein or just the sequence matches let's say that we are looking into a human protein human hemoglobin let's say for example hemoglobin carries oxygen let's say that we are looking into human hemoglobin protein sequence just the amino acid sequence if we can somehow match that sequence with all of the plant proteins that are known and if we do see some similarity with that maybe that protein implant can also carry oxygen maybe just because maybe this is just prediction and that's what bioinformatics helps us with it helps us to do meaningful to to generate meaningful predictions and we can test those predictions further more with real experiments so yes the function of protein essentially any other factor that might matter into biological data for protein how about this um conditional again um formation of protein or synthesis how about that just like rna some proteins are made only under certain conditions like most of the antibodies are made only when we have infection okay there are some antibodies that are made even without but there are some proteins that are made only under stress conditions there are some proteins that are always being made homology fantastic yes amino acid sequence homology how similar one protein is to the other protein and that kind of thing can be in rna as well wherever you have some kind of sequence we can definitely have sequence homology sequence alignment any other thing you can think of this is this is going tremendously well people thank you thank you for your feedback any other things aha protein interactions that relates to what i just told you few minutes ago protein wants to find binding partners so what other molecules does it bind to how does it interact in the cell is it no that brings to another point cellular location of the protein that's also important some proteins are membrane proteins some proteins are just remain within the cell some proteins are secreted out of the cell okay so where does this protein go that is also important any other thing that you can think of how about rna splicing yes fantastic r and a splicing in parentheses we can write alternative surprising in you care i deliberately avoided this because not everyone knows this is kind of a complicated topic alternative splicing but i'm glad somebody mentioned this fantastic any other things in protein in fact i would like to add something in dna now how about name of the gene if we are really looking into a gene sequence because there could be several other dna several other stretches of nucleotides that may or may not be a gene if we are looking into the gene then it's good to have the name of that gene its location remember dna is the genetic material chromosomal location where exactly on chromosome that gene is present are there any diseases associated with that gene what if that gene sequence might have some mutations in some people if they do have mutations then what diseases could they have so diseases are disorders associated with specific genes you know what people every single point that we are putting on this slide there is a database out there for that there are databases out there that correlate the name of the gene with diseases there are databases out there that have um homology between um several genes between different organisms we are going to touch upon some of those there are databases out there to tell you protein three-dimensional structure there are databases out there to analyze the whole transcriptome no matter what organism you talk about so things have advanced quite far we are going to just touch upon those databases and just some of the widely used databases that's the whole point of today's lecture okay so this is going well let's move on let's keep this slide this is a slide in progress always okay so now i think it's time for me to introduce you two common biological databases that house gene sequences one of them was originated in america the other one tells us essentially the same thing but it's originated in europe okay so american one is ncbi genbank national center for biotechnology information and the european one is embell bank these two databases house gene sequences of course there are protein sequences that are housed by these two databases they essentially are different versions american and european version of the same data so there is redundancy there is correlation between these two databases and there is interrelations in fact for today we will stick with the american version of the database just because it is more user friendly however we are going to use some cool tools from this emblem bag okay and there are many more databases that i have not even listed on this slide there are some specific databases what if genbank houses the gene sequences from all organisms that are sequenced like humans mice fruit flies worms plants but what if we just want to look into fruit fly sequences then there is a database for that flybase what if we want to look at just plant gene sequences then there are some small databases just for that so there are some specific ones but as of now we are going to stick to genbank and again i'm going to just stop slideshow for a while and i'm going to share my browser screen with you i'm going to show you one critical thing to do how to search for a gene sequence in a genbank okay so let me stop sharing my screen and let me be back with my web browser here we go what i'm going to do is in the search window i'm just going to type ncbi genbank that is genbank notice that there are several sister databases in genbank there are nucleotide sequence databases in which case you will select a nucleotide there are genome databases there are gene expression omnibus geo database let's start off with gene this also houses some free textbooks by the way people there are book databases as well in there let's start with gene and you can pretty much type your favorite organism and the name of your favorite gene in here my favorite gene in humans is acting so let's search for human a c t b beta actin just an example later on we will search for some other genes as well i'm just going to show you how to search for a gene and this is where you will get a gene card for that gene beta actin make sure that we are looking into human gene click on that that will take you to an ncbi page for that gene the name of the gene right up front there is some summary you can get some meaningful information about this gene okay um you can also get some information about the expression pattern of this gene in which human tissues is this gene expressed well it tells you ubiquitous expression several tissues first of all it is a protein coding gene keep going down i'm just going to quickly scroll if you are a biologist you will have also actually appreciate this little interactive browser it tells you a cartoon of the structure of that gene and tells you some meaningful information about how many exons how many protein making sequences are there how many introns are there and um where do they start where do they end so as you hover your mouse pointer on that it tells you that information keep going down if the experiment is done by some people some scientists out there in the lab this graph will pop up and this is the expression data i like actin gene because it's expressed in every single human tissue in every single pretty much human cell and that's what this graph tells you okay you can also change the type of experiment over here from the drop down menu and see several other types of graphs but again that's reserved for some advanced things let's move on what other things does this webpage show you of course several references the people names of the people that work on um this data associated conditions are there any mutations associated with this gene if so what kind of diseases or disorders or syndromes can humans get you can have that information right there okay so without any without going anywhere on to google just in this database itself you can do this for any gene and there are several other things down there what mutations you can have what kind of interactions does this protein do to which other proteins it can interact maybe it can interact with some viral proteins so you will find all kinds of interactions and the associated research studies listed right here on this web page the most important thing that i want to point out is what if i want to know this gene sequence in that case you have to go up right here where you see this interactive browser if you are a biologist again feel free to look around but click on this link genbank that will take you to the sequence of that gene and that is the sequence okay we are getting to that page it doesn't tell you that it's human actin it just tells you that human actin gene is on human chromosome 7 and of course the earlier page also had that information this number though is a unique database id for human acting gene so if you are a researcher working on this gene you better note down this id so that you can refer to this same gene sequence in future you can almost just put this number in the search window and you will be coming directly to this page this also tells you how long is the nucleotide sequence so about um 3 454 base pairs so about 3 400 base pairs it's a linear dna keep scrolling down it tells you the names of people who submitted this sequence make a note of this section features it tells you that of course it's genomic dna all the way starting from the first nucleotide to the last nucleotide it also tells you that it's a gene the name of that gene is actb and all of that sequence starting from first to the last nucleotide is the same gene it also tells you the mrna sequence for that gene okay it asks you to join several nucleotides to make an mrna now you might be thinking oh do i have to manually join these nucleotides no no no look look under the subheading under this mrna transcript id just click on that and it will take you to just the mrna sequence for that gene there you go mrna if you scroll down that is just the mrna sequence of course replace ts by use okay it also gives you the coding dna sequence just the exams of the gene and what would the protein sequence look like so this is the protein sequence these are single letter abbreviations of the amino acids so every single information that you need to know is right there on this page if you want to know just the coding sequence separately just click on this external link ccds and that will take you to just this sequence it asks you to manually join these nucleotides you don't have to just click on that link and then we have a full gene sequence starting from first nucleotide to the very last nucleotide right here now notice that this sequence has numbers okay nucleotide number of positions what if you want to work with this sequence what if you want to do some analysis with the sequence and you want to get rid of these numbers you can do a simple trick just copy this whole thing copy and go to this website this is a fantastic sequence manipulation suite online software developed by university of alberta in canada it has several free tools for us to play around okay we are just going to look at some of these for example filter dna whatever other non-dna characters you might have what if somebody gives you a word file of dna sequence with some other characters you don't want those characters because those characters will be thrown off by any bioinformatics software in that case just run your sequence through this i mean it gives you an example sequence just clear that off and paste our sequence in here and hit submit what it gives you is the sequence without numbers and now you can play around with this sequence okay it is also in some kind of a format which i'll tell you what that format is i think i'm not really sharing this screen with you so let me go back and share my whole desktop i can see your screen before oh you could you could see the output i saw your cut and paste oh okay um now this is the output here we go so now we don't have numbers we have just the sequence okay filter dna sequence it gives a name to it and i want you guys to notice this sign the greater than sign that starts off with the sequence that greater than sign signifies something it's a format of a sequence that this software online software converts our sequence to it's called as fasta format and that's what we are going to come to now so let me unshare my screen and let me put the powerpoint back up since we had some introduction about genes right here the fasta format lot of bioinformatics software's don't accept just the dna sequence it has to be in this fast a format we call it faster and what it means is whatever sequence you are looking into put this greater than sign in front of it and that helps computer to know that this is where computer should start reading the sequence and now you can list multiple sequences one after the other as long as you start every sequence with a greater than sign you can even you are allowed to put certain name for that sequence so greater than sign whatever unique name you want to put for this sequence you can even type in simple english like human beta actin something like that and um you know mouse beta acting something like that and don't be under impression that you have to have a limited number of nucleotides now you can go on to like thousands of nucleotides here and then put another greater than sign and put your second sequence put another greater than sign put your third sequence so that is a fast start format of a sequence now how can we get that well fortunately it is easy for us to get any sequence on genbank in fasta format genbank has made that real easy um let me go back to the browser that we were working with and that will be more clear to you i'll show you right there how you can go to fasta format of any sequence from genbank here we go we are back to that beta acting gene sequence of humans on chromosome 7. there is a link here for every single genbank entry fasta just click on that and you will get the whole sequence into fasta format there you go you see that family are greater than sign you see some name so all you can do is just copy paste this sequence into any kind of bioinformatics software that you want to use it with you can do one more thing if you want a standalone file standalone fasta file from this sequence all you got to do is just click send to file and select format faster and say create file this will actually download a fasta file of that sequence on your computer if you have some software installed on your computer you can use this feature okay um there are several other tools that you can use in fact you can sign into ncbi using one of your google account and that can save you can have your favorite searches saved you can play around with this ncbi it's all free and in open domain if you want to search something within this sequence there is a feature in ncbi to do that all you have to do is go here find within this sequence okay so the most important thing is now you know how to obtain fasta sequences of any dna sequence from genbank we are going to use that skill and i'm going to share my screen back with you and give you the very first assignment okay and that's going to be kind of interesting so let's go back to my powerpoint we are going to play around a little now with these dna sequences it's time for that here we have i have assembled it's not me i have just put together these several coronavirus whole genome sequences here we have the sars cove two right at the bottom okay that's causing covalent we have we are living with pandemic these days a similar one to that is sars cove of course there is one middle eastern variant and there are several other coronal viruses and these links will take you to their full viral genome sequences on genbank your job is to obtain them in fasta format and then go to this link to align them with each other and we are going to actually do a multiple sequence alignment the software will do it for us to see which of these viral sequences are similar to each other and which of them differ from each other so let me go to this link real quick and show you how it kind of looks like um now again i might have lost that shared screen but i will share my entire desktop with you all so you can see everything in there okay here we have the cluster omega web server i'll just align sample sequences for you we are aligning dna we just load example sequences in there notice that they are in fasta format so when you copy and paste your whole coronavirus genomes make sure you post them with the fasta format next to each other okay other than that leave the default parameters just like that and just hit submit and let the server do its job it's going to take some time to run and it gives you the alignment when it gives you the alignment the first thing you should do is go to this guide tree and that tells you which two sequences are highly similar to each other and the third one is a little distant so people start doing this start gathering those sequences from my powerpoint slide i'm actually going to share um the google drive folder link with you now it's time to do that and i'll give you about 5-10 minutes for this activity and we will further move on to the next one okay so let me stop sharing and go back to my powerpoint so that you can see there you go if you can click on these links you can get those sequences from right here otherwise i'm gonna show you and i'm gonna share that link of google drive folder with you the same text you will find as activity one but start doing this and if you can get to that cladogram the guide tree click on guide retrieve you can get that then screenshot it and maybe post it in the same folder i have given you edit access for that i'm posting that link um in a few seconds okay here we go here is the link to the google drive folder so this is your time start working on that alignment thank you that if you guys can get to um the guide tree cladogram post it either in zoom text chat or post it into our google drive folder right there as long as we have a couple of responses we can safely move on to next activity but i'm going to giv
Resume
Categories