GenBank files: take one

This part deals with the manipulation of GenBank files. These files are used by NCBI to store information about RNA, DNA and protein sequences. It is usually composed of an annotation section, that gives information about the sequence present in the particular file. I won't spend much time explaining the GenBank format, because it is not the goal of the site. The perl book has some good explanation about it and you can also find more information here. Also, we are going to see here some of the characteristics of such files.

The GenBank file we are going to manipulate from now on is this one

LOCUS       DQ283072                2414 bp    DNA     linear   VRT 23-MAR-2006
DEFINITION  Megaelosia goeldii 12S ribosomal RNA gene, partial sequence;
            tRNA-Val gene, complete sequence; and 16S ribosomal RNA gene,
            partial sequence; mitochondrial.
ACCESSION   DQ283072
VERSION     DQ283072.1  GI:90296241
KEYWORDS    .
SOURCE      mitochondrion Megaelosia goeldii (Rio big-tooth frog)
  ORGANISM  Megaelosia goeldii
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Amphibia; Batrachia; Anura; Neobatrachia; Hyloidea;
            Leptodactylidae; Cycloramphinae; Megaelosia.
REFERENCE   1  (bases 1 to 2414)
  AUTHORS   Frost,D.R., Grant,T., Faivovich,J., Bain,R., Haas,A.,
            Haddad,C.F.B., de Sa,R.O., Channing,A., Wilkinson,M.,
            Donnellan,S.C., Raxworthy,C., Campbell,J.A., Blotto,B.L., Moler,P.,
            Drewes,R.C., Nussbaum,R.A., Lynch,J.D., Green,D.M. and Wheeler,W.C.
  TITLE     The Amphibian Tree of Life
  JOURNAL   Bull. Am. Mus. Nat. Hist. 297, 1-291 (2006)
REFERENCE   2  (bases 1 to 2414)
  AUTHORS   Frost,D.R., Grant,T., Faivovich,J., Bain,R., Haas,A.,
            Haddad,C.F.B., de Sa,R.O., Channing,A., Wilkinson,M.,
            Donnellan,S.C., Raxworthy,C., Campbell,J.A., Blotto,B.L., Moler,P.,
            Drewes,R.C., Nussbaum,R.A., Lynch,J.D., Green,D.M. and Wheeler,W.C.
  TITLE     Direct Submission
  JOURNAL   Submitted (26-OCT-2005) Herpetology, Division of Vertebrate
            Zoology, American Museum of Natural History, Central Park West at
            79th Street, New York, NY 10024, USA
FEATURES             Location/Qualifiers
     source          1..2414
                     /organism="Megaelosia goeldii"
                     /organelle="mitochondrion"
                     /mol_type="genomic DNA"
                     /specimen_voucher="Paulo Nuin"
                     /db_xref="taxon:209670"
                     /country="Brazil: Rio de Janeiro, Teresopolis, Rio Beija
                     Flor, 910 m, 22'24'S, 42'69'W"
     misc_RNA        <1..>2414
                     /note="contains 12S ribosomal RNA, tRNA-Val, and 16S
                     ribosomal RNA"
ORIGIN      
        1 gtttggtcct aaccttgtaa tcaattttta cttaatatac acatgcaagt ctccgcaccc
       61 ctgtgaaaac gcccttaaat cccccatggg ataaggagct ggtatcaggc acgaaaatct
      121 gcccaaaaca cctagctatg ccacacccac aagggtactc agcagtgatt gacattattt
      181 ataagcgcca gcttgactca gttaaagtaa agagagccgg caaatctggt gccagccgcc
      241 gcggttaccc cacgtggctc aaattgattt ctttcggcgt aaagcgtgat taaagtgccc
      301 atcaacattg gagttaaact aaaattaagc tgtgacacgc ttattttaca gaaaagcaca
      361 aacgaaagtt acttcaattt aaacaacttg aattcacgac agtcaggaca caaactggga
      421 ttagataccc cactatgccc gaccgtaaac tttaatttac accatcaccg ccagagaact
      481 acgagcaaag cttaaaactc aaaggacttg acggtccccc acatccccct agaggagcct
      541 gtcctttaat cgataatccc cgcttaacct caccattctt agtctttcag cctgtatacc
      601 tccgtcgtca gcttaccccg tgagcgaaaa ttagtgagct taatgtccac acgtctacac
      661 gtcaggtcaa ggtgcagcaa atataatggg aagagatggg ctacactttc tagtctagaa
      721 tatacgaaag accacctatg aaacctggtc agaaggcgga tttagaagta aaaggaaacc
      781 agagcatccc ttttaatttg gcactggggc atgtacacac cgcccgtcac cctcttcaaa
      841 gcctaatttt agtatctaac caactaacgc ctagtagaag aggcaagtcg taacatggta
      901 agtataccgg aaggtgtgct tggaaacaaa atatagccta atcaaagcat ttcgcttaca
      961 ccgaaaagtt atctgtgaaa ttcagattat tttgagctaa aaatctagcc ccactttatt
     1021 ctataatccc ttatcactta aattcatgaa tcaaaacatt ttaataatca agtaaaggcg
     1081 attgaaaaat taataggagc aatatatact gtaccgcaag ggaaagatga aatagaaatg
     1141 aaataataat taaagcataa aaaagtaaag attaaatctt gtaccttttg catcatgatt
     1201 taactagtct acccaggcaa aatgatttta agtctgacct cccgaaacta agtgagctac
     1261 ttcaaggcag cttaatgagc aaatccgtct ctgtcgcaaa agagtggaga gaccttcaag
     1321 tagaagtgat agacctaacg aacttagtaa tagctggtta ttcaagaaaa ggatctcagt
     1381 ccaacctaaa gtcaaattaa tgtttaaaaa taaaaattct gaccttagag taattcaatt
     1441 aaggtacagc ctacttgaaa caggatacaa ccttaactaa tgggtaactt accccttcat
     1501 cttttaagtg ggcctaaaag cagccacctt taaaatagcg tcaaagctta gccgtcctat
     1561 acatctaata ccaaaaacat ctatgaaccc tatactcata ttgaataatt ctatattatt
     1621 atagagattt ttatgttaaa actagtaaca agaattaaat tttctctatt atgttcgtgt
     1681 acatcagaaa ggataaacca ctgataattg acatgcatga gtaaaaagca gtaacttaac
     1741 aagaaaaccc tcctaactct aatgttaacc taacacaagt acatctcaag aaagatttaa
     1801 agaaaaagaa ggaactcggc aaacattaac ctcgcctgtt taccaaaaac atcgcctctt
     1861 gtcaaaattt aagaggtcca gcctgcccag tgaccctgtt caacggccgc ggtatcctaa
     1921 ccgtgcgaag gtagcgtaat cacttgttct ttaaataagg actagtatga atggcaccac
     1981 gagggttata ctgtctcctt tttctaatca gtgaaactaa tcttcccgtg aagaagcggg
     2041 aatttttata taagacgaga agaccctatg gagctttaga cgagtaacaa ctgctaattt
     2101 tataatattt cagataatat ctctatccta gcattatgat tataagtctt tggttggggt
     2161 gaccgcggag aaaaaaataa cctccacatt gaaagaatat tattctaagc aaaaagacac
     2221 atctttaagc atcaacaaat tgacatctat tgacccaata ttttgatcaa cgaaccaagt
     2281 taccctaggg ataacagcgc aatccacttc gagagctctt atcgacaagt gggcttacga
     2341 cctcgatgtt ggatcagggt atcctagtgg tgtagccgct actaaaggtt cgtttgttca
     2401 acgattaaaa ccct
//

which stores a sequence of a mitochondrial gene of a stream frog from South America, called Megaelosia goeldii, also known as Rio big-tooth frog. (get a better formatted file here)

Our fast task will be to extract the DNA sequence from the file. This sounds easy, and not surprisingly it is. If we take a closer look at the file we will see that the sequence starts after the mark <strong>ORIGIN</strong>. From what we have seen before we just need to read the file, the a boolean variable that checks for <strong>ORIGIN</strong> and concatenate everything after that. Something like this

import sys
 
gbfile = open(sys.argv[1], 'r').readlines()
 
sequence = ''
issequence = False
for line in gbfile:
    if issequence == True:
        sequence += line
    elif line.find('ORIGIN') >= 0:
        issequence = True
 
print sequence

Quick and easy. When we find ORIGIN, issequence has its state changed to True and the lines below will concatenated into a string. We print at the end.

GenBank files: take two

Last time we saw how to extract the sequence from our GenBank file, but the final result might not be as nice as we wanted it to be. So, we need to make it prettier. Running last entry's script we end up with this sequence

        1 gtttggtcct aaccttgtaa tcaattttta cttaatatac acatgcaagt ctccgcaccc
       61 ctgtgaaaac gcccttaaat cccccatggg ataaggagct ggtatcaggc acgaaaatct
      121 gcccaaaaca cctagctatg ccacacccac aagggtactc agcagtgatt gacattattt
      181 ataagcgcca gcttgactca gttaaagtaa agagagccgg caaatctggt gccagccgcc
      241 gcggttaccc cacgtggctc aaattgattt ctttcggcgt aaagcgtgat taaagtgccc
      301 atcaacattg gagttaaact aaaattaagc tgtgacacgc ttattttaca gaaaagcaca
      361 aacgaaagtt acttcaattt aaacaacttg aattcacgac agtcaggaca caaactggga
      421 ttagataccc cactatgccc gaccgtaaac tttaatttac accatcaccg ccagagaact
      481 acgagcaaag cttaaaactc aaaggacttg acggtccccc acatccccct agaggagcct
      541 gtcctttaat cgataatccc cgcttaacct caccattctt agtctttcag cctgtatacc
      601 tccgtcgtca gcttaccccg tgagcgaaaa ttagtgagct taatgtccac acgtctacac
      661 gtcaggtcaa ggtgcagcaa atataatggg aagagatggg ctacactttc tagtctagaa
      721 tatacgaaag accacctatg aaacctggtc agaaggcgga tttagaagta aaaggaaacc
      781 agagcatccc ttttaatttg gcactggggc atgtacacac cgcccgtcac cctcttcaaa
      841 gcctaatttt agtatctaac caactaacgc ctagtagaag aggcaagtcg taacatggta
      901 agtataccgg aaggtgtgct tggaaacaaa atatagccta atcaaagcat ttcgcttaca
      961 ccgaaaagtt atctgtgaaa ttcagattat tttgagctaa aaatctagcc ccactttatt
     1021 ctataatccc ttatcactta aattcatgaa tcaaaacatt ttaataatca agtaaaggcg
     1081 attgaaaaat taataggagc aatatatact gtaccgcaag ggaaagatga aatagaaatg
     1141 aaataataat taaagcataa aaaagtaaag attaaatctt gtaccttttg catcatgatt
     1201 taactagtct acccaggcaa aatgatttta agtctgacct cccgaaacta agtgagctac
     1261 ttcaaggcag cttaatgagc aaatccgtct ctgtcgcaaa agagtggaga gaccttcaag
     1321 tagaagtgat agacctaacg aacttagtaa tagctggtta ttcaagaaaa ggatctcagt
     1381 ccaacctaaa gtcaaattaa tgtttaaaaa taaaaattct gaccttagag taattcaatt
     1441 aaggtacagc ctacttgaaa caggatacaa ccttaactaa tgggtaactt accccttcat
     1501 cttttaagtg ggcctaaaag cagccacctt taaaatagcg tcaaagctta gccgtcctat
     1561 acatctaata ccaaaaacat ctatgaaccc tatactcata ttgaataatt ctatattatt
     1621 atagagattt ttatgttaaa actagtaaca agaattaaat tttctctatt atgttcgtgt
     1681 acatcagaaa ggataaacca ctgataattg acatgcatga gtaaaaagca gtaacttaac
     1741 aagaaaaccc tcctaactct aatgttaacc taacacaagt acatctcaag aaagatttaa
     1801 agaaaaagaa ggaactcggc aaacattaac ctcgcctgtt taccaaaaac atcgcctctt
     1861 gtcaaaattt aagaggtcca gcctgcccag tgaccctgtt caacggccgc ggtatcctaa
     1921 ccgtgcgaag gtagcgtaat cacttgttct ttaaataagg actagtatga atggcaccac
     1981 gagggttata ctgtctcctt tttctaatca gtgaaactaa tcttcccgtg aagaagcggg
     2041 aatttttata taagacgaga agaccctatg gagctttaga cgagtaacaa ctgctaattt
     2101 tataatattt cagataatat ctctatccta gcattatgat tataagtctt tggttggggt
     2161 gaccgcggag aaaaaaataa cctccacatt gaaagaatat tattctaagc aaaaagacac
     2221 atctttaagc atcaacaaat tgacatctat tgacccaata ttttgatcaa cgaaccaagt
     2281 taccctaggg ataacagcgc aatccacttc gagagctctt atcgacaagt gggcttacga
     2341 cctcgatgtt ggatcagggt atcctagtgg tgtagccgct actaaaggtt cgtttgttca
     2401 acgattaaaa ccct
//

Notice that there were no mention to the double slash at the end of the file This double slash represents the EOF of a GenBank file. Ideally we need to remove it from the final output, as we need to remove the nucleotide number at the beginning of each line. In order to do that we need to modify our last script.

First we will add a condition on the for loop that will take care of the double slash at the end. Then we need to use a trick to remove the numbers from the lines. We should expect the numbers to be at the beginning of the line, and never in another place. The long way would be to create a regular expression and then replace every number occurrence with an empty string. But Python comes with batteries included, so we will take the short path. We use the lstrip string method that will remove characters from the left part of the string, and set it to remove numbers from 0 to 9 and the extra space before the nucleotides. And our script will like this:

import sys
 
gbfile = open(sys.argv[1], 'r').readlines()
 
sequence = ''
issequence = False
for line in gbfile:
    if issequence == True and not line.find('/') == 0:
        sequence += line.lstrip('0123456789 ')
    elif line.find('ORIGIN') >= 0:
        issequence = True
 
print sequence

Notice that the first if condition was modified to include a condition that we do <b>not</b> find a slash. And at the line where we concatenate the sequence we added a lstrip('0123456789 ') to modify the string on the fly. The final result looks much better now. But we still have a small “problem”: there are spaces between groups of 10 nucleotides and we want to get rid of them. We would be surprised if it was difficult, and as expected it is not. We used here the replace method and we can use it here too. We only need to modify the line that concatenates the sequence, and our final script will be

import sys
 
gbfile = open(sys.argv[1], 'r').readlines()
 
sequence = ''
issequence = False
for line in gbfile:
    if issequence == True and not line.find('/') == 0:
        sequence += line.lstrip('0123456789 ').replace(' ', '')
    elif line.find('ORIGIN') >= 0:
        issequence = True
 
print sequence

Notice that we add the replace method after the lstrip, but it can either way. The output should be only the nucleotides, with no space or numbers.

GenBank: parsing some features

We saw how to extract the sequence from a GenBank file. This time we are going to parse some other information from these files. Basically we will use the same idea of our last post to extract the Organism name, the Locus and the Accession number of the item. From our last entry we have to remember this

sequence = ''
issequence = False
for line in gbfile:
    if issequence == True and not line.find('/') == 0:
        sequence += line.lstrip('0123456789 ').replace(' ', '')
    elif line.find('ORIGIN') >= 0:
        issequence = True

and modify it to our needs. Looks simple, and it is. Let's see

import sys
 
gbfile = open(sys.argv[1], 'r').readlines()
 
locus = ''
organism = ''
accession = ''
for line in gbfile:
    if line.find('LOCUS') >= 0:
        locus = line
    elif line.find('ACCESSION') >= 0:
        accession = line
    elif line.find('ORGANISM') >= 0:
        organism = line
 
print locus.strip()
print organism.strip()
print accession.strip()

Just add a flag for each entry you want to parse and that's it. For longer entries, such as the sequence, we have to use the same approach used before, with a boolean flag and concatenating the lines until another flag is found.

Parsing a GenBank file, another approach

We saw how to extract some simple information contained in this type of files. I worked on a project that required a lot of work in GenBank files, and instead of learning how to use an available parser I decided to write a simple one on my own. The requirements were simple: extract all protein sequences from a whole genome GenBank file and also the DNA from the sequence available in it (this time we will see how to get the proteins and next time the latter).

We all know that GenBank files have an annotation section, with IDs, references, organism and some other information at the top. The DNA sequence at the bottom, after ORIGIN, and the protein sequences are in annotated regions of CDSs. An example is below:

     gene            8330..9475
                     /gene="COS1"
                     /locus_tag="YNL336W"
                     /db_xref="GeneID:855380"
     CDS             8330..9475
                     /gene="COS1"
                     /locus_tag="YNL336W"
                     /experiment="experimental evidence, no additional details
                     recorded"
                     /note="Protein of unknown function, member of the DUP380
                     subfamily of conserved, often subtelomerically-encoded
                     proteins"
                     /codon_start=1
                     /product="Cos1p"
                     /protein_id="NP_014063.1"
                     /db_xref="GI:6323993"
                     /db_xref="SGD:S000005280"
                     /db_xref="GeneID:855380"
                     /translation="MKENELKNEKSVDVLSFKQLESQKIVLPQDLFRSSFTWFCYEIY
                     KSLAFRIWMLLWLPLSVWWKLSNNCIYPLIVSLLVLFLGPIFVLVICGLSRKRSLSKQ
                     LIQFCKEVTENTPSSDPHDWEVVAANLNSYLYENKAWNTKYFFFNAMVCQEAFRTTLL
                     EPFSLKKDEAAKVKSFKDSVPYIEEALGVYFREVEKQWKLFNSEKSWSPVGLEDAKLP
                     KEAYRFKLTWFLKRISNIFMLIPFLNFLCCIYVSRGMCLLLRTFYLGWILFMLVQGFQ
                     NMRMIVLSVKMEHKMQFLSTIINEQESGANGWDEIAKKMNRYLFEKKVWKNEEFFFDG
                     IDCEWFFSHFFYRVLSAKKSMRALSLNVELWPYIKEAQLSCSEESLA"

In a GenBank file each coding gene corresponds to at least one protein. The first line, gene, contains the nucleotide region, as the CDS line. The other lines, starting with a slash provide information about the gene and the last piece of information is the DNA translation with the whole protein sequence. In my case, the first objective was to obtain the protein sequence, get the protein ID and the gi ID from each entry in a genome file.

An initial assessment, tell us that creating a class for each protein in the file is a good start.

class Protein:
    def __init__(self, gi, id, sequence):
        self.gi = gi
        self.id = id
        self.sequence = sequence

So we declare the class and have elements for the IDs and the amino acid sequence. The next problem is how do get each gene/CDS information, extract what we need and store in a list of class instances? This is much like we do with our FASTA entries, reading line by line and looking for FASTA sequence names, which start with >. For GenBank files we will look for lines containing the word gene. Of course this would mean that every gene that has this word in its name will be a problem, so we match against ' gene ' (the word gene flanked by two spaces).

proteins = []
index = 0
entry = ''
for line in gbfile:
    if line.find('  gene ') >= 0:
        if index >= 1:
            #parses the CDS and appends to a list
            proteins.append(parse_entry(entry))
            entry = ''
        index += 1
        entry += line
    elif line.find('ORIGIN') >= 0:
        #found the DNA sequence, we can stop now
        break
    else:
        entry += line

In the above excerpt we read the file line by line, having an index just to keep the code aware of which entry we are at. Every time we find a line with ' gene ', we check for the index and if it is larger than one we parse the entry that has been compiled already and start a new one. If we find 'ORIGIN' we bail out, breaking the loop, and in every other case we add a line to our current entry. When this is sent to the parsing function we clear the string and start compiling the information again. Using a string for an entry might be a second choice, as lists are easier to manipulate after (and eventually we will transform our string in a list by splitting it).

Notice that we have a call to a function

parse_entry

, let's see how this function is

def parse_entry(gene_data):
    prot_id = ''
    sequence = ''
    gi_id = ''
    gene_data = gene_data.split('\n')
    for line in gene_data:
        if line.find('/product') >=0:
            prot_id = line[line.find('=') + 2:-1]
        elif line.find('protein_id') >= 0:
            prot_id += '\t' + line[line.find('=') + 2: -1]
        elif line.find('GI:') >= 0:
            gi_id = 'gi' + line[line.find('GI:')+3:-1]
        elif line.find('/translation') >= 0:
            sequence = line[line.find('=') + 2:]
            temp = gene_data.index(line)
            for i in range(temp+1, len(gene_data)):
                    sequence += gene_data[i].strip()
 
    return Protein(gi_id, prot_id, sequence)

As mentioned above, we split the string in a list (we will check if there is any improvement in speed if we use a list from the beginning) and then analyse each one of the lines. In this case, we only need the protein, the gene name and, evidently, the protein sequence. Dissecting the excerpt above we have

if line.find('/product') >=0:
            prot_id = line[line.find('=') + 2:-1]
    elif line.find('protein_id') >= 0:
            prot_id += '\t' + line[line.find('=') + 2: -1]

This will get us the protein ID from /protein_id=“NP_014063.1” and also the gene name/info from /product=“Cos1p”, which them can be stored in the prot_id string that will be passed to the class instance. The gi ID can be extracted in similar fashion.

In order to get the sequence we have to do a small trick, that can be seen on the excerpt below

elif line.find('/translation') >= 0:
    sequence = line[line.find('=') + 2:]
    temp = gene_data.index(line)
    for i in range(temp+1, len(gene_data)):
        sequence += gene_data[i].strip()

We find the beginning of the translation information by checking the line that contains /translation and we extract the initial part of the sequence my finding the equal sign and getting everything after it. Then the trick. We know that the end of the translation information will be the last line of the whole gene/CDS entry. So we check for the index of the first sequence line and store it in a temp variable. Using this variable as the first value of a range we are able to get every other line of the sequence. That simple.

To do the output, we can just iterate the list of class objects and print all the information in FASTA format or any other that we wish. Put everything together and we have a simple but effective GenBank parser. Please notice that this script wouldn't work in every possible scenario, but with minor tweaks it should be applicable in most cases.

A comparison between Python and C(++): parsing a GenBank file

Last time we saw how to extract all the protein sequences from a GenBank genome file. A program is available in the WU-Blast package that does exactly the same thing, called gt2fasta. Let's compare the performance of our Python script and this compiled C++ (I believe it is C++, the source is not available). I am using Python 2.5 on the same machine I am running the already compiled gt2fasta, so don't worry about specs. The input file is the E. coli genome which was linked in the previous post, and I am timing with the time in Linux. Let's see how Python does (less time, the better):

gt2fasta 0m0.123s Python 0m5.466s

Yep, that's 44 times slower for Python. It is widely known that compiled languages are faster than interpreted ones, but maybe 44 times it is an exaggeration, what makes me think that there are ways to improve out Python script to make it a little bit faster. After next post we will see if this is possible.

Parsing GenBank: getting DNA from genome entries

Before we see how to improve the speed in our Python code (would it be possible?), let's go back to our GenBank parsing algorithm. Last time we saw how to extract the proteins from the CDS entries along the file. This time we will see how to get the DNA from the file. Usually DNA sequence is stored at the bottom of the GenBank file, in a long sequence (if the file is a genome, for instance). Basically, what we need to do is to get the start and end position of each gene (or any annotated piece) and cut the long sequence at the bottom. This will require a couple of changes in our previous script, basically to get the DNA and to cut it.

Previously, we had this to read the file

proteins = []
index = 0
entry = ''
for line in gbfile:
    if line.find('  gene ') >= 0:
        if index >= 1:
            #parses the CDS and appends to a list
            proteins.append(parse_entry(entry))
            entry = ''
        index += 1
        entry += line
    elif line.find('ORIGIN') >= 0:
        #found the DNA sequence, we can stop now
        break
    else:
        entry += line

Notice that we stop when we reach the bottom DNA sequence. We need to go further, but there is another issue: DNA sequence stored in a GenBank file are stored in blocks of nucleotides and the initial position in each line is also there. So we need to join these blocks and remove the number from the beginning of the line. In the end we would have this

is_seq = False
genes = []
for line in gbfile:
    if line.find('  gene ') >= 0:
        if index >= 1:
            genes.append(parse_entry(entry))
            entry = ''
        index += 1
        entry += line
    elif line.find('ORIGIN') >= 0:
        is_seq = True
        genes.append(parse_entry(entry))
    elif is_seq == True:
        line = line.split()
        sequence.append(line)
    else:
        entry += line
 
str_seq = ''
for i in sequence:
    str_seq += ''.join(i[1:]).upper()

We added a couple of things. Mainly a flag boolean variable (is_seq) that tells the program that we have reached “DNA level” and a routine to join the blocks from the second one to the end. sequence is list at first, appending blocks of actual DNA sequence and numbers obtained from the

split

method. We use a “blank” split, what means it will break the string in the spaces. Our join method above joining list elements from the element 1 and forward. Element 0 in this case would be the numbers indicating sequence position.

The function to parse the entries and the class to store the information also need to be changed. Previously, our class was simpler, with less objects. This time we need to extract more information.

class CDSinfo:
    def __init__(self, gi_id, id, start, end, complement):
        self.gi_id = gi_id
        self.id = id
        self.start = start
        self.end = end
        self.complement = complement

We removed the sequence object, but added a start, end and a complement, for the cases where the gene is on the reverse complement. The function to extract the info looks like

def parse_entry(gene_data):
    #changes a string to list, splitting at line ends
    gene_data = gene_data.split('\n')
    start, end = 0, 0
    gi_id = ''
    id = ''
    complement = False
    for line in gene_data:
        if line.find('  CDS  ') >=0:
            temp = line.split()
            if temp[1].find('complement') >= 0:
                complement = True
                temp[1] = temp[1].replace('complement(', '')
                temp[1] = temp[1].replace(')', '')
            temp2 = temp[1].split('..')
            start = temp2[0]
            end = temp2[1]
        elif line.find('GI:') >= 0:
            gi_id = 'gi' + line[line.find('GI:')+3:-1]
        elif line.find('/product') >=0:
            id = line[line.find('=') + 2:-1]
        elif line.find('protein_id') >= 0:
            id += '\t' + line[line.find('=') + 2: -1]
 
    return CDSinfo(gi_id, id, start, end, complement)

First things first. As in the CDS/protein parsing example, the function receives a string and it is easier for us to operate with lists, so we split the string at the carriage returns. This time we also need to check for the start and end positions of the gene in the long string that is the DNA sequence of a GenBank file, so we just declare a couple of integers start and end. We also have a couple of strings to receive the gi ID and the gene product (id). And finally a boolean, that will tell us if the sequence in question is on the DNA complement strand, meaning it is a reverse complement of the DNA that is stored in the file.

So we start our loop checking for the elements of each line. If we find a ' CDS ' we know that we can extract the start and end positions of the gene

if line.find('  CDS  ') >=0:
    temp = line.split()
    if temp[1].find('complement') >= 0:
        complement = True
        temp[1] = temp[1].replace('complement(', '')
        temp[1] = temp[1].replace(')', '')
    temp2 = temp[1].split('..')
    start = temp2[0]
    end = temp2[1]

We use temporary list and split the line at the spaces and then move to get the information from it. If the line is a reverse complement we remove the word from the entry and the parentheses. Whenever the gene annotation relates to the complement strand the entry is as such complement(10728..11330). We then use another temp object and split the second element in the split line and separate it at the double dots. From this latter split, we end up with two elements, the first is the start position and the second the end position of the gene.

To extract the other information everything works as the previous script, basically checking for the presence of certain words in the each line. In the end we return an instance of the class CDSinfo which will be used to extract the DNA sequence.

Both scripts to parse GenBank scripts are below.

First the one that extracts amino acid sequences, and then the latest to extract the DNA.

#! /usr/bin/env python
 
'''
input is a GenBank file. The script searches for gene annotations, extract all lines
from the file and then parses these lines in order to extract protein sequences
Ribosomal genes and other non-coding genes are not extracted - plan to do it later
output is a fasta formatted file    
'''
 
import sys
import fasta
 
class Protein:
    '''
    class that stores protein information
    '''
    def __init__(self, gi, id, sequence):
        self.gi = gi
        self.id = id
        self.sequence = sequence
 
def parse_entry(gene_data):
    '''
    parses the entry received from the main function
    in order to extract information as protein id
    gi, etc
    '''
    prot_id = ''
    sequence = ''
    gi_id = ''
    gene_data = gene_data.splitlines()
    for line in gene_data:
        if line.find('/product') >=0:
            prot_id = line[line.find('=') + 2:-1]
        elif line.find('protein_id') >= 0:
            prot_id += '\t' + line[line.find('=') + 2: -1]
        elif line.find('GI:') >= 0:
            gi_id = 'gi' + line[line.find('GI:')+3:-1]
        elif line.find('/translation') >= 0:
            sequence = line[line.find('=') + 2:]
            temp = gene_data.index(line)
            for i in range(temp+1, len(gene_data)):
                if gene_data[i].find('sig_peptide') >= 0:
                    break
                else:
                    sequence += gene_data[i].strip()
 
    return Protein(gi_id, prot_id, sequence)
 
#only input is a genbank file
gbfile = open(sys.argv[1])
 
proteins = []
index = 0
entry = ''
for line in gbfile:
    if line.find('  gene ') >= 0:
        if index >= 1:
            #parses the CDS and appends to a list
            proteins.append(parse_entry(entry))
            entry = ''
        index += 1
        entry += line
    elif line.find('ORIGIN') >= 0:
        #found the DNA sequence, we can stop now
        break
    else:
        entry += line
 
#parses the last entry after leaving the loop
proteins.append(parse_entry(entry))
 
#output
for i in proteins:
    if len(i.gi) > 2:
        print i.gi, i.id
        output = open(i.gi + '.fasta', 'w')
        output.write('>' + i.gi + '\t' + i.id + '\n')
        i.sequence = i.sequence.replace('\"', '')
        output.write(fasta.format_output(i.sequence, 80))
        print i.id
#! /usr/bin/env python
 
'''
script that extracts sequences from a GenBank file. Script reads the gene CDS from the file and
builds a list of start and end positions, and if gene is complement outputs the 5'3' sequence and
its reverse complement
only input is a GenBank file
outputs a fasta file with the GI ID as its name.
'''
 
import sys
import fasta
 
class CDSinfo:
    '''
    CDSinf class to store all the information from CDS
    '''
    def __init__(self, gi_id, id, start, end, complement):
        self.gi_id = gi_id
        self.id = id
        self.start = start
        self.end = end
        self.complement = complement
 
def parse_entry(gene_data):
    '''
    each CDS entry is obtained in the main function and a string of lines with
    information is passed to parse_entry to be parsed and have information extracted
    '''
 
    gene_data = gene_data.splitlines() #changes a string to list, splitting at line ends
    start, end = 0, 0
    gi_id = ''
    id = ''
    complement = False
    for line in gene_data: #searches for regions annotated as CDS
        if line.find('  CDS  ') >=0:
            temp = line.split()
            #checks for complement sequence, if true remove extra characters
            if temp[1].find('complement') >= 0:
                complement = True
                temp[1] = temp[1].replace('complement(', '')
                temp[1] = temp[1].replace(')', '')
            temp2 = temp[1].split('..')
            start = temp2[0]
            end = temp2[1]
        #checks for GI IDs 
        elif line.find('GI:') >= 0:
            gi_id = 'gi' + line[line.find('GI:')+3:-1]
        #get the gene name/function
        elif line.find('/product') >=0:
            id = line[line.find('=') + 2:-1]
        #and adds the protein id
        elif line.find('protein_id') >= 0:
            id += '\t' + line[line.find('=') + 2: -1]
 
    return CDSinfo(gi_id, id, start, end, complement)
 
#only input is the genbank file with annotation and sequence
gbfile = open(sys.argv[1])
 
index = 0
entry = ''
sequence = []
is_seq = False
 
genes = []
for line in gbfile:
    #reads the genbank file and whenever finds a gene annotation 
    #concatenate the lines up to the next gene
    if line.find('  gene ') >= 0:
        #if an entry is complete, send it to parse
        if index >= 1:
            #appends to a list of CDSinfo objects
            genes.append(parse_entry(entry))
            entry = ''
        index += 1
        entry += line
    elif line.find('ORIGIN') >= 0:
        #found sequence start, set the flag on and parses the last entry
        is_seq = True
        genes.append(parse_entry(entry))
    elif is_seq == True:
        #if flag is true keep going, usually sequences are store at the end of the file
        line = line.split()
        sequence.append(line)
    else:
        #this is an entry so append line
        entry += line
 
str_seq = ''
#make the sequence a string
for i in sequence:
    str_seq += ''.join(i[1:]).upper()
 
for i in genes:
    if len(i.gi_id) > 2:
        print i.id, i.start, i.end
        output = open(i.gi_id + '.DNA.fasta', 'w')
        output.write('>' + i.gi_id + '\t' + i.id + '\n')
        # if this is a complement, print both 5'-3' and reverse complement sequences
        if i.complement == True:
            output.write(fasta.format_output(fasta.invert(str_seq[int(i.start)-1:int(i.end)]), 80) + '\n')
        else:
            if not i.start.find('join') >= 0:
                output.write(fasta.format_output(str_seq[int(i.start)-1:int(i.end)], 80))

Checking the DNA extract script carefully, we notice that we have an extra function in the fasta module, invert.

 
part7.txt · Last modified: 2009/05/20 11:34 by nuin
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki