Differences

This shows you the differences between two versions of the page.

part7 [2009/05/14 12:39]
nuin
part7 [2009/05/20 11:34] (current)
nuin
Line 358: Line 358:
Yep, that's 44 times slower for Python. It is widely known that compiled languages are faster than interpreted ones, but maybe 44 times it is an exaggeration, what makes me think that there are ways to improve out Python script to make it a little bit faster. After next post we will see if this is possible. Yep, that's 44 times slower for Python. It is widely known that compiled languages are faster than interpreted ones, but maybe 44 times it is an exaggeration, what makes me think that there are ways to improve out Python script to make it a little bit faster. After next post we will see if this is possible.
 +
 +
 +
 +==== Parsing GenBank: getting DNA from genome entries ====
 +
 +
 +
 +Before we see how to improve the speed in our Python code (would it be possible?), let's go back to our GenBank parsing algorithm. Last time we saw how to extract the proteins from the CDS entries along the file. This time we will see how to get the DNA from the file. Usually DNA sequence is stored at the bottom of the GenBank file, in a long sequence (if the file is a genome, for instance). Basically, what we need to do is to get the start and end position of each gene (or any annotated piece) and cut the long sequence at the bottom. This will require a couple of changes in our previous script,  basically to get the DNA and to cut it.
 +
 +Previously, we had this to read the file
 +
 +<code python>proteins = []
 +index = 0
 +entry = ''
 +for line in gbfile:
 +    if line.find('  gene ') >= 0:
 +        if index >= 1:
 +            #parses the CDS and appends to a list
 +            proteins.append(parse_entry(entry))
 +            entry = ''
 +        index += 1
 +        entry += line
 +    elif line.find('ORIGIN') >= 0:
 +        #found the DNA sequence, we can stop now
 +        break
 +    else:
 +        entry += line</code>
 +
 +Notice that we stop when we reach the bottom DNA sequence. We need to go further, but there is another issue: DNA sequence stored in a GenBank file are stored in blocks of nucleotides and the initial position in each line is also there. So we need to join these blocks and remove the number from the beginning of the line. In the end we would have this
 +
 +<code python>is_seq = False
 +genes = []
 +for line in gbfile:
 +    if line.find('  gene ') >= 0:
 +        if index >= 1:
 +            genes.append(parse_entry(entry))
 +            entry = ''
 +        index += 1
 +        entry += line
 +    elif line.find('ORIGIN') >= 0:
 +        is_seq = True
 +        genes.append(parse_entry(entry))
 +    elif is_seq == True:
 +        line = line.split()
 +        sequence.append(line)
 +    else:
 +        entry += line
 +
 +str_seq = ''
 +for i in sequence:
 +    str_seq += ''.join(i[1:]).upper()</code>
 +
 +We added a couple of things. Mainly a flag boolean variable (//is_seq//) that tells the program that we have reached "DNA level" and a routine to join the blocks from the second one to the end. //sequence// is list at first, appending blocks of actual DNA sequence and numbers obtained from the <code>split</code> method. We use a "blank" //split//, what means it will break the string in the spaces. Our //join// method above joining list elements from the element 1 and forward. Element 0 in this case would be the numbers indicating sequence position.
 +
 +The function to parse the entries and the class to store the information also need to be changed. Previously, our class was simpler, with less objects. This time we need to extract more information.
 +
 +<code python>class CDSinfo:
 +    def __init__(self, gi_id, id, start, end, complement):
 +        self.gi_id = gi_id
 +        self.id = id
 +        self.start = start
 +        self.end = end
 +        self.complement = complement</code>
 +
 +We removed the sequence object, but added a start, end and a complement, for the cases where the gene is on the reverse complement. The function to extract the info looks like
 +
 +<code python>def parse_entry(gene_data):
 +    #changes a string to list, splitting at line ends
 +    gene_data = gene_data.split('\n')
 +    start, end = 0, 0
 +    gi_id = ''
 +    id = ''
 +    complement = False
 +    for line in gene_data:
 +        if line.find('  CDS  ') >=0:
 +            temp = line.split()
 +            if temp[1].find('complement') >= 0:
 +                complement = True
 +                temp[1] = temp[1].replace('complement(', '')
 +                temp[1] = temp[1].replace(')', '')
 +            temp2 = temp[1].split('..')
 +            start = temp2[0]
 +            end = temp2[1]
 +        elif line.find('GI:') >= 0:
 +            gi_id = 'gi' + line[line.find('GI:')+3:-1]
 +        elif line.find('/product') >=0:
 +            id = line[line.find('=') + 2:-1]
 +        elif line.find('protein_id') >= 0:
 +            id += '\t' + line[line.find('=') + 2: -1]
 +
 +    return CDSinfo(gi_id, id, start, end, complement)</code>
 +
 +
 +First things first. As in the CDS/protein parsing example, the function receives a string and it is easier for us to operate with lists, so we split the string at the carriage returns. This time we also need to check for the start and end positions of the gene in the long string that is the DNA sequence of a GenBank file, so we just declare a couple of integers //start// and //end//. We also have a couple of strings to receive the gi ID and the gene product (//id//). And finally a boolean, that will tell us if the sequence in question is on the DNA complement strand, meaning it is a reverse complement of the DNA that is stored in the file.
 +
 +So we start our loop checking for the elements of each line. If we find a //'  CDS  '// we know that we can extract the start and end positions of the gene
 +
 +<code python>if line.find('  CDS  ') >=0:
 +    temp = line.split()
 +    if temp[1].find('complement') >= 0:
 +        complement = True
 +        temp[1] = temp[1].replace('complement(', '')
 +        temp[1] = temp[1].replace(')', '')
 +    temp2 = temp[1].split('..')
 +    start = temp2[0]
 +    end = temp2[1]</code>
 +
 +We use temporary list and split the line at the spaces and then move to get the information from it. If the line is a reverse complement we remove the word from the entry and the parentheses. Whenever the gene annotation relates to the complement strand the entry is as such //complement(10728..11330)//. We then use another temp object and split the second element in the split line and separate it at the double dots. From this latter split, we end up with two elements, the first is the start position and the second the end position of the gene.
 +
 +To extract the other information everything works as the previous script, basically checking for the presence of certain words in the each line. In the end we return an instance of the class //CDSinfo// which will be used to extract the DNA sequence.
 +
 +
 +Both scripts to parse GenBank scripts are below.
 +
 +First the one that extracts amino acid sequences, and then the latest to extract the DNA.
 +
 +
 +<code python>#! /usr/bin/env python
 +
 +'''
 +input is a GenBank file. The script searches for gene annotations, extract all lines
 +from the file and then parses these lines in order to extract protein sequences
 +Ribosomal genes and other non-coding genes are not extracted - plan to do it later
 +output is a fasta formatted file    
 +'''
 +
 +import sys
 +import fasta
 +
 +class Protein:
 +    '''
 +    class that stores protein information
 +    '''
 +    def __init__(self, gi, id, sequence):
 +        self.gi = gi
 +        self.id = id
 +        self.sequence = sequence
 +
 +def parse_entry(gene_data):
 +    '''
 +    parses the entry received from the main function
 +    in order to extract information as protein id
 +    gi, etc
 +    '''
 +    prot_id = ''
 +    sequence = ''
 +    gi_id = ''
 +    gene_data = gene_data.splitlines()
 +    for line in gene_data:
 +        if line.find('/product') >=0:
 +            prot_id = line[line.find('=') + 2:-1]
 +        elif line.find('protein_id') >= 0:
 +            prot_id += '\t' + line[line.find('=') + 2: -1]
 +        elif line.find('GI:') >= 0:
 +            gi_id = 'gi' + line[line.find('GI:')+3:-1]
 +        elif line.find('/translation') >= 0:
 +            sequence = line[line.find('=') + 2:]
 +            temp = gene_data.index(line)
 +            for i in range(temp+1, len(gene_data)):
 +                if gene_data[i].find('sig_peptide') >= 0:
 +                    break
 +                else:
 +                    sequence += gene_data[i].strip()
 +
 +    return Protein(gi_id, prot_id, sequence)
 +
 +#only input is a genbank file
 +gbfile = open(sys.argv[1])
 +
 +proteins = []
 +index = 0
 +entry = ''
 +for line in gbfile:
 +    if line.find('  gene ') >= 0:
 +        if index >= 1:
 +            #parses the CDS and appends to a list
 +            proteins.append(parse_entry(entry))
 +            entry = ''
 +        index += 1
 +        entry += line
 +    elif line.find('ORIGIN') >= 0:
 +        #found the DNA sequence, we can stop now
 +        break
 +    else:
 +        entry += line
 +
 +#parses the last entry after leaving the loop
 +proteins.append(parse_entry(entry))
 +
 +#output
 +for i in proteins:
 +    if len(i.gi) > 2:
 +        print i.gi, i.id
 +        output = open(i.gi + '.fasta', 'w')
 +        output.write('>' + i.gi + '\t' + i.id + '\n')
 +        i.sequence = i.sequence.replace('\"', '')
 +        output.write(fasta.format_output(i.sequence, 80))
 +        print i.id</code>
 +
 +
 +<code python>#! /usr/bin/env python
 +
 +'''
 +script that extracts sequences from a GenBank file. Script reads the gene CDS from the file and
 +builds a list of start and end positions, and if gene is complement outputs the 5'3' sequence and
 +its reverse complement
 +only input is a GenBank file
 +outputs a fasta file with the GI ID as its name.
 +'''
 +
 +import sys
 +import fasta
 +
 +class CDSinfo:
 +    '''
 +    CDSinf class to store all the information from CDS
 +    '''
 +    def __init__(self, gi_id, id, start, end, complement):
 +        self.gi_id = gi_id
 +        self.id = id
 +        self.start = start
 +        self.end = end
 +        self.complement = complement
 +
 +def parse_entry(gene_data):
 +    '''
 +    each CDS entry is obtained in the main function and a string of lines with
 +    information is passed to parse_entry to be parsed and have information extracted
 +    '''
 +
 +    gene_data = gene_data.splitlines() #changes a string to list, splitting at line ends
 +    start, end = 0, 0
 +    gi_id = ''
 +    id = ''
 +    complement = False
 +    for line in gene_data: #searches for regions annotated as CDS
 +        if line.find('  CDS  ') >=0:
 +            temp = line.split()
 +            #checks for complement sequence, if true remove extra characters
 +            if temp[1].find('complement') >= 0:
 +                complement = True
 +                temp[1] = temp[1].replace('complement(', '')
 +                temp[1] = temp[1].replace(')', '')
 +            temp2 = temp[1].split('..')
 +            start = temp2[0]
 +            end = temp2[1]
 +        #checks for GI IDs
 +        elif line.find('GI:') >= 0:
 +            gi_id = 'gi' + line[line.find('GI:')+3:-1]
 +        #get the gene name/function
 +        elif line.find('/product') >=0:
 +            id = line[line.find('=') + 2:-1]
 +        #and adds the protein id
 +        elif line.find('protein_id') >= 0:
 +            id += '\t' + line[line.find('=') + 2: -1]
 +
 +    return CDSinfo(gi_id, id, start, end, complement)
 +
 +#only input is the genbank file with annotation and sequence
 +gbfile = open(sys.argv[1])
 +
 +index = 0
 +entry = ''
 +sequence = []
 +is_seq = False
 +
 +genes = []
 +for line in gbfile:
 +    #reads the genbank file and whenever finds a gene annotation
 +    #concatenate the lines up to the next gene
 +    if line.find('  gene ') >= 0:
 +        #if an entry is complete, send it to parse
 +        if index >= 1:
 +            #appends to a list of CDSinfo objects
 +            genes.append(parse_entry(entry))
 +            entry = ''
 +        index += 1
 +        entry += line
 +    elif line.find('ORIGIN') >= 0:
 +        #found sequence start, set the flag on and parses the last entry
 +        is_seq = True
 +        genes.append(parse_entry(entry))
 +    elif is_seq == True:
 +        #if flag is true keep going, usually sequences are store at the end of the file
 +        line = line.split()
 +        sequence.append(line)
 +    else:
 +        #this is an entry so append line
 +        entry += line
 +
 +str_seq = ''
 +#make the sequence a string
 +for i in sequence:
 +    str_seq += ''.join(i[1:]).upper()
 +
 +for i in genes:
 +    if len(i.gi_id) > 2:
 +        print i.id, i.start, i.end
 +        output = open(i.gi_id + '.DNA.fasta', 'w')
 +        output.write('>' + i.gi_id + '\t' + i.id + '\n')
 +        # if this is a complement, print both 5'-3' and reverse complement sequences
 +        if i.complement == True:
 +            output.write(fasta.format_output(fasta.invert(str_seq[int(i.start)-1:int(i.end)]), 80) + '\n')
 +        else:
 +            if not i.start.find('join') >= 0:
 +                output.write(fasta.format_output(str_seq[int(i.start)-1:int(i.end)], 80))</code>
 +
 +Checking the DNA extract script carefully, we notice that we have an extra function in the //fasta// module, //invert//.
 
part7.txt · Last modified: 2009/05/20 11:34 by nuin
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki