Differences

This shows you the differences between two versions of the page.

part10 [2009/05/14 12:13]
nuin created
part10 [2009/05/14 12:14] (current)
nuin
Line 4: Line 4:
Last couple of posts we started with some functional programming aspects of Python. I was away last week and couldn't create anything related to FP in the meantime, so I decided to post about a quick way to "cut", or extract segments, from chromosomes stored as FASTA files. Last couple of posts we started with some functional programming aspects of Python. I was away last week and couldn't create anything related to FP in the meantime, so I decided to post about a quick way to "cut", or extract segments, from chromosomes stored as FASTA files.
-This is a subject that I have been investigating for some time, as I needed a tool that would get some segment from a chromosome for further analysis. I have a short C++ code that does the trick and is really fast, but I wanted to find a Pythonic way of doing it, and quick. I have been trying different methods but no one were fast. Chromosomes are usually large, and stored in large files, so reading line by line and using <code>readlines</code> were extremely slow. Inspired by a [[http://www.goldb.org/goldblog/2007/11/07/PythonProcessingLargeTextFilesOneLineAtATime.aspx | post]] from Corey Goldberg, I decide to test the method explained there, with wonderful results. +This is a subject that I have been investigating for some time, as I needed a tool that would get some segment from a chromosome for further analysis. I have a short C++ code that does the trick and is really fast, but I wanted to find a Pythonic way of doing it, and quick. I have been trying different methods but no one were fast. Chromosomes are usually large, and stored in large files, so reading line by line and using //readlines// were extremely slow. Inspired by a [[http://www.goldb.org/goldblog/2007/11/07/PythonProcessingLargeTextFilesOneLineAtATime.aspx | post]] from Corey Goldberg, I decide to test the method explained there, with wonderful results.
Basically what we need to do is to use an iterator protocol on the file object Basically what we need to do is to use an iterator protocol on the file object
Line 32: Line 32:
print name, segment</code> print name, segment</code>
-We start getting the parameters from the command line, assigning a couple of variables and we start the loop. We open the file and use the file object iterator to check each line. We check each line for a FASTA sequence title. If found we store the title in the variable //name//, and if not we check the size of the line and increment the variable that will guide us on the file position. We use <code>size</code> as a line by line increment and if it is between //start// and //end// we store the read line into the //segment// variable. That's it.+We start getting the parameters from the command line, assigning a couple of variables and we start the loop. We open the file and use the file object iterator to check each line. We check each line for a FASTA sequence title. If found we store the title in the variable //name//, and if not we check the size of the line and increment the variable that will guide us on the file position. We use //size// as a line by line increment and if it is between //start// and //end// we store the read line into the //segment// variable. That's it.
A few drawbacks with this script: it will fail on FASTA files where the sequence is stored in one line, there is no precision to get the actual start and end points and it can only run on unique sequence files. On the next post we will address some of these concerns. A few drawbacks with this script: it will fail on FASTA files where the sequence is stored in one line, there is no precision to get the actual start and end points and it can only run on unique sequence files. On the next post we will address some of these concerns.
Line 39: Line 39:
Our last code had no precision in the cutting/extracting, because it didn't take in account the line size so it would extract more information than actually requested. Our last code had no precision in the cutting/extracting, because it didn't take in account the line size so it would extract more information than actually requested.
-We need to address precision them. First take into account the size of each line, the number of nucleotides on each line. Then use this this information to remove unwanted bases. Most of the changes we need to make on our script is outside the main loop that cuts/extracts information. We need to modify the post-processing of the sequence we nust read from the chromosome. Also, instead of storing the sequence in a <code>string</code> it is easier to use a //list// (in fact it is just a matter of personal taste, and maybe a little be of being Pythonic). In the end our script would look like+We need to address precision them. First take into account the size of each line, the number of nucleotides on each line. Then use this this information to remove unwanted bases. Most of the changes we need to make on our script is outside the main loop that cuts/extracts information. We need to modify the post-processing of the sequence we nust read from the chromosome. Also, instead of storing the sequence in a //string// it is easier to use a //list// (in fact it is just a matter of personal taste, and maybe a little be of being Pythonic). In the end our script would look like
<code python>#! /usr/bin/env python <code python>#! /usr/bin/env python
Line 81: Line 81:
After this checking we need to add some processing conditionals to make sure we are returning the exact segment requested. The % operator in Python return the remainder of a division, even if the two values are integers. We need to check that the entered starting and ending positions are or aren't multiple of the line size. If one or both are multiples we don't need to process the segment. On the other hand some processing is needed. Clearly we only need to cut extra nucleotides from the first and last returned lines. That's why the three clauses: one for both values not being multiple of line size, one for just the start and another for the end position. After this checking we need to add some processing conditionals to make sure we are returning the exact segment requested. The % operator in Python return the remainder of a division, even if the two values are integers. We need to check that the entered starting and ending positions are or aren't multiple of the line size. If one or both are multiples we don't need to process the segment. On the other hand some processing is needed. Clearly we only need to cut extra nucleotides from the first and last returned lines. That's why the three clauses: one for both values not being multiple of line size, one for just the start and another for the end position.
-And to remove unwanted nucleotides we basically multiply the number of the starting/ending line by the size of each line and subtract the actual start/end position. This should give us the number of unwanted nucleotides in those two lines. And the fact that we stored the segment as a <code>list</code> makes our life easier as we access position 0 and -1 in our list.+And to remove unwanted nucleotides we basically multiply the number of the starting/ending line by the size of each line and subtract the actual start/end position. This should give us the number of unwanted nucleotides in those two lines. And the fact that we stored the segment as a //list// makes our life easier as we access position 0 and -1 in our list.
On the final output, we print the name and join the list elements by adding a new line between them. On the final output, we print the name and join the list elements by adding a new line between them.
 
part10.txt · Last modified: 2009/05/14 12:14 by nuin
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki