Differences

This shows you the differences between two versions of the page.

part8 [2009/05/13 10:36] (current)
nuin created
Line 1: Line 1:
 +==== Splitting a multiple sequence FASTA file ====
 +
 +
 +We will see the easiest way to split a FASTA file. There are many ways to name each file after the split, and we will use a simple <i>original filename</i> + number. We also going to use the FASTA class and functions defined previously in [[http://python.genedrift.org/code/fasta.py | fasta.py]]. To refresh our memories here is the code
 +
 +<code python>class Fasta:
 +    def __init__(self, name, sequence):
 +        self.name = name
 +        self.sequence = sequence
 +
 +def read_fasta(file):
 +    items = []
 +    index = 0
 +    for line in file:
 +        if line.startswith(">"):
 +           if index >= 1:
 +               items.append(aninstance)
 +           index+=1
 +           name = line[:-1]
 +           seq = ''
 +           aninstance = Fasta(name, seq)
 +        else:
 +           seq += line[:-1]
 +           aninstance = Fasta(name, seq)
 +
 +    items.append(aninstance)
 +    return items</code>
 +
 +So, basically we need to create a script that reads the file and then in a loop saves every sequence in a separate file. For the sake of simplicity all saved files will be in the same directory. To the script
 +
 +<code python>#!/usr/bin/env python
 +
 +import sys
 +import fasta
 +#define a variable with the input filename
 +file = sys.argv[1]
 +#open the file and read the sequences in one line
 +sequences = fasta.read_fasta(open(file, 'r').readlines())
 +
 +#counter to keep the number the sequences and to
 +#be written to the output files
 +count = 1
 +for i in sequences:
 +    #open a file to output
 +    output = open(file+str(count), 'w')
 +    #write the name and sequence
 +    output.write(i.name+'\n')
 +    output.write(i.sequence)
 +    #increment the counter
 +    count += 1</code>
 +
 +Very simple but nice script. In the end we will have a list of files with only one sequence in it, with identical name of the original file but with a number in the end.
 +
 +
 +==== Splitting a multiple sequence FASTA file, making it better ====
 +
 +
 +We saw how to split a multiple FASTA file with our previous FASTA module. The previous script saved a sequence per file, where the filename was identical to the original file with the exception of a number added to the end. Let's say we have the file mysequences.fa, our script would save mysequences.fa1, mysequences.fa2, ..., mysequences.fa//n//.
 +
 +This is fine, but we need to make it better. Ideally we should ask the user for a prefix or a filename for the output. On the other hand, it would be an improvement if we put the number before the file tag. We will do that in a couple of lines. First we need to find the tag and the actual filename. We could go through a longer route if we used the //find// method to find the dot and then extract portions of the string to the actual name and tag. But we can use //split//
 +
 +
 +<code python>file = sys.argv[1]
 +temp = file.split('.')
 +filename = temp[0]
 +tag = temp[1]</code>
 +
 +We split the original string at the dot and the first item or the returned list is the file name and the second item is the tag. Now we just need to change a little bit the code for the output and we are done. The final script is
 +
 +<code python>#!/usr/bin/env python
 +
 +import sys
 +import fasta
 +
 +file = sys.argv[1]
 +temp = file.split('.')
 +filename = temp[0]
 +tag = temp[1]
 +
 +sequences = fasta.read_fasta(open(file, 'r').readlines())
 +
 +count = 1
 +for i in sequences:
 +    #this is the only different line
 +    output = open(filename+'_'+str(count)+'.'+tag, 'w')
 +    output.write(i.name+'\n')
 +    output.write(i.sequence)
 +    count += 1</code>
 +
 +
 +==== Splitting multiple FASTA file, or why shouldn't we reinvent the wheel? ====
 +
 +
 +
 +[[eridanus.net/blog | Daniel Swan]] posted a comment on the previous entry regarding that splitting multiple sequence FASTA files is "one of those ‘bioinformatics’ tasks where people are seriously guilty of reinventing the wheel".
 +
 +Biologically, let's dissect his comment. As there are not many comments in this blog, we take advantage of the few ones posted.
 +
 +I don't disagree with the fact that many applications in Bioinformatics are reinventing the wheel, but at the same time this blog is not to teach anyone advanced bioinformatics development. Here, we deal with Python and I try to use real life examples to show how powerful (or not, depends on your point of view) Python is.The splitting FASTA example is not the first one that "reinvents the wheel". Who needs a new GenBank parser? Or a restriction site finder? Or a new class to read FASTA file? Basically who needs Python if we can do everything else with Perl, C++? What the heck, let's just learn Assembler.
 +
 +Diversity, that's the first rule here. Perl mongers usually say that are many ways to do the same thing, and that's very close to the truth. What would be of blue if everyone liked green?
 +
 +At the same time he presents us with an alternative for the job which is "‘csplit’" and this "should probably be your first port of call for context based splitting of files!". Yes, I know csplit. I also know dozens of other methods (see next post) to do same thing. But maybe he didn't notice the title of the site: Beginning **Python** for Bioinformatics. If the main reason of the blog was to show *nix commands I would have titled it Beginning *nix Commands for Bioinformatics.
 +
 +We also learn from his comment that csplit is "far more widely applicable than for just this example." I know in bioinformatics we are in a microuniverse with its own idiosyncrasies. Because we live in this bioinfo universe we might think that every other person in the world has immediate access to a *nix terminal. Biologists //don't use// Windows.  Because of that, I cannot disagree more that csplit is far more widely applicable than any other method. Again, diversity. Not every system is equal, not every solution is widely applicable.
 +
 +And lastly, he tells us "Remember - if it seems like a logical operation and you are in the ‘why hasn’t anyone made a utility to do this?’ frame of mind - look around because they probably have!". Oh, I look around. And I see the diversity of systems, applications, programming languages. I even see diversity among biologists and bioinformaticians.
 +
 +Again, what I try to emphasize with this post is diversity. Since the beginning my main focus was to show a Pythonistic view of basic bioinformatics development. I am probably reinventing the wheel every week, in every post. For the ones that are learning Python or bioinformatics, it is a seed. For those that are already established bioinformaticians, it might be a new option on how to accomplish things. Diversity is the key.
 +
 +Out there is evolve or perish. In (bio)informatics that's also true.
 +
 +
 +==== Alternative methods to split a FASTA file ====
 +
 +
 +As Daniel didn't enlightened us on how to use csplit, I am posting several ways on how to split a multiple sequence FASTA file. This post gets out of our focus (if you haven't noticed, our focus here is <b>Python</b>, and maybe suffers from the NIH effect. Not invented here. We will be back with our normal programming after this.
 +
 +We start with our "widely" available option **csplit**:
 +
 +<code>csplit -kf seq file '%^>%' '/^>/' '{*}'</code>
 +
 +where //k// tells the program to keep files in case of an error, //f// sets a prefix for the output files (otherwise xx00 is used) and two regex patterns are used: between %s is the one to skip and between /s is the one to copy up to but not including the line. In {} is the number of times we want to have the previous patterns repeated, a * meaning as many time as possible.
 +
 +Easy, eh? Now, we can check how perl fares. I am not a perl monger so I googled it and I found an one-liner. Quoting the original [[http://www.softpanorama.org/Scripting/Perlorama/perl_in_command_line.shtml| site]]:
 +
 +Split a multi-sequence FastA file into individual files named after their description lines.
 +
 +<code perl>perl -ne 'BEGIN{$/=">"}if(/^\s*(\S+)/){open(F,">$1")||warn"$1 write failed:$!\n";chomp;print F ">", $_}'</code>
 +
 +
 +and I am not daring to explain this one. But cariaso did and also sent a better version of it
 +
 +<code perl>
 +# This magic variable makes perl read lines
 +# that end with >
 +# instead of a newline n
 +$/=">";
 + 
 +while (<>) {           # foreach line in the input files
 +     if (/^>(w+)/) {      # grab the first word of text
 +         open(F,">$1") ||  # open a file named that word
 +              warn "$1 write failed:$!n";
 +         chomp;            # strip off the > at the end
 +         print F ">", $_;  # print >text to the file
 +     }
 + }</code>
 +
 +You can also do in Python, without any OO programming in 5 lines:
 +
 +<code python?file = open('myfile').readlines()
 +str = ''.join(file)
 +temp = str.split('>')
 +for i in temp:
 +    print '>' + i</code>
 +
 +that can be run with no difficulties from the interactive console. And which cariaso also corrected and improved
 +
 +<code python>fileobj = open("myfile.fasta")
 +ignore  = fileobj.read(1)
 +text    = fileobj.read()
 +records = text.split(">")
 +for i in records:
 +     print '>' + i</code>
 +
 +along the offer of a nicer approach
 +
 +<code python>def eachfasta(fileobj):
 +         sofar = fileobj.readline()
 +         for line in fileobj:
 +                 if '>' == line (bracket zero close bracket):
 +                         yield sofar
 +                         sofar = line
 +                 else:
 +                         sofar += line
 +         yield sofar
 +  
 +fileobj = open("myfile.fasta")
 + 
 +for i in eachfasta(fileobj):
 +     print i</code>
 +
 +Also Joe wanted to dip in and sent his approach:
 +
 +<code python># f6smod.py 10-26-07 jto
 +# purpose: modify program by Michael Cariaso, given in Paolo Nuin's blog:
 +# http://python.genedrift.org
 +
 +import sys
 +fileobj = open('fasta.seq')
 +ignore = fileobj.read(1)
 +# The above line removes the first char from the file object.
 +# It can also be used to check that the first char is a '>'
 +if ignore != '>':
 +    print "The first character in the supposed FASTA file is: " + ignore
 +    print "not '>', so sys.exit() is being invoked."
 +    sys.exit()
 +text = fileobj.read()
 +records = text.split('>')
 +
 +# Here, rather than use the for loop to just print out the sequences, it
 +# is used to store them in a list. After that they can be printed out, or
 +# stored in separate files, or or further split into header line and sequence
 +# (using the carriage return at the end of the header file).
 +seqlist = []
 +listcount = 0
 + 
 +# store each header/sequence in a list
 +for i in records:
 +    i = '>' + i
 +    seqlist.append(i)
 +    listcount += 1
 +
 +# print the list
 +for seq in seqlist:
 +    print seq
 +  
 +# Split into header line and sequence, and make the sequence a single string.
 +for seq in seqlist:
 +    splitCR = seq.split('\n')
 +    print "header: " + splitCR[0]
 +    sequence = ''.join(splitCR[1:])
 +    print "sequence: "
 +    print sequence</code>
 +
 +OK, so  six more options available. I am pretty sure sed and awk would be able to do it glued by a bash script, but I am far of being an expert in sed or awk.
 +
 +
 +
 +==== Splitting a FASTA file using awk (no sed required), or do we care about csplit? ====
 +
 +
 +We saw that "[[http://eridanus.net/ | top notch bioinformaticians]]" use csplit to split FASTA files, so I decided to <a href="http://python.genedrift.org/2007/10/10/alternative-methods-to-split-a-fasta-file/">post as many as possible alternatives</a> to split these files. As csplit, awk is something found with more frequency in Linux machines than Windows, but it can be installed on Windows (even Vista) and it runs fine. Awk is usually shown in parallel with sed, as both utilities have the main objective of text processing.
 +
 +I browsed the web searching for a sed/awk solution to split FASTA files something that I thought could be done in one line. And I found [[http://www.unix.com/shell-programming-scripting/14576-split-file-using-awk.html | it]]. With a little tweak I ended up with a one-liner using awk that split the sequences in multiple files:
 +
 +<code>awk '/^>/{f="fasta."++d} {print > f}' <infile></code>
 +
 +Awk breaks each line of its input into fields that can be checked, printed or ignored. In the command line above. awk's commands are in between the single quotes. The first command is a regex that matches initial //>//. Inside the curly braces we assign a variable//f// to store the output filename we want and also have an integer that is incremented every time we pass through it. The second curly braces are for the actual output, where we use awk's //print// command and redirect output to a file named after the variable we created in the first commands in between curly braces. Finally, we assign an input file, outside the region delimited by single quotes. And that's it. As output we will have a series of files called fasta.# each one containing one sequence.
 +
 + 
 +
 +==== Merging single (or multiple) sequence FASTA files ====
 +
 +
 +We saw how to split a multiple FASTA file. And how can we achieve the opposite, merge single (or sometimes multiples) sequence files in on larger multiple sequence file? In Python is very simple, but we need to introduce a new module that would allow us to get the desired result.
 +
 +In most scripts we have seen in previous posts, we end up reading a single file from the input (in some cases two) and that's not enough in the case of merging multiple files. Python has a module called //glob// that is very powerful for reading contents of a directory. We can use any type of pattern matching to find only the files we want, wildcards allowed. Let's test glob and see what we can get.
 +
 +We start with a simple script
 +
 +<code python>import glob
 +
 +for i in glob.glob('*.txt'):
 +    print i</code>
 +
 +If we run this script in a directory (Windows or Linux or Mac) we should see a list of the filenames that match the patter passed to glob.glob. In this case every file with a txt extension. This should be enough for us, for now. Basically we need to create a list of all FASTA files, that match a certain name pattern, and then take care of the sequences using our already known FASTA module. To make things more flexible, we should ask in the command line for a pattern to match, and maybe a output filename. Our script would look like
 +
 +<code python>import sys
 +import glob
 +import fasta
 +
 +filepattern = sys.argv[1]
 +output = open(sys.argv[2], 'w')
 +
 +#initialize lists
 +names = []
 +seqs = []
 +
 +#glob.glob returns a list of files that match the pattern
 +for file in glob.glob(filepattern):
 +    #we read the contents and an instance of the class is returned
 +    contents = fasta.read_fasta(open(file).readlines())
 +    #a file can contain more than one sequence so we read them in a loop
 +    for item in contents:
 +        names.append(item.name)
 +        seqs.append(item.sequence)
 +
 +#we print the output
 +for i in range(len(names)):
 +    output.write(names[i] + '\n' + seqs[i] + '\n\n')
 +
 +output.close()</code>
 +
 +Except for the glob part, we've already seen the rest in previous entries. Here, the input //filepattern// should be enclosed in single or double quotes in order to be transmitted properly (for instance when you include wildcards). In the first //for// loop we access the list of files created by //glob.glob// and extract the sequence name(s) and sequence(s) from each file and parse them to a list which will serve to print the final output. An example command line would look like
 +
 +<code python>python merge.py '*.fas' merged.fa</code>
 +
 +
 +Of course, there are several ways of merging FASTA files, but not every time you have Linux's //cat// available.
 +
 +
 
part8.txt · Last modified: 2009/05/13 10:36 by nuin
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki