This time we will see a different example of set usage that includes some methods available for this type of objects. The initial problem was how to determine which genomes are represented in protein/DNA clusters obtained with CD-HIT. Basically CD-HIT uses a multiple fasta file to generate clusters of proteins/DNA using their sequence identity. It clusters the sequences, keeping their fasta title and assigning an ID for each cluster obtained. We won't see how is CD-HIT's output, not how to parse it. For this example, let's say three genomes (A, B and C) were analysed in CD-HIT and we want to determine which clusters contain sequences from the combinations AB, AC and CB (we won't touch unique items and clusters with sequences from all genomes this time).
After reading the clusters and sequence IDs in each one of them we basically need to create unique lists of cluster IDs for each one of the genomes. That's where sets are used
from sets import Set genA, genB, genC = Set([]), Set([]), Set([])
We declare three emtpy sets, that will store cluster IDs for each one of the genomes. The important part here is that the sequences' fasta titles should have a unique identifier for each one of the genomes, just to make it easier to read each cluster contents. In each of our genomes the sequences were tagged with their letter in the first character
>genA_sequence1 ACGT >genB_sequence1 ACGTT >genC_sequence1 ACGTT ...
We run CD-HIT and parse the results, maybe creating a class to store information about each cluster and its sequences. Then we analyse this list
for i in clusters: for id in i.sequence_id: if id.startswith('>genA'): genA.add(i.cluster_number) elif id.startswith('genB'): genB.add(i.cluster_number) elif id.startswith('genC'): o.add(i.cluster_number)
Remembering that sets are unordered unique lists, we don't expect to see repeated cluster IDs in each of the three sets. Now to the fun part. Our first thought to determine how many clusters contain sequences of A and B would be to create two loops and iterate checking for equalities in cluster IDs. But with sets, our task is easier. What we need is the intersection from genA and genB, and that's the method available to do that.
print len(Set.intersection(genA, genB)) print len(Set.intersection(genA, genC)) print len(Set.intersection(genB, genC))
That's it. Three lines of code, one method. The output will be the number of clusters that contain sequences of A and B, A and C and B and C, respectively.
Last time we saw how to use a set intersection to check for clusters of DNA/protein sequences and their genome intersections. We going to use the same example but this time we will see how we can change it and create a function that would calculate an arbitrary number of intersections at a time and also be able to check the intersection of more than two sets.
Our previous code to calculate intersections was
from sets import Set #for Python 2.3 and below genA, genB, genC = Set([]), Set([]), Set([]) #populate the sets ... print len(Set.intersection(genA, genB)) print len(Set.intersection(genA, genC)) print len(Set.intersection(genB, genC))
and that's does not gives us the most important piece of information: the intersection of A, B and C. Let's see how we can do it. Python has a function that can help us this time: reduce. This function allows to reduce a series of values to a single one, by employing a function defined in its parameters. From the file tutorial
reduce(function, iterable[, initializer])
Basically reduce will make the function modify the values of a iterable (i.e. list). In our case reduce helps us calculate the intersection of many sets at once (our sets are genA, genB and genC)
reduce(Set.intersection, [genA, genB, genC])
where reduce is applying the Set method intersection to a list of sets passed to the function. Python's reduce might not be available in 3.0, but it is worth knowing.
Now that we have a Pythonic way of checking the intersections we need to put reduce in a function, to make our code more elegant.
def get_intersection(my_set1, my_set2, my_set3): return reduce(Set.intersection, [my_set1, my_set2, my_set3])
Done. In this function reduce will operate by first checking the intersection between my_set1 and my_set2 and storing it internally, then checking for the intersection between the previous result and my_set3.
But wait … get_intersection receives three arguments, so it will be able to return the intersection only when three sets are sent, thus we would have to create another function that can receive two items in order to calculate the intersection between two items. And if we have more than three genomes to compare …
There is a trick. If we define our function receiving only one argument and put a star/asterisk on it, this means that an arbitrary number of arguments will be provided to this function in a tuple (and a tuple is iterable, so it can be used in a reduce). Changing our function accordingly, we will have
def get_intersection(*my_sets): return reduce(Set.intersection, my_sets)
so we can send 2, 3, 4, 5 or <em>n</em> number of sets to this function and it will return the intersection among all of them. Our simple script would be (including the above function)
from sets import Set #for Python 2.3 and below def get_intersection(*my_sets): return reduce(Set.intersection, my_sets) genA, genB, genC = Set([]), Set([]), Set([]) #populate the sets ... print len(get_intersection(genA, genB)) print len(get_intersection(genA, genC)) print len(get_intersection(genB, genC)) print len(get_intersection(genA, genB, genC))
A colleague came with a “problem”: what would be the most efficient way to merge Pfam alignments? He had FASTA files containing sequences and he wanted to find identical IDs in two files and merge the related sequences from different domains of the same protein. His C++ approach was taking too long to run so I jumped in to help with some Python tricks.
Fasta headers of Pfam alignments look like this
>P00526.2|SRC_RSVP/147-229
where the first section, before the pipe, is the protein family, the section between the pipe and the slash is the protein ID and the after the slash are the start and end positions. Basically we want to match the protein ID, between | and /, which is the only section that should not change from one alignment from the other, if there are similar sequences in both files.
His dataset size was around 8000 FASTA files and he wanted to compare all-versus-all. The first idea that comes to mind is to read the two files and basically run nested loops checking for matches between two sequence names. Simple and easy. But it would be good to find something else to do in the meantime, instead of checking for its progress.
Another idea, to simplify the process, is to read both files, extract the protein IDs and check for the intersection of the two sets of IDs and then just extract the sequences from a list of matches.
def merge_seqs(data1, data2): myset1, myset2 = Set([]), Set([]) for i in data1: myset1.add(i.name[i.name.find('|')+1:i.name.find('/')]) for i in data2: myset2.add(i.name[i.name.find('|')+1:i.name.find('/')]) mylist = Set.intersection(myset1, myset2) flist = [] for i in mylist: for j in data1: if j.name[j.name.find('|')+1:j.name.find('/')] == i: for k in data2: if k.name[k.name.find('|')+1:k.name.find('/')] == j.name[j.name.find('|')+1:j.name.find('/')]: tempname = j.name + '-' + k.name + '->' + str(len(j.sequence)) tempseq = j.sequence + k.sequence flist.append(tempname + '\n' + tempseq) return flist
This is the function that matches the sequence IDs. As we are reading the FASTA file with the functions/classes in the fasta module there is no* direct way to transform the list of sequence names into a set, that would make easier to extract the intersection between two sets of IDs from distinct files. In the above function, data1 and data2 are instances of the fasta class, and we read these lists and add the fasta.name to a couple of sets and get the intersection of them. The intersection method returns a list that we use in the nested loops to find the matching sequences, as we need to merge them, we add both names and residues. Simple and easy. All merged sequences, and their respective merged names, are appended to a list that is returned at the end of the function.
Mike and Luke left comments and suggestions on how to attack the problem from the previous entry. I copied their comments here in order to maintain code formatting a little bit better (comments do not have code highlighting and formatting).
Mike: Why not build a dictionary of the fasta data keyed by protein ID. The value of each dictionary entry would be a list of (name,sequence) tuples. Then iterate over the dictionary items and build the output list from dictionary entries of length == 2.
Obviously, I don’t have any data sets to test it on, but something like this might work:
def merge_seqs(data1, data2): from collections import defaultdict from itertools import chain data = defaultdict(list) for item in chain(data1, data2): ident = i.name[i.name.find(’|')+1:i.name.find(’/')] data[ident].append( (i.name, i.sequence) ) format = “%s-%s->%d\n%s%s” flist = [ ] for key,value in data.iteritems(): if len(value) == 2: jname,jseq = value[0] kname,kseq = value[1] flist.append(fmt % (jname, kname, len(jseq), jseq, kseq) ) return flist
Luke:
Are there duplicate IDs in a single file? If not, no reason to perform the intersection twice (once by set, once by nested loop). Sets and dictionaries are both hashes, so just remember the instances with the id:
def merge_seqs(data1, data2): first, second = dict(), dict() for i in data1: first[i.name[i.name.find(’|')+1:i.name.find(’/')]] = i for i in data2: second[i.name[i.name.find(’|')+1:i.name.find(’/')]] = i shared_ids = set(first).intersection(set(second)) flist = [] for i in shared_ids: j = first[i] k = second[i] tempname = j.name + ‘-’ + k.name + ‘->’ + str(len(j.sequence)) tempseq = j.sequence + k.sequence flist.append(tempname + ‘\n’ + tempseq) return flist
If there are duplicate IDs in a file, and you want the behavior of your original script of giving every combination _between_ files, how about a dictionary of lists and the cross-product of j & k? (I’m going to use python2.5’s collections.defaultdict, if you’re on an earlier version use .setdefault . The cross product brings back a nested loop, but it’s only over values we want rather than the whole datasets.)
from collections import defaultdict def merge_seqs(data1, data2): first, second = defaultdict(list), defaultdict(list) for i in data1: first[i.name[i.name.find(’|')+1:i.name.find(’/')]].append(i) for i in data2: second[i.name[i.name.find(’|')+1:i.name.find(’/')]].append(i) shared_ids = set(first).intersection(set(second)) flist = [] for i in shared_ids: cross = [(a,b) for a in first[i] for b in second[i]] for j,k in cross: tempname = j.name + ‘-’ + k.name + ‘->’ + str(len(j.sequence)) tempseq = j.sequence + k.sequence flist.append(tempname + ‘\n’ + tempseq) return flist
And if you want combinations from within a file as well, just use one dictionary of lists and cross each list value with itself, excluding identical elements:
cross = [(a,b) for a in data[i] for b in data[i] if a.name != b.name]
To your question of more directly creating the sets of names, can’t be much more direct, but perhaps more concise using a generator (or list, with the extra cost) comprehension:
myset1 = set(i.name… for i in data1)
Could do equivalently for the dict-based first example above, noting that the dict() constructor can take a sequence of (key, value) tuples. This would not work for the defaultdict or setdefault based version.
In order to make it clear to anyone that is chronologically following the entries, we will see what new things were suggested in the comments to BPforB.
We will start with Mike's comment. The code is below.
def merge_seqs(data1, data2): from collections import defaultdict from itertools import chain data = defaultdict(list) for item in chain(data1, data2): ident = item.name[item.name.find('|') + 1 : item.name.find('/')] data[ident].append((item.name, item.sequence)) format = "%s-%s->%d\n%s%s" flist = [] for key, value in data.iteritems(): if len(value) == 2: jname, jseq = value[0] kname, kseq = value[1] flist.append(format % (jname, kname, len(jseq), jseq, kseq) ) return flist
Mike's code would only run on Python 2.5, due to the import of collections. He imports only defaultdict from collections, which is similar to a regular dictionary except for the fact that is takes an extra argument, a factory function (that we will examine soon), and this factory function is called every time a dict key is found for the the first time, initializing the object. The line
data = defaultdict(list)
creates and intializes a defaultdict from an empty list. This defaultdict is used to store data from a chain formed by both input files being compared. A chain makes an iterator that “returns elements from the first iterable until it is exhausted, the proceeds to the next iterable” (from Python docs). Basically a chain in our case will make the loop iterate first through data1 until its last item then will start the iteration through data2.
In the loop it will use the region in the FASTA sequence name as a defaultdict key, and store the fasta name and sequence as the value. Notice that the defaultdict, as it has been initialized as a list, allows appending more than one value to each key, so every time that it finds a sequence with a name identical to some key already present in the defaultdict, it will append the information to the value of that key in a list. So, all the matching is done with this short loop, we only need then to iterate through the keys and print the result.
For the results, Mike set up a format string (we will see in a future post) that receives the merged sequences and their names in a formatted way. The iteration along the defaultdict checks for both keys and values and every time it finds a value with a length of 2 (meaning two sequences shared identical names) it puts the items of the defaultdict's value list into two tuples. These tuples, containing sequence and name, are then formatted for output and appended to a list of merged sequences.
There are a lot of new concepts, objects and methods but Mike's suggestion shows how powerful Python is. For a set of 25 alignments, there is little performance improvement, but our coding ability improves exponentially with clever suggestions.
Continuing on the Pfam alignment sequence merge, Luke provided two solutions, one for case where we have duplicates in the file (that happens) and another one where duplicates are not tested. First for the case with no duplicates:
def merge_seqs(data1, data2): first, second = dict(), dict() for i in data1: first[i.name[i.name.find('|') + 1:i.name.find('/')]] = i for i in data2: second[i.name[i.name.find('|') + 1:i.name.find('/')]] = i shared_ids = set(first).intersection(set(second)) flist = [] for i in shared_ids: j = first[i] k = second[i] tempname = j.name + '-' + k.name + '->' + str(len(j.sequence)) tempseq = j.sequence + k.sequence flist.append(tempname + '\n' + tempseq) return flist
Basically, his approach is to create two dictionaries, first and second to store the data from our lists of FASTA objects. The dictionaries keys are the sections fo each sequence name, and the values are the FASTA object created when we read the file. After that, a set of both dictionaries is created, and this represents everything we need in order to find shared sequences between both files. The output is trivial, basically iterating the set and getting the values from both dictionaries.
But, in some cases there are some duplicates in each Pfam alignment, and this would make the above approach to miss some sequences. Luke also provided a solution for this case
from collections import defaultdict def merge_seqs(data1, data2): first, second = defaultdict(list), defaultdict(list) for i in data1: first[i.name[i.name.find('|') + 1:i.name.find('/')]].append(i) for i in data2: second[i.name[i.name.find('|') + 1:i.name.find('/')]].append(i) shared_ids = set(first).intersection(set(second)) flist = [] for i in shared_ids: cross = [(a,b) for a in first[i] for b in second[i]] for j,k in cross: tempname = j.name + '-' + k.name + '->'’ + str(len(j.sequence)) tempseq = j.sequence + k.sequence flist.append(tempname + '\n' + tempseq) return flist
which is similar to what Mike suggested. In his approach defaultdict is used instead of a regular dict, initialized as a list. Again we can take advantage of the fact that we that the value of defaultdict has the same features as the object that was used to initialized it. The same set approach is used and we have a nice set of merged sequences from two Pfam alignments. Remember that defaultdict were introduced in Python 2.5, it won't work on older versions.
One of the things I like about Python and the Python community is the search for the making code simple and clear. Tal left a comment in the last post about merging Pfam alignment sequences suggesting another approach to our problem. The code is below
def merge_seqs(data1, data2): from itertools import chain, groupby format = "%s-%s->%d\n%s%s" flist = [] keyfunc = lambda it: it.name[it.name.find('|') + 1 : it.name.find('/')] for it, g in groupby(sorted(chain(data1, data2), key=keyfunc), keyfunc): values = list(g) if len(values) == 2: jname, jseq = values[0].name, values[0].sequence kname, kseq = values[1].name, values[1].sequence flist.append(format % (jname, kname, len(jseq), jseq, kseq) ) return flist
The code also uses the itertools module, importing chains and groupby. We already saw chains in the previous post, but groupby is new to us here. groupby was introduced in the 2.4 version of Python and is a method returns keys and groups from an iterable. An Python iterable is any object that can return its elemements at given time, for instance in a for loop, while the index of this loop is the iterator. So, in our case groupby will return the sequence names based on the lambda function defined before the groupby and the chain method. Usually groupby has this syntax
groupby(iterable[, key])
The key is optional, and in our case it is the lambda function. Another method new to use that uses the same lambda function is sorted. As its name hints, sorted returns a sorted list of iterables. The key in this case is the sorting algorithm, that actually creates the comparison between items.
Basically in the code above, a lambda function extracts the desired regions from the sequence names, which are them iterated in a groupby method that returns they key values, one value when the sequence is unique, two values when there are two sequences, of a sorted iterable generated by a chain that read both input lists in one pass. After this we just need to check the number of returned values and we have our list of matching sequences.