This time we will check how set works in Python. set is another object type available in Python (version 2.3 and up) that brings a lot of features to the language.
From the Python Library Reference: “A set object is an unordered collection of immutable values. Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference.” Yep, all these are possible with set, and let me stress that it is not duplicated.
So, a set is basically a collection of item, but its unordered and not indexed, because it does not record element position or indention order. Methods available for a set include union(), intersection(), difference() that we will check next time. First let's see some basic set functionality.
In Python 2.4, set is available as a built-in object type. In Python 2.3, we need to import a library in order to use
set
, like this:
from sets import Set as set
A first use for set would be to uniquify a list. Let's say that you have the gene IDs of two different clusters and you want to merge these lists and keep only the unique ones, eliminating possible duplicates IDs. We could do that with a dictionary and a simple function (we will also check this later on) but a
set
makes our life easier.
cluster1 = open(sys.argv[1]).readlines() cluster2 = open(sys.argv[2]).readlines() allgenes = cluster1 + cluster2 uniqueset = set(allgenes)
and that's all. Of course we won't have a flexibility of a list, but we can easily convert the set to a list and manipulate as before.
We are going to use our previous example to compare the use of sets and dictionaries to create unique lists. We've already seen that when sets are used it is very simple to transform a list with repeated items in a unique list. The only hassle is to create the set and then transform it back into a list.
Like last time (with one small addition)
cluster1 = open(sys.argv[1]).readlines() cluster2 = open(sys.argv[2]).readlines() allgenes = cluster1 + cluster2 uniqueset = set(allgenes) finalist = list(uniqueset)
We can accomplish identical result by using a dictionary. We create a small function to make our code clearer and pass a list to it and we return the
dictionary
keys. Remember that Python dictionaries have values and keys and the latter cannot be repeated, so it is basically a list of unique entries. Our function would look like
def make_unique_list(mylist): dict = {} for word in mylist: dict[word] = 1 return dict.keys()
In this function we declare the object and in the loop, iterating over every list's item we assign a value (arbitrary) to the dictionary key. As pointed above, no repeated keys are allowed, so every time a already checked item is seen by the assignment it is not included in the dictionary. Finally we return only the dictionary keys which is our final unique list.
Our small script would be:
cluster1 = open(sys.argv[1]).readlines() cluster2 = open(sys.argv[2]).readlines() allgenes = cluster1 + cluster2 allgenes = make_unique_list(allgenes)
Both methods are very effective and usually fast. I will post some comparisons and benchmarks, just for fun.
Nathan posted in the comments another approach using dictionaries. It is below with syntax highlighting
dict.fromkeys(mylist).keys()
Basically in one line you pass a list of elements to dictionary and return all the keys that are in the dic. Very Pythonic.
Just for fun, let's see if there is any advantage (apart from generating a smaller code) in using either of the approaches to create an unique list. A list of 741 gene IDs and another one with 1322 (that contained all the 741 IDs from the first) were used. Instead of hard coding the lists in the script, normal I/O was used and the files were read automatically. Using a for loop the scripts were run 10 times and the final time averaged.
Here are the results
Using dictionaries 0.08540 Using sets 0.15890
Almost twice as fast for the dictionaries. Why? One thing, and one thing only: Python version. My system's box default Python is 2.3.4, which has one of the first implementations of set. I have a “personal” Python version 2.5 so the test was redone with the newer Python
Using dictionaries 0.04520 Using sets 0.03770
Ok. That shows a little bit of advantage for sets what is expected. But it shows us that different versions of Python have a huge difference between them. This is a rather crude and simple test, with a small list of entries, but there is a huge gain from version 2.3 to version 2.5, either in dicts or sets.
For a more comprehensive test and more functions with a similar objective check this page.