Table of Contents

1

This part is based on the conversion of some small C++ programs I developed in the past. These small programs (I call them modules because they were part of a larger application) were used to count motifs, short nucleotide <i>words</i> up to 10-12 base pairs, and then calculate statistical overrepresentation of these words by comparing a foreground set of DNA sequences against a background set.

We will start comparing the different approaches of the C++ and the Python codes and point out advantages and disadvantages of doing it in one language or the other. First thing we need to do is to count the motifs in all sequences from our foreground and background sets. For the project I was working on, the ideal word length was 10 nucleotides. Basically our C++ approach to increase speed was to transform the character DNA sequences into numbers and then, while sliding a window with the desired word length, hash the base-four numbers into base 10 and increment a vector position, previously initialized with 0. For four nucleotides and a word size of 10 there are 1,048,576 permutations possible, from AAAAAAAAAA to TTTTTTTTTT. Initially the C++ program would do

for(j = 0; j < seqsize; j++)
{
    seqfile.get(base);
    if(base != '\n')
    {
        switch(base)
        {
            case 'A':
                bseq.push_back(0);
                break;
            case 'C':
                bseq.push_back(1);
                break;
            case 'G':
                bseq.push_back(2);
                break;
            case 'T':
                bseq.push_back(3);
                break;
        }
    }
}

reading all sequences and pushing an figure for each nucleotide in a vector, and then sliding a window on this vector and hashing the base-four number

int hashSeq(vector<short> subseq)
{
    int w, i, hashvalue = 0, power;
    w = subseq.size() - 1;
    for(i = 0; i &lt; subseq.size(); i++)
    {
            power = 0;
            hashvalue += subseq[i] * pow((double)4,(double)w);
            w--;
    }
    return hashvalue;
}
 
if(binseq[i].size() &gt; motifwidth)
{
    for(j = 0; j &lt; binseq[i].size()-motifwidth+1; j++)
    {
        sub.assign(binseq[i].begin() + j, binseq[i].begin() + j+motifwidth);
        hashed = hashSeq(sub);
        nmercount[hashed]++;
        sub.clear();
    }
}

The whole C++ code has about 400 lines, including all the possible output formatting and printing. Timing with time the C++ executable takes a little bit less than 2 seconds to read, count and output different files.

For Python, we will use a different approach and gain a lot in code simplicity. As we want to count the number of times size-10 words appear in all sequences, we first need to generate all possible permutations (with replacement) of four nucleotides. This can be easily accomplished by using generator functions. Regular functions run until completion and the return a value. For instance, a function that calculates the factorial of 10 will return the last value only, after multiplying 10.9.8.7.6.5.4.3.2. A generator function runs until a value is available to return, yielding it and then suspending its operation until called again. The yielding part was emphasized because yield is the command used by Python to return the value and suspend the function until further notice. In the factorial function, a generator would return the intermediary factorial values up to 10.

To generate all 1 million plus permutations of 4 nucleotides we need a function similar to the one below (modified from here)

def permutations(items, n):
    if n == 0:
        yield ''
    else:
        for i in range(len(items)):
            for base in permutations(items, n - 1):
                yield str(items[i]) + str(base)

Basically, what this generator function does is to combine all four nucleotides in words of size 10. This is a recursive function, where the result of the function is dependent on the n-1 value calculated by the function until n equals 0. The first for loop over the items that we want to permute (the nucleotides) and the second for recursively calls permutations starting with the initial n passed (10) until we reach 0. Debugging this function we will see that i is constant for each iteration of the second loop and only n changes from 10 to 0, while one by one nucleotides are joined to form a motif. It starts with AAAAAAAAAA, then AAAAAAAAAC, then AAAAAAAAAG, until it gets to a poly-T.

Our final code would look like the one below

import fasta
import sys
 
def permutations(items, n):
    if n == 0:
        yield ''
    else:
        for i in range(len(items)):
            for base in permutations(items, n - 1):
                yield str(items[i]) + str(base)
 
seqs = fasta.get_seqs(open(sys.argv[1]).readlines())
length = sys.argv[2]
 
nucleotides = ['A', 'C', 'G', 'T']
 
merged_seqs = ''
for i in seqs:
    merged_seqs += i.sequence
 
for i in permutations(nucleotides, int(length)):
    print i + '\t' + merged_seqs.count(i)

where we read the input sequence(s), merge them in one long string and as we generate all possible combinations we count the number of times they appear. This code running on the same input file used on the C++ executable is 60 times slower, taking in average one full minute to count motifs in 8 500 bp DNA sequences. The slowest section is the count, as the generation of all possible combinations is straightforward. We lose some speed, but gain a lot on code simplicity and clarity.

2

The main focus of the part one was to show the decrease in code length from C++ to Python, introduce generator functions and yield and also show the nice permutation function. Turns out by using the approach of the previous entry there is a big drawback in code execution, which is around 60 times slower than C++. So we need to find another approach, and Mike showed us how (with some small modifications)

#!/usr/bin/env python
 
from collections import defaultdict
import sys
import fasta
 
seqs = fasta.get_seqs(open(sys.argv[1]).readlines())
length = int(sys.argv[2])
 
#for a missing key, the dict entry is initialized to zero
counts = defaultdict(int)
 
#count the length-element subsequences in each sequence
for i in seqs:
	for n in range(len(i.sequence) - length):
		counts[i.sequence[n : n + length]] += 1
 
#counts.keys() will then return the nucleotide sequences 
#that were actually in merged_seqs
 
#print out the sequences that occur more than once
for count in counts:
        print ''.join(count), counts[count]

This time the counting is done with a defaultdict initialized with integers (all zeroes), and instead of generating all possible combinations (which was cool and fast, but in the end made our script slower) a window with the input length is slided along the each sequence and the key of the default dictionary is the motif and the value is the actual count incremented as we determine each motif.

Checking this code performance using Linux's time, we get 0.2 seconds, an improvement of 5 fold on the C++ code and 300 fold on our previous Python code

What is different here is that the output is not ordered and only the seen motifs are printed in the end.

3

We take a break of developing code and check for performance (non-scientific testing!). In the previous entry a simple file was used as input: 8 DNA sequences of 500 bases each. That's not enough to test the performance of the Python script against the C++ compiled executable. So, we use a larger file; two larger files to be more exact. First, we use a 555 sequence file with the sequences averaging 19371 nucleotides and another with 3854 sequences averaging 20000 nucleotides in length. Those files were the largest foreground and background clusters used in analysis. Let's see how the Pyhton and C++ fared (Linux's time was used in the comparison, for simplicity).

Foreground cluster

Average of 10 runs

C++: 45.66 seconds Python: 36.4 seconds

Background cluster

Average of 10 runs

C++: 5 minutes and 4 seconds Python: 2 minutes and 44 seconds

The C++ analysis of the background cluster was done on a cluster's node, and not on my desktop computer, due to the fact that my desktop could not handle it. C++ defense: the code was developed by me and I am not a computer scientist. Clearly there is room for improvement and performance gain.

But, definitely we can see that by using some advanced techniques in Python we are able to outperform C++ (at least C++ code developed by me) and still have a short and easy to read script.

4

We found a way to make the Python script as good as or better than the C++ executable. But for the analysis we need to do, motif counts are not the value we want. We need the quorum: the number of sequences the motif is present at least once. For instance, if the desired motifs was AAACCCTTTG we will check in which sequences this word was present. Let's say in a cluster of 10 sequences, we would find it in sequences 1, 2, 3, 4 and 5, giving us a quorum of 5 out of 10, or 50%. The quorum will be used in the future in the statistical calculation in order to determine the overrepresented motifs.

With only a couple of modifications, we can adapt the script used to get the motif counts to get the quorum.

#!/scratch/python/bin/python
 
from collections import defaultdict
import sys
import fasta
 
seqs = fasta.get_seqs(open(sys.argv[1]).readlines())
length = int(sys.argv[2])
 
quorum = defaultdict(list)
 
seq_number = 0
for i in seqs:
    seq_number += 1
    for n in range(len(i.sequence) - int(length)):
        if not seq_number in quorum[i.sequence[n : n + length]]:
            quorum[i.sequence[n : n + length]].append(seq_number)
 
for i in quorum:
    print ''.join(i).upper(), len(quorum[i])

Basically, we change the way the defaultdict is initialized, this time as a list instead of int and we also change the procedure that used to get the counts. The loop does identical work, iterating along the sequences, with a window (of the input length) sliding on them and checking each word. This time instead of incrementing the value of the defaultdict, we append to the list the sequence number, obtained from a index integer variable (incremented in each iteration of the loop), if this number is not already in the list value. In the end each value of quorum will be a list of numbers and by printing the list length we obtain the quorum. Testing the above script there is no performance loss when comparing to the previous count script.

5

Now that we have the script to generate the word quorums working (and working fast!) we need then to calculate the a p value for each motif based on the fore and background quorums. A p value cut-off will determine the statistically significant words, or overrepresented. These overrepresented words then can be analysed in more details (that we won't see here) and for instance determine new or already known transcription factor binding sites.

A well established statistical method to determine such overrepresented words is the Hypergeometric Distribution (HD for short). HD measures “success” and “failures” for values that do not fit in the binomial distribution, and depend on the measurements without replacement.

Basically, HD's equation has a a series of binomial coefficients/combinations

http://www.genedrift.org/hd.gif

where N is the population size, m is foreground cluster size, k is the motif quorum in the background gene set and x is the word quorum in the foreground set. Note that the above equation is for the cumulative HD, where a sum of probabilities is calculated.

All the combinations in the above equation have to be expanded to factorials that depending on the value to be calculated are very computer intensive and sometimes don't fit in the memory (either a float or integer). But Python is able to handle very large numbers and the calculation of large factorials is relatively fast.

In C++, I had to use a couple of tricks to achieve a good speed in the factorial determination, and specially in the HD calculation that requires multiple factorials and multiplication, division and subtraction of large numbers. I didn't want to use any mathematical trick such as Stirling's approximation. 13! in C++ already blows the size of long, so I had to use the MAPM, A Portable Arbitrary Precision Math Library in C. This library is quite fast to calculate the factorial values but when one needs to calculate more than 200,000 factorials the speed is unbearable. So, I decided to pre-calculate a series of factorial values, keeping 10 decimal places as precision and saving in another column their exponential. Then using this table as an input I was able to multiply, divide and subtract the factorials and by employing the first law of exponents do the same operations with their exponential. This speeds up the process tremendously.

In Python, we don't need any extra third-party library, we just use Python itself, without importing an extra module. A factorial function in Python can be written in one line, but for clarity is better to define it separately. We can try throwing any number at it and see the result.

def fac(n):
    value = reduce(lambda i, j : i * j, range(1, n + 1))
    return value

We already saw reduce and lambda and using these two methods make the factorial function clear and simple. And why are we not using a recursive function? Because Python has a limit recursion depth (1000).

6

We will take a break on developing the statistical module to obtain overrepresented motifs, and take a deeper look at the possibilities on obtaining the motif quorums. Mike DeHaemer, a regular commenter and contributor to the BPforB blog, sent me a Python script with 8 different ways distributed in 13 distinct functions for obtaining the motif quorums. I will take advantage of his contribution and post all of them, with some quick comments on each one of them (his code comments were kept in each function). After, a small benchmarking will be posted.

Most of the functions need to import a couple of module

from collections import defaultdict, deque
from itertools import count, izip, tee

and they have two parameters, a sequence list and the length of the motifs.

The first function uses again the defaultdict and it is very similar to the one used in the final version of the quorum script. The defaultdict is initialized as a list and the ids are added to a the list, keys are motifs, only if they are not already present in it. The sequence id is generated in a variable incremented each time the loop iterates.

def get_quorums_01(seqs, mlen):
    """
    append seq id_no to list after checking to see if already present
    use explicit counter to create seq_no
    """
    quorum = defaultdict(list)
    id_no = 0
    for seq in seqs:
        id_no += 1
        for n in range(len(seq) - mlen):
            if id not in quorum[seq[n:n+mlen]]:
                quorum[seq[n:n + mlen]].append(id_no)
    return quorum

The second function is very similar to the first one, with the caveat that sequence id numbers are generated with enumerate.

def get_quorums_02(seqs, mlen):
    """
    append seq id_no to list after checking to see if already present
    use 'enumerate' to create seq_no 
    """
    quorum = defaultdict(list)
 
    for id_no, seq in enumerate(seqs):
        for n in range(len(seq) - mlen):
            if id_no not in quorum[seq[n:n+mlen]]:
                quorum[seq[n:n+mlen]].append(id_no)
    return quorum

enumerate is a object based on another iterable object. When called enumerate always returns a tuple of an indexed series. For instance, in our case above, enumerate will return a series of tuples (0, sequence1), (1, sequence2) … (n, sequenceN). That's the reason the enumerate loop uses a tuple as its index

for id_no, seq in enumerate(seqs)

Continuing on Mike's functions to obtain motif quorums. Function get_quorums_03, uses an old friend of the blog, sets. Recall that sets are very similar to lists, but their are unordered and items are unique.

def get_quorums_03(seqs, mlen):
    """
    add seq id_no to a set
    use explicit counter to create seq_no
    """
    quorum = defaultdict(set)
    id_no = 0
    for seq in seqs:
        id_no += 1
        for n in range(len(seq)-mlen):
            quorum[seq[n:n+mlen]].add(id_no)
    return quorum

Basically, the sequence numbers (an incremented counter) are added to a defaultdict which was initialized as a set. This way you don't need to check for the existence of the sequence number in the defaultdict list and count on the ability of set of being unique. Function 4 is very similar to function 3 with the difference of using enumerate (as in function 02) to make the sequence numbers.

def get_quorums_04(seqs, mlen):
    """
    add seq id_no to a set
    use 'enumerate' to create seq_no 
    """
    quorum = defaultdict(set)
    for id_no, seq in enumerate(seqs):
        for n in range(len(seq)-mlen):
            quorum[seq[n:n+mlen]].add(id_no)
    return quorum

Function 5 adds a twist, which is to have an enumerate to set the sequence range (motif/word width) start and stop. This way the window is sliding based on the tuple created by the enumerate method and not on the slicing that were used in all other functions. Again, a defaultdict is initialized as set and the sequence numbers are generated by an enumerate.

def get_quorums_05(seqs, mlen):
    """
    add seq id_no to a set
    use 'enumerate' to create seq_no
    use enumerate(range(...)) to create start/stop indices for motif
    """
    quorum = defaultdict(set)
    for id_no, seq in enumerate(seqs):
        for s, e in enumerate(range(mlen, len(seq))):
            quorum[seq[s:e]].add(id_no)
    return quorum

There are a couple of Python methods that we haven't seen here and need some introduction, izip and count. To use these two we also need to import new modules

from itertools import count, izip

count returns consecutive integers starting at a defined point (the method's parameter). If empty it starts from zero. Basically, by starting a count it will give an iterable with a increasing integer values, in a fashion similar to a function with yield. Every time our loop accesses the count it will “remember” the last return value and increment it by one.

izip also returns an iterator, but from a list of iterables. It is basically used to iterate through a list of many iterables at the same time. In the function below it is used twice: one to generate a tuple (with count) with a sequence number and the sequence itself. The sequence in the tuple is then used in another izip to create the windows on the sequences to count motifs.

def get_quorums_06(seqs, mlen):
    """
    add seq id_no to a set
    use 'izip(count(),...) to create seq_no
    use 'izip(count(),range(...)) to create start/stop indices for motifs
    """
    quorum = defaultdict(set)
    for id_no, seq in izip(count(), seqs):
        for s, e in izip(count(), range(mlen, len(seq))):
            quorum[seq[s:e]].add(id_no)
    return quorum

We jump function 7 in order to explain “simpler” ones, 8 and 9. Both functions use generators. We've already seen here generators, which are functions that use the yield statement to generate iterators. The generator is very similar to a function but instead of returning a value, it yields one and waits for another call to resume. In function 8, a generator is used to return the motif sequence that is used as a key in the defaultdict. Notice the scope of the generator that is coded inside a function.

def get_quorums_08(seqs, mlen):
    """
    add seq id_no to a set
    use enumerate to create seq_no
    use an explicit generator to create the motifs
    """
    def motif_gen(seq):
        for n in range(len(seq)-mlen):
            yield seq[n:n+mlen]
 
    quorum = defaultdict(set)
    for id_no, seq in enumerate(seqs):
        for motif in motif_gen(seq):
            quorum[motif].add(id_no)
 
    return quorum

In function 9 a very similar structure is used but in this cases instead of a “pure” generator it uses generator expressions which a very similar to list comprehensions but with parentheses instead of square brackets. Generator expressions are generators that can be written in one line and have identical behaviour as generators coded in the “regular” inception. In the function below the generator expressions provide the iterator for the loop with motif as index. Very simple and elegant.

def get_quorums_09(seqs, mlen):
    """
    add seq id_no to a set
    use enumerate to create seq_no
    use a generator expression to create the motifs
    """
    quorum = defaultdict(set)
    for id_no, seq in enumerate(seqs):
        for motif in (seq[n:n+mlen] for n in range(len(seq)-mlen)):
            quorum[motif].add(id_no)
 
    return quorum

incomplete set of functions, will return to them

7

Let's get back to the statistical module, that will calculate an Hypergeometric Distribution (HD) p value so we can define the overrepresented motifs. Last time we saw it, we just had defined the factorial function, which is immensely helpful in this case due to the number of factorial calculations needed in the HD. The factorial function was the one below

def fac(n):
    value = reduce(lambda i, j : i * j, range(1, n + 1))
    return value

but as mentioned in the comments by Dave and by Mike via email the method used is not the best method to calculate factorial in Python. The best approach in this case is to use operator.mul. All functions in the operator modules are in implemented in pure C and they mimic the same operators in Python. So in this module we can find mul for multiplication, sub for subtraction, add for additions, etc.

The operator.mul needs two arguments to multiply, and in our case we still need to use reduce to sum all the results from a series of multiplications. As parameters we should use a range, that can start with 2, that should go up to the number we want the factorial plus one. Finally our function would be

import operator
 
def fac(n):
    value = reduce(operator.mul, xrange(2, n+1))
    return value

The time gain, quickly measured in a non-scientific fashion in my system, is around 5 to 15%, depending on the factorial being calculated. It may seem a small gain, but when you need to calculate almost a million factorials for all possible motifs the amount of time saved is crucial.

8

So let's modify a little bit the factorial function and then benchmark both by using timeit. Ideally our factorial function would need to calculate a value similar to the binomial expansion, as we have three factorials to calculate in for each binomial in the Hypergeometric Distribution.

So we can add two extra factorial calculations to our function and perform the multiplication and division to return the equivalent to the binomial calculation. So the function would be

def fac(n, m): 
    value1 = 1 
    for i in xrange(2, n + 1): 
        value1 *= i 
    value2 = 1
    for i in xrange(2, m + 1): 
        value2 *= i 
    value3 = 1
    for i in xrange(2, (n - m) + 1): 
        value3 *= i 
 
    return  value1 / (value2 * value3)

m and n are both values of the binomial and n - m is the subtraction of one by the other that forms the last factorial to be calculated. This way it makes easier to time the performance of both functions. In the end the complete script would look like

#!/usr/bin/env python
 
import timeit
 
def fac(n, m): 
    result1 = 1 
    for i in xrange(2, n + 1): 
        result1 *= i 
    result2 = 1
    for i in xrange(2, m + 1): 
        result2 *= i 
    result3 = 1
    for i in xrange(2, (n - m) + 1): 
        result3 *= i 
 
    return  result1 / (result2 * result3) 
 
def binom(n, m): 
    b = [0] * (n + 1) 
    b[0] = 1 
    for i in xrange(1, n + 1): 
        b[i] = 1 
        j = i - 1 
        while j &gt; 0: 
            b[j] += b[j - 1] 
            j -= 1 
    return b[m] 
 
def choose(n, k):
    if 0 <= k <= n:
        ntok = 1
        ktok = 1
        for t in xrange(1, min(k, n - k) + 1):
            ntok *= n
            ktok *= t
            n -= 1
        #print ntok // ktok
        return ntok // ktok
    else:
        return 0
 
if __name__ == "__main__":
 
    stmt = "fac(3000, 7)"
    t = timeit.Timer(stmt = stmt, setup='from __main__ import fac')
    stmt2 = "binom(3000, 7)"
    t2 = timeit.Timer(stmt = stmt2, setup = 'from __main__ import binom')
    stmt3 = "choose(3000, 7)"
    t3 = timeit.Timer(stmt = stmt3, setup = 'from __main__ import choose')
 
    print 'fac: %.9f' % (t.timeit(100)/100)
    print 'binom: %.2f' % (t2.timeit(10)/10)
    print 'choose %.9f' % (t3.timeit(100)/100)

The final result of the average for ten repetitions is as follow

fac = 0.10 s

binom = 43.24 s

choose = 0.000005 s

9

Now that we have the best quorum determination function and the ideal function to calculate the binomial expansions it is easy to program a script to calculate the p value of motifs in DNA sequences. To the script

#!/usr/bin/env python
 
import fasta
import sys
from collections import defaultdict
 
def choose(n, k):
    if 0 <= k <= n:
        ntok = 1
        ktok = 1
        for t in xrange(1, min(k, n - k) + 1):
            ntok *= n
            ktok *= t
            n -= 1
        #print ntok // ktok
        return ntok // ktok
    else:
        return 0
 
def get_quorums(seqs, mlen):
    """
    add seq id_no to a set
    use explicit counter to create seq_no
    """
    quorum = defaultdict(set)
    id_no = 0
    for seq in seqs:
        id_no += 1
        for n in range(len(seq) - mlen):
            quorum[seq[n:n + mlen]].add(id_no)
    return quorum
 
input_seqs = fasta.read_seqs(open(sys.argv[1]).readlines())
input_seqs2 = fasta.read_seqs(open(sys.argv[2]).readlines())
 
foreground = get_quorums(input_seqs, 10)
background = get_quorums(input_seqs2, 10)
 
N = len(input_seqs) + len(input_seqs2)
 
for i in foreground:
    term1 = choose(len(background[i]), len(foreground[i]))
    term2 = choose((N - len(background[i])), len(input_seqs)-1)
    term3 = choose(N, len(input_seqs))
    p = (float(term1) * float(term2)) / term3
    if 0 < p <= 0.0001:
        print i, len(foreground[i]), len(background[i]), p

We already defined choose in the last post (more information in the link from the Python's cookbook) and earlier Mike sent us a series of quorum-determination functions and one of the best was portrayed and explained here. We also need our fasta module to read the sequences (and only the sequences) in order to use it in the quorum function.

Basically we use the foreground and background files as input, determine the quorum of the different words (width 10) and then we iterate over the results, calculating the <em>p</em> value for each motif found in the foreground set. The tree terms of the Hypergeometric Distribution are calculated separately and we test for a <em>p</em> value smaller that 0.0001 (this can be modified) and we only print the results that fall in this category.

Gran Finale

In this final part, let's do some very simple refactoring and modify the output section to make the result a little bit better. There are not many options about the functions to calculate the binomial expansion. But Andrew posted some opinions on how to slight change the quorum function.

def get_quorums(seqs, mlen):
    """
    add seq id_no to a set
    use explicit counter to create seq_no
    """
    quorum = defaultdict(int)
    for seq in seqs:
        for n in range(len(seq) - mlen):
            quorum[seq[n:n + mlen]] += 1
    return quorum

His modifications were small but improved the code a bit, as you remove one variable/object from the function. At the same time there is need to change a bit our output section of the code, as we don't use a defaultdict initialized with a set, but with an integer.

for i in foreground:
    term1 = choose(background[i], foreground[i])
    term2 = choose((N - background[i]), len(input_seqs)-1)
    term3 = choose(N, len(input_seqs))
    p = (float(term1) * float(term2)) / term3
    if 0 < p <= 0.0001:
        print i, foreground[i], background[i], p

Notice that in the term1 line we don't check for the set length anymore and just use the integer stored in foreground and background. Again a small change, that can make the code a little bit more clear. But we need to modify this section so the output is a little bit more clear, maybe ordered by motif sequence.

But as we are reading the sequences as they are our results are not ordered. It would be great to have a final list starting with AAAAAAAA and ending with TTTTTTTTT. There is an easy way to do that, and very inexpensive regarding code and final performance. Basically we append each one of the motifs (and their extra information) to a list and use the sort method for lists. So our output section of the code will be

res_motifs = []
for i in foreground:
    term1 = choose(background[i], foreground[i])
    term2 = choose((N - background[i]), len(input_seqs)-1)
    term3 = choose(N, len(input_seqs))
    p = (float(term1) * float(term2)) / term3
    if 0 < p <= 0.0001:
        res_motifs.append(i + '\t' + str(foreground[i]) + '\t' + str(background[i]) + '\t' + str(p))
 
res_motifs.sort()
for i in res_motifs:
    print i

Putting everything together our final motif determination script is (batteries included):

#!/usr/bin/env python
 
import fasta
import sys
from collections import defaultdict
 
def choose(n, k):
    if 0 <= k <= n:
        ntok = 1
        ktok = 1
        for t in xrange(1, min(k, n - k) + 1):
            ntok *= n
            ktok *= t
            n -= 1
        return ntok // ktok
    else:
        return 0
 
def get_quorums(seqs, mlen):
    """
    add seq id_no to a set
    use explicit counter to create seq_no
    """
    quorum = defaultdict(int)
    for seq in seqs:
        for n in range(len(seq) - mlen):
            quorum[seq[n:n + mlen]] += 1
    return quorum
 
input_seqs = fasta.read_seqs(open(sys.argv[1]).readlines())
input_seqs2 = fasta.read_seqs(open(sys.argv[2]).readlines())
 
foreground = get_quorums(input_seqs, 10)
background = get_quorums(input_seqs2, 10)
 
N = len(input_seqs) + len(input_seqs2)
 
res_motifs = []
for i in foreground:
    term1 = choose(background[i], len(foreground[i])
    term2 = choose((N - background[i]), len(input_seqs)-1)
    term3 = choose(N, len(input_seqs))
    p = (float(term1) * float(term2)) / term3
    if 0 < p <= 0.0001:
        res_motifs.append(i + '\t' + str(foreground[i]) + '\t' + str(background[i]) + '\t' + str(p))
 
res_motifs.sort()
for i in res_motifs:
    print i
 
part14.txt · Last modified: 2009/05/21 11:23 by nuin
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki