Suppose my long sequence looks like, 5’-AGGGTTTCCCTGACCTTCACTGCAGGTCATGCA-3 The two italics subsequences (here within the two stars) in this long sequence are combinedly called as inverted repeat pattern. The length and the combination of the four letters such as A,T,G,C in those two subsequences will be varying. But there is a relation between these two subsequence. Notice that, when you consider the first subsequence then its complementary subsequence is ACTGGA (according to A combines with T and G combine with C) and when you invert this complementary subsequence (i,e last letter comes first) then it matches with the second subsequence. There are large no of such patterns are present in a FASTA sequence (contains 10 million ATGC letters ) and I want to find such pattern and their start and end position. Could anyone help me in this regard.

when rumning this script, I am getting this error

Traceback (most recent call last):
File "irc.py", line 12, in <module>
print list(ivp('AGGGTTTCCCTGACCTTCACTGCAGGTCATGCA', 6, 6))
File "irc.py", line 10, in ivp
if sub.translate(mapping)[::-1] in s :
TypeError: expected a character buffer object

Can anyone rectify me?

def substrings(s, lmin, lmax):
    for i in range(len(s)):
        for l in range(lmin, lmax+1):
            subst = s[i:i+l]
            if len(subst) == l:
                yield i, l, subst
def ivp(s, lmin, lmax):
    mapping = {'A': 'T', 'G': 'C', 'T': 'A', 'C': 'G'}
    for i, l, sub in substrings(s, lmin, lmax):
        if sub.translate(mapping)[::-1] in s :
            yield i, l, sub
print list(ivp('AGGGTTTCCCTGACCTTCACTGCAGGTCATGCA', 6, 6))

Edited 3 Years Ago by sudipta.mml

This article has been dead for over six months. Start a new discussion instead.