Hi,

I am working on Protein and DNA sequences. When you have a protein sequence it can be translated in to DNA sequence in many ways. A nice way to express this is using regular expression. I would like to create this long regular expression for a protein sequence, and than seach for a smaller string. So that is the other way around of what regular expression does. Is there a way to do this?

a short example would be

DNAsequence
r'atata[atg]ca[atgc]...atatagtagcctgt'

search for
'agcacg'

Thanks

Edited 4 Years Ago by aint

I see a way, but there is some work to do. First you convert your regular expression to a NFA (non deterministic finite automaton) with a start state and
an accept state. In this NFA, you add an epsilon-transition from the start state to every other state and an epsilon transition from any state to the accept state.
This NFA should be able to find your subsequences.

A second step could be to convert the NFA into a deterministic automaton (DFA) to produce a fast tool. You could use a program like this one for the conversion (again there is some work, because the NFA in this link doesn't seem to have epsilon transitions).

Edited 4 Years Ago by Gribouillis

You can use an NDFA (NFA) where each transition character in your short string triggers a transition in the NFA, if possible. The google knows about how you might want to start doing that.

If you have only not optional alternatives without repeats in [], you could do something like this to avoid complicated transformations (you could improve the matching with more advanced string matching algorithm) :

dna = r'atata[atg]ca[atgc]atatagtagcctgt'

grouping = []
s = set()
reading = False
for acid in dna:
    if acid == '[':
        s = set()
        reading = True
    elif acid == ']':
        reading = False
        grouping.append(s)
    elif reading:
        s.add(acid)
    else:
        grouping.append(acid)

print grouping

target = 'tat'
matches = list()

for index in range(len(grouping) - len(target) + 1):
    if all((c1 in c2) for c1, c2 in zip(target, grouping[index:])):
        g = grouping[index:index+len(target)]
        matches.append((index, g))

if matches:
    print 'Match at:'
    for index, g in matches:
        print index,':',g
else:
    print 'Match not found'

Edited 4 Years Ago by pyTony

This article has been dead for over six months. Start a new discussion instead.