Hi,

I am working on Protein and DNA sequences. When you have a protein sequence it can be translated in to DNA sequence in many ways. A nice way to express this is using regular expression. I would like to create this long regular expression for a protein sequence, and than seach for a smaller string. So that is the other way around of what regular expression does. Is there a way to do this?

a short example would be

DNAsequence
r'atata[atg]ca[atgc]...atatagtagcctgt'

search for
'agcacg'

Thanks

I see a way, but there is some work to do. First you convert your regular expression to a NFA (non deterministic finite automaton) with a start state and
an accept state. In this NFA, you add an epsilon-transition from the start state to every other state and an epsilon transition from any state to the accept state.
This NFA should be able to find your subsequences.

A second step could be to convert the NFA into a deterministic automaton (DFA) to produce a fast tool. You could use a program like this one for the conversion (again there is some work, because the NFA in this link doesn't seem to have epsilon transitions).

You can use an NDFA (NFA) where each transition character in your short string triggers a transition in the NFA, if possible. The google knows about how you might want to start doing that.

If you have only not optional alternatives without repeats in [], you could do something like this to avoid complicated transformations (you could improve the matching with more advanced string matching algorithm) :

dna = r'atata[atg]ca[atgc]atatagtagcctgt'

grouping = []
s = set()
reading = False
for acid in dna:
    if acid == '[':
        s = set()
        reading = True
    elif acid == ']':
        reading = False
        grouping.append(s)
    elif reading:
        s.add(acid)
    else:
        grouping.append(acid)

print grouping

target = 'tat'
matches = list()

for index in range(len(grouping) - len(target) + 1):
    if all((c1 in c2) for c1, c2 in zip(target, grouping[index:])):
        g = grouping[index:index+len(target)]
        matches.append((index, g))

if matches:
    print 'Match at:'
    for index, g in matches:
        print index,':',g
else:
    print 'Match not found'