I am working on Protein and DNA sequences. When you have a protein sequence it can be translated in to DNA sequence in many ways. A nice way to express this is using regular expression. I would like to create this long regular expression for a protein sequence, and than seach for a smaller string. So that is the other way around of what regular expression does. Is there a way to do this?

a short example would be


search for


Edited by aint

4 Years
Discussion Span
Last Post by pyTony

I see a way, but there is some work to do. First you convert your regular expression to a NFA (non deterministic finite automaton) with a start state and
an accept state. In this NFA, you add an epsilon-transition from the start state to every other state and an epsilon transition from any state to the accept state.
This NFA should be able to find your subsequences.

A second step could be to convert the NFA into a deterministic automaton (DFA) to produce a fast tool. You could use a program like this one for the conversion (again there is some work, because the NFA in this link doesn't seem to have epsilon transitions).

Edited by Gribouillis


You can use an NDFA (NFA) where each transition character in your short string triggers a transition in the NFA, if possible. The google knows about how you might want to start doing that.


If you have only not optional alternatives without repeats in [], you could do something like this to avoid complicated transformations (you could improve the matching with more advanced string matching algorithm) :

dna = r'atata[atg]ca[atgc]atatagtagcctgt'

grouping = []
s = set()
reading = False
for acid in dna:
    if acid == '[':
        s = set()
        reading = True
    elif acid == ']':
        reading = False
    elif reading:

print grouping

target = 'tat'
matches = list()

for index in range(len(grouping) - len(target) + 1):
    if all((c1 in c2) for c1, c2 in zip(target, grouping[index:])):
        g = grouping[index:index+len(target)]
        matches.append((index, g))

if matches:
    print 'Match at:'
    for index, g in matches:
        print index,':',g
    print 'Match not found'

Edited by pyTony

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.