The quadruplex sequence of a genome looks like this Gx Ny1 Gx Ny2 Gx Ny3 Gx, where G is the Guanine base and the Ns are representing other bases. The x, y1, y2 and y3 are integer. A particular segment will be quadruplex sequence if x>=2. My question is I want to count how many such segments are present in a FASTA sequence and what are their types in terms of x, y1,y2 and y3. For that purpose I need some help from your side.

You may want to start using biopython as it has many classes and methods for handling sequence data, especially from a fasta-formatted file. It may already do some of the work for you.

import re
fasta = open('e-coli-k12.fasta', 'r').read()
segments=re.findall('GG+[^G]+',fasta)
print segments

This script produces a pattern which starts with more than two G letters for a sequence. But I can't extract the start and end position of that pattern.

I have tried with

import re
fasta = open('e-coli-k12.fasta', 'r').read()
segments=re.compile('GG+[^G]+')
for item in segments.findall(fasta):
print item
found=re.search(item,fasta)
print found.span()

but I didn't get success. It searches a particular pattern over the whole sequence and produces multiple number for a particular pattern. But I want the only one exact start and end position which exactly corresponds to the fasta sequence. How to get the start and end position of a particular pattern.

As posted by @dashing.adamhughes there are made a lot of stuff that can help you with this,another eksp is pyfasta

How to get the start and end position of a particular pattern

You can use re.finditer for this.

>>> import re
>>> s = '11ATGC1111ATGC11111ATGC'
>>> p = 'ATGC'
>>> [m.start() for m in re.finditer(p, s)]
[2, 10, 19]
>>> #To also find end postion
>>> [(m.start(),m.end()) for m in re.finditer(p, s)]
[(2, 6), (10, 14), (19, 23)]

Edited 3 Years Ago by snippsat

This article has been dead for over six months. Start a new discussion instead.