Searching quadruplex sequence from a FASTA file by using Python

Question

sudipta.mml 0 Newbie Poster

12 Years Ago

The quadruplex sequence of a genome looks like this Gx Ny1 Gx Ny2 Gx Ny3 Gx, where G is the Guanine base and the Ns are representing other bases. The x, y1, y2 and y3 are integer. A particular segment will be quadruplex sequence if x>=2. My question is I want to count how many such segments are present in a FASTA sequence and what are their types in terms of x, y1,y2 and y3. For that purpose I need some help from your side.

python

4 Contributors
4 Replies
350 Views
4 Years Discussion Span
Latest Post 7 Years Ago Latest Post by Matej_1

All 4 Replies

Matej_1 14 Newbie Poster

7 Years Ago

If you don't mind calling R from python (or vice versa), the pqsfinder package in R (http://bioconductor.org/packages/pqsfinder/) solves most of the quadruplex sequence search and manipulation. For example, it already has functions to retrieve loop lengths, positions etc.

Gribouillis commented: Thanks for sharing. +14

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

dashing.adamhughes 0 Light Poster · Answer 1 · 2013-01-11T20:05:57+00:00

You may want to start using biopython as it has many classes and methods for handling sequence data, especially from a fasta-formatted file. It may already do some of the work for you.

sudipta.mml 0 Newbie Poster · Answer 2 · 2013-01-12T04:38:15+00:00

import re
fasta = open('e-coli-k12.fasta', 'r').read()
segments=re.findall('GG+[^G]+',fasta)
print segments

This script produces a pattern which starts with more than two G letters for a sequence. But I can't extract the start and end position of that pattern.

I have tried with

import re
fasta = open('e-coli-k12.fasta', 'r').read()
segments=re.compile('GG+[^G]+')
for item in segments.findall(fasta):
print item
found=re.search(item,fasta)
print found.span()

but I didn't get success. It searches a particular pattern over the whole sequence and produces multiple number for a particular pattern. But I want the only one exact start and end position which exactly corresponds to the fasta sequence. How to get the start and end position of a particular pattern.

snippsat 661 Master Poster · Answer 3 · 2013-01-13T02:11:05+00:00

As posted by @dashing.adamhughes there are made a lot of stuff that can help you with this,another eksp is pyfasta

How to get the start and end position of a particular pattern

You can use re.finditer for this.

>>> import re
>>> s = '11ATGC1111ATGC11111ATGC'
>>> p = 'ATGC'
>>> [m.start() for m in re.finditer(p, s)]
[2, 10, 19]
>>> #To also find end postion
>>> [(m.start(),m.end()) for m in re.finditer(p, s)]
[(2, 6), (10, 14), (19, 23)]

Searching quadruplex sequence from a FASTA file by using Python

Recommended Answers Collapse Answers

All 4 Replies

Recommended Answers