extracting subsequence from the sequence

Question

parijat24 0 Newbie Poster

13 Years Ago

hey, thanks to all of them who helps me in learning this language,
again there is one text file
file 1.txt

>sp|P81928[/B]|140U_DROME

67 198 Tim17 8.9e-19 No_clan

>sp|P20905|5HT1R_DROME

179 507 7tm_1 1.1e-97 CL0192

>sp|P28285|5HT2A_DROME

243 805 7tm_1 3.2e-73 CL0192

>sp|P28286|5HT2B_DROME

107 588 7tm_1 7.2e-82 CL0192

* here the number represents the start and ending of subsequence which has to be extracted.

the next file is sequence file2.txt

>sp|P81928|140U_DROME RPII140-upstream gene protein OS=Drosophila melanogaster GN=140up PE=2 SV=2
MNFLWKGRRFLIAGILPTFEGAADEIVDKENKTYKAFLASKPPEETGLERLKQMFTIDEF
GSISSELNSVYQAGFLGFLIGAIYGGVTQSRVAYMNFMENNQATAFKSHFDAKKKLQDQF
TVNFAKGGFKWGWRVGLFTTSYFGIITCMSVYRGKSSIYEYLAAGSITGSLYKVSLGLRG
MAAGGIIGGFLGGVAGVTSLLLMKASGTSMEEVRYWQYKWRLDRDENIQQAFKKLTEDEN
PELFKAHDEKTSEHVSLDTIK
>sp|P20905|5HT1R_DROME 5-hydroxytryptamine receptor 1 OS=Drosophila melanogaster GN=5-HT7 PE=2 SV=1
MALSGQDWRRHQSHRQHRNHRTQGNHQKLISTATLTLFVLFLSSWIAYAAGKATVPAPLV
EGETESATSQDFNSSSAFLGAIASASSTGSGSGSGSGSGSGSGSGSYGLASMNSSPIAIV
SYQGITSSNLGDSNTTLVPLSDTPLLLEEFAAGEFVLPPLTSIFVSIVLLIVILGTVVGN
VLVCIAVCMVRKLRRPCNYLLVSLALSDLCVALLVMPMALLYEVLEKWNFGPLLCDIWVS
FDVLCCTASILNLCAISVDRYLAITKPLEYGVKRTPRRMMLCVGIVWLAAACISLPPLLI
LGNEHEDEEGQPICTVCQNFAYQIYATLGSFYIPLSVMLFVYYQIFRAARRIVLEEKRAQ
THLQQALNGTGSPSAPQAPPLGHTELASSGNGQRHSSVGNTSLTYSTCGGLSSGGGALAG
HGSGGGVSGSTGLLGSPHHKKLRFQLAKEKKASTTLGIIMSAFTVCWLPFFILALIRPFE
TMHVPASLSSLFLWLGYANSLLNPIIYATLNRDFRKPFQEILYFRCSSLNTMMRENYYQD
QYGEPPSQRVMLGDERHGARESFL
>sp|P28285|5HT2A_DROME 5-hydroxytryptamine receptor 2A OS=Drosophila melanogaster GN=5-HT1A PE=2 SV=2
MAHETSFNDALDYIYIANSMNDRAFLIAEPHPEQPNVDGQDQDDAELEELDDMAVTDDGQ
LEDTNNNNNSKRYYSSGKRRADFIGSLALKPPPTDVNTTTTTAGSPLATAALAAAAASAS
VAAAAARITAKAAHRALTTKQDATSSPASSPALQLIDMDNNYTNVAVGLGAMLLNDTLLL
EGNDSSLFGEMLANRSGQLDLINGTGGLNVTTSKVAEDDFTQLLRMAVTSVLLGLMILVT
IIGNVFVIAAIILERNLQNVANYLVASLAVADLFVACLVMPLGAVYEISQGWILGPELCD
IWTSCDVLCCTASILHLVAIAVDRYWAVTNIDYIHSRTSNRVFMMIFCVWTAAVIVSLAP
QFGWKDPDYLQRIEQQKCMVSQDVSYQVFATCCTFYVPLLVILALYWKIYQTARKRIHRR
RPRPVDAAVNNNQPDGGAATDTKLHRLRLRLGRFSTAKSKTGSAVGVSGPASGGRALGLV
DGNSTNTVNTVEDTEFSSSNVDSKSRAGVEAPSTSGNQIATVSHLVALAKQQGKSTAKSS
AAVNGMAPSGRQEDDGQRPEHGEQEDREELEDQDEQVGPQPTTATSATTAAGTNESEDQC
KANGVEVLEDPQLQQQLEQVQQLQKSVKSGGGGGASTSNATTITSISALSPQTPTSQGVG
IAAAAAGPMTAKTSTLTSCNQSHPLCGTANESPSTPEPRSRQPTTPQQQPHQQAHQQQQQ
QQQLSSIANPMQKVNKRKETLEAKRERKAAKTLAIITGAFVVCWLPFFVMALTMPLCAAC
QISDSVASLFLWLGYFNSTLNPVIYTIFSPEFRQAFKRILFGGHRPVHYRSGKL

i want to extract the subsequence from this sequences with respect to the proteins id

python

4 Contributors
6 Replies
190 Views
2 Days Discussion Span
Latest Post 13 Years Ago Latest Post by jcao219

All 6 Replies

jcao219 18 Posting Pro in Training

13 Years Ago

I see you are working with Drosophila genetics.
I'm not sure what you are trying to do.. can you explain it further?

ultimatebuster 14 Posting Whiz in Training

13 Years Ago

wow that's some huge geek cred for knowing what that is.

Anyhow i don't exactly know what you want either so yeah.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

TrustyTony 888 pyMod Team Colleague Featured Poster · Answer 1 · 2010-07-18T14:53:57+00:00

For example that first line of first file looks different format (bold finishing tag before but no starting tag), otherwise, is it so that yo want to pick identifier between > and | psoition 1:9, (>sp|P20905| > id = sp|P20905) and when it is found take start and end indexes of two lines down of start and end index (0 or 1 based?)

Something like this for file1:

inp=open('file1.txt').read()

sep='>'
data= []
while sep:
    part,sep, inp = inp.partition(sep)
    if sep and part: data.append(part.strip().split('\n\n'))

idend = len('sp|P81928')-1
info = [( id[:idend],)+tuple(loc.split(' ',2)[:2])
        for id,loc in data if id.startswith('s')
         ]
print info
info = [(a, int(b), int (c)) for a,b,c in info]
print info

parijat24 0 Newbie Poster · Answer 2 · 2010-07-18T18:06:22+00:00

no it not like that sorry ,

i just want that from given sequence of lengh 400 , i have two cut the sequence ranging from one index to another index

TrustyTony 888 pyMod Team Colleague Featured Poster · Answer 3 · 2010-07-18T18:08:52+00:00

If those where not right ids and indexes, I am afraid I can not help you.

jcao219 18 Posting Pro in Training · Answer 4 · 2010-07-19T06:52:28+00:00

wow that's some huge geek cred for knowing what that is.
Anyhow i don't exactly know what you want either so yeah.

Heh, I'm great at Biology (got first place in a state competition), so Drosophila melanogaster is very familiar to me. It's a fruitfly. I've helped raise them before.

Anyways, OP, can you provide us with some kind of desired output, so that we know exactly what you want?

extracting subsequence from the sequence

Recommended Answers Collapse Answers

All 6 Replies

Recommended Answers