0

hi, am trying to parse a multiple pairwise format into table for example:


Query= m100529_140129_SMRT1_c0000010190006406181231110_s0_p0/32965/0_332_clipped_50:0
(282 letters)

Query: 8 TTTTTGAACAGCCCCAACAACTCTTCCGCTGCCGGTTGCTGCA-TTCCAGTTGTTCCACA 66
||||||||||||||||||||||||||||||||||||||||||| |||||||||||| |||
Sbjct: 4045830 TTTTTGAACAGCCCCAACAACTCTTCCGCTGCCGGTTGCTGCACTTCCAGTTGTTC-ACA 4045772

Query: 67 GTCCAGCTCCAGTTCAACGTCGGTTTAAATCGTCG--AGCT-GTATGAGAGATAAGCATA 123
| ||||||||||||||||||||||| |||||||| |||| |||||||||||||||| |
Sbjct: 4045771 GGTCAGCTCCAGTTCAACGTCGGTTTTAATCGTCGCCAGCTGGTATGAGAGATAAGCA-A 4045713


Query= m100529_140129_SMRT1_c0000010190006406181231110_s0_p0/56521/6_684_clipped_527:0
(151 letters)

Query: 1 CTTCAAAGAGGGAGAATTACGTCGATATTACCGAAGGCTGGGAGAAGGGTGAAAATACAA 60
||||||||||||||||||||||||||||||| ||||||||||||| |||||||| |||||
Sbjct: 1500035 CTTCAAAGAGGGAGAATTACGTCGATATTAC-GAAGGCTGGGAGA-GGGTGAAA-TACAA 1500091

Query: 61 G--AGACGCTCGGCGAGCTGGCGCCG-ACCGACGCCCACGTTAATCG-ATTAAACTGCGT 116
||||| ||||||||||| | ||| ||||||||| ||| |||||| ||||||||||||
Sbjct: 1500092 TGAAGACG-TCGGCGAGCTG-CACCGCACCGACGCCAACGGTAATCGTATTAAACTGCGT 1500149


into table like below:

m100529_140129_SMRT1_c0000010190006406181231110_s0_p0/32965/0_332_clipped_50:0 '\t' TTTTTGAACAGCCCCAACAACTCTTCCGCTGCCGGTTGCTGCA-TTCCAGTTGTTCCACAGTCCAGCTCCAGTTCAACGTCGGTTTAAATCGTCG--AGCT-GTATGAGAGATAAGCATA
||||||||||||||||||||||||||||||||||||||||||| |||||||||||| |||| ||||||||||||||||||||||| |||||||| |||| |||||||||||||||| |
TTTTTGAACAGCCCCAACAACTCTTCCGCTGCCGGTTGCTGCACTTCCAGTTGTTC-ACAGGTCAGCTCCAGTTCAACGTCGGTTTTAATCGTCGCCAGCTGGTATGAGAGATAAGCA-A

m100529_140129_SMRT1_c0000010190006406181231110_s0_p0/56521/6_684_clipped_527:0 '\t'

CTTCAAAGAGGGAGAATTACGTCGATATTACCGAAGGCTGGGAGAAGGGTGAAAATACAAG--AGACGCTCGGCGAGCTGGCGCCG-ACCGACGCCCACGTTAATCG-ATTAAACTGCGT
||||||||||||||||||||||||||||||| ||||||||||||| |||||||| ||||| ||||| ||||||||||| | ||| ||||||||| ||| |||||| ||||||||||||
CTTCAAAGAGGGAGAATTACGTCGATATTAC-GAAGGCTGGGAGA-GGGTGAAA-TACAATGAAGACG-TCGGCGAGCTG-CACCGCACCGACGCCAACGGTAATCGTATTAAACTGCGT


i tried create d python program to do dis as program shown below:


#!/usr/bin/env python

import sys

class Fasta:
def __init__(self, name, pwiseseq):
self.name = name
self.pwiseseq = pwiseseq

def read_pw(file):
items = []
index = 0
for line in file:
if not line.strip():
continue

if line.startswith("Query="):
if index >= 1:
items.append(aninstance)
index+=1
name = line[7:-1]
if line.find('Query:') >= 0:
QseqPW = ''
QseqPW = (line[7:-1]).strip('0123456789 ')

aninstance = Fasta(name, QseqPW)

items.append(aninstance)
return items


filePW = open(sys.argv[1], 'r').readlines()

mydatasets = read_pw(filePW)


for i in mydatasets:
print i.name + '\t' + i.pwiseseq

but unfortunately d output i got only shown the last sequence alignment line for each sequence header as like below:
m100529_140129_SMRT1_c0000010190006406181231110_s0_p0/32965/0_332_clipped_50:0 '\t' GTCCAGCTCCAGTTCAACGTCGGTTTAAATCGTCG--AGCT-GTATGAGAGATAAGCATA
100529_140129_SMRT1_c0000010190006406181231110_s0_p0/56521/6_684_clipped_527:0 '\t'
G--AGACGCTCGGCGAGCTGGCGCCG-ACCGACGCCCACGTTAATCG-ATTAAACTGCGT

can anybody help me to solve dis? thanks

2
Contributors
1
Reply
2
Views
6 Years
Discussion Span
Last Post by griswolf
0

I have two problems understanding:

  1. You must use the (CODE) button to make sure that indent and line numbers are correct. Without indent, it is very hard in Python
  2. Your data is too complicated to understand easily. Can you make up something very much shorter that has the same format? Put the made up file data in a (CODE) segment too, please

Edited by mike_2000_17: Fixed formatting

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.