hi, am trying to parse a multiple pairwise format into table for example:


Query= m100529_140129_SMRT1_c0000010190006406181231110_s0_p0/32965/0_332_clipped_50:0
(282 letters)

Query: 8 TTTTTGAACAGCCCCAACAACTCTTCCGCTGCCGGTTGCTGCA-TTCCAGTTGTTCCACA 66
||||||||||||||||||||||||||||||||||||||||||| |||||||||||| |||
Sbjct: 4045830 TTTTTGAACAGCCCCAACAACTCTTCCGCTGCCGGTTGCTGCACTTCCAGTTGTTC-ACA 4045772

Query: 67 GTCCAGCTCCAGTTCAACGTCGGTTTAAATCGTCG--AGCT-GTATGAGAGATAAGCATA 123
| ||||||||||||||||||||||| |||||||| |||| |||||||||||||||| |
Sbjct: 4045771 GGTCAGCTCCAGTTCAACGTCGGTTTTAATCGTCGCCAGCTGGTATGAGAGATAAGCA-A 4045713


Query= m100529_140129_SMRT1_c0000010190006406181231110_s0_p0/56521/6_684_clipped_527:0
(151 letters)

Query: 1 CTTCAAAGAGGGAGAATTACGTCGATATTACCGAAGGCTGGGAGAAGGGTGAAAATACAA 60
||||||||||||||||||||||||||||||| ||||||||||||| |||||||| |||||
Sbjct: 1500035 CTTCAAAGAGGGAGAATTACGTCGATATTAC-GAAGGCTGGGAGA-GGGTGAAA-TACAA 1500091

Query: 61 G--AGACGCTCGGCGAGCTGGCGCCG-ACCGACGCCCACGTTAATCG-ATTAAACTGCGT 116
||||| ||||||||||| | ||| ||||||||| ||| |||||| ||||||||||||
Sbjct: 1500092 TGAAGACG-TCGGCGAGCTG-CACCGCACCGACGCCAACGGTAATCGTATTAAACTGCGT 1500149


into table like below:

m100529_140129_SMRT1_c0000010190006406181231110_s0_p0/32965/0_332_clipped_50:0 '\t' TTTTTGAACAGCCCCAACAACTCTTCCGCTGCCGGTTGCTGCA-TTCCAGTTGTTCCACAGTCCAGCTCCAGTTCAACGTCGGTTTAAATCGTCG--AGCT-GTATGAGAGATAAGCATA
||||||||||||||||||||||||||||||||||||||||||| |||||||||||| |||| ||||||||||||||||||||||| |||||||| |||| |||||||||||||||| |
TTTTTGAACAGCCCCAACAACTCTTCCGCTGCCGGTTGCTGCACTTCCAGTTGTTC-ACAGGTCAGCTCCAGTTCAACGTCGGTTTTAATCGTCGCCAGCTGGTATGAGAGATAAGCA-A

m100529_140129_SMRT1_c0000010190006406181231110_s0_p0/56521/6_684_clipped_527:0 '\t'

CTTCAAAGAGGGAGAATTACGTCGATATTACCGAAGGCTGGGAGAAGGGTGAAAATACAAG--AGACGCTCGGCGAGCTGGCGCCG-ACCGACGCCCACGTTAATCG-ATTAAACTGCGT
||||||||||||||||||||||||||||||| ||||||||||||| |||||||| ||||| ||||| ||||||||||| | ||| ||||||||| ||| |||||| ||||||||||||
CTTCAAAGAGGGAGAATTACGTCGATATTAC-GAAGGCTGGGAGA-GGGTGAAA-TACAATGAAGACG-TCGGCGAGCTG-CACCGCACCGACGCCAACGGTAATCGTATTAAACTGCGT


i tried create d python program to do dis as program shown below:


#!/usr/bin/env python

import sys

class Fasta:
def __init__(self, name, pwiseseq):
self.name = name
self.pwiseseq = pwiseseq

def read_pw(file):
items = []
index = 0
for line in file:
if not line.strip():
continue

if line.startswith("Query="):
if index >= 1:
items.append(aninstance)
index+=1
name = line[7:-1]
if line.find('Query:') >= 0:
QseqPW = ''
QseqPW = (line[7:-1]).strip('0123456789 ')

aninstance = Fasta(name, QseqPW)

items.append(aninstance)
return items


filePW = open(sys.argv[1], 'r').readlines()

mydatasets = read_pw(filePW)


for i in mydatasets:
print i.name + '\t' + i.pwiseseq

but unfortunately d output i got only shown the last sequence alignment line for each sequence header as like below:
m100529_140129_SMRT1_c0000010190006406181231110_s0_p0/32965/0_332_clipped_50:0 '\t' GTCCAGCTCCAGTTCAACGTCGGTTTAAATCGTCG--AGCT-GTATGAGAGATAAGCATA
100529_140129_SMRT1_c0000010190006406181231110_s0_p0/56521/6_684_clipped_527:0 '\t'
G--AGACGCTCGGCGAGCTGGCGCCG-ACCGACGCCCACGTTAATCG-ATTAAACTGCGT

can anybody help me to solve dis? thanks

I have two problems understanding:

  1. You must use the (CODE) button to make sure that indent and line numbers are correct. Without indent, it is very hard in Python
  2. Your data is too complicated to understand easily. Can you make up something very much shorter that has the same format? Put the made up file data in a (CODE) segment too, please

Edited 4 Years Ago by mike_2000_17: Fixed formatting

This article has been dead for over six months. Start a new discussion instead.