parsing the text file and compare the lines in text

Question

parijat24 0 Newbie Poster

14 Years Ago

The actual file is as follows

>sp|P81928|140U_DROME
67      198     Tim17   8.9e-19 No_clan
>sp|P20905|5HT1R_DROME
179     507     7tm_1   1.1e-97 CL0192
>sp|P28285|5HT2A_DROME
243     805     7tm_1   3.2e-73 CL0192
>sp|P28286|5HT2B_DROME
107     588     7tm_1   7.2e-82 CL0192
>sp|P13368|7LESS_DROME
439     520     fn3     1.4e-10 CL0159
1313    1380    fn3     3.4e-05 CL0159
1800    1890    fn3     3.6e-12 CL0159
2209    2481    Pkinase_Tyr     3.7e-91 CL0016
>sp|P14599|A4_DROME
26      198     A4_EXTRA        4.9e-75 No_clan
826     884     APP_amyloid     1.5e-24 No_clan
>sp|P91927|A60DA_DROME
178     446     LETM1   3.7e-114        No_clan
>sp|Q24093|ABHD2_DROME
158     362     Abhydrolase_1   5.5e-07 CL0028
>sp|Q9VIP7|ACASE_DROME
19      266     aPHC    8e-64   No_clan
>sp|P07140|ACES_DROME
26      601     COesterase      3.6e-172        CL0028
>sp|P09478|ACH1_DROME
26      240     Neur_chan_LBD   1.3e-78 No_clan
247     530     Neur_chan_memb  5.8e-78 No_clan
>sp|P17644|ACH2_DROME
46      261     Neur_chan_LBD   2.3e-76 No_clan
268     543     Neur_chan_memb  1.1e-73 No_clan
>sp|P04755|ACH3_DROME
28      236     Neur_chan_LBD   9.5e-71 No_clan
243     498     Neur_chan_memb  1.9e-69 No_clan
>sp|P25162|ACH4_DROME
31      245     Neur_chan_LBD   3e-73   No_clan
252     479     Neur_chan_memb  2.2e-71 No_clan
>sp|P16395|ACM1_DROME
121     318     7tm_1   2.2e-57 CL0192
695     772     7tm_1   3.3e-20 CL0192
>sp|Q9VAC5|ADA17_DROME
32      160     Pep_M12B_propep 3.7e-13 No_clan
394     464     Reprolysin      2.2e-05 CL0126
477     555     Disintegrin     2.9e-13 No_clan
>sp|Q9VW60|ADCY2_DROME
310     490     Guanylate_cyc   7.2e-54 CL0276
1102    1300    Guanylate_cyc   1.6e-55 CL0276
>sp|Q9VCY8|ADRL_DROME
198     419     HlyIII  3e-71   No_clan
>sp|Q26365|ADT_DROME
21      114     Mito_carr       8.1e-27 No_clan
127     217     Mito_carr       5.3e-23 No_clan
224     312     Mito_carr       8.4e-14 No_clan

program I wrote is given below

infile = open('memb_protein2.hmmout','r')
rec = infile.read()
records = rec.split('>')[1:]
line = []
protein = ''
for item in records:
        domains = item.count('\n') - 1
        if  domains != 1:
                protein = item.split('\n',1)[0]
                dom_line = item.split('\n',1)[1]
                dom_present = dom_line.split('\n')[:-1]
                #print dom_present
                for item in dom_present:
                        #print item
                        dom = item.split('\t')
                        #print dom[2]
                        #dom_list = dom.split('\t')
                        #dom_name = dom_list[2]
                        #print dom_name
                        entry = dom[2]
                        line.append(entry)
                        entry = ''
                print  protein + '\t',
                for item in line :
                        print  item + '\t',
                line = []
                print '\n'
                        #seq = ''.join(line)
                        #file('multi_dom','a').write(seq)
infile.close()

the results comes is as follows

tr|B7YZE8|B7YZE8_DROME	Ion_trans_N	Ion_trans	cNMP_binding	

tr|B7Z145|B7Z145_DROME	EGF_2	EGF_2	EGF_2	EGF_2	EGF_2	EGF_2	EGF_2	

tr|Q8MQM7|Q8MQM7_DROME	ANF_receptor	Lig_chan-Glu_bd	Lig_chan	

tr|Q9VVS0|Q9VVS0_DROME	Mito_carr	Mito_carr	Mito_carr	

tr|O18367|O18367_DROME	Na_Ca_ex	Calx-beta	Calx-beta	Na_Ca_ex	

tr|Q8IQW3|Q8IQW3_DROME	Mito_carr	Mito_carr	

tr|Q8IGX4|Q8IGX4_DROME	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	

tr|Q9VFH5|Q9VFH5_DROME	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	

tr|Q7KT97|Q7KT97_DROME	   Neur_chan_LBD	Neur_chan_memb	

tr|Q8SZN6|Q8SZN6_DROME	Mito_carr	Mito_carr	

tr|Q8T0K9|Q8T0K9_DROME	SBP_bac_3	Lig_chan	

tr|Q9BML7|Q9BML7_DROME	ANF_receptor	7tm_3	

tr|Q0KIB1|Q0KIB1_DROME	V-set	I-set	C2-set_2	I-set	

tr|A1Z855|A1Z855_DROME	Ion_trans	KCNQ_channel	

tr|A1Z9L9|A1Z9L9_DROME	Mito_carr	Mito_carr	Mito_carr	

tr|A8QI34|A8QI34_DROME	Cation_ATPase_N	E1-E2_ATPase	Hydrolase	Cation_ATPase_C	

tr|Q9VSV5|Q9VSV5_DROME	ANF_receptor	Lig_chan-Glu_bd	Lig_chan	

tr|Q95SN5|Q95SN5_DROME	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	EGF_CA	Ldl_recept_b	Ldl_recept_b	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_aLdl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	EGF_CA	Ldl_recept_b	

tr|Q24273|Q24273_DROME	V-set	C2-set_2	C2-set_2	I-set	

tr|Q8IQQ6|Q8IQQ6_DROME	SNF	SNF	

tr|Q8IPK1|Q8IPK1_DROME	Mito_carr	Mito_carr	

tr|Q7PLW4|Q7PLW4_DROME	Na_Ca_ex	Na_Ca_ex	

tr|Q5BHU9|Q5BHU9_DROME	Mito_carr	Mito_carr	

tr|Q2PDZ0|Q2PDZ0_DROME	Neur_chan_LBD	Neur_chan_memb	

tr|Q0E8N6|Q0E8N6_DROME	ANF_receptor	SBP_bac_3	Lig_chan	

tr|Q0KIF2|Q0KIF2_DROME	ANF_receptor	Lig_chan-Glu_bd	Lig_chan	

tr|A1Z7J1|A1Z7J1_DROME	I-set	C2-set_2	C2-set_2	C2-set_2	C2-set_2	C2-set_2	I-set	I-set	I-set	fn3	

tr|A8DYR5|A8DYR5_DROME	K_tetra	Ion_trans	

tr|B7YZR4|B7YZR4_DROME	Ion_trans	KCNQ_channel	

tr|B7Z015|B7Z015_DROME	C2-set_2	I-set	I-set	fn3	

tr|Q9VYY5|Q9VYY5_DROME	Ion_trans_2	Ion_trans_2	

tr|Q8IN24|Q8IN24_DROME	ANF_receptor	7tm_3	

tr|Q7KUV2|Q7KUV2_DROME	Neur_chan_LBD	Neur_chan_memb	

tr|Q3ZZY0|Q3ZZY0_DROME	Ion_trans_2	Ion_trans_2	

tr|Q0E9F2|Q0E9F2_DROME	I-set	C2-set_2	C2-set_2	C2-set_2	C2-set_2	C2-set_2	I-set	I-set	I-set	fn3	

tr|Q0E8B8|Q0E8B8_DROME	V-set	C2-set_2	C2-set_2	I-set	

tr|A1Z9M0|A1Z9M0_DROME	Mito_carr	Mito_carr	Mito_carr	

tr|A1Z9P0|A1Z9P0_DROME	Ion_trans_N	Ion_trans	cNMP_binding	

tr|A8DYJ6|A8DYJ6_DROME	V-set	C2-set_2	I-set	

tr|B7Z0Z2|B7Z0Z2_DROME	Ion_trans	DUF3451	Ion_trans	Na_trans_assoc	Ion_trans	Ion_trans	

tr|Q9VVM6|Q9VVM6_DROME	Sulfate_transp	STAS	

tr|Q9VB20|Q9VB20_DROME	PSI	PSI	

tr|Q967X6|Q967X6_DROME	V-set	C2-set_2	C2-set_2	I-set	fn3	

tr|Q8IQN2|Q8IQN2_DROME	Voltage_CLC	CBS	CBS

Ignore the green lines in code , but what the results given below is an example , from this program I got similiar type of similiar result obtaioned containing more than 30, 000 lines , I want to compare the results each other and want to know how much of entries in a results are similiar to each other

python

2 Contributors
1 Reply
153 Views
6 Hours Discussion Span
Latest Post 14 Years Ago Latest Post by TrustyTony

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 1 · 2010-08-18T15:31:48+00:00

TrustyTony 888 ex-Moderator

14 Years Ago

Check the module difflib.