hi, I am again involve in solving one trivial problem , that is I have a text file in which large number of entry are there like


proteinid
sp|P13368|7LESS_DROME fn3 fn3 fn3 Pkinase_Tyr

sp|P14599|A4_DROME A4_EXTRA APP_amyloid

sp|P09478|ACH1_DROME Neur_chan_LBD Neur_chan_memb

sp|P17644|ACH2_DROME Neur_chan_LBD Neur_chan_memb

sp|P04755|ACH3_DROME Neur_chan_LBD Neur_chan_memb

sp|P25162|ACH4_DROME Neur_chan_LBD Neur_chan_memb

sp|P16395|ACM1_DROME 7tm_1 7tm_1

sp|Q9VAC5|ADA17_DROME Pep_M12B_propep Reprolysin Disintegrin

sp|Q9VW60|ADCY2_DROME Guanylate_cyc Guanylate_cyc

sp|Q26365|ADT_DROME Mito_carr Mito_carr Mito_carr

sp|Q8INB9|AKT1_DROME PH Pkinase Pkinase_C

sp|P15364|AMAL_DROME I-set I-set I-set

sp|P91926|AP2A_DROME Adaptin_N Alpha_adaptinC2 Alpha_adaptin_C

sp|P54362|AP3D_DROME Adaptin_N BLVR

sp|P18824|ARM_DROME Arm Arm Arm Arm

sp|P22700|ATC1_DROME Cation_ATPase_N E1-E2_ATPase Hydrolase Cation_ATPase_C

sp|P13607|ATNA_DROME Cation_ATPase_N E1-E2_ATPase Hydrolase Cation_ATPase_C

sp|P35381|ATPA_DROME ATP-synt_ab_N ATP-synt_ab ATP-synt_ab_C

sp|Q05825|ATPB_DROME ATP-synt_ab_N ATP-synt_ab ATP-synt_ab_C

sp|P12428|BROWN_DROME ABC_tran ABC2_membrane

sp|Q7KT91|C3390_DROME GBA2_N DUF608

sp|Q9VYY4|C4G15_DROME p450 p450

sp|P91645|CAC1A_DROME Ion_trans Ion_trans Ion_trans Ion_trans

now in this file a proteinid is given in aline and infront of them domains are present which are present in that protein are given................. now i want to know which two protein are similiar in domain content , and which protein are diffrent in domain content ..............as for an example i have this of file more than 30, 000 entries so impossible to check through eyes..

Recommended Answers

All 3 Replies

Seem's easy.

Can you post the file?

The text pasted on the thread normally does'n keep all the text characteristics.

Cheers and Happy coding.

I looked through all 10 of your previous posts and you haven't produced one line of code. It's about time you did or find another site to get free programming.

The actual file is as follows

>sp|P81928|140U_DROME
67      198     Tim17   8.9e-19 No_clan
>sp|P20905|5HT1R_DROME
179     507     7tm_1   1.1e-97 CL0192
>sp|P28285|5HT2A_DROME
243     805     7tm_1   3.2e-73 CL0192
>sp|P28286|5HT2B_DROME
107     588     7tm_1   7.2e-82 CL0192
>sp|P13368|7LESS_DROME
439     520     fn3     1.4e-10 CL0159
1313    1380    fn3     3.4e-05 CL0159
1800    1890    fn3     3.6e-12 CL0159
2209    2481    Pkinase_Tyr     3.7e-91 CL0016
>sp|P14599|A4_DROME
26      198     A4_EXTRA        4.9e-75 No_clan
826     884     APP_amyloid     1.5e-24 No_clan
>sp|P91927|A60DA_DROME
178     446     LETM1   3.7e-114        No_clan
>sp|Q24093|ABHD2_DROME
158     362     Abhydrolase_1   5.5e-07 CL0028
>sp|Q9VIP7|ACASE_DROME
19      266     aPHC    8e-64   No_clan
>sp|P07140|ACES_DROME
26      601     COesterase      3.6e-172        CL0028
>sp|P09478|ACH1_DROME
26      240     Neur_chan_LBD   1.3e-78 No_clan
247     530     Neur_chan_memb  5.8e-78 No_clan
>sp|P17644|ACH2_DROME
46      261     Neur_chan_LBD   2.3e-76 No_clan
268     543     Neur_chan_memb  1.1e-73 No_clan
>sp|P04755|ACH3_DROME
28      236     Neur_chan_LBD   9.5e-71 No_clan
243     498     Neur_chan_memb  1.9e-69 No_clan
>sp|P25162|ACH4_DROME
31      245     Neur_chan_LBD   3e-73   No_clan
252     479     Neur_chan_memb  2.2e-71 No_clan
>sp|P16395|ACM1_DROME
121     318     7tm_1   2.2e-57 CL0192
695     772     7tm_1   3.3e-20 CL0192
>sp|Q9VAC5|ADA17_DROME
32      160     Pep_M12B_propep 3.7e-13 No_clan
394     464     Reprolysin      2.2e-05 CL0126
477     555     Disintegrin     2.9e-13 No_clan
>sp|Q9VW60|ADCY2_DROME
310     490     Guanylate_cyc   7.2e-54 CL0276
1102    1300    Guanylate_cyc   1.6e-55 CL0276
>sp|Q9VCY8|ADRL_DROME
198     419     HlyIII  3e-71   No_clan
>sp|Q26365|ADT_DROME
21      114     Mito_carr       8.1e-27 No_clan
127     217     Mito_carr       5.3e-23 No_clan
224     312     Mito_carr       8.4e-14 No_clan

program I wrote is given below

infile = open('memb_protein2.hmmout','r')
rec = infile.read()
records = rec.split('>')[1:]
line = []
protein = ''
for item in records:
        domains = item.count('\n') - 1
        if  domains != 1:
                protein = item.split('\n',1)[0]
                dom_line = item.split('\n',1)[1]
                dom_present = dom_line.split('\n')[:-1]
                #print dom_present
                for item in dom_present:
                        #print item
                        dom = item.split('\t')
                        #print dom[2]
                        #dom_list = dom.split('\t')
                        #dom_name = dom_list[2]
                        #print dom_name
                        entry = dom[2]
                        line.append(entry)
                        entry = ''
                print  protein + '\t',
                for item in line :
                        print  item + '\t',
                line = []
                print '\n'
                        #seq = ''.join(line)
                        #file('multi_dom','a').write(seq)
infile.close()
the results comes is as follows

tr|B7YZE8|B7YZE8_DROME	Ion_trans_N	Ion_trans	cNMP_binding	

tr|B7Z145|B7Z145_DROME	EGF_2	EGF_2	EGF_2	EGF_2	EGF_2	EGF_2	EGF_2	

tr|Q8MQM7|Q8MQM7_DROME	ANF_receptor	Lig_chan-Glu_bd	Lig_chan	

tr|Q9VVS0|Q9VVS0_DROME	Mito_carr	Mito_carr	Mito_carr	

tr|O18367|O18367_DROME	Na_Ca_ex	Calx-beta	Calx-beta	Na_Ca_ex	

tr|Q8IQW3|Q8IQW3_DROME	Mito_carr	Mito_carr	

tr|Q8IGX4|Q8IGX4_DROME	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	

tr|Q9VFH5|Q9VFH5_DROME	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	Cadherin	

tr|Q7KT97|Q7KT97_DROME	   Neur_chan_LBD	Neur_chan_memb	

tr|Q8SZN6|Q8SZN6_DROME	Mito_carr	Mito_carr	

tr|Q8T0K9|Q8T0K9_DROME	SBP_bac_3	Lig_chan	

tr|Q9BML7|Q9BML7_DROME	ANF_receptor	7tm_3	

tr|Q0KIB1|Q0KIB1_DROME	V-set	I-set	C2-set_2	I-set	

tr|A1Z855|A1Z855_DROME	Ion_trans	KCNQ_channel	

tr|A1Z9L9|A1Z9L9_DROME	Mito_carr	Mito_carr	Mito_carr	

tr|A8QI34|A8QI34_DROME	Cation_ATPase_N	E1-E2_ATPase	Hydrolase	Cation_ATPase_C	

tr|Q9VSV5|Q9VSV5_DROME	ANF_receptor	Lig_chan-Glu_bd	Lig_chan	

tr|Q95SN5|Q95SN5_DROME	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	EGF_CA	Ldl_recept_b	Ldl_recept_b	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_aLdl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	Ldl_recept_a	EGF_CA	Ldl_recept_b	

tr|Q24273|Q24273_DROME	V-set	C2-set_2	C2-set_2	I-set	

tr|Q8IQQ6|Q8IQQ6_DROME	SNF	SNF	

tr|Q8IPK1|Q8IPK1_DROME	Mito_carr	Mito_carr	

tr|Q7PLW4|Q7PLW4_DROME	Na_Ca_ex	Na_Ca_ex	

tr|Q5BHU9|Q5BHU9_DROME	Mito_carr	Mito_carr	

tr|Q2PDZ0|Q2PDZ0_DROME	Neur_chan_LBD	Neur_chan_memb	

tr|Q0E8N6|Q0E8N6_DROME	ANF_receptor	SBP_bac_3	Lig_chan	

tr|Q0KIF2|Q0KIF2_DROME	ANF_receptor	Lig_chan-Glu_bd	Lig_chan	

tr|A1Z7J1|A1Z7J1_DROME	I-set	C2-set_2	C2-set_2	C2-set_2	C2-set_2	C2-set_2	I-set	I-set	I-set	fn3	

tr|A8DYR5|A8DYR5_DROME	K_tetra	Ion_trans	

tr|B7YZR4|B7YZR4_DROME	Ion_trans	KCNQ_channel	

tr|B7Z015|B7Z015_DROME	C2-set_2	I-set	I-set	fn3	

tr|Q9VYY5|Q9VYY5_DROME	Ion_trans_2	Ion_trans_2	

tr|Q8IN24|Q8IN24_DROME	ANF_receptor	7tm_3	

tr|Q7KUV2|Q7KUV2_DROME	Neur_chan_LBD	Neur_chan_memb	

tr|Q3ZZY0|Q3ZZY0_DROME	Ion_trans_2	Ion_trans_2	

tr|Q0E9F2|Q0E9F2_DROME	I-set	C2-set_2	C2-set_2	C2-set_2	C2-set_2	C2-set_2	I-set	I-set	I-set	fn3	

tr|Q0E8B8|Q0E8B8_DROME	V-set	C2-set_2	C2-set_2	I-set	

tr|A1Z9M0|A1Z9M0_DROME	Mito_carr	Mito_carr	Mito_carr	

tr|A1Z9P0|A1Z9P0_DROME	Ion_trans_N	Ion_trans	cNMP_binding	

tr|A8DYJ6|A8DYJ6_DROME	V-set	C2-set_2	I-set	

tr|B7Z0Z2|B7Z0Z2_DROME	Ion_trans	DUF3451	Ion_trans	Na_trans_assoc	Ion_trans	Ion_trans	

tr|Q9VVM6|Q9VVM6_DROME	Sulfate_transp	STAS	

tr|Q9VB20|Q9VB20_DROME	PSI	PSI	

tr|Q967X6|Q967X6_DROME	V-set	C2-set_2	C2-set_2	I-set	fn3	

tr|Q8IQN2|Q8IQN2_DROME	Voltage_CLC	CBS	CBS

Ignore the green lines in code , but what the results given below is an example , from this program I got similiar type of similiar result obtaioned containing more than 30, 000 lines , I want to compare the results each other and want to know how much of entries in a results are similiar to each other

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.