merging text files on column

Question

rx21825 0 Newbie Poster

14 Years Ago

I am new to python programming and struggling with a problem I would like help with. I have multiple text files that I would like to join using the first column in each file to serve as the key to align the files. Each file could be several hundred lines long. The files SHOULD have the same number of lines. The first line in each file can be omitted in the output file. The may be extra text (Possible superstructure of XX) in one or both files which can be omitted. The first character in the first column can be dropped as well.

File 1 looks like
>SU
>PD-98059 PD-98059 Tanimoto from SU = 0.129213
>BML-265 BML-265 Tanimoto from SU = 0.163743
>BML-257 BML-257 Tanimoto from SU = 0.156627
>SU 4312 SU 4312 Tanimoto from SU = 1
Possible superstructure of SU
>AG-370 AG-370 Tanimoto from SU = 0.264286
>AG-490 AG-490 Tanimoto from SU = 0.347826

File 2 looks like
>GF
>PD-98059 PD-98059 Tanimoto from GF = 0.118483
>BML-265 BML-265 Tanimoto from GF = 0.164179
>BML-257 BML-257 Tanimoto from GF = 0.213904
>SU 4312 SU 4312 Tanimoto from GF = 0.436364
>AG-370 AG-370 Tanimoto from GF = 0.284848
>AG-490 AG-490 Tanimoto from GF = 0.307692

The output file including headers would look like

ID SU GF
PD-98059 0.129213 0.118483
BML-265 0.163743 0.164179
BML-257 0.156627 0.213904
SU 4312 1 0.436364
AG-370 0.264286 0.284848
AG-490 0.347826 0.307692

At this point I would like to join this output file and add a third column and header. I will need to repeat this process many times building a large text file with the number of columns equal to the number of lines. I am trying to build a distance matrix for another application. I hope someone can find this a challenge and offer a solution. Any help will be appreciated.

python

3 Contributors
10 Replies
2K Views
4 Days Discussion Span
Latest Post 14 Years Ago Latest Post by TrustyTony

All 10 Replies

woooee 814 Nearly a Posting Maven

14 Years Ago

Hundreds of records is not much in today's world so you can read each file into a dictionary and go from there. A simple example to associate the two files because I am too tired to do more today. You can omit some of the unnecessary records from the dictionary or use as is and filter before writing to the third file.

## simulate 2 files read into lists using readlines()
file_1 = ['SU' ,
'PD-98059 PD-98059 Tanimoto from SU = 0.129213',
'BML-265 BML-265 Tanimoto from SU = 0.163743',
'BML-257 BML-257 Tanimoto from SU = 0.156627',
'SU 4312 SU 4312 Tanimoto from SU = 1',
'AG-370 AG-370 Tanimoto from SU = 0.264286',
'AG-490 AG-490 Tanimoto from SU = 0.347826',
'PD-98060 PD-98059 Tanimoto from SU = 0.129213',
'BML-265 BML-265 Tanimoto from SU = 0.163743',
'BML-257 BML-257 Tanimoto from SU = 0.156627',
'SU 4312 SU 4312 Tanimoto from SU = 1',
'AG-370 AG-370 Tanimoto from SU = 0.264286',
'AG-490 AG-490 Tanimoto from SU = 0.347826',
'PD-98061 PD-98060 Tanimoto from SU = 0.129213',
'BML-265 BML-265 Tanimoto from SU = 0.163743',
'BML-257 BML-257 Tanimoto from SU = 0.156627',
'SU 4312 SU 4312 Tanimoto from SU = 1',
'AG-370 AG-370 Tanimoto from SU = 0.264286',
'AG-490 AG-490 Tanimoto from SU = 0.347826']


file_2 = ['GF',
'PD-98059 PD-98059 Tanimoto from GF = 0.118483',
'BML-265 BML-265 Tanimoto from GF = 0.164179',
'BML-257 BML-257 Tanimoto from GF = 0.213904',
'SU 4312 SU 4312 Tanimoto from GF = 0.436364',
'AG-370 AG-370 Tanimoto from GF = 0.284848',
'AG-490 AG-490 Tanimoto from GF = 0.307692',
'PD-98061 PD-98059 Tanimoto from GF = 0.118483',
'BML-265 BML-265 Tanimoto from GF = 0.164179',
'BML-257 BML-257 Tanimoto from GF = 0.213904',
'SU 4312 SU 4312 Tanimoto from GF = 0.436364',
'AG-370 AG-370 Tanimoto from GF = 0.284848',
'AG-490 AG-490 Tanimoto from GF = 0.307692']
   
def groups(list_in):
    """ break the file into groups of records from "PD" to
        the next "PD"
    """
    return_dict = {}
    group_list = []
    for rec in list_in:
        rec = rec.strip()
        if (rec.startswith("PD")) and (len(group_list)):     ## new group
           dict_in = to_dict(group_list, return_dict)
           group_list = []
        group_list.append(rec)

    ## process the final group
    dict_in = to_dict(group_list, return_dict)

    return return_dict

def to_dict(group_list, dict_in):
    """ add to the dictionary
        key = "PD"+number
        values = list of lists = all records associated with this key
    """
    ## the first record contains the "PD" key
    substrs = group_list[0].split()
    key = substrs[0]
    if key in dict_in:
        print "DUPLICATE record", group_list[0]
    else :
        dict_in[key] = []
        ## add all of the records to the dictionary
        ## including the "PD" record
        for rec in group_list:
            dict_in[key].append(rec)

    return dict_in

ID = file_1[0].strip()     ## "SU"
file_1_dict = groups(file_1[1:])

ID += " " + file_2[0].strip()     ## "GF"
file_2_dict = groups(file_2[1:])

print "ID =", ID
## not printed in any particular order
for key in file_1_dict:
    print key
    for rec in file_1_dict[key]:
        print "  ", rec
    if key in file_2_dict:
        for rec in file_2_dict[key]:
            print "     ", rec     ## additional indent

Edited 14 Years Ago by woooee because: n/a

TrustyTony 888 ex-Moderator

14 Years Ago

Here is my start also, but it sorts the lines by alphabetic order input files are supposed to start with same letters (here 'file0') and end with '.txt'.

import os
import itertools
ids = []
lines = []
for fn in (f for f in os.listdir(os.curdir) if f.startswith('file0') and f.endswith('.txt')):
    with open(fn) as infile:
        id=next(infile)[1:].rstrip()
        ids.append(id)
        for line in ((line[1:4]+line[4:].split(None, 1)[0], line.rsplit('=',1)[-1].rstrip()) for line in infile):
            lines.append(line)
        if line[0]==id:
            break # ignore superstructure
lines.sort()
# process or write to file
print 'ID',' '.join(ids)
for group,line in itertools.groupby(lines, key=lambda x: x[0]):
    print group, ' '.join(data for _,data in reversed(list(line)))
    print group, ' '.join(data for _,data in reversed(list(line)))

Edited 14 Years Ago by TrustyTony because: n/a

woooee 814 Nearly a Posting Maven

14 Years Ago

Post a link to a sample file that we can use for testing when/if you want to take this tread any further.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

rx21825 0 Newbie Poster · Answer 1 · 2010-12-04T05:29:53+00:00

Thanks woooee and tonyjv. I will try these and report back as soon as I can. Be patient with me as I come up to speed.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 2 · 2010-12-04T16:15:29+00:00

I think prefered way here in DaniWeb is to go Advanced view and attach file. If type of file does not fit zip and attach.

rx21825 0 Newbie Poster · Answer 3 · 2010-12-06T03:45:50+00:00

rx21825 0 Newbie Poster

14 Years Ago

I have uploaded two files. One file is the desired output format and the zip files contains 6 typical files of 1,000 plus lines. Untimately I wouldlike to merge multiple files. They would start with a common name of mostsim_*.txt. I hope this help in testing your code. Thank you.

This attachment is potentially unsafe to open. It may be an executable that is capable of making changes to your file system, or it may require specific software to open. Use caution and only open this attachment if you are comfortable working with zip files.

MostSim_ExampleFile.zip (84.44 KB)

MostSim_NxN_Example.txt (61.98 KB)

The attachment preview is chopped off after the first 10 KB. Please download the entire file.

Note:  First row and first column are HEADERS taken from the first column in
	in each individual file.
	
ID	1		10		6			7			14
1	1	0.585714	0.585859	0.661972	0.577236
10	0.585714	1	0.82	0.497608	0.433884
1000	0.339207	0.398936	0.318919	0.256637	0.232558
1001	0.237113	0.267516	0.304348	0.271676	0.227053
1002	0.275362	0.296512	0.299363	0.297872	0.257919
1003	0.135659	0.145455	0.153465	0.166667	0.154135
1004	0.150794	0.147465	0.155779	0.173913	0.192157
1005	0.0209424	0.0266667	0.0305344	0.0233918	0.0195122
1006	0.170507	0.163934	0.182927	0.187817	0.175439
1007	0.229839	0.205479	0.212871	0.202532	0.222222
1008	0.0693069	0.0869565	0.0833333	0.0652174	0.0599078
1009	0.126214	0.150602	0.170068	0.139785	0.118182
1010	0.111675	0.1125	0.119718	0.124294	0.104265
1011	0.166667	0.152542	0.163522	0.243094	0.21028
1012	0.197044	0.239264	0.27972	0.238889	0.200935
1013	0.183333	0.157143	0.172775	0.123404	0.13308
1014	0.210117	0.24424	0.248756	0.24359	0.25
1015	0.140562	0.125	0.12	0.147826	0.146154
1016	0.255906	0.305164	0.308081	0.277778	0.261364
1017	0.282828	0.268657	0.26378	0.253472	0.253968
1018	0.166667	0.2	0.225166	0.229508	0.193548
1019	0.0858586	0.115385	0.131387	0.0893855	0.0904762
1020	.	.	.	.	.
1021	0.139286	0.139344	0.135965	0.141221	0.136519
1022	0.135802	0.169154	0.173913	0.153153	0.16
1023	0.264368	0.229787	0.255814	0.24498	0.246377
1024	.	.	.	.	.
1025	0.166038	0.145299	0.15814	0.235043	0.223485
1026	0.295019	0.258475	0.269406	0.308642	0.318352
1028	0.0353535	0.044586	0.0507246	0.0393258	0.0379147
1029	0.240876	0.27234	0.278539	0.290323	0.291971
1030	0.231214	0.188272	0.184466	0.272727	0.27907
1031	0.0961538	0.113095	0.127517	0.0777202	0.0707965
1032	0.204651	0.211111	0.228395	0.189055	0.171674
1033	0.109524	0.0847458	0.0880503	0.121053	0.117647
1034	0.159817	0.176796	0.175758	0.135922	0.160173
1035	0.162562	0.189024	0.205479	0.220339	0.190476
1036	0.22335	0.273885	0.311594	0.241573	0.20283
1037	0.182292	0.231788	0.265152	0.203488	0.169903
1038	0.26087	0.281106	0.275862	0.288793	0.280769
1039	0.166667	0.178378	0.206061	0.183168	0.171674
104	0.663415	0.507538	0.471204	0.613065	0.530172
1040	.	.	.	.	.
1041	0.273148	0.314607	0.360759	0.321244	0.290179
1042	0.23301	0.275449	0.302013	0.258065	0.218182
1043	0.0582011	0.0743243	0.0687023	0.0526316	0.0439024
1044	0.245353	0.204918	0.216814	0.296296	0.297398
1045	0.234192	0.215	0.209845	0.227603	0.269953
1046	0.14218	0.162791	0.167742	0.116162	0.118421
1047	0.163424	0.151786	0.160194	0.134146	0.14652
1048	0.181818	0.14346	0.166667	0.172691	0.185455
1049	.	.	.	.	.
1050	0.161458	0.181818	0.207407	0.17341	0.144928
1051	0.195906	0.179487	0.179054	0.22327	0.233236
1053	.	.	.	.	.
1054	0.256705	0.237069	0.252336	0.252033	0.252747
1055	0.173709	0.167598	0.17284	0.210526	0.183857
1056	0.165854	0.171598	0.18543	0.196721	0.171296
1057	.	.	.	.	.
1058	0.315068	0.357143	0.373494	0.353535	0.330396
1059	0.219512	0.199074	0.212121	0.296296	0.281633
106	0.619512	0.607735	0.648485	0.677419	0.572727
1060	0.390558	0.458763	0.404255	0.345133	0.310078
1061	0.354582	0.430622	0.37931	0.316872	0.291971
1062	0.218341	0.226804	0.237288	0.193548	0.191057
1063	0.215686	0.24537	0.237624	0.25	0.255814
1064	0.226804	0.270968	0.308824	0.252874	0.211538
1065	0.285714	0.258242	0.272727	0.436782	0.371981
1066	0.255102	0.225543	0.190083	0.225974	0.234146
1067	0.265873	0.281106	0.282178	0.27234	0.256604
1068	0.224199	0.202381	0.20339	0.250965	0.230241
1069	0.123967	0.132353	0.115789	0.16129	0.167347
107	0.673797	0.6	0.611465	0.457711	0.397436
1070	0.056872	0.0581395	0.0653595	0.0463918	0.0580357
1071	0.165975	0.182266	0.169312	0.191781	0.194332
1072	0.343511	0.370044	0.390476	0.377593	0.429688
1073	0.161512	0.155642	0.168067	0.160584	0.189189
1074	0.111111	0.134228	0.136364	0.111111	0.0926829
1075	0.137255	0.157576	0.178082	0.152174	0.160377
1076	0.249147	0.217228	0.224	0.262774	0.270903
1077	.	.	.	.	.
1078	0.127572	0.114833	0.126316	0.0995671	0.107692
1080	0.107692	0.129032	0.155556	0.0949721	0.0849057
1081	0.271429	0.321637	0.352941	0.266667	0.248889
1082	0.216327	0.241546	0.246073	0.208696	0.21875
1083	0.229268	0.255952	0.28	0.208333	0.220183
1084	0.204167	0.209756	0.218085	0.211712	0.236735
1085	0.320574	0.205128	0.227273	0.286432	0.27193
1086	0.117647	0.126506	0.135135	0.142857	0.13615
1087	0.18552	0.157068	0.174419	0.253886	0.221239
1088	0.158333	0.179104	0.191257	0.146667	0.163347
1089	0.325424	0.282051	0.282946	0.374074	0.368243
109	0.779412	0.570732	0.611702	0.844086	0.713636
1090	0.106952	0.129252	0.15748	0.11976	0.0995025
1091	0.120603	0.1375	0.156028	0.16	0.133971
1092	0.124481	0.116505	0.128342	0.0960699	0.104651
1094	0.196787	0.212264	0.226804	0.203463	0.204633
1095	0.243542	0.208163	0.209607	0.283401	0.276364
1096	0.288793	0.270936	0.278075	0.328571	0.293388
1097	0.14	0.12963	0.136364	0.162281	0.158915
1098	0.267148	0.183206	0.187755	0.249057	0.254296
1099	0.0900474	0.0738636	0.0759494	0.105263	0.0892857
1100	0.286996	0.28125	0.304598	0.296117	0.280851
1101	0.200692	0.195313	0.205882	0.184783	0.207358
1102	0.284264	0.341772	0.388489	0.316384	0.271429
1103	0.21267	0.194737	0.223529	0.284974	0.247788
1104	0.176211	0.195767	0.217647	0.193237	0.185654
1105	0.214286	0.246835	0.280576	0.217877	0.188679
1106	0.302439	0.329412	0.361842	0.273196	0.243363
1107	0.197861	0.236486	0.251908	0.2	0.166667
1108	0.106299	0.100917	0.105	0.0829876	0.109023
1109	0.228873	0.232	0.240343	0.205128	0.234694
1110	0.220264	0.19797	0.219101	0.218009	0.233051
1111	0.336538	0.419162	0.415584	0.30303	0.275109
1112	0.258721	0.268608	0.231023	0.225519	0.217984
1113	0.180617	0.164103	0.175141	0.258883	0.231441
1114	0.234742	0.193548	0.215569	0.285714	0.247748
1115	0.2607	0.241228	0.257143	0.256198	0.247232
1116	0.269565	0.371585	0.281768	0.242009	0.219124
1117	0.190299	0.203463	0.193548	0.177165	0.176678
1118	0.202166	0.191837	0.207965	0.222656	0.230496
1119	0.123675	0.130612	0.116883	0.137405	0.136986
1120	0.194915	0.235897	0.226519	0.190909	0.248945
1121	0.178571	0.258278	0.295455	0.205714	0.172249
1122	0.19802	0.210843	0.22973	0.240223	0.201878
1123	0.0643564	0.0609756	0.0616438	0.133721	0.11165
1124	0.157658	0.213483	0.223602	0.161765	0.148305
1125	0.183099	0.219653	0.246753	0.208333	0.198198
1126	0.213439	0.192825	0.193237	0.21097	0.188889
1127	0.220264	0.19797	0.219101	0.218009	0.233051
1128	0.230159	0.211712	0.179245	0.218487	0.208955
1129	0.182836	0.164557	0.132159	0.227273	0.221402
1130	0.145833	0.130435	0.143617	0.108696	0.120155
1131	0.0933333	0.108108	0.113772	0.13	0.11588
1132	0.330677	0.302222	0.268519	0.365217	0.343629
1133	0.0666667	0.0756757	0.0650888	0.0731707	0.0808511
1134	.	.	.	.	.
1135	.	.	.	.	.
1136	0.199115	0.204188	0.212644	0.287179	0.272321
1137	0.233108	0.204461	0.200787	0.277778	0.297945
1138	0.192913	0.218605	0.151659	0.189076	0.223938
1139	0.274834	0.264706	0.221805	0.258621	0.250784
114	.	.	.	.	.
1140	0.338235	0.357143	0.310345	0.307985	0.281356
1141	0.38206	0.378676	0.407115	0.375	0.360759
1142	0.161871	0.146341	0.15859	0.165385	0.162069
1143	0.101449	0.106509	0.0980392	0.0612245	0.0707965
1144	0.102326	0.0949721	0.0925926	0.101523	0.0960699
1145	0.0884956	0.0962567	0.0751445	0.1133	0.125541
1146	0.207921	0.230303	0.260274	0.224044	0.216981
1147	0.286184	0.282051	0.2397	0.249158	0.257764
1148	0.174721	0.170213	0.185185	0.198381	0.222222
1149	0.268939	0.219917	0.227679	0.245059	0.255396
1150	0.317241	0.332031	0.288	0.288256	0.269231
1151	0.174672	0.181347	0.194286	0.180095	0.155102
1152	0.248092	0.227468	0.236111	0.228	0.24
1153	0.234127	0.210762	0.218447	0.2125	0.231061
1154	0.159624	0.22619	0.222222	0.152284	0.139738
1155	0.25	0.226923	0.25	0.29771	0.294118
1156	0.215	0.2625	0.297872	0.24581	0.206573
1157	0.166667	0.134529	0.147059	0.2287	0.212598
1158	0.0990099	0.124224	0.140845	0.128492	0.107981
1159	0.0203046	0.025641	0.0291971	0.0225989	0.0189573
116	.	.	.	.	.
1160	.	.	.	.	.
1161	0.163636	0.168478	0.173653	0.191919	0.2
1162	0.283401	0.271889	0.311224	0.325893	0.324
1163	0.290837	0.373786	0.306931	0.245902	0.229091
1165	0.233108	0.204461	0.200787	0.277778	0.297945
1166	0.336	0.279476	0.292453	0.319328	0.284133
1167	0.265734	0.225191	0.227642	0.280899	0.296552
1168	0.129808	0.141176	0.15894	0.168478	0.169014
1169	.	.	.	.	.
117	0.319149	0.162921	0.18239	0.341176	0.284314
1170	0.121212	0.175325	0.182482	0.16092	0.140097
1171	0.186916	0.217143	0.24359	0.231579	0.235023
1172	0.128889	0.132979	0.141176	0.0934579	0.112033
1173	0.138095	0.151163	0.169935	0.152632	0.171296
1174	0.253731	0.352564	0.40146	0.26776	0.225806
1175	0.229437	0.208955	0.230769	0.21659	0.216327
1176	0.139303	0.167702	0.190141	0.174157	0.146226
1177	0.223729	0.19403	0.208835	0.213523	0.22549
1178	0.174797	0.215686	0.156566	0.174672	0.169884
1179	0.23622	0.22973	0.245098	0.214876	0.223881
118	0.192171	0.235294	0.165254	0.162362	0.191126
1180	0.323432	0.285714	0.286792	0.370504	0.360656
1181	0.103448	0.125654	0.113636	0.102804	0.0931174
1182	0.240909	0.296089	0.322981	0.277778	0.253275
1183	0.314607	0.330472	0.334862	0.282946	0.276224
1184	0.273148	0.251337	0.25731	0.268657	0.24569
1185	0.109091	0.115385	0.121951	0.0926829	0.0932203
1186	0.333333	0.341837	0.264249	0.412935	0.364807
1187	0.206897	0.214286	0.233333	0.17801	0.171946
1188	0.160338	0.158416	0.155914	0.208531	0.1893
1189	0.0515464	0.0653595	0.0746269	0.0574713	0.0582524
1190	0.333333	0.353774	0.326733	0.257143	0.271375
1191	0.2	0.245161	0.288889	0.222857	0.186603
1192	0.316794	0.339207	0.34434	0.289683	0.282143
1193	0.258621	0.255	0.274725	0.295238	0.264463
1194	0.203883	0.208029	0.2	0.205479	0.20625
1195	0.271111	0.289474	0.306358	0.303922	0.265823
1196	0.214592	0.267016	0.281609	0.200913	0.192771
1197	0.294118	0.34	0.348148	0.247191	0.21327
1198	0.218954	0.198556	0.203846	0.238596	0.236422
1199	0.210938	0.217195	0.231527	0.239316	0.227273
120	0.212766	0.272109	0.3125	0.223529	0.186275
1200	0.344023	0.308411	0.314754	0.373832	0.397059
1201	0.235981	0.216958	0.211886	0.229469	0.271663
1202	0.192513	0.238095	0.283465	0.208333	0.173267
1203	0.209402	0.241026	0.252809	0.217593	0.207317
1204	0.21774

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 4 · 2010-12-06T19:30:17+00:00

Looks OK for me:

import os
import itertools
ids = []
lines = []
for fn in (f for f in os.listdir(os.curdir) if f.startswith('MostSim') and f.endswith('.txt')):
    with open(fn) as infile:
        name = next(infile)[1:].rstrip()
        ids.append(name)
        for line in ((line[1:].split(None, 1)[0],
                        line.rsplit('= ',1)[-1].rstrip())
                            for line in infile if line.startswith('>')):
                lines.append((name,line))
        if line[0] == name:
            break # ignore superstructure
lines.sort(key = lambda x: x[1][0])
lines = [(a, list(b)) for a, b in itertools.groupby(lines, key=lambda x: x[1][0])]
#print 'Lines begin',lines[0] # debug
# process or write to file
with open('outp_mostsim.txt','w') as outp:
  outp.write('ID\t'+'\t\t'.join(sorted(ids))+'\n')
  for group,line in lines:
      outp.write('%s\t%s' % (group,'\t'.join(b[1] for a,b in sorted(list(line)))+'\n'))

rx21825 0 Newbie Poster · Answer 5 · 2010-12-07T08:16:31+00:00

Thank you! This works VERY well and just what I needed. What is the significance of the "extra" indentations on line 10? I will work with this and study it. I appreciate your time and help.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 6 · 2010-12-07T13:08:16+00:00

It is one line list comprehension which is divided in multiple lines for clarity. You can continue lines, when the expression is in parenthesis or square brackets without line continuation sign \.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 7 · 2010-12-07T23:44:12+00:00

It is one line list comprehension which is divided in multiple lines for clarity. You can continue lines, when the expression is in parenthesis or square brackets without line continuation sign \.

Should not post early at morning: I meant it is generator expression split multiple lines the way IDLE environment likes to do it automatically. It is to make expression more readable as wide lines are nasty to read and understand.

merging text files on column

Recommended Answers Collapse Answers

All 10 Replies

Recommended Answers